Reporting Gets Stuck

Hi all,

One of our experiments is currently running a large staging campaign, and we’ve observed that after some time, retrieve jobs stop being reported and remain stuck in the ObjectStore. This issue doesn’t affect archive jobs as consistently, but we’ve noticed that they can occasionally get stuck too. In those cases, deleting RetrieveReports if any, allows them to be reported.

We’ve had to trigger the stages again, but eventually reporting still gets stuck. The CTA logs clearly show that files are successfully transferred to disk, but since reporting doesn’t happen, these files don’t reflect in dCache

At the moment, three tape movers have the Maintenance process enabled.

Has anyone observed a similar behaviour or have any suggestions on how to resolve or mitigate this?

Sample stuck jobs since last evening:

-rwx------. 1 cta tape 236 Aug  7 00:45 RetrieveQueueToReportForUser-J00665-DriveProcess-rz3-6,3,1-tpm303.desy.de-1094118-20250807-00:41:15-0-0
-rwx------. 1 cta tape 240 Aug  7 00:50 RetrieveQueueToReportForUser-J00728-DriveProcess-rz3-6,4,3-tpm306.desy.de-1073927-20250807-00:43:21-0-0
-rwx------. 1 cta tape 240 Aug  6 23:24 RetrieveQueueToReportForUser-J01278-DriveProcess-rz1-6,3,4-tpm105.desy.de-3191049-20250806-17:37:18-0-20
-rwx------. 1 cta tape 240 Aug  7 00:46 RetrieveQueueToReportForUser-J01462-DriveProcess-rz3-6,4,1-tpm303.desy.de-1092995-20250807-00:38:29-0-0
-rwx------. 1 cta tape 245 Aug  6 23:38 RetrieveQueueToReportForUser-J03480-DriveProcess-rz1-6,4,3-tpm106.desy.de-1836571-20250806-17:36:20-0-18
-rwx------. 1 cta tape 237 Aug  7 00:43 RetrieveQueueToReportForUser-J03774-DriveProcess-rz3-6,3,3-tpm305.desy.de-671364-20250807-00:40:11-0-0
-rwx------. 1 cta tape 239 Aug  7 00:51 RetrieveQueueToReportForUser-J03807-DriveProcess-rz3-6,4,4-tpm306.desy.de-1025146-20250806-21:52:54-0-8
-rwx------. 1 cta tape 238 Aug  7 00:48 RetrieveQueueToReportForUser-J03819-DriveProcess-rz3-6,3,3-tpm305.desy.de-671971-20250807-00:43:55-0-0
-rwx------. 1 cta tape 237 Aug  7 00:53 RetrieveQueueToReportForUser-J03825-DriveProcess-rz3-6,3,3-tpm305.desy.de-672982-20250807-00:49:28-0-0
-rwx------. 1 cta tape 240 Aug  7 00:37 RetrieveQueueToReportForUser-J03835-DriveProcess-rz3-6,2,4-tpm303.desy.de-1090876-20250807-00:33:22-0-0
-rwx------. 1 cta tape 244 Aug  7 00:52 RetrieveQueueToReportForUser-J03847-DriveProcess-rz3-6,2,3-tpm303.desy.de-1061703-20250806-23:39:01-0-0
-rwx------. 1 cta tape 238 Aug  7 00:50 RetrieveQueueToReportForUser-J03886-DriveProcess-rz3-6,3,3-tpm305.desy.de-587853-20250806-21:57:10-0-4
-rwx------. 1 cta tape 238 Aug  7 00:38 RetrieveQueueToReportForUser-J03893-DriveProcess-rz3-6,4,2-tpm306.desy.de-1071995-20250807-00:34:28-0-0
-rwx------. 1 cta tape 239 Aug  7 00:43 RetrieveQueueToReportForUser-J03896-DriveProcess-rz3-6,2,4-tpm303.desy.de-1093323-20250807-00:39:15-0-0
-rwx------. 1 cta tape 239 Aug  7 00:41 RetrieveQueueToReportForUser-J03906-DriveProcess-rz3-6,3,1-tpm303.desy.de-1088318-20250807-00:28:15-0-2
-rwx------. 1 cta tape 240 Aug  7 00:41 RetrieveQueueToReportForUser-J03937-DriveProcess-rz3-6,4,3-tpm306.desy.de-1072335-20250807-00:36:56-0-0
-rwx------. 1 cta tape 238 Aug  7 00:59 RetrieveQueueToReportForUser-J03943-DriveProcess-rz3-6,4,1-tpm303.desy.de-990021-20250806-22:18:27-0-0
-rwx------. 1 cta tape 240 Aug  7 00:56 RetrieveQueueToReportForUser-J03964-DriveProcess-rz3-6,3,1-tpm303.desy.de-956205-20250806-20:56:01-0-16
-rwx------. 1 cta tape 240 Aug  7 01:00 RetrieveQueueToReportForUser-J03991-DriveProcess-rz3-6,4,2-tpm306.desy.de-1074394-20250807-00:46:37-0-0
-rwx------. 1 cta tape 240 Aug  7 00:47 RetrieveQueueToReportForUser-J03998-DriveProcess-rz3-6,2,3-tpm303.desy.de-1095483-20250807-00:42:36-0-0
-rwx------. 1 cta tape 239 Aug  6 22:40 RetrieveQueueToReportForUser-J04517-DriveProcess-rz1-6,2,3-tpm103.desy.de-1022115-20250806-21:46:47-0-12
-rwx------. 1 cta tape 240 Aug  6 23:48 RetrieveQueueToReportForUser-J04552-DriveProcess-rz1-6,2,3-tpm103.desy.de-1096946-20250806-23:42:44-0-0
-rwx------. 1 cta tape 239 Aug  6 23:44 RetrieveQueueToReportForUser-J04554-DriveProcess-rz1-6,3,1-tpm103.desy.de-964966-20250806-19:33:28-0-16
-rwx------. 1 cta tape 248 Aug  6 23:17 RetrieveQueueToReportForUser-L07580-DriveProcess-rz1-2,2,1-tpm104.desy.de-656252-20250806-23:08:59-0-0
-rwx------. 1 cta tape 239 Aug  7 08:10 RetrieveQueueToReportForUser-L08343-DriveProcess-rz1-2,2,1-tpm104.desy.de-730087-20250807-06:29:03-0-0
-rwx------. 1 cta tape 239 Aug  7 00:42 RetrieveQueueToReportForUser-L08536-DriveProcess-rz1-2,1,3-tpm104.desy.de-646253-20250806-22:13:40-0-0

Thanks,
Mwai

Follow up:

Or perhaps it fails because the objectID is missing in the ObjectStore? Dumping one of the RetrieveReport objects gives us the shard containing the RetrieveRequests and dumping these gives you an error as the object doesn’t exist:

Failed to dump object:
In ObjectStoreVFS::read, failed to open file for read: No such object for path /cta-cache/ctaObjectStore/RetrieveRequest-Frontend-cta-frontend03.desy.de-1552598-20250807-18:12:47-0-943

This is not the same for ArchiveReport objects as the request is still in the ObjectStore, which is perhaps deleted after the reporting.

As example, file is transferred to disk:

Aug  7 21:24:43.775036719 tpm102 cta-taped: LVL="INFO" PID="739903" TID="747421" MSG="File successfully transfered to disk" drive_name="rz1-2,3,1" instance="production" sched_backend="vfsUser" thread="DiskWrite" tapeDrive="rz1-2,3,1" tapeVid="L08567" mountId="20363" vo="petra3" tapePool="p05-osm" threadCount="10" threadID="0" fileId="42163716" dstURL="root://dcache-photon50.desy.de:1117/00004DD9232CD8984585BEDE5803C98126C1?oss.asize=30040160484" fSeq="704" actualURL="root://dcache-photon50.desy.de:1117/00004DD9232CD8984585BEDE5803C98126C1?oss.asize=30040160484&xrdcl.requuid=0f6bbba5-4970-4462-8a21-ede88cfb02ca" readWriteTime="110.233793" checksumingTime="11.7310240000001" waitDataTime="104.487109000009" waitReportingTime="0.000107" checkingErrorTime="0.00481200000000024" openingTime="0.021075" closingTime="0.002736" transferTime="122.010712" totalTime="226.480656" dataVolume="30040160484" globalPayloadTransferSpeedMBps="132.63896800087" diskPerformanceMBps="246.209205663844" openRWCloseToTransferTimeRatio="0.903671507137832"

and moments later, deleted:

Aug  7 21:24:43.819283584 tpm102 cta-taped: LVL="INFO" PID="739903" TID="747431" MSG="In OStoreDB::RetrieveJob::~RetrieveJob(): will leave the job owned after destruction." drive_name="rz1-2,3,1" instance="production" sched_backend="vfsUser" agentObject="DriveProcess-rz1-2,3,1-tpm102.desy.de-739903-20250807-21:03:14-0" jobObject="RetrieveRequest-Frontend-cta-frontend03.desy.de-1552598-20250807-18:12:47-0-943"

Nevertheless the RetrieveReport is in the objectStore:

-rwx------. 1 cta tape 237 Aug  7 21:24 RetrieveQueueToReportForUser-L08567-DriveProcess-rz1-2,3,1-tpm102.desy.de-739903-20250807-21:03:14-0-0

However as mentioned in the earlier post, some do get reported immediately. Looking at file 42166842 with requestID - RetrieveRequest-Frontend-cta-frontend03.desy.de-1519923-20250807-00:19:10-0-515 :

Aug  7 00:33:32.316658703 tpm104 cta-taped: LVL="INFO" PID="665345" TID="671505" MSG="In OStoreDB::RetrieveMount::flushAsyncSuccessReports(), retrieve job successful" drive_name="rz1-2,2,1" instance="production" sched_backend="vfsUser" thread="RecallReportPacker" tapeDrive="rz1-2,2,1" mountId="105459" vo="petra3" tapePool="p05-osm" tapeVid="L08536" mountType="RetrieveForUser" fileId="42166842"

and as we observe, the object is only removed after the reporting - in this case after a few hours

Aug  7 10:30:43.501795390 tpm101 cta-taped: LVL="INFO" PID="889714" TID="889714" MSG="In OStoreDB::RetrieveJob::~RetrieveJob(): will leave the job owned after destruction." drive_name="rz1-2,4,2" instance="production" sched_backend="vfsUser" agentObject="Maintenance-tpm101.desy.de-889714-20250807-10:15:22-0" jobObject="RetrieveRequest-Frontend-cta-frontend03.desy.de-1519923-20250807-00:19:10-0-515"

This might not be the clincher, but it could provide some useful clues.

Thank you

Hi Mwai,

Can you check in the logs for the maintenance processes to see if the reporting jobs are being picked up? Is it the case that CTA is trying to report the jobs and failing? Any errors or warnings?

Hi Michael,

I have a few pointers:
Lot’s of Invalid Request ID errors like - so maybe non-existent request

Aug  5 07:16:02.424739920 tpm104 cta-taped: LVL="ERROR" PID="3737137" TID="3737137" MSG="In Scheduler::reportRetrieveJobsBatch(): failed to report." drive_name="rz1-2,2,1" instance="production" sched_backend="vfsUser" SubprocessName="maintenanceHandler" fileId="42168455" reportType="FailureReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Server responded with an error: [3000] Invalid request id code:400 errNo:3000 status:1"

and AsyncUpdateDelete fails - not sure this has anything to do with it

Aug  5 21:09:18.357354702 tpm101 cta-taped: LVL="ERROR" PID="3400680" TID="3400717" MSG="In OStoreDB::RetrieveMount::flushAsyncSuccessReports(): async status update failed. Will leave job to garbage collection." drive_name="rz1-2,4,1" instance="production" sched_backend="vfsUser" thread="RecallReportPacker" tapeDrive="rz1-2,4,1" mountId="11454" vo="petra3" tapePool="p05-osm" fileId="42167469" requestObject="RetrieveRequest-Frontend-cta-frontend03.desy.de-1466152-20250805-19:22:41-0-412" exceptionMessage="In ObjectStoreVFS::read, failed to open file for read: No such object for path /cta-cache/ctaObjectStore/RetrieveRequest-Frontend-cta-frontend03.desy.de-1466152-20250805-19:22:41-0-412"

Maybe first error means the request is missing in dCache? I have to find that out

What does the async update do?

thanks

Hi Mwai,

The first error is a failed Failure Report. This means:

  • Something failed
  • CTA created a Failure Report
  • Delivery of the Failure Report also failed (possibly for same reason as original failure or could be unrelated)

Does the failed report show up with cta-admin failedrequest ls?

The second error is about a missing object in the objectstore. I would need more context to understand what’s going on. Perhaps a retrieve request was cancelled after it had already been processed?

These errors are not necessarily related to each other.

Hi Michael,

No, these don’t appear in the `failedrequest` list. The files are successfully transferred to disk and only the reporting bit that fails — and not for all, but for most that are queued on the RetrieveQueueToReportForUser... object.

So far, the observation is that the retrieve jobs referenced by the RetrieveQueueToReportForUser... objects are missing from the objectStore, which is perhaps why the reporting doesn’t happen? Not sure why they are missing in the first place unless of course the requests are cancelled from dCache but we are currently working on reproducing the issue to better understand what’s going on & I’ll report back.

Thanks

Hi Mwai,

The first example you quoted above was for a failed job, not just a failed report:

MSG="In Scheduler::reportRetrieveJobsBatch(): failed to report."
reportType="FailureReport"
exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Server responded with an error: [3000] Invalid request id code:400 errNo:3000 status:1"

The reason for the failed job is not given in this log line, but I would expect to find the failed job in thefailedrequest list, where you can see all of the errors and retries for the job.

The message Invalid request id code:400 errNo:3000 status:1 is from dCache. What could cause that?

The second example is a failed report for a successful data transfer:

MSG="In OStoreDB::RetrieveMount::flushAsyncSuccessReports(): async status update failed. Will leave job to garbage collection."
exceptionMessage="In ObjectStoreVFS::read, failed to open file for read: No such object for path /cta-cache/ctaObjectStore/RetrieveRequest-Frontend-cta-frontend03.desy.de-1466152-20250805-19:22:41-0-412"

This looks like a different case from the first one and looks like this is the case under investigation.