We are seeing a very slow rate of files to tape despite files being read from EOS and written to tape. The issue seems to be due to the following error message:
{"epoch_time":1764190023.326574545,"local_time":"2025-11-26T20:47:03+0000","hostname":"getafix-ts20","program":"cta-taped","log_level":"ERROR","pid":833340,"tid":843106,"message":"In ArchiveMount::reportJobsBatchTransferred(): got an exception","drive_name":"obelix_lto9_22","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto9_22","mountId":"3282291","vo":"storaged-ceda","tapePool":"offsite_lto8","successfulBatchSize":45,"exceptionMessageValue":"filesWrittenToTape: Failed to find archive file entry in the catalogue: archiveFileId=4383011486, diskInstanceName=eosantaresfac, diskFileId=210594737"}
We have confirmed files are being read from EOS successfully, and are being written to tape successfully, but the failure of the reporting causes the files to not be registered in the catalogue
We are seeing this only for our facilities VO that has two tape copies for their data. We are seeing this error reported for both the first and second tape copy sessions. Files with a single tape copy for other VOs are going to tape as expected, at the expected rate, without any indication of this error.
We’re dealing with a fairly large backlog of facilities data to ingest at the moment (~100TB, ~10k files on the buffer), although we are seeing a low rate of these errors (a few an hour), it seems to be enough to cause the rate to tape to be effectively zero and for the buffer to fill up. I assume this is because the tape sessions are long and the batches contain many files. We continue to see single copy tape files going to tape during this time without issue.
We are seeing some of the dual copy files going to tape and being registered in the catalog, but it’s at a rate of ~100MB/s, when the tape servers are reading files off the buffer at ~10GB/s.
I note that in In ArchiveMount::reportJobsBatchTransferred(): got an exception - #3 by poliverc , there was a mention of issues with dual copy storage classes and ObjectStore locking timeouts. I wonder if we’re running into something similar?
I though the issue was that the 2nd copy report needed the 1st copy to have already been recorded in the catalog, but we are also seeing exceptions when reporting 1st copy mounts.
Details
CTA version: 5.11.10.0-1
Operating System and version: Rocky 9.6
Xrootd version: 5.8.0-2
Objectstore backend: 17.2.8-2