Two copy archival reporting intermittently failing with "Failed to find archive file entry in the catalogue."

Yes, thank you. I was forgetting that.

Indeed, the request for the other copy is probably still queued.

Could you confirm if this is the case?
You can use some of the commands:

  • cta-admin --jsonl ar ls: to check the archive routes and destination tapepool of copy 2.
  • cta-admin --jsonl sq: to check if there is an archive request queue for this tapepool.
  • cta-objectstore-dump-object --dumpbody <object_ref>: (if necessary) to look into the queue and inspect the requests. You must start at root

If the request is still queued, then something must be preventing the tape for getting mounted.

Best,
Joao

As I know for sure that the file’s storage class is a two-copy one and that there is an archive queue for the second copy pool, I copy paste the contents of the request object below. From what I can see, the tape mount sessions for the second copy pool have also been interrupted by the same exception.

[root@cta-adm-fac1 ~]# cta-objectstore-dump-object ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-2400-20251104-09:41:41-0-5016581
Object store path: rados://antares@cta:antares
Object name: ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-2400-20251104-09:41:41-0-5016581
Header dump:
{
 "type": "ArchiveRequest_t",
 "version": "0",
 "owner": "",
 "backupowner": "",
 "payload": "8LEEm83wqhCasgSlAaKxBBlzdG9yYWdlZF9jZWRhX21vdW50cG9saWN5qLEEBbCxBNgEuLEEBMCxBNgE0rEEJLKvBAljdGEtYWRtaW6CsAQLY3RhLWZyb250MDHQsAT+hfyjBtqxBCSyrwQJY3RhLWFkbWlugrAEC2N0YS1mcm9udDA00LAElMSbugbisQQiVXBkYXRlIEZhY2lsaXRpZXMgbW91bnQgcHJpb3JpdGllc5qzBAoKCAgBEgRaNhfvuLMEgICAgICAgICAAcizBACCtQRzwK0E2QTorQTZBOKuBGUvZW9zL2FudGFyZXNmYWMvcHJvZC9zdG9yYWdlZC1jZWRhL2JhZGMvY29udGFpbmVyL3Nwb3QtNjE1MjAtc29jMjUxMDAzL3Nwb3QtNjE1MjAtc29jMjUxMDAzLzExODQxNDc4OdK1BAkyMTExNTg3ODL6tQQNZW9zYW50YXJlc2ZhY4q2BNoBZW9zUXVlcnk6Ly9hbnRhcmVzLWVvczE1LnNjZC5ybC5hYy51ay8vZW9zL3dmZS9wYXNzd2Q/bWdtLnBjbWQ9ZXZlbnQmbWdtLmZpZD1jOTYwNmZlJm1nbS5sb2dpZD1jdGEmbWdtLmV2ZW50PXN5bmM6OmFyY2hpdmVkJm1nbS53b3JrZmxvdz1kZWZhdWx0Jm1nbS5wYXRoPS9kdW1teV9wYXRoJm1nbS5ydWlkPTAmbWdtLnJnaWQ9MCZjdGFfYXJjaGl2ZV9maWxlX2lkPTQzODQ4OTI1NzGStgTsAWVvc1F1ZXJ5Oi8vYW50YXJlcy1lb3MxNS5zY2QucmwuYWMudWsvL2Vvcy93ZmUvcGFzc3dkP21nbS5wY21kPWV2ZW50Jm1nbS5maWQ9Yzk2MDZmZSZtZ20ubG9naWQ9Y3RhJm1nbS5ldmVudD1zeW5jOjphcmNoaXZlX2ZhaWxlZCZtZ20ud29ya2Zsb3c9ZGVmYXVsdCZtZ20ucGF0aD0vZHVtbXlfcGF0aCZtZ20ucnVpZD0wJm1nbS5yZ2lkPTAmY3RhX2FyY2hpdmVfZmlsZV9pZD00Mzg0ODkyNTcxJm1nbS5lcnJtc2c9oLYEhoK2qCjytgQeUg1zdG9yYWdlZF9jZWRhWg1zdG9yYWdlZF9jZWRhwrcEnAFyb290Oi8vYW50YXJlcy1lb3MxNS5zY2QucmwuYWMudWsvL2Vvcy9hbnRhcmVzZmFjL3Byb2Qvc3RvcmFnZWQtY2VkYS9iYWRjL2NvbnRhaW5lci9zcG90LTYxNTIwLXNvYzI1MTAwMy9zcG90LTYxNTIwLXNvYzI1MTAwMy8xMTg0MTQ3ODk/ZW9zLmxmbj1meGlkOmM5NjA2ZmWSuAQJY2VkYV9vYl8ymrgEKLKvBA1lb3NhbnRhcmVzZmFjgrAEC2N0YS1mcm9udDA00LAE9beHywaiuASlAYCTAgGKkwILY2VkYV90czExNjCSkwIAmpMCAKCTAgOokwIBsJMCAbiTApjwzgHAkwICyJMCAtKTAmNKYW4gMjEgMTQ6NTY6NDUuNTg0ODcwIGdldGFmaXgtdHMwNCBJbiBBcmNoaXZlTW91bnQ6OnJlcG9ydEpvYnNCYXRjaFRyYW5zZmVycmVkKCk6IGdvdCBhbiBleGNlcHRpb27YkwIC4JMCAKK4BPcBgJMCAoqTAgxvZmZzaXRlX2x0bziSkwIAmpMCUURyaXZlUHJvY2Vzcy1vYmVsaXhfbHRvOV8xOS1nZXRhZml4LXRzMTkuc2NkLnJsLmFjLnVrLTIwMjQzNzUtMjAyNjAxMjMtMTQ6MDM6MTAtMKCTAgGokwIBsJMCAbiTAv6EzwHAkwICyJMCAtKTAmNKYW4gMjIgMDk6MzA6MjguMjIxMDE1IGdldGFmaXgtdHMwMyBJbiBBcmNoaXZlTW91bnQ6OnJlcG9ydEpvYnNCYXRjaFRyYW5zZmVycmVkKCk6IGdvdCBhbiBleGNlcHRpb27YkwIC4JMCAKi4BAC4uAQA4rgEGXN0b3JhZ2VkX2NlZGFfbW91bnRwb2xpY3k="
}
Body dump:
{
 "archivefileid": "4384892571",
 "mountpolicy": {
  "name": "storaged_ceda_mountpolicy",
  "archivepriority": "5",
  "archiveminrequestage": "600",
  "retrievepriority": "4",
  "retieveminrequestage": "600",
  "creationlog": {
   "username": "cta-admin",
   "host": "cta-front01",
   "time": "1686045438"
  },
  "lastmodificationlog": {
   "username": "cta-admin",
   "host": "cta-front04",
   "time": "1732698644"
  },
  "comment": "Update Facilities mount priorities"
 },
 "mountpolicyname": "storaged_ceda_mountpolicy",
 "checksumblob": "CggIARIEWjYX7w==",
 "creationtime": "9223372036854775808",
 "reconcilationtime": "0",
 "diskfileinfo": {
  "ownerUid": 601,
  "gid": 601,
  "path": "/eos/antaresfac/prod/storaged-ceda/badc/container/spot-61520-soc251003/spot-61520-soc251003/118414789"
 },
 "diskfileid": "211158782",
 "diskinstance": "eosantaresfac",
 "archivereporturl": "eosQuery://antares-eos15.scd.rl.ac.uk//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=c9606fe&mgm.logid=cta&mgm.event=sync::archived&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&cta_archive_file_id=4384892571",
 "archiveerrorreporturl": "eosQuery://antares-eos15.scd.rl.ac.uk//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=c9606fe&mgm.logid=cta&mgm.event=sync::archive_failed&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&cta_archive_file_id=4384892571&mgm.errmsg=",
 "filesize": "10822189318",
 "requester": {
  "name": "storaged_ceda",
  "group": "storaged_ceda"
 },
 "srcurl": "root://antares-eos15.scd.rl.ac.uk//eos/antaresfac/prod/storaged-ceda/badc/container/spot-61520-soc251003/spot-61520-soc251003/118414789?eos.lfn=fxid:c9606fe",
 "storageclass": "ceda_ob_2",
 "creationlog": {
  "username": "eosantaresfac",
  "host": "cta-front04",
  "time": "1768020981"
 },
 "jobs": [
  {
   "copynb": 1,
   "tapepool": "ceda_ts1160",
   "archivequeueaddress": "",
   "owner": "",
   "status": "AJS_Complete",
   "totalretries": 1,
   "retrieswithinmount": 1,
   "lastmountwithfailure": "3389464",
   "maxtotalretries": 2,
   "maxretrieswithinmount": 2,
   "failurelogs": [
    "Jan 21 14:56:45.584870 getafix-ts04 In ArchiveMount::reportJobsBatchTransferred(): got an exception"
   ],
   "maxreportretries": 2,
   "totalreportretries": 0,
   "reportfailurelogs": []
  },
  {
   "copynb": 2,
   "tapepool": "offsite_lto8",
   "archivequeueaddress": "",
   "owner": "DriveProcess-obelix_lto9_19-getafix-ts19.scd.rl.ac.uk-2024375-20260123-14:03:10-0",
   "status": "AJS_ToTransferForUser",
   "totalretries": 1,
   "retrieswithinmount": 1,
   "lastmountwithfailure": "3392126",
   "maxtotalretries": 2,
   "maxretrieswithinmount": 2,
   "failurelogs": [
    "Jan 22 09:30:28.221015 getafix-ts03 In ArchiveMount::reportJobsBatchTransferred(): got an exception"
   ],
   "maxreportretries": 2,
   "totalreportretries": 0,
   "reportfailurelogs": []
  }
 ],
 "reportdecided": false,
 "isrepack": false,
 "isfailed": false
}

Yes, according to the field "failurelogs" something went wrong during ArchiveMount::reportJobsBatchTransferred().

Unfortunatelly, the following line does not tell much…

"failurelogs": [
    "Jan 22 09:30:28.221015 getafix-ts03 In ArchiveMount::reportJobsBatchTransferred(): got an exception"
],

I can see, the tape mount sessions for the second copy pool have also been interrupted by the same exception.

This exception is a general “catch-all”. Could be cause by the same problem as before or not. Unfortunatelly, it does not include the log parameters.

The logs themselves probably contain more information about what happened at Jan 22 09:30:28. Could you please share here what you find there at this timestamp?

Hi Joao,

Sorry for the delay, we’re in the middle of a LumOS upgrade and PM sessions for our libraries. We’ll have more time to look into this after today!

Here are the preceding cta-taped log lines for the exception you mentioned:

log
{"epoch_time":1769074226.188514776,"local_time":"2026-01-22T09:30:26+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564961,"message":"File successfully transmitted to drive","drive_name":"obelix_lto8_05","instance":"antares","sched_bac
kend":"cephUser","thread":"TapeWrite","tapeDrive":"obelix_lto8_05","tapeVid":"JL1808","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","mediaType":"LTO7M","logicalLibrary":"obelix_lto","mountType":"ArchiveForUser","vendor":"IBM","capacityInBytes":9000000000000,"
fileId":4384917294,"fileSize":11221208943,"fSeq":216,"diskURL":"root://antares-eos15.scd.rl.ac.uk//eos/antaresfac/prod/storaged-ceda/badc/container/spot-61566-09/spot-61566-09/118415971?eos.lfn=fxid:c9610ec","readWriteTime":30.051813,"checksumingTime":6.54976300000001,"waitDataTi
me":0.0272340000000002,"waitReportingTime":0.000115,"transferTime":36.628925,"totalTime":36.632102,"dataVolume":11221208943,"headerVolume":480,"driveTransferSpeedMBps":306.321745418813,"payloadTransferSpeedMBps":306.321732315552,"reconciliationTime":0,"LBPMode":"LBP_Off"}
{"epoch_time":1769074228.134842006,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564961,"message":"Normal flush because thresholds was reached","drive_name":"obelix_lto8_05","instance":"antares","sche
d_backend":"cephUser","thread":"TapeWrite","tapeDrive":"obelix_lto8_05","tapeVid":"JL1808","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","mediaType":"LTO7M","logicalLibrary":"obelix_lto","mountType":"ArchiveForUser","vendor":"IBM","capacityInBytes":9000000000
000,"files":3,"bytes":33511960488,"flushTime":1.946577}
{"epoch_time":1769074228.136619696,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In cta::ArchiveMount::reportJobsBatchTransferred(): archive job successful","drive_name":"obelix_lto8
_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","tapeVid":"JL1808","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","successfulBatchSize":3,"mountType":"ArchiveForUser","fileId":4384870054,"type":"ReportSuc
cessful"}
{"epoch_time":1769074228.137045146,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In cta::ArchiveMount::reportJobsBatchTransferred(): archive job successful","drive_name":"obelix_lto8
_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","successfulBatchSize":3,"tapeVid":"JL1808","mountType":"ArchiveForUser","fileId":4384892571,"type":"ReportSuc
cessful"}
{"epoch_time":1769074228.137295593,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In cta::ArchiveMount::reportJobsBatchTransferred(): archive job successful","drive_name":"obelix_lto8
_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","successfulBatchSize":3,"tapeVid":"JL1808","mountType":"ArchiveForUser","fileId":4384917294,"type":"ReportSuc
cessful"}
{"epoch_time":1769074228.159687895,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"ERROR","pid":564363,"tid":564962,"message":"In ArchiveMount::reportJobsBatchTransferred(): got an exception","drive_name":"obelix_lto8_05","inst
ance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","successfulBatchSize":3,"exceptionMessageValue":"filesWrittenToTape: Failed to find archive file entry in the catalogue
: archiveFileId=4384917294, diskInstanceName=eosantaresfac, diskFileId=211161324"}
{"epoch_time":1769074228.170202476,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue.
","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","attemptNb":1,"queueObject":"ArchiveQueueToTransferForUser-offsite_lto8-Maintenanc
e-getafix-ts22.scd.rl.ac.uk-1249893-20260108-09:28:38-0-2","rootFetchNoLockTime":0.003458,"rootRelockExclusiveTime":0.0,"rootRefetchTime":0.0,"addOrGetQueueandCommitTime":0.0,"rootUnlockExclusiveTime":0.0,"queueLockTime":0.001655,"queueFetchTime":0.000181}
{"epoch_time":1769074228.220086382,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements.","drive_name":"o
belix_lto8_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","C":"ArchiveQueueToTransferForUser","tapepool":"offsite_lto8","containerAddress":"ArchiveQueueToTra
nsferForUser-offsite_lto8-Maintenance-getafix-ts22.scd.rl.ac.uk-1249893-20260108-09:28:38-0-2","queueJobsBefore":4018,"queueBytesBefore":44459163849051,"queueJobsAfter":4019,"queueBytesAfter":44470632411278,"queueLockFetchTime":0.005763,"queueProcessAndCommitTime":0.043097,"async
UpdateLaunchTime":0.000139,"asyncUpdateCompletionTime":0.002666,"requestsUpdatingTime":4.3e-05,"queueUnlockTime":0.000548}
{"epoch_time":1769074228.220607512,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In ArchiveJob::failTransfer(): requeued job for (potentially in-mount) retry.","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","fileId":4384870054,"copyNb":2,"failureReason":"In ArchiveMount::reportJobsBatchTransferred(): got an exception","requestObject":"ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-2400-20251104-09:41:41-0-4993190","retriesWithinMount":1,"maxRetriesWithinMount":2,"totalRetries":1,"maxTotalRetries":2}
{"epoch_time":1769074228.220857119,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In OStoreDB::ArchiveJob::~ArchiveJob(): will leave the job owned after destruction.","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","agentObject":"DriveProcess-obelix_lto8_05-getafix-ts03.scd.rl.ac.uk-564363-20260122-09:26:32-0","jobObject":"ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-2400-20251104-09:41:41-0-4993190"}
{"epoch_time":1769074228.227530834,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue.","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","attemptNb":1,"queueObject":"ArchiveQueueToTransferForUser-offsite_lto8-Maintenance-getafix-ts22.scd.rl.ac.uk-1249893-20260108-09:28:38-0-2","rootFetchNoLockTime":0.00252,"rootRelockExclusiveTime":0.0,"rootRefetchTime":0.0,"addOrGetQueueandCommitTime":0.0,"rootUnlockExclusiveTime":0.0,"queueLockTime":0.001483,"queueFetchTime":0.000168}
{"epoch_time":1769074228.268375768,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements.","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","C":"ArchiveQueueToTransferForUser","tapepool":"offsite_lto8","containerAddress":"ArchiveQueueToTransferForUser-offsite_lto8-Maintenance-getafix-ts22.scd.rl.ac.uk-1249893-20260108-09:28:38-0-2","queueJobsBefore":4019,"queueBytesBefore":44470632411278,"queueJobsAfter":4020,"queueBytesAfter":44481454600596,"queueLockFetchTime":0.004607,"queueProcessAndCommitTime":0.034924,"asyncUpdateLaunchTime":0.000127,"asyncUpdateCompletionTime":0.002067,"requestsUpdatingTime":4e-05,"queueUnlockTime":0.000574}
{"epoch_time":1769074228.268855146,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In ArchiveJob::failTransfer(): requeued job for (potentially in-mount) retry.","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","fileId":4384892571,"copyNb":2,"failureReason":"In ArchiveMount::reportJobsBatchTransferred(): got an exception","requestObject":"ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-2400-20251104-09:41:41-0-5016581","retriesWithinMount":1,"maxRetriesWithinMount":2,"totalRetries":1,"maxTotalRetries":2}
{"epoch_time":1769074228.269096585,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In OStoreDB::ArchiveJob::~ArchiveJob(): will leave the job owned after destruction.","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","agentObject":"DriveProcess-obelix_lto8_05-getafix-ts03.scd.rl.ac.uk-564363-20260122-09:26:32-0","jobObject":"ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-2400-20251104-09:41:41-0-5016581"}
{"epoch_time":1769074228.308992677,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue.","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","attemptNb":1,"queueObject":"ArchiveQueueToReportForUser-offsite_lto8-DriveProcess-obelix_lto8_05-getafix-ts03.scd.rl.ac.uk-564363-20260122-09:26:32-0-0","rootFetchNoLockTime":0.002582,"rootRelockExclusiveTime":0.007855,"rootRefetchTime":0.001895,"addOrGetQueueandCommitTime":0.016927,"rootUnlockExclusiveTime":0.001558,"queueLockTime":0.000594,"queueFetchTime":0.003366}
{"epoch_time":1769074228.318374083,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In ContainerAlgorithms::referenceAndSwitchOwnership(): Requeued a batch of elements.","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","C":"ArchiveQueueToReportForUser","tapepool":"offsite_lto8","containerAddress":"ArchiveQueueToReportForUser-offsite_lto8-DriveProcess-obelix_lto8_05-getafix-ts03.scd.rl.ac.uk-564363-20260122-09:26:32-0-0","queueJobsBefore":0,"queueBytesBefore":0,"queueJobsAfter":1,"queueBytesAfter":11221208943,"queueLockFetchTime":0.035257,"queueProcessAndCommitTime":0.001284,"asyncUpdateLaunchTime":0.000127,"asyncUpdateCompletionTime":0.004139,"requestsUpdatingTime":7.2e-05,"queueUnlockTime":0.000599}
{"epoch_time":1769074228.318854450,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In ArchiveJob::failTransfer(): enqueued job for reporting","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"obelix_lto8_05","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","fileId":4384917294,"copyNb":2,"failureReason":"In ArchiveMount::reportJobsBatchTransferred(): got an exception","requestObject":"ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-2400-20251104-09:41:41-0-5042078","retriesWithinMount":1,"maxRetriesWithinMount":2,"totalRetries":2,"maxTotalRetries":2}
{"epoch_time":1769074228.319096204,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"INFO","pid":564363,"tid":564962,"message":"In OStoreDB::ArchiveJob::~ArchiveJob(): will leave the job owned after destruction.","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","agentObject":"DriveProcess-obelix_lto8_05-getafix-ts03.scd.rl.ac.uk-564363-20260122-09:26:32-0","jobObject":"ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-2400-20251104-09:41:41-0-5042078"}
{"epoch_time":1769074228.319275910,"local_time":"2026-01-22T09:30:28+0000","hostname":"getafix-ts03","program":"cta-taped","log_level":"ERROR","pid":564363,"tid":564962,"message":"In MigrationReportPacker::WorkerThread::run(): Received a CTA exception while reporting archive mount results.","drive_name":"obelix_lto8_05","instance":"antares","sched_backend":"cephUser","thread":"ReportPacker","tapeDrive":"obelix_lto8_05","tapeVid":"JL1808","mountId":"3392126","vo":"storaged-ceda","tapePool":"offsite_lto8","exceptionMSG":"In ArchiveMount::reportJobsBatchTransferred(): got an exception"}

…with the same Failed to find archive file entry in the catalogue exception message that we noted before.

We’re planning to dig into the cta-taped → DB comminucation to make sure that there isn’t something going wrong that could explain this, but we haven’t seen anything obvious so far.

Thanks,
Tom

Hi Joao,

We looked at this issue today together with our DBA team. By examining the verification loop in

https://gitlab.cern.ch/cta/CTA/-/blob/main/catalogue/rdbms/oracle/OracleTapeFileCatalogue.cpp#L312-322

we understood that the check is done against already existing entries in the ARCHIVE_FILE table of all the files that are in the TEMP_TAPE_FILE_INSERTION_BATCH table.

Clearly, then, for some reason - and as you suggested when we discussed at CERN - these entries are not created in in the ARCHIVE_FILE table.

Can you please let us know when exactly in the file archival lifecycle, how and from which client (i.e. the frontend or the tape server) is the row/entry in the ARCHIVE_FILE table is created for a file that has been successfully written to EOS with a CLOSEW event?

Based on your pointers, we will have another session with our DBAs and try to understand the exact conditions under which the row creation in the ARCHIVE_FILE table fails in the first place.

Thanks,

George

Hi @afonso

Sorry for the nagging…do you think you can give us any clues re the above question on the ARCHIVE_TABLE ?

Thanks,

George

Hi George,

My apologies, I was with limited internet access during the last weeks.

we understood that the check is done against already existing entries in the ARCHIVE_FILE table of all the files that are in the TEMP_TAPE_FILE_INSERTION_BATCH table.
Clearly, then, for some reason - and as you suggested when we discussed at CERN - these entries are not created in in the ARCHIVE_FILE table.

Yes, your understanding is correct. If an entry is missing from the ARCHIVE_FILE table, then something must have gone wrong at the insertion point.

This is the workflow that will result in a new entry to the ARCHIVE_FILE table:

  1. The CLOSEW event is processed by the CTA Frontend. No entries are added nor checked against the ARCHIVE_FILE table here. All that the CTA Frontend does is creating a new archive request and inserting it in the scheduler DB (objectstore). The relevant code can be found here.
  2. During a mount, the archive requests will be popped from the scheduler DB and written to tape by the tape server. Once several files have been written, a flush can finally be done (synchronized tape mark).
  3. New entries will only be written to the ARCHIVE_FILE table after this flush, because only at this point can the tape server be sure that the file was written to tape without errors. These entries are written to the table here and the insert is allowed to fail if a unique constraint is violated (this is meant to allow multiple tape copies to be registered, depite the fact that all of them refer to the same archive file). This means that any violation of these unique constraints will be silently ignored (link to constraints).

My guess is that this is the problem you are seeing.
If this constraint is failing, then it would explain the behaviour you see:

CONSTRAINT ARCHIVE_FILE_DIN_DFI_UN UNIQUE(DISK_INSTANCE_NAME, DISK_FILE_ID),

Any other error should have been logged (assuming no bugs).

A bit later, as you mentioned correctly, a check is done against the ARCHIVE_FILE table (using the TEMP_TAPE_FILE_INSERTION_BATCH table) for all the archive file IDs that have been written to tape. Unless the constraint ARCHIVE_FILE_DIN_DFI_UN was violated, there would be no reason to assume that the archive file ID would not exist in the table at this point.

If this is not the problem, then there must be a bug in the code that is only triggered under rare circunstances (not seen yet at CERN).

If this is the case, would you be able to find a way to reproduce this bug on command? With the current information it would be very challenging to pinpoint other possible causes.

Hi Joao,

Many thanks for your detailed reply!

I will ask our DBAs to look for violations of any of the constrains listed in https://gitlab.cern.ch:8443/cta/cta-catalogue-schema/-/blob/ae7671e0/common_catalogue_schema.sql#L363-L367 and see how far we get.

Best,

George

Hi Joao,

In case it is usefull, I got the following sequence of events that our DBAs reconstructed from the Oracle DB trace file (tracing was enabled on the DB side).

Job Starts


UPDATE DRIVE_STATE


SELECT DRIVE_NAME


UPDATE TAPE run till this point


INSERT INTO ARCHIVE_FILE
│ (where STORAGE_CLASS_NAME) ( where condition may be failing)

INSERT INTO TEMP_TAPE_FILE_INSERTION_BATCH


Completes (edited)

The UPDATE TAPE is the exit point when the exception occurs in cta-taped.

There is no indication of constrain violatoin but the DBAs

suspect that the INSERT INTO ARCHIVE_FILE statement includes a condition on STORAGE_CLASS_NAME in the WHERE clause. If this condition is not met, the insert would not occur. However, we do not see any insert failure errors reported

Maybe this is indeed a bug but I dont think I would be able to reproduce it….

Best,

George

Hi George,

suspect that the INSERT INTO ARCHIVE_FILE statement includes a condition on STORAGE_CLASS_NAME in the WHERE clause. If this condition is not met, the insert would not occur. However, we do not see any insert failure errors reported

Great finding! I think this could be the problem :slight_smile:

When looking at the code here I don’t find any check for the existance of STORAGE_CLASS_NAME inside STORAGE_CLASS. The query would “work” but nothing would be inserted.

Are you able to confirm if this is indeed the case (by checking the the storage class on EOS does match or not a storage class on the CTA catalogue)?
If this is indeed the problem, it could be because the storage class was removed from STORAGE_CLASS between the archive request being successfully queued but not yet written to tape…