Summary
This is a free form request to help understanding of some failures on archive which have not
been able to get to the bottom of.
Details
CTA Admin Client xrd-ssi-protobuf CTA Frontend Server xrd-ssi-protobuf Catalogue Schema DB Connection String Instance Name Status Scheduler Backend Name
5.11.11.0-1 v1.26 5.11.11.0-1 v1.26 15.0 postgresql:postgresql://cta:******@xxx.fnal.gov:5468/cta_prd?gssencmode=disable prd PRODUCTION cephUser
Object Store backend: ceph
Cache front end: dCache
Details of the issue
We have a customer generating a lot of small files. These files belong to a storage classs w/ number of copies set to 2. I see noticeable (still very small compared to the total number of archival requests) rate of file archive failures that I fail to explain.
dCache pool submits archive request to CTA gRPC frontend. The request contains file URI (root://…)_ served by the pool that plays a role of xrootd server. The request is scheduled and eventually taped pulls it from the object store queue and reads the file using the provided URI.
Some of the requests fail in peculiar way. E.g. in dCache billing I can see the following:
finish | errormessage
----------------------------+---------------------------------------------------------------------------------------------------
2026-06-06 03:20:21.701-05 | Jun 6 03:20:21.737420 tpsrvg1829 In ArchiveMount::reportJobsBatchTransferred(): got an exception
2026-06-06 23:24:11.066-05 | Jun 6 23:24:11.728561 tpsrvf1814 In ArchiveMount::reportJobsBatchTransferred(): got an exception
2026-06-08 20:12:46.145-05 | Jun 8 20:12:44.874688 tpsrvg1819 In ArchiveMount::reportJobsBatchTransferred(): got an exception
2026-06-10 06:13:15.471-05 | Flush was cancelled.
(4 rows)
Above are 4 attempts to store the file. 3 failed w/ "ArchiveMount::reportJobsBatchTransferred(): got an exception… Looking at the log files I see:
{"epoch_time":1781089995.096897006,"local_time":"2026-06-10T06:13:15-0500","hostname":"tpsrvf2217","program":"cta-taped","log_level":"ERROR","pid":954815,"tid":1336087,"message":"In ArchiveMount::reportJobsBatchTransferred(): got an exception","drive_name":"F2_F9B4D1","instance":"prd","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"F2_F9B4D1","mountId":"484851","vo":"nova","tapePool":"nova.nova_production","successfulBatchSize":20,"exceptionMessageValue":"commit problem committing the DB transaction: Database library reported: ERROR: duplicate key value violates unique constraint \"archive_file_din_dfi_un\"DETAIL: Key (disk_instance_name, disk_file_id)=(public_prd, 0000B534DEDC859C4BEA9BF89D68115CCA4E) already exists. (DB Result Status:7 SQLState:23505)"}
I was assuming that this is due to some miscommunication and dCache still thinks file is not on tape but CTA stored it already.
So, I have developed a daemon to catch these exceptions and “fix” files on dCache end so that they are fully furnishied as archived.
Flush was cancelled above is the result of this daemon’s activity. This seems to work well, but I caught this file to have only one copy:
# cta-admin tf ls --dfid 0000B534DEDC859C4BEA9BF89D68115CCA4E -i public_prd
archive id copy no vid fseq block id disk buffer disk fxid size checksum type checksum value storage class owner group creation time instance
4563192547 1 FB7392 18869 17009221 public_prd 0000B534DEDC859C4BEA9BF89D68115CCA4E 70.3M ADLER32 b5c8bfbf nova.nova_production@cta 1 1 2026-06-05 16:04 prd
Note the creation time 2026-06-05 16:04 , compare to the finish time of dCache store request 2026-06-06 03:20:21.701-05. Took a long time.
And… there is just one copy.
I do not know exactly what happens when 2 copies are made and when the entries in archive_file and tape_file are made - as soon as first copy made, or when the second copy is made. By all means it looks like the former.
Anyhow, something is glitching and second copy is not made. Again, this happens just in handful
of cases compared to 100 of thousands of files being stored daily.
I looked at the CTA log files for 0000B534DEDC859C4BEA9BF89D68115CCA4E (this is disk file id) and then converted them into time sorted CSV. There is huge amount of entries:
$ wc -l 0000B534DEDC859C4BEA9BF89D68115CCA4E.csv
17757
What worries me is that there is a lot of wasted activity like mounting, trasferring and then failing going on. And of course end result of not having a second copy.
The entries look like:
#local_time,epoch_time,tapeVid,fSeq,mountId,tapePool,message
2026-06-05T16:02:00-0500,1780693320.129447811,FB7392,18869,479839,null,Created tasks for migrating a file
2026-06-05T16:03:58-0500,1780693438.583108827,FB7392,null,479839,nova.nova_production,Opened disk file for read
2026-06-05T16:04:00-0500,1780693440.706893872,FB7392,null,479839,nova.nova_production,File successfully read from disk
2026-06-05T16:04:23-0500,1780693463.528033032,FB7392,18869,479839,nova.nova_production,File successfully transmitted to drive
2026-06-05T21:10:11-0500,1780711811.407383908,VRB857,21056,480068,null,Created tasks for migrating a file
2026-06-05T21:12:47-0500,1780711967.209886931,VRB857,null,480068,nova.nova_production_copy_1,Opened disk file for read
2026-06-05T21:12:49-0500,1780711969.887017068,VRB857,null,480068,nova.nova_production_copy_1,File successfully read from disk
2026-06-05T21:13:14-0500,1780711994.981748682,VRB857,21056,480068,nova.nova_production_copy_1,File successfully transmitted to drive
2026-06-06T03:15:45-0500,1780733745.600739093,VRB857,21056,480463,null,Created tasks for migrating a file
2026-06-06T03:15:45-0500,1780733745.632075464,VRB857,null,480463,nova.nova_production_copy_1,Opened disk file for read
2026-06-06T03:15:45-0500,1780733745.796284241,VRB857,null,480463,nova.nova_production_copy_1,Handler received SIGCHILD. Propagating to all handlers.
2026-06-06T03:15:45-0500,1780733745.796422288,VRB857,null,480463,nova.nova_production_copy_1,Propagated SIGCHILD.
2026-06-06T03:15:45-0500,1780733745.796561861,VRB857,null,480463,nova.nova_production_copy_1,Propagated SIGCHILD.
2026-06-06T03:15:45-0500,1780733745.796681364,VRB857,null,480463,nova.nova_production_copy_1,Propagated SIGCHILD.
2026-06-06T03:15:45-0500,1780733745.797007707,VRB857,null,480463,nova.nova_production_copy_1,Handler received SIGCHILD. Propagating to all handlers.
...
And then it sort of repeats:
2026-06-06T03:15:48-0500,1780733748.885509002,VRB857,null,480463,nova.nova_production_copy_1,Propagated SIGCHILD.
2026-06-06T03:18:03-0500,1780733883.189954801,VRB857,21056,480463,nova.nova_production_copy_1,File successfully transmitted to drive
2026-06-06T04:10:36-0500,1780737036.755888373,VRB600,39109,480522,null,Created tasks for migrating a file
2026-06-06T04:17:33-0500,1780737453.943368929,VRB600,null,480522,nova.nova_production,Opened disk file for read
2026-06-06T04:17:34-0500,1780737454.270872277,VRB600,null,480522,nova.nova_production,File successfully read from disk
2026-06-06T04:17:45-0500,1780737465.787209169,VRB600,39109,480522,nova.nova_production,File successfully transmitted to drive
2026-06-06T09:20:25-0500,1780755625.726665967,VRB857,21090,480876,nova.nova_production_copy_1,Created tasks for migrating a file
2026-06-06T09:22:41-0500,1780755761.715025817,VRB857,null,480876,nova.nova_production_copy_1,Opened disk file for read
2026-06-06T09:22:42-0500,1780755762.069282263,VRB857,null,480876,nova.nova_production_copy_1,File successfully read from disk
2026-06-06T09:22:53-0500,1780755773.441639831,VRB857,21090,480876,nova.nova_production_copy_1,File successfully transmitted to drive
2026-06-06T10:13:19-0500,1780758799.554305063,FB7392,18932,480938,nova.nova_production,Created tasks for migrating a file
2026-06-06T10:16:01-0500,1780758961.487517957,FB7392,null,480938,nova.nova_production,Opened disk file for read
2026-06-06T10:16:01-0500,1780758961.587893830,FB7392,null,480938,nova.nova_production,Handler received SIGCHILD. Propagating to all handlers.
2026-06-06T10:16:01-0500,1780758961.588050630,FB7392,null,480938,nova.nova_production,Propagated SIGCHILD.
2026-06-06T10:16:01-0500,1780758961.588190177,FB7392,null,480938,nova.nova_production,Propagated SIGCHILD.
2026-06-06T10:16:01-0500,1780758961.588302630,FB7392,null,480938,nova.nova_production,Propagated SIGCHILD.
2026-06-06T10:16:01-0500,1780758961.588571745,FB7392,null,480938,nova.nova_production,Handler received SIGCHILD. Propagating to all handlers.
...
and on and on it goes
The entire file is here:
The tail end of above file has:
2026-06-10T06:09:47-0500,1781089787.706949787,FB7392,null,484851,nova.nova_production,Propagated SIGCHILD.
2026-06-10T06:12:16-0500,1781089936.136735567,FB7392,20305,484851,nova.nova_production,File successfully transmitted to drive
2026-06-10T06:13:15-0500,1781089995.096897006,,null,484851,nova.nova_production,In ArchiveMount::reportJobsBatchTransferred(): got an exception
2026-06-10T06:13:16-0500,1781089996.330899192,,null,,null,In WorkflowEvent::WorkflowEvent(): received event.
2026-06-10T06:13:16-0500,1781089996.388848273,,null,,null,In WorkflowEvent::processDELETE(): archive file deleted.
Above tells me that cancelling request in dCache led to "WorkflowEvent::processDELETE(): archive file deleted.”. But the end result is that the file is on tape just in one copy.
My question really - does what I am experiencing ring any bell? What would be the issue that
triggered second copy failure. Looking at the logs I am seeing that CTA keeps trying to write 2 copies and fails because of
"commit problem committing the DB transaction: Database library reported: ERROR: duplicate key value violates unique constraint \"archive_file_din_dfi_un\"DETAIL: Key (disk_instance_name, disk_file_id)=(public_prd, 0000B534DEDC859C4BEA9BF89D68115CCA4E) already exists. (DB Result Status:7 SQLState:23505)”}
due to first copy “being there” in archive_file since Jun 5, 16:04
Thanks,
Dmitry