Understanding a failure on archive

dmitrylitvintsev · 11 June 2026 01:37

Summary

This is a free form request to help understanding of some failures on archive which have not
been able to get to the bottom of.

Details

  CTA Admin Client xrd-ssi-protobuf CTA Frontend Server xrd-ssi-protobuf Catalogue Schema                                                               DB Connection String Instance Name     Status Scheduler Backend Name 
5.11.11.0-1                   v1.26  5.11.11.0-1                   v1.26             15.0 postgresql:postgresql://cta:******@xxx.fnal.gov:5468/cta_prd?gssencmode=disable           prd PRODUCTION               cephUser

Object Store backend: ceph
Cache front end: dCache

Details of the issue

We have a customer generating a lot of small files. These files belong to a storage classs w/ number of copies set to 2. I see noticeable (still very small compared to the total number of archival requests) rate of file archive failures that I fail to explain.

dCache pool submits archive request to CTA gRPC frontend. The request contains file URI (root://…)_ served by the pool that plays a role of xrootd server. The request is scheduled and eventually taped pulls it from the object store queue and reads the file using the provided URI.

Some of the requests fail in peculiar way. E.g. in dCache billing I can see the following:

           finish           |                                           errormessage                                            
----------------------------+---------------------------------------------------------------------------------------------------
 2026-06-06 03:20:21.701-05 | Jun  6 03:20:21.737420 tpsrvg1829 In ArchiveMount::reportJobsBatchTransferred(): got an exception
 2026-06-06 23:24:11.066-05 | Jun  6 23:24:11.728561 tpsrvf1814 In ArchiveMount::reportJobsBatchTransferred(): got an exception
 2026-06-08 20:12:46.145-05 | Jun  8 20:12:44.874688 tpsrvg1819 In ArchiveMount::reportJobsBatchTransferred(): got an exception
 2026-06-10 06:13:15.471-05 | Flush was cancelled.
(4 rows)

Above are 4 attempts to store the file. 3 failed w/ "ArchiveMount::reportJobsBatchTransferred(): got an exception… Looking at the log files I see:

{"epoch_time":1781089995.096897006,"local_time":"2026-06-10T06:13:15-0500","hostname":"tpsrvf2217","program":"cta-taped","log_level":"ERROR","pid":954815,"tid":1336087,"message":"In ArchiveMount::reportJobsBatchTransferred(): got an exception","drive_name":"F2_F9B4D1","instance":"prd","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"F2_F9B4D1","mountId":"484851","vo":"nova","tapePool":"nova.nova_production","successfulBatchSize":20,"exceptionMessageValue":"commit problem committing the DB transaction: Database library reported: ERROR:  duplicate key value violates unique constraint \"archive_file_din_dfi_un\"DETAIL:  Key (disk_instance_name, disk_file_id)=(public_prd, 0000B534DEDC859C4BEA9BF89D68115CCA4E) already exists. (DB Result Status:7 SQLState:23505)"}

I was assuming that this is due to some miscommunication and dCache still thinks file is not on tape but CTA stored it already.

So, I have developed a daemon to catch these exceptions and “fix” files on dCache end so that they are fully furnishied as archived.

Flush was cancelled above is the result of this daemon’s activity. This seems to work well, but I caught this file to have only one copy:

# cta-admin tf ls --dfid  0000B534DEDC859C4BEA9BF89D68115CCA4E  -i public_prd
archive id copy no    vid  fseq block id disk buffer                            disk fxid  size checksum type checksum value            storage class owner group    creation time instance 
4563192547       1 FB7392 18869 17009221  public_prd 0000B534DEDC859C4BEA9BF89D68115CCA4E 70.3M       ADLER32       b5c8bfbf nova.nova_production@cta     1     1 2026-06-05 16:04      prd

Note the creation time 2026-06-05 16:04 , compare to the finish time of dCache store request 2026-06-06 03:20:21.701-05. Took a long time.

And… there is just one copy.

I do not know exactly what happens when 2 copies are made and when the entries in archive_file and tape_file are made - as soon as first copy made, or when the second copy is made. By all means it looks like the former.

Anyhow, something is glitching and second copy is not made. Again, this happens just in handful
of cases compared to 100 of thousands of files being stored daily.

I looked at the CTA log files for 0000B534DEDC859C4BEA9BF89D68115CCA4E (this is disk file id) and then converted them into time sorted CSV. There is huge amount of entries:

$ wc -l 0000B534DEDC859C4BEA9BF89D68115CCA4E.csv
17757

What worries me is that there is a lot of wasted activity like mounting, trasferring and then failing going on. And of course end result of not having a second copy.

The entries look like:

#local_time,epoch_time,tapeVid,fSeq,mountId,tapePool,message
2026-06-05T16:02:00-0500,1780693320.129447811,FB7392,18869,479839,null,Created tasks for migrating a file
2026-06-05T16:03:58-0500,1780693438.583108827,FB7392,null,479839,nova.nova_production,Opened disk file for read
2026-06-05T16:04:00-0500,1780693440.706893872,FB7392,null,479839,nova.nova_production,File successfully read from disk
2026-06-05T16:04:23-0500,1780693463.528033032,FB7392,18869,479839,nova.nova_production,File successfully transmitted to drive
2026-06-05T21:10:11-0500,1780711811.407383908,VRB857,21056,480068,null,Created tasks for migrating a file
2026-06-05T21:12:47-0500,1780711967.209886931,VRB857,null,480068,nova.nova_production_copy_1,Opened disk file for read
2026-06-05T21:12:49-0500,1780711969.887017068,VRB857,null,480068,nova.nova_production_copy_1,File successfully read from disk
2026-06-05T21:13:14-0500,1780711994.981748682,VRB857,21056,480068,nova.nova_production_copy_1,File successfully transmitted to drive
2026-06-06T03:15:45-0500,1780733745.600739093,VRB857,21056,480463,null,Created tasks for migrating a file
2026-06-06T03:15:45-0500,1780733745.632075464,VRB857,null,480463,nova.nova_production_copy_1,Opened disk file for read
2026-06-06T03:15:45-0500,1780733745.796284241,VRB857,null,480463,nova.nova_production_copy_1,Handler received SIGCHILD. Propagating to all handlers.
2026-06-06T03:15:45-0500,1780733745.796422288,VRB857,null,480463,nova.nova_production_copy_1,Propagated SIGCHILD.
2026-06-06T03:15:45-0500,1780733745.796561861,VRB857,null,480463,nova.nova_production_copy_1,Propagated SIGCHILD.
2026-06-06T03:15:45-0500,1780733745.796681364,VRB857,null,480463,nova.nova_production_copy_1,Propagated SIGCHILD.
2026-06-06T03:15:45-0500,1780733745.797007707,VRB857,null,480463,nova.nova_production_copy_1,Handler received SIGCHILD. Propagating to all handlers.
...

And then it sort of repeats:

2026-06-06T03:15:48-0500,1780733748.885509002,VRB857,null,480463,nova.nova_production_copy_1,Propagated SIGCHILD.
2026-06-06T03:18:03-0500,1780733883.189954801,VRB857,21056,480463,nova.nova_production_copy_1,File successfully transmitted to drive
2026-06-06T04:10:36-0500,1780737036.755888373,VRB600,39109,480522,null,Created tasks for migrating a file
2026-06-06T04:17:33-0500,1780737453.943368929,VRB600,null,480522,nova.nova_production,Opened disk file for read
2026-06-06T04:17:34-0500,1780737454.270872277,VRB600,null,480522,nova.nova_production,File successfully read from disk
2026-06-06T04:17:45-0500,1780737465.787209169,VRB600,39109,480522,nova.nova_production,File successfully transmitted to drive
2026-06-06T09:20:25-0500,1780755625.726665967,VRB857,21090,480876,nova.nova_production_copy_1,Created tasks for migrating a file
2026-06-06T09:22:41-0500,1780755761.715025817,VRB857,null,480876,nova.nova_production_copy_1,Opened disk file for read
2026-06-06T09:22:42-0500,1780755762.069282263,VRB857,null,480876,nova.nova_production_copy_1,File successfully read from disk
2026-06-06T09:22:53-0500,1780755773.441639831,VRB857,21090,480876,nova.nova_production_copy_1,File successfully transmitted to drive
2026-06-06T10:13:19-0500,1780758799.554305063,FB7392,18932,480938,nova.nova_production,Created tasks for migrating a file
2026-06-06T10:16:01-0500,1780758961.487517957,FB7392,null,480938,nova.nova_production,Opened disk file for read
2026-06-06T10:16:01-0500,1780758961.587893830,FB7392,null,480938,nova.nova_production,Handler received SIGCHILD. Propagating to all handlers.
2026-06-06T10:16:01-0500,1780758961.588050630,FB7392,null,480938,nova.nova_production,Propagated SIGCHILD.
2026-06-06T10:16:01-0500,1780758961.588190177,FB7392,null,480938,nova.nova_production,Propagated SIGCHILD.
2026-06-06T10:16:01-0500,1780758961.588302630,FB7392,null,480938,nova.nova_production,Propagated SIGCHILD.
2026-06-06T10:16:01-0500,1780758961.588571745,FB7392,null,480938,nova.nova_production,Handler received SIGCHILD. Propagating to all handlers.
...

and on and on it goes

The entire file is here:

The tail end of above file has:

2026-06-10T06:09:47-0500,1781089787.706949787,FB7392,null,484851,nova.nova_production,Propagated SIGCHILD.
2026-06-10T06:12:16-0500,1781089936.136735567,FB7392,20305,484851,nova.nova_production,File successfully transmitted to drive
2026-06-10T06:13:15-0500,1781089995.096897006,,null,484851,nova.nova_production,In ArchiveMount::reportJobsBatchTransferred(): got an exception
2026-06-10T06:13:16-0500,1781089996.330899192,,null,,null,In WorkflowEvent::WorkflowEvent(): received event.
2026-06-10T06:13:16-0500,1781089996.388848273,,null,,null,In WorkflowEvent::processDELETE(): archive file deleted.

Above tells me that cancelling request in dCache led to "WorkflowEvent::processDELETE(): archive file deleted.”. But the end result is that the file is on tape just in one copy.

My question really - does what I am experiencing ring any bell? What would be the issue that
triggered second copy failure. Looking at the logs I am seeing that CTA keeps trying to write 2 copies and fails because of

"commit problem committing the DB transaction: Database library reported: ERROR:  duplicate key value violates unique constraint \"archive_file_din_dfi_un\"DETAIL:  Key (disk_instance_name, disk_file_id)=(public_prd, 0000B534DEDC859C4BEA9BF89D68115CCA4E) already exists. (DB Result Status:7 SQLState:23505)”}

due to first copy “being there” in archive_file since Jun 5, 16:04

Thanks,
Dmitry

reichard · 17 June 2026 14:49

Hello @dmitrylitvintsev,

I have encountered this issue in the past as well.

There are two things that can trigger this behavior:

The pool that flushes to tape has a store timeout set.
- I believe you have set this correctly
- it can be disabled via st unset timeout on each pool
One copy of a two-copy archive route has been written to tape, but the specific dCache pool was restarted and lost the state of this flush. When the pool comes back online, it goes through its store list and flushes again - this is where things go wrong.

The second flush to tape behaves badly within CTA:

it gets a new archive ID and is written to two tapes (with success)
once this is done, CTA tries to register the newly written files in the SQL Tape DB, but this fails because the pnfsid already has an entry in the database
the newly written files are then marked as free, and the issue starts again
you can see a complete list of failed requests via cta-admin failedrequest ls and cta-objectstore-dump-object $objectid
you can also list and modify the object store via rados ls/df, etc.

How to fix:

Send a success message to dCache for the file so it will not retry repeatedly. This can be done with xrdfs: xrdfs "${host}:${port}" query opaquefile "/success/${pnfsid}?archiveid=${archiveid}"

Where:

host is the IP/hostname of the dCache disk pool (I/O endpoint of the dCache-CTA plugin)
port is the configured I/O port of the disk pool
example: root://1.2.3.4:7980/success/00001234567842854B4C890D3DB041FDB617?archiveid=4294981234

You can also get the URL from the failed objects:

cta-objectstore-dump-object $objectid | grep "archivereporturl"

Once you issue this request to the dCache-CTA plugin, dCache will mark the file as cached and will stop retrying. However, you will only have one copy then. To create the missing copy, start a repack for missing copies like this:

get a list of tapes with missing copies: cta-admin tape ls --missingfilecopies
for each tape with missing copies, run: cta-admin repack add --vid $TAPE --mountpolicy repack --justaddcopies

I think it would be very helpful if CTA checked the pnfsid when dCache submits new work. Currently, CTA simply generates a new archive ID and adds it to the object store.

Regards, Tino

dmitrylitvintsev · 18 June 2026 23:47

Hi Tito,

Thanks! This was very useful adfice. I will try that.

Dmitry

dmitrylitvintsev · 19 June 2026 17:01

Just one obsevaton tho: The frequency of these problems is much higher than can be explained by pool restarts. Pools get restarted rarely/
Here is the example of a file on pool that hasn’t been restarted. In billing:

# select datestamp, errorcode, errormessage from storageinfo where pnfsid = '0000A1520C2E71A14302B73DF35635DC1269';
         datestamp          | errorcode |                                           errormessage                                            
----------------------------+-----------+---------------------------------------------------------------------------------------------------
 2026-06-17 17:53:43.497-05 |     10023 | Jun 17 22:42:32.619119 tpsrvf1809 In ArchiveMount::reportJobsBatchTransferred(): got an exception
 2026-06-18 05:46:33.923-05 |     10023 | Jun 18 07:11:01.583868 tpsrvf2213 In ArchiveMount::reportJobsBatchTransferred(): got an exception
 2026-06-18 07:11:34.026-05 |     10006 | Flush was cancelled.

Same sequence of events as in my first message on this thread.
(Flush was cancelled is the result of my cta-nanny daemon action)

The pool has been running fine.

The cta entry says:

  cta-admin tf ls  --instance public_prd --dfid 0000A1520C2E71A14302B73DF35635DC1269
archive id copy no    vid   fseq block id disk buffer                            disk fxid size checksum type checksum value            storage class owner group    creation time instance 
4564736106       2 GC0271 102461 97606650  public_prd 0000A1520C2E71A14302B73DF35635DC1269 8.8M       ADLER32       ef8d5696 nova.nova_production@cta     1     1 2026-06-17 18:25      prd

First copy (well actually second!) created 2026-06-17 18:25 and first copy never made - started to generate these same duplicate key value violates. First failure apparently at Jun 17 22:42:32.619119.

So, the problem of not making second copy is far more frequent than can be explained by pool restarts.

reichard · 22 June 2026 15:27

There must be something going on - which did a second flush of the same pnfsid.
When I had this issue, I had just the problem, that I hat a store timeout set via st set timeout 14440 … the solution was st unset timeout there.

I have no other ideas, maybe try to search some failing pnfsid within the frontend + taped logs and try to find the source of the second flush.

Hopefully the second flush problem will be solved with the new postgres scheduler.

dmitrylitvintsev · 22 June 2026 17:38

Yes, I am digging the CTA logs looking for understanding. As for pool store/restore timeouts - unset since we migrated to CTA. Yes, the general hope is that DB based scheduler will solve this.