In ArchiveMount::reportJobsBatchTransferred(): got an exception

dmitrylitvintsev · 15 December 2024 07:28

Hello,

I am observing In ArchiveMount::reportJobsBatchTransferred(): got an exception when archiving files to CTA. This is not a 100% failure - most of the archive calls are successul.

This is dCache/CTA setup.

CTA version 5.10.11.0-1.el9.el9.x86_64
dCache version: 9.2
dCache CTA driver : 0.15.0

# cta-admin --json v| jq
[
  {
    "clientVersion": {
      "ctaVersion": "5.10.11.0-1.el9",
      "xrootdSsiProtobufInterfaceVersion": "v1.15"
    },
    "serverVersion": {
      "ctaVersion": "5.10.11.0-1.el9",
      "xrootdSsiProtobufInterfaceVersion": "v1.15"
    },
    "catalogueConnectionString": "postgresql:postgresql://cta:******@example.org:5432/cta",
    "catalogueVersion": "14.0",
    "isUpgrading": false
  }
]

(I put bogus names to db connection string)

I am just asking for some guideance to get to to bottom of the issue. What I observe is the following:

based on cta-taped log a file seem to get written to CTA.
then something fails, looks like in some reporting, and the archive fails. Then, after a while dCache retries ad nauseam with the same result

example request dump:

# cta-objectstore-dump-object ArchiveRequest-Frontend-cmscta01.fnal.gov-3003176-20241127-11:32:04-0-96848
Object store path: rados://CTAQUEUE@rados.cta.CTAQUEUE:CTAQUEUE
Object name: ArchiveRequest-Frontend-cmscta01.fnal.gov-3003176-20241127-11:32:04-0-96848
Header dump:
{
 "type": "ArchiveRequest_t",
 "version": "0",
 "owner": "",
 "backupowner": "",
 "payload": "8LEE/aCYgBCasgR9orEEC2N0YV9pbml0aWFsqLEEAbCxBLAJuLEEAcCxBLAJ0rEEHrKvBAZlb3NkZXaCsAQIY21zY3RhMDHQsASBqNesBtqxBB6yrwQGZW9zZGV2grAECGNtc2N0YTAx0LAEyPDGrwbisQQUSW5pdGlhbCBtb3VudCBwb2xpY3maswQKCggIARIETdVqIbizBICAgICAgICAgAHIswQAgrUEMcCtBAHorQQB4q4EJS8wMDAwMEFCREE0NkIzMjQ1NDdGQUFERTNDRDYwODI5NURDQ0PStQQkMDAwMDBBQkRBNDZCMzI0NTQ3RkFBREUzQ0Q2MDgyOTVEQ0ND+rUEBmVvc2N0YYq2BGVlb3NRdWVyeTovL2Ntc2RhdGEzMjIuZm5hbC5nb3Y6MTA5Ny9zdWNjZXNzLzAwMDAwQUJEQTQ2QjMyNDU0N0ZBQURFM0NENjA4Mjk1RENDQz9hcmNoaXZlaWQ9NDI5NTM2NDczM5K2BFVlb3NRdWVyeTovL2Ntc2RhdGEzMjIuZm5hbC5nb3Y6MTA5Ny9lcnJvci8wMDAwMEFCREE0NkIzMjQ1NDdGQUFERTNDRDYwODI5NURDQ0M/ZXJyb3I9oLYEh6OsBPK2BBBSBHJvb3RaCGVvc3VzZXJzwrcERHJvb3Q6Ly9jbXNkYXRhMzIyLmZuYWwuZ292OjEwOTcvMDAwMDBBQkRBNDZCMzI0NTQ3RkFBREUzQ0Q2MDgyOTVEQ0NDkrgEE2Ntcy5jbXNEYXRhMjAyM0BjdGGauAQwsq8EBmVvc2N0YYKwBBppcHY0OjEzMS4yMjUuMTkwLjE3Mjo0MjM1MtCwBO3b8boGorgE5QKAkwIBipMCD2Ntcy5jbXNEYXRhMjAyM5KTAgCakwJeQXJjaGl2ZVF1ZXVlRmFpbGVkLWNtcy5jbXNEYXRhMjAyMy1NYWludGVuYW5jZS1mbXYyMjAyOS5mbmFsLmdvdi0yODAzLTIwMjQxMTAxLTEwOjI4OjMzLTAtMjcxM6CTAuYHqJMCArCTAgG4kwLbHsCTAgLIkwIC0pMCX0RlYyAxMyAxNDozMzowNi44NTg1NDAgZ212MTgwMTQgSW4gQXJjaGl2ZU1vdW50OjpyZXBvcnRKb2JzQmF0Y2hUcmFuc2ZlcnJlZCgpOiBnb3QgYW4gZXhjZXB0aW9u0pMCX0RlYyAxMyAxNTowMzoxNC4zNDg1OTQgZ212MTgwMTQgSW4gQXJjaGl2ZU1vdW50OjpyZXBvcnRKb2JzQmF0Y2hUcmFuc2ZlcnJlZCgpOiBnb3QgYW4gZXhjZXB0aW9u2JMCAuCTAgCouAQBuLgEAOK4BAtjdGFfaW5pdGlhbA=="
}
Body dump:
{
 "archivefileid": "4295364733",
 "mountpolicy": {
  "name": "cta_initial",
  "archivepriority": "1",
  "archiveminrequestage": "1200",
  "retrievepriority": "1",
  "retieveminrequestage": "1200",
  "creationlog": {
   "username": "eosdev",
   "host": "cmscta01",
   "time": "1704317953"
  },
  "lastmodificationlog": {
   "username": "eosdev",
   "host": "cmscta01",
   "time": "1710340168"
  },
  "comment": "Initial mount policy"
 },
 "mountpolicyname": "cta_initial",
 "checksumblob": "CggIARIETdVqIQ==",
 "creationtime": "9223372036854775808",
 "reconcilationtime": "0",
 "diskfileinfo": {
  "ownerUid": 1,
  "gid": 1,
  "path": "/00000ABDA46B324547FAADE3CD608295DCCC"
 },
 "diskfileid": "00000ABDA46B324547FAADE3CD608295DCCC",
 "diskinstance": "eoscta",
 "archivereporturl": "eosQuery://cmsdata322.fnal.gov:1097/success/00000ABDA46B324547FAADE3CD608295DCCC?archiveid=4295364733",
 "archiveerrorreporturl": "eosQuery://cmsdata322.fnal.gov:1097/error/00000ABDA46B324547FAADE3CD608295DCCC?error=",
 "filesize": "9113991",
 "requester": {
  "name": "root",
  "group": "eosusers"
 },
 "srcurl": "root://cmsdata322.fnal.gov:1097/00000ABDA46B324547FAADE3CD608295DCCC",
 "storageclass": "cms.cmsData2023@cta",
 "creationlog": {
  "username": "eoscta",
  "host": "ipv4:131.225.190.172:42352",
  "time": "1734110701"
 },
 "jobs": [
  {
   "copynb": 1,
   "tapepool": "cms.cmsData2023",
   "archivequeueaddress": "",
   "owner": "ArchiveQueueFailed-cms.cmsData2023-Maintenance-fmv22029.fnal.gov-2803-20241101-10:28:33-0-2713",
   "status": "AJS_Failed",
   "totalretries": 2,
   "retrieswithinmount": 1,
   "lastmountwithfailure": "3931",
   "maxtotalretries": 2,
   "maxretrieswithinmount": 2,
   "failurelogs": [
    "Dec 13 14:33:06.858540 gmv18014 In ArchiveMount::reportJobsBatchTransferred(): got an exception",
    "Dec 13 15:03:14.348594 gmv18014 In ArchiveMount::reportJobsBatchTransferred(): got an exception"
   ],
   "maxreportretries": 2,
   "totalreportretries": 0,
   "reportfailurelogs": []
  }
 ],
 "reportdecided": true,
 "isrepack": false,
 "isfailed": false
}

I am mostly curious if this looks familiar and where I need to dig to figure out what is wrong. taped log does not show any failures.

dmitrylitvintsev · 16 December 2024 16:28

OK. With help of Mwai and Tigran it looks like these errors are due to presense of already archived files in archive batch. They fail the whole batch.

poliverc · 16 December 2024 17:25

Hello Dmitry,

What do you mean by? taped log does not show any failures.

Could you search in the tapeserver logs for the In ArchiveMount::reportJobsBatchTransferred(): got an exception error message? This log line should contain additional information about the reason of the failure. You can see what should be logged for that log line at: scheduler/ArchiveMount.cpp · main · cta / CTA · GitLab

This exception happens when trying to queue for reporting the files that have successfully been written to tape, and as you state in your second comment, if there is a problem with a file, the entire batch will fail.

In the past we have experienced this issue with dual copy storage classes, but it was related to ObjectStore locking timeout. Looks you are getting a different error based on due to presense of already archived files in archive batch.

dmitrylitvintsev · 16 December 2024 23:52

I meant to say that I havent seen errors associated with disk_file_id = 00000ABDA46B324547FAADE3CD608295DCCC

# grep '00000ABDA46B324547FAADE3CD608295DCCC' *.log  | grep -c "got an exception" 
0
[root@gmv18014 cta]#

but if I grep for ERROR in general (or “got an exception”) I do see some other disk_file_id showing:

{"epoch_time":1734362759.148497944,"local_time":"2024-12-16T09:25:59-0600","hostname":"gmv18014","program":"cta-taped","log_level":"ERROR","pid":2193083,"tid":2251909,"message":"In ArchiveMount::reportJobsBatchTransferred(): got an exception","drive_name":"G1_F3C3R3","instance":"preProdCMS","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"G1_F3C3R3","mountId":"4131","vo":"cms","tapePool":"cms.cmsData2023","exceptionMessageValue":"commit problem committing the DB transaction: Database library reported: ERROR:  duplicate key value violates unique constraint \"archive_file_din_dfi_un\"DETAIL:  Key (disk_instance_name, disk_file_id)=(eoscta, 0000AEC65EE5F77D43B89B1E759F9D0B4400) already exists. (DB Result Status:7 SQLState:23505)"}

And I see a bunch of entries like:

{"epoch_time":1734362759.190981927,"local_time":"2024-12-16T09:25:59-0600","hostname":"gmv18014","program":"cta-taped","log_level":"INFO","pid":2193083,"tid":2251909,"message":"In ArchiveJob::failTransfer(): requeued job for (potentially in-mount) retry.","drive_name":"G1_F3C3R3","instance":"preProdCMS","sched_backend":"cephUser","thread":"MainThread","tapeDrive":"G1_F3C3R3","mountId":"4131","vo":"cms","tapePool":"cms.cmsData2023","fileId":4295386504,"copyNb":1,"failureReason":"In ArchiveMount::reportJobsBatchTransferred(): got an exception","requestObject":"ArchiveRequest-Frontend-cmscta01.fnal.gov-3003176-20241127-11:32:04-0-121348","retriesWithinMount":1,"maxRetriesWithinMount":2,"totalRetries":1,"maxTotalRetries":2}

It has been understood that errors associated with 00000ABDA46B324547FAADE3CD608295DCCC seen in cta-objectstore-dump-object are caused by:

"commit problem committing the DB transaction: Database library reported: ERROR:  duplicate key value violates unique constraint \"archive_file_din_dfi_un\"DETAIL:  Key (disk_instance_name, disk_file_id)=(eoscta, 0000AEC65EE5F77D43B89B1E759F9D0B4400) already exists. (DB Result Status:7 SQLState:23505)"

An attempt to archive a file already archived. It was just not clear to me why that other file was affected (00000ABDA46B324547FAADE3CD608295DCCC). But now I understand that the whole batch fails if one failed. Right?

As for file 00000ABDA46B324547FAADE3CD608295DCCC , it has been succesfully written:

{"epoch_time":1734175603.047356067,"local_time":"2024-12-14T05:26:43-0600","hostname":"gmv18014","program":"cta-taped","log_level":"INFO","pid":1266386,"tid":1321287,"message":"File successfully transmitted to drive","drive_name":"G1_F3C3R3","instance":"preProdCMS","sched_backend":"cephUser","thread":"TapeWrite","tapeDrive":"G1_F3C3R3","tapeVid":"FL2780","mountId":"3979","vo":"cms","tapePool":"cms.cmsData2023","mediaType":"LTO8","logicalLibrary":"TS4500G1_CTACMS","mountType":"ArchiveForUser","vendor":"Unknown","capacityInBytes":12000000000000,"fileId":4295372622,"fileSize":9113991,"fSeq":58645,"diskURL":"root://cmsdata322.fnal.gov:1097/00000ABDA46B324547FAADE3CD608295DCCC","readWriteTime":0.027945,"checksumingTime":0.004059,"waitDataTime":8e06,"waitReportingTime":0.000128,"transferTime":0.03214,"totalTime":0.032127,"dataVolume":9113991,"headerVolume":480,"driveTransferSpeedMBps":283.701279297787,"payloadTransferSpeedMBps":283.686338593706,"reconciliationTime":0,"LBPMode":"LBP_On"}

But archive ID did not register in DB:

# cta-admin tapefile ls --id 4295372622
Archive file with ID 4295372622 does not exist

And dCache received a failure to archive.

afonso · 18 December 2024 09:29

Hello Dmitry,

From the logs, it looks like CTA is writting the same file two times to tape.
However, it will only detect it afterwards, when it tries to register the file in the CTA catalogue and it is rejected due to a constraint in the DB:

commit problem committing the DB transaction: Database library reported: ERROR: duplicate key value violates unique constraint \"archive_file_din_dfi_un\"DETAIL: Key (disk_instance_name, disk_file_id)=(eoscta, 0000AEC65EE5F77D43B89B1E759F9D0B4400) already exists. (DB Result Status:7 SQLState:23505)}

I think this problem may be caused by some difference in the way how EOS and dCache work when interacting with CTA.

In case of EOS, we haven’t seen this problem because EOS will only trigger one archive request per each file that was written to the disk buffer.

In case of dCache I’m not familiar with the behaviour, so the problem might be related to it.

Regarding the reason why this causes a batch to fail, I think it’s simply a consequence of how the DB INSERT query is implemented.
It’s optimised for speed, and for that it tries to register several archived files at once in the CTA Catalogue.
If one of them fails the constraint, I think it causes the whole transaction to fail and be reverted.
As a result, several files will be affected:

catalogue/rdbms/oracle/OracleTapeFileCatalogue.cpp · main · cta / CTA · GitLab

This could be considered a bug, but it would only be triggered if the disk buffer tries to archive the same file twice.

dmitrylitvintsev · 19 December 2024 23:23

Hi Joao,

The cause of this duplicate archival request - dCache pool (aka dataserver) was restarted just as the writing to tape completed successfully and DB metadata was inserted on the CTA end but dCache had not received the completion report so it could mark its own metadata store that the file was archived.

There could be many solutions to this. Obviously if CTA could respond to duplicate archival request differently (like e.g. returning file archive_id right away instead of failing or something along these lines). But I undrestand that EOS works a certain way with CTA and these changes could be disruptive or not desireable. I would love to be able to discuss the solutions and even try to contribute the code to CTA if there is an agreement that it could be handled entirely on CTA end. At the same time we could also see what we can do on dCache end to mitigate effects of pool restarts.

Thank you,
Dmitry

afonso · 7 January 2025 10:37

Hi Dmitry,

Thank you for the clarification.

Indeed, if EOS is not working during the completion report it will also not register in its namespace that the file was archived.
In these rare occurrences the client will never know that the file was archived, and the data on tape will end up becoming “dark data”.

If I remember correctly, the expected procedure is for the client to delete the file and retry after a certain time, which will restart the whole process from the beginning.
EOS itself does not retry, unlike dCache.

I will discuss with the team what can be done to address this issue on the CTA side and come back to you.

One possible solution (but maybe not what you want) is for CTA to simply fail the request if the disk file ID already exists in the catalogue.
Re-triggering the completion report would probably require major changes in the queuing mechanism, which is something we want to avoid (at least until the new Postgres scheduler is complete).

Best regards,
Joao