Files fail to be archived 1

Hi,

Some files fail to be be marked as ARCHIVED even after a CLOSEW request has been manually re-submit for them (the pool name is na62Tape).

The sys.archive.error attribute has the following (not informative) value

sys.archive.error=“Mar 22 11:38:26.040832 cta-ts12 In ArchiveMount::reportJobsBatchWritten(): got an exception”

I can see this object

ArchiveQueueFailed-na62Tape-Maintenance-getafix-ts20.scd.rl.ac.uk-1885-20240322-11:27:22-0-0

whose contents don’t have any info as to why the archival failed.

Do you know what should I do?

Thanks,

George

Hi,

Some files fail to be be marked as ARCHIVED even after a CLOSEW request has been manually re-submit for them (the pool name is na62Tape).

The sys.archive.error attribute has the following (not informative) value

sys.archive.error=“Mar 22 11:38:26.040832 cta-ts12 In ArchiveMount::reportJobsBatchWritten(): got an exception”

I can see this object

ArchiveQueueFailed-na62Tape-Maintenance-getafix-ts20.scd.rl.ac.uk-1885-20240322-11:27:22-0-0

whose contents don’t have any info as to why the archival failed.

Do you know what should I do?

Thanks,

George

Also this message on the tape server

2024-03-22T12:26:01.968698+00:00 cta-ts11 cta-taped: LVL=“ERROR” PID=“21444” TID=“31146” MSG=“In MigrationReportPacker::WorkerThread::run(): Received a CTA exception while reporting archive mount results.” thread=“ReportPacker” tapeDrive=“asterix_lto9_00” tapeVid=“CL1546” mountId=“1395528” exceptionMSG=“In ArchiveMount::reportJobsBatchWritten(): got an exception”

Hello George,

Could you let us know which version of CTA are you running ?
cta-admin --json v| jq

We are trying to look up where this message could be coming from, in the most up-to-date code of CTA I see that your error output could originate only from the mock CTA classes used for testing.

Would you have more log information, how often does this happen ? Do you face the same problem even if you restart the cta-taped and retry ?

Best Regards,
Jaroslav

Hi Jaroslav,

Thanks for the reply. I am running CTA 4.10.0-2

[root@cta-adm ~]# cta-admin --json v| jq
[
{
“clientVersion”: {
“ctaVersion”: “4.10.0-2”,
“xrootdSsiProtobufInterfaceVersion”: “v1.9”
},
“serverVersion”: {
“ctaVersion”: “4.10.0-2”,
“xrootdSsiProtobufInterfaceVersion”: “v1.9”
},
“catalogueConnectionString”: “oracle:cta/******@CTA_VEGA”,
“catalogueVersion”: “14.0”,
“isUpgrading”: false
}
]
[root@cta-adm ~]#

Best,

George

Hello George,

Thank you for the release version, indeed, I could find the error message in the code.

Would you have any more log lines from around this error which you could share with us ?
Does this happen anyhow systematically ? Is there a way to reproduce the problem ?
Do you face the same problem even if you restart the cta-taped and retry ?

Cheers,
Jaroslav

Hi Jaroslav,

Sorry for the delay. What keeps happening is the following

  • There is a set of files that are consitently marked as failed archives (identified by running cta-admin --json fr ls)

  • I send the CLOSEW signal for for these files

  • A tape is mounted on a tape server and a tape session starts. According to the cta-taped.log, “File successfully transmitted to drive” but nevertheless the tape session completes with the status=“failure”

  • I can see an object named like ArchiveQueueFailed-na62Tape-Maintenance-getafix-ts20.scd.rl.ac.uk-1885-20240322-11:27:22-0-0 which I try to delete but it keeps re-appearing. I think I have deleted it before the max of number of archival trials is reached by CTA but I am not sure if this makes any difference.

George

Hi George,

If the file is on disk but repeatedly fails to archive to tape, I would begin to suspect that the disk file has been corrupted. If you calculate the checksum of the file on disk, does it match the checksum in the EOS disk file metadata?

If you inspect the failed file with cta-admin --json fr ls -l, you should see the full log of retry attempts for each failed file. Can you post one of those logs here?

Can you also attach the full dump of the ArchiveQueueFailed-... object?

Cheers,
Michael

Hi Michael,

The adler32 checksum of the disk replica is the same as in the EOS disk file metadata for 5 files that I checked.

I paste the output from the cta-admin --json fr ls -l for one of the files

{
“objectId”: “ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-14392-20240305-18:19:32-0-1820934”,
“requestType”: “ARCHIVE_REQUEST”,
“requester”: {
“username”: “spsna62001”,
“groupname”: “spsna62”
},
“af”: {
“archiveId”: “4335367681”,
“diskInstance”: “eosantares”,
“diskId”: “858851396”,
“size”: “958773777”,
“storageClass”: “na62-tape0”,
“df”: {
“path”: “/eos/antares/prod/na62.vo.gridpp.ac.uk/disk/prod/v3.8.6/Kch2pipi0g_ib-126/pluto_c2_dr012037_r10903506.root”
},
“creationTime”: “1711106367”
},
“tapepool”: “na62Tape”,
“copyNb”: “1”,
“totalretries”: 2,
“failurelogs”: [
“Mar 22 11:27:44.511988 cta-ts12 In ArchiveMount::reportJobsBatchWritten(): got an exception”,
“Mar 22 11:38:25.798892 cta-ts12 In ArchiveMount::reportJobsBatchWritten(): got an exception”
],
“totalreportretries”: 0,
“reportfailurelogs”:
}

and the contents of the object

[root@cta-adm 23042024_fr_cleanup]# cta-objectstore-dump-object ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-14392-20240305-18:19:32-0-1820934
Object store path: rados://antares@cta:antares
Object name: ArchiveRequest-Frontend-cta-front04.scd.rl.ac.uk-14392-20240305-18:19:32-0-1820934
Header dump:
{
“type”: “ArchiveRequest_t”,
“version”: “0”,
“owner”: “”,
“backupowner”: “”,
“payload”: “8LEEgeyhkxCasgSgAaKxBBBuYTYyX21vdW50cG9saWN5qLEEAbCxBNgEuLEEAcCxBNgE0rEEMbKvBAljdGEtYWRtaW6CsAQYY3RhLWZyb250MDEuc2NkLnJsLmFjLnVr0LAE+LujkQbasQQksq8ECWN0YS1hZG1pboKwBAtjdGEtZnJvbnQwMdCwBOnwlqAG4rEEGU5BNjIgcHJlcHJvZCBtb3VudCBwb2xpY3maswQKCggIARIECNAUb7izBICAgICAgICAgAHIswQAgrUEesCtBOGVAuitBMSVAuKuBGovZW9zL2FudGFyZXMvcHJvZC9uYTYyLnZvLmdyaWRwcC5hYy51ay9kaXNrL3Byb2QvdjMuOC42L0tjaDJwaXBpMGdfaWItMTI2L3BsdXRvX2MyX2RyMDEyMDM3X3IxMDkwMzUwNi5yb2900rUECTg1ODg1MTM5Nvq1BAplb3NhbnRhcmVzirYE2wFlb3NRdWVyeTovL2FudGFyZXMtZW9zMDEuc2NkLnJsLmFjLnVrLy9lb3Mvd2ZlL3Bhc3N3ZD9tZ20ucGNtZD1ldmVudCZtZ20uZmlkPTMzMzEwODQ0Jm1nbS5sb2dpZD1jdGEmbWdtLmV2ZW50PXN5bmM6OmFyY2hpdmVkJm1nbS53b3JrZmxvdz1kZWZhdWx0Jm1nbS5wYXRoPS9kdW1teV9wYXRoJm1nbS5ydWlkPTAmbWdtLnJnaWQ9MCZjdGFfYXJjaGl2ZV9maWxlX2lkPTQzMzUzNjc2ODGStgTtAWVvc1F1ZXJ5Oi8vYW50YXJlcy1lb3MwMS5zY2QucmwuYWMudWsvL2Vvcy93ZmUvcGFzc3dkP21nbS5wY21kPWV2ZW50Jm1nbS5maWQ9MzMzMTA4NDQmbWdtLmxvZ2lkPWN0YSZtZ20uZXZlbnQ9c3luYzo6YXJjaGl2ZV9mYWlsZWQmbWdtLndvcmtmbG93PWRlZmF1bHQmbWdtLnBhdGg9L2R1bW15X3BhdGgmbWdtLnJ1aWQ9MCZtZ20ucmdpZD0wJmN0YV9hcmNoaXZlX2ZpbGVfaWQ9NDMzNTM2NzY4MSZtZ20uZXJybXNnPaC2BJH0lskD8rYEFVIKc3BzbmE2MjAwMVoHc3BzbmE2MsK3BKIBcm9vdDovL2FudGFyZXMtZW9zMDEuc2NkLnJsLmFjLnVrLy9lb3MvYW50YXJlcy9wcm9kL25hNjIudm8uZ3JpZHBwLmFjLnVrL2Rpc2svcHJvZC92My44LjYvS2NoMnBpcGkwZ19pYi0xMjYvcGx1dG9fYzJfZHIwMTIwMzdfcjEwOTAzNTA2LnJvb3Q/ZW9zLmxmbj1meGlkOjMzMzEwODQ0krgECm5hNjItdGFwZTCauAQlsq8ECmVvc2FudGFyZXOCsAQLY3RhLWZyb250MDTQsAS/0vWvBqK4BNUCgJMCAYqTAghuYTYyVGFwZZKTAgCakwJcQXJjaGl2ZVF1ZXVlRmFpbGVkLW5hNjJUYXBlLU1haW50ZW5hbmNlLWdldGFmaXgtdHMyMC5zY2QucmwuYWMudWstMTg4NS0yMDI0MDMyMi0xMToyNzoyMi0wLTCgkwLmB6iTAgKwkwIBuJMCr41VwJMCAsiTAgLSkwJbTWFyIDIyIDExOjI3OjQ0LjUxMTk4OCBjdGEtdHMxMiBJbiBBcmNoaXZlTW91bnQ6OnJlcG9ydEpvYnNCYXRjaFdyaXR0ZW4oKTogZ290IGFuIGV4Y2VwdGlvbtKTAltNYXIgMjIgMTE6Mzg6MjUuNzk4ODkyIGN0YS10czEyIEluIEFyY2hpdmVNb3VudDo6cmVwb3J0Sm9ic0JhdGNoV3JpdHRlbigpOiBnb3QgYW4gZXhjZXB0aW9u2JMCAuCTAgCouAQBuLgEAOK4BBBuYTYyX21vdW50cG9saWN5”
}
Body dump:
{
“archivefileid”: “4335367681”,
“mountpolicy”: {
“name”: “na62_mountpolicy”,
“archivepriority”: “1”,
“archiveminrequestage”: “600”,
“retrievepriority”: “1”,
“retieveminrequestage”: “600”,
“creationlog”: {
“username”: “cta-admin”,
“host”: “cta-front01.scd.rl.ac.uk”,
“time”: “1646845432”
},
“lastmodificationlog”: {
“username”: “cta-admin”,
“host”: “cta-front01”,
“time”: “1678096489”
},
“comment”: “NA62 preprod mount policy”
},
“mountpolicyname”: “na62_mountpolicy”,
“checksumblob”: “CggIARIECNAUbw==”,
“creationtime”: “9223372036854775808”,
“reconcilationtime”: “0”,
“diskfileinfo”: {
“ownerUid”: 35553,
“gid”: 35524,
“path”: “/eos/antares/prod/na62.vo.gridpp.ac.uk/disk/prod/v3.8.6/Kch2pipi0g_ib-126/pluto_c2_dr012037_r10903506.root”
},
“diskfileid”: “858851396”,
“diskinstance”: “eosantares”,
“archivereporturl”: “eosQuery://antares-eos01.scd.rl.ac.uk//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=33310844&mgm.logid=cta&mgm.event=sync::archived&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&cta_archive_file_id=4335367681”,
“archiveerrorreporturl”: “eosQuery://antares-eos01.scd.rl.ac.uk//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=33310844&mgm.logid=cta&mgm.event=sync::archive_failed&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&cta_archive_file_id=4335367681&mgm.errmsg=”,
“filesize”: “958773777”,
“requester”: {
“name”: “spsna62001”,
“group”: “spsna62”
},
“srcurl”: “root://antares-eos01.scd.rl.ac.uk//eos/antares/prod/na62.vo.gridpp.ac.uk/disk/prod/v3.8.6/Kch2pipi0g_ib-126/pluto_c2_dr012037_r10903506.root?eos.lfn=fxid:33310844”,
“storageclass”: “na62-tape0”,
“creationlog”: {
“username”: “eosantares”,
“host”: “cta-front04”,
“time”: “1711106367”
},
“jobs”: [
{
“copynb”: 1,
“tapepool”: “na62Tape”,
“archivequeueaddress”: “”,
“owner”: “ArchiveQueueFailed-na62Tape-Maintenance-getafix-ts20.scd.rl.ac.uk-1885-20240322-11:27:22-0-0”,
“status”: “AJS_Failed”,
“totalretries”: 2,
“retrieswithinmount”: 1,
“lastmountwithfailure”: “1394351”,
“maxtotalretries”: 2,
“maxretrieswithinmount”: 2,
“failurelogs”: [
“Mar 22 11:27:44.511988 cta-ts12 In ArchiveMount::reportJobsBatchWritten(): got an exception”,
“Mar 22 11:38:25.798892 cta-ts12 In ArchiveMount::reportJobsBatchWritten(): got an exception”
],
“maxreportretries”: 2,
“totalreportretries”: 0,
“reportfailurelogs”:
}
],
“reportdecided”: true,
“isrepack”: false,
“isfailed”: false
}

Hi George:

This indicates that the files were successfully written to tape, but reporting failed. What is the output if you check for the presence of the file on tape? cta-admin tapefile ls --id 4335367681

If reporting failed, the EOS metadata was not updated to indicate that the file is on tape. Take a look in the tape server logs for one of those timestamps Mar 22 11:27:44.511988 and find out what the exception was.

Michael