Hello everyone,
I’m Eli and we are developing a test instance of CTA+dCache on PIC, as we currently have Enstore+dCache in production.
We have managed to do archives and retrieves successfully (I can list the files with cta-admin tapefile ls -v [VID]
and I see them in the catalogue,etc).
But in the cta-taped logs we always see an EOSReporter error, and this causes transfers to appear in the failed requests queue, even if the transfer has completed successfully.
Here is the error that we find repeatedly (as I understand that it tries at least twice per file)
Aug 2 12:05:49.560587 ctatps001 cta-taped: LVL="ERROR" PID="26153" TID="26153" MSG="In Scheduler::reportArchiveJobsBatch(): failed to report." SubprocessName="maintenanceHandler" fileId="222" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1"
From what I see when I dump the objects with cta-objecstore-dump-object
, this EOSReporter tries to send the report to a url (eosQuery:///…) pointing to our dCache pool, which it obviously cannot find.
[root@ctatps001 ~]# cta-objectstore-dump-object ArchiveRequest-Frontend-ctatps001.pic.es-26176-20220801-17:23:08-0-3
Object store path: file:///ctavfs
Object name: ArchiveRequest-Frontend-ctatps001.pic.es-26176-20220801-17:23:08-0-3
Header dump:
{
"type": "ArchiveRequest_t",
[...]
},
"mountpolicyname": "ctamp",
"checksumblob": "CggIARIEAQCtqQ==",
"creationtime": "9223372036854775808",
"reconcilationtime": "0",
"diskfileinfo": {
"ownerUid": 1,
"gid": 1,
"path": "/0000BAFCB20151CF4EA4911A7FEFB599BD40"
},
"diskfileid": "0000BAFCB20151CF4EA4911A7FEFB599BD40",
"diskinstance": "cta",
"archivereporturl": "eosQuery://193.109.172.111:36639/success/0000BAFCB20151CF4EA4911A7FEFB599BD40?archiveid=222",
"archiveerrorreporturl": "eosQuery://193.109.172.111:36639/error/0000BAFCB20151CF4EA4911A7FEFB599BD40?error=",
"filesize": "1048576000",
"requester": {
"name": "cta",
"group": "cta"
},
"srcurl": "root://193.109.172.111:36639/0000BAFCB20151CF4EA4911A7FEFB599BD40",
"storageclass": "dteam.dteam@cta",
"creationlog": {
"username": "cta",
"host": "ipv4:193.109.172.111:56498",
"time": "1659434678"
},
"jobs": [
{
"copynb": 1,
"tapepool": "dteam",
"archivequeueaddress": "",
"owner": "ArchiveQueueFailed-dteam-Maintenance-ctatps001.pic.es-26153-20220801-17:23:03-0-6",
"status": "AJS_Failed",
"totalretries": 0,
"retrieswithinmount": 0,
"lastmountwithfailure": "0",
"maxtotalretries": 2,
"maxretrieswithinmount": 2,
"failurelogs": [],
"maxreportretries": 2,
"totalreportretries": 2,
"reportfailurelogs": [
"Aug 2 12:05:34.559213 ctatps001 In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1",
"Aug 2 12:05:49.560677 ctatps001 In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1"
]
}
],
"reportdecided": false,
"isrepack": false,
"isfailed": true
}
We have tried to put logs in debug but no more relevant information appeared. I assume that this EOSReporter only sends reports of archive transfers, as it doesn’t seem to happen with retrieves. This has happened to us both in version 4.3 that we were using previously, and in version 4.7.8, which we compiled few weeks ago.
I also saw this issue (EOS not update after succesful archival) , where the error is the same but since we don’t use EOS in our deployment, I think the approach to the error is perhaps different.
At this point, what could I do to prevent our transfers from ending up in the failed requests queue? Could I manually change the URL of the reports somehow? Or is there any way to disable EOSReporter?
Perhaps it should work even if we’re using dCache and it’s some misconfig from our part?
Apologies and thank you very much in advance,
Eli