EOSReporter fails with CTA+dCache

eli_carr · 2 August 2022 10:34

Hello everyone,

I’m Eli and we are developing a test instance of CTA+dCache on PIC, as we currently have Enstore+dCache in production.

We have managed to do archives and retrieves successfully (I can list the files with cta-admin tapefile ls -v [VID] and I see them in the catalogue,etc).

But in the cta-taped logs we always see an EOSReporter error, and this causes transfers to appear in the failed requests queue, even if the transfer has completed successfully.
Here is the error that we find repeatedly (as I understand that it tries at least twice per file)

Aug 2 12:05:49.560587 ctatps001 cta-taped: LVL="ERROR" PID="26153" TID="26153" MSG="In Scheduler::reportArchiveJobsBatch(): failed to report." SubprocessName="maintenanceHandler" fileId="222" reportType="CompletionReport" exceptionMSG="In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1"

From what I see when I dump the objects with cta-objecstore-dump-object, this EOSReporter tries to send the report to a url (eosQuery:///…) pointing to our dCache pool, which it obviously cannot find.

[root@ctatps001 ~]# cta-objectstore-dump-object ArchiveRequest-Frontend-ctatps001.pic.es-26176-20220801-17:23:08-0-3
Object store path: file:///ctavfs
Object name: ArchiveRequest-Frontend-ctatps001.pic.es-26176-20220801-17:23:08-0-3
Header dump:
{
 "type": "ArchiveRequest_t",
[...]
 },
 "mountpolicyname": "ctamp",
 "checksumblob": "CggIARIEAQCtqQ==",
 "creationtime": "9223372036854775808",
 "reconcilationtime": "0",
 "diskfileinfo": {
  "ownerUid": 1,
  "gid": 1,
  "path": "/0000BAFCB20151CF4EA4911A7FEFB599BD40"
 },
 "diskfileid": "0000BAFCB20151CF4EA4911A7FEFB599BD40",
 "diskinstance": "cta",
 "archivereporturl": "eosQuery://193.109.172.111:36639/success/0000BAFCB20151CF4EA4911A7FEFB599BD40?archiveid=222",
 "archiveerrorreporturl": "eosQuery://193.109.172.111:36639/error/0000BAFCB20151CF4EA4911A7FEFB599BD40?error=",
 "filesize": "1048576000",
 "requester": {
  "name": "cta",
  "group": "cta"
 },
 "srcurl": "root://193.109.172.111:36639/0000BAFCB20151CF4EA4911A7FEFB599BD40",
 "storageclass": "dteam.dteam@cta",
 "creationlog": {
  "username": "cta",
  "host": "ipv4:193.109.172.111:56498",
  "time": "1659434678"
 },
 "jobs": [
  {
   "copynb": 1,
   "tapepool": "dteam",
   "archivequeueaddress": "",
   "owner": "ArchiveQueueFailed-dteam-Maintenance-ctatps001.pic.es-26153-20220801-17:23:03-0-6",
   "status": "AJS_Failed",
   "totalretries": 0,
   "retrieswithinmount": 0,
   "lastmountwithfailure": "0",
   "maxtotalretries": 2,
   "maxretrieswithinmount": 2,
   "failurelogs": [],
   "maxreportretries": 2,
   "totalreportretries": 2,
   "reportfailurelogs": [
    "Aug  2 12:05:34.559213 ctatps001 In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1",
    "Aug  2 12:05:49.560677 ctatps001 In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1"
   ]
  }
 ],
 "reportdecided": false,
 "isrepack": false,
 "isfailed": true
}

We have tried to put logs in debug but no more relevant information appeared. I assume that this EOSReporter only sends reports of archive transfers, as it doesn’t seem to happen with retrieves. This has happened to us both in version 4.3 that we were using previously, and in version 4.7.8, which we compiled few weeks ago.

I also saw this issue (EOS not update after succesful archival) , where the error is the same but since we don’t use EOS in our deployment, I think the approach to the error is perhaps different.

At this point, what could I do to prevent our transfers from ending up in the failed requests queue? Could I manually change the URL of the reports somehow? Or is there any way to disable EOSReporter?
Perhaps it should work even if we’re using dCache and it’s some misconfig from our part?

Apologies and thank you very much in advance,

Eli

mdavis · 8 August 2022 14:17

Hi Eli and welcome to the CTA Community!

archivereporturl and archiveerrorreporturl are set by the client code which calls the CTA Frontend. Here is the relevant snippet from the protobuf definitions:

message Transport {
  string dst_url              =  1;      //< transport destination URL
  string report_url           =  2;      //< URL to report successful archiving
  string error_report_url     =  3;      //< URL to report errors
}

These URLs specify callbacks to report a successful or failed archive request so are system-neutral. You can find a more detailed explanation of the workflow in the documentation (with EOS as the disk frontend).

The dCache client should be filling these values when it sends an archive request to CTA, and provide an endpoint to handle them. Perhaps @tigran can elaborate if some configuration is needed on the dCache side?

Michael

Mwai · 12 August 2022 14:32

Hi Eli,

Welcome to the community.

Can you paste the output of hsm ls and hsm show providers on one of your pools running the dcache plugin?

Many thanks,
mwai

eli_carr · 16 August 2022 10:45

Hi Mwai,

We have the latest version of the plugin we could find on your repo. But this was also happening with v.0.4.0. The output is the following:

[dccore01-pps] (local) admin > \s dc111_1 hsm ls
cta(cta):dcache-cta
   cta-user            cta
   cta-group           cta
   cta-instance-name   cta
   cta-frontend-addr   192.168.20.35:17017
awss3(awss3):script
   command             /opt/awss3/tfg/s3hsm.py
   s3bucket            pic-test-tags
   s3config            /opt/awss3/tfg/dteam.yml
   c:puts              8
   c:gets              20
   c:removes           5

[dccore01-pps] (local) admin > \s dc111_1 hsm show providers 
PROVIDER   DESCRIPTION
dcache-cta dCache Nearline Storage Driver for CTA. Version: 0.6.0 2022-07-20T09:33:38Z
script     Calls out to an HSM integration script.                                    
copy       Copies files to and from another directory.                                
link       Hard links flushed files in another directory.                             
tar        Bundles files into tar archives (not ready for production).

Thank you both for your replies

Mwai · 17 August 2022 19:45

Thanks Eli.

Could you kindly attach a snippet of the dcache frontend log when this error happens (should be on /var/log/messages atm unless you created a rule). Also, what’s the output of show pinboard on the pool when the request is submitted and after the report fails?

mwai

eli_carr · 18 August 2022 09:01

Hi Mwai,

These are the logs of the dcache frontend when I queue an archive request:

Aug 18 09:24:57 ctatps001 cta-frontend-grpc: LVL="INFO" PID="23282" TID="23799" MSG="Archive request"
Aug 18 09:24:57 ctatps001 cta-frontend-grpc: LVL="DEBUG" PID="23282" TID="23799" MSG="Archive request for storageClass: dteam.eli@cta" remoteHost="ipv4:193.109.172.111:36794" request="archive"
Aug 18 09:24:57 ctatps001 cta-frontend-grpc: LVL="INFO" PID="23282" TID="23799" MSG="Checked request and got next archive file ID" remoteHost="ipv4:193.109.172.111:36794" request="archive" instance="cta" username="cta" groupname="cta" storageClass="dteam.eli@cta" fileID="000002D1643DF8D741DD9E77A557DD2F14B3" instanceName="cta" usergroup="cta" fileId="681" catalogueTime="0.008666" schedulerDbTime="0.008666"
Aug 18 09:24:57 ctatps001 cta-frontend-grpc: LVL="DEBUG" PID="23282" TID="23799" MSG="In AgentReference::appyAction(): added object to ownership." agentObject="Frontend-ctatps001.pic.es-23282-20220818-09:24:24-0" ownedObject="ArchiveRequest-Frontend-ctatps001.pic.es-23282-20220818-09:24:24-0-0"
Aug 18 09:24:57 ctatps001 cta-frontend-grpc: LVL="INFO" PID="23282" TID="23799" MSG="In OStoreDB::queueArchive(): recorded request for queueing (enqueueing posted to thread pool)." remoteHost="ipv4:193.109.172.111:36794" request="archive" instance="cta" groupname="cta" fileID="000002D1643DF8D741DD9E77A557DD2F14B3" archiveID="681" jobObject="ArchiveRequest-Frontend-ctatps001.pic.es-23282-20220818-09:24:24-0-0" fileId="681" diskInstance="cta" diskFilePath="/000002D1643DF8D741DD9E77A557DD2F14B3" diskFileId="000002D1643DF8D741DD9E77A557DD2F14B3" agentReferencingTime="0.001086" insertionTime="0.000242" taskPostingTime="0.000146" taskQueueSize="1" totalTime="0.001474"
Aug 18 09:24:57 ctatps001 cta-frontend-grpc: LVL="INFO" PID="23282" TID="23799" MSG="Queued archive request" remoteHost="ipv4:193.109.172.111:36794" request="archive" instance="cta" groupname="cta" fileID="000002D1643DF8D741DD9E77A557DD2F14B3" archiveID="681" instanceName="cta" storageClass="dteam.eli@cta" diskFileID="000002D1643DF8D741DD9E77A557DD2F14B3" fileSize="1048576000" fileId="681" tapePool1="dteam" policyName="ctamp" policyArchiveMinAge="1" policyArchivePriority="1" diskFilePath="/000002D1643DF8D741DD9E77A557DD2F14B3" diskFileOwnerUid="1" diskFileGid="1" archiveReportURL="eosQuery://193.[...]9E77A557DD2F14B3?archiveid=681" archiveErrorReportURL="eosQuery://193.[...]8D741DD9E77A557DD2F14B3?error=" creationHost="ipv4:193.109.172.111:36794" creationTime="1660807497" creationUser="cta" requesterName="cta" requesterGroup="cta" srcURL="root://193.109.[...]D1643DF8D741DD9E77A557DD2F14B3" catalogueTime="0.000073" schedulerDbTime="0.001797" checksumType="ADLER32" checksumValue="a9ad0001"
Aug 18 09:24:57 ctatps001 cta-frontend-grpc: LVL="INFO" PID="23282" TID="23799" MSG="Archive request for storageClass: dteam.eli@cta archiveFileId: 681 RequestID: ArchiveRequest-Frontend-ctatps001.pic.es-23282-20220818-09:24:24-0-0" remoteHost="ipv4:193.109.172.111:36794" request="archive" instance="cta" groupname="cta" fileID="000002D1643DF8D741DD9E77A557DD2F14B3" archiveID="681" reqId="ArchiveRequest-Frontend-ctatps001.pic.es-23282-20220818-09:24:24-0-0"
Aug 18 09:24:57 ctatps001 cta-frontend-grpc: LVL="INFO" PID="23282" TID="23595" MSG="In Helpers::getLockedAndFetchedQueue<ArchiveQueue>(): Successfully found and locked an archive queue." attemptNb="1" queueObject="ArchiveQueueToTransferForUser-dteam-Frontend-ctatps001.pic.es-23282-20220818-09:24:24-0-1" rootFetchNoLockTime="0.000098" rootRelockExclusiveTime="0.007161" rootRefetchTime="0.000089" addOrGetQueueandCommitTime="0.004076" rootUnlockExclusiveTime="0.000024" queueLockTime="0.000016" queueFetchTime="0.000072"

And this is the snippet from show pinboard at the pool when the transfer is finished. At the end there’s this Xrootd error that seems related to the EOSReporter :

18 Aug 2022 09:24:26 [dc111_1-22] [door:GFTP-dccore01-pps-AAXmft7jFMg@gridftp-dccore01-ppsDomain GFTP-dccore01-pps-AAXmft7jFMg PoolAcceptFile 000002D1643DF8D741DD9E77A557DD2F14B3] Creating new entry for 000002D1643DF8D741DD9E77A557DD2F14B3
18 Aug 2022 09:24:35 [MoverMapTransferService-transfer-service-4] [door:GFTP-dccore01-pps-AAXmft7jFMg@gridftp-dccore01-ppsDomain GFTP-dccore01-pps-AAXmft7jFMg PoolAcceptFile 000002D1643DF8D741DD9E77A557DD2F14B3] Transfer finished: 1048576000 bytes transferred in 9.31 seconds = 107.41138560687433 MB/s
18 Aug 2022 09:24:57 [flush] [] Flushing SCI=dteam.eli@cta;def=false;exp=0;pend=0;maxTotal=0;waiting=1
18 Aug 2022 09:24:57 [cta-grpc-worker-18] [] 864718f9-66b9-4dc9-8b30-1e0ea91ebd32 : 000002D1643DF8D741DD9E77A557DD2F14B3 : archive id 681, request: ArchiveRequest-Frontend-ctatps001.pic.es-23282-20220818-09:24:24-0-0
18 Aug 2022 09:24:59 [cta-datamover-worker-6] [] Request /dcpool/vpool1/data/000002D1643DF8D741DD9E77A557DD2F14B3 scheduling time: 1392 ms
18 Aug 2022 09:24:59 [cta-datamover-worker-6] [] Opening /dcpool/vpool1/data/000002D1643DF8D741DD9E77A557DD2F14B3 for reading.
18 Aug 2022 09:25:07 [cta-datamover-worker-6] [] Closing file /dcpool/vpool1/data/000002D1643DF8D741DD9E77A557DD2F14B3. Transferred 1000 MiB in 8.62 s, 116 MiB/s
18 Aug 2022 09:25:44 [post-transfer-51] [door:GFTP-dccore01-pps-AAXmft7jFMg@gridftp-dccore01-ppsDomain GFTP-dccore01-pps-AAXmft7jFMg PoolAcceptFile 000002D1643DF8D741DD9E77A557DD2F14B3] Unable to send message to topic billing on partition null, with key null and value MoverInfoMessage{dataTransferred=1048576000, connectionTime=9310, protocolInfo=GFtp-2.0 193.109.175.252 36466, fileCreated=true, initiator='door:GFTP-dccore01-pps-AAXmft7jFMg@gridftp-dccore01-ppsDomain:1660807465967000', isP2p=false, transferPath='/upload/5/93210194-fef5-4e55-9550-64ee352a412f/random_12', readBw='2.264904382208157E9', writeBw='7.143263280660648E8', readIdle='PT8.809077211S', readActive='PT0.4628381S', writeIdle='PT6.99134782S', writeActive='PT2.258573326S'} InfoMessage{cellType='pool', messageType='transfer', cellAddress=dc111_1@dc111_1Domain, timeQueued=0, resultCode=0, message='', timestamp=1660807484238, transaction='null', transactionID=107, subject=Subject:
Principal: EmailAddressPrincipal[eplanas@pic.es]
Principal: Origin[193.109.175.252]
Principal: UidPrincipal[31601]
Principal: GidPrincipal[46003,primary]
Principal: FQANPrincipal[/dteam,primary]
Principal: FQANPrincipal[/dteam/NGI_IBERGRID]
Principal: GroupNamePrincipal[dteam001,primary]
Principal: UserNamePrincipal[dteam001]
Principal: /DC=org/DC=terena/DC=tcs/C=ES/O=PIC/CN=Elena Planas eplanas@pic.es
} : Topic billing not present in metadata after 60000 ms.
18 Aug 2022 09:25:45 [cta-datamover-worker-7] [] XROOD query: /success/000002D1643DF8D741DD9E77A557DD2F14B3?archiveid=681
18 Aug 2022 09:26:46 [cta-datamover-worker-7] [] Unable to send message to topic billing on partition null, with key null and value StorageInfoMessage{transferTime=48467} InfoMessage{cellType='pool', messageType='store', cellAddress=dc111_1@dc111_1Domain, timeQueued=0, resultCode=0, message='', timestamp=1660807497706, transaction='null', transactionID=108, subject=null} : Topic billing not present in metadata after 60000 ms.
18 Aug 2022 09:26:46 [cta-datamover-worker-7] [] Flushed 000002D1643DF8D741DD9E77A557DD2F14B3 to nearline storage: [cta://cta/000002D1643DF8D741DD9E77A557DD2F14B3?archiveid=681]
18 Aug 2022 09:26:46 [cta-datamover-worker-7] [] Successful flushing: 000002D1643DF8D741DD9E77A557DD2F14B3 : archive id: 681
18 Aug 2022 09:26:46 [cta-datamover-worker-7] [] XROOD query: /success/000002D1643DF8D741DD9E77A557DD2F14B3?archiveid=681
18 Aug 2022 09:26:46 [cta-datamover-worker-7] [] Xrootd-Error-Response: ErrorNr=3000 ErrorMsg=Invalid request id

Eli

tigran · 22 August 2022 09:16

Hi Eli,

according to the log I see

18 Aug 2022 09:25:45 [cta-datamover-worker-7] [] XROOD query: /success/000002D1643DF8D741DD9E77A557DD2F14B3?archiveid=681
18 Aug 2022 09:26:46 [cta-datamover-worker-7] [] Flushed 000002D1643DF8D741DD9E77A557DD2F14B3 to nearline storage: [cta://cta/000002D1643DF8D741DD9E77A557DD2F14B3?archiveid=681]
18 Aug 2022 09:26:46 [cta-datamover-worker-7] [] Successful flushing: 000002D1643DF8D741DD9E77A557DD2F14B3 : archive id: 681

and later again

18 Aug 2022 09:26:46 [cta-datamover-worker-7] [] XROOD query: /success/000002D1643DF8D741DD9E77A557DD2F14B3?archiveid=681
18 Aug 2022 09:26:46 [cta-datamover-worker-7] [] Xrootd-Error-Response: ErrorNr=3000 ErrorMsg=Invalid request id

For whatever reason, the success notification comes twice for the same file with the archiveId 681. As the first notification removes the requests from the flush queue, the second one fails with Invalid request id.

Do you see such behavior always or only for some files?

@Mwai do we see similar behavior at desy?

eli_carr · 30 August 2022 09:27

Hi Tigran,

Indeed, we see this behavior with every file we archive.

I thought the success notification came twice because EOSReporter also tries twice to get a successful report.

Mwai · 2 September 2022 08:16

Hi Eli,

We can set up a quick zoom session and try understand the issue if that’s okay for you. I will be in touch via email.

thanks,
mwai

eli_carr · 7 December 2022 12:54

Hello everyone,

After our zoom session we concluded that it may be some problem with our xrootd version and/or compilation process.
We have been testing the public distribution of rpms. We’ve tested either with version 4 (Index of /cta-4) and version 5 (Index of /cta-5) and the result is always the same.
All our archive transfers, even if they complete successfully, end up in the failed requests list. As Tigran said in a previous post it seems that cta-taped always tries to report twice that the transfer has succeeded.

At pinboard we can see the message changed a bit but still seems the same problem:

07 Dec 2022 12:14:47 [cta-datamover-worker-6] [] XROOD query: /success/00006D3F3CCCAC2B47678862137E8D122AB2?archiveid=204
07 Dec 2022 12:14:47 [cta-datamover-worker-6] [] Xrootd-Error-Response: [session ?][connection [id: 0xb1215ba2, L:/193.109.172.111:35801 - R:/192.168.20.35:56110]][request 3001 kXR_query](error 3000, kXR_ArgInvalid, Invalid request id).

When we dump the object with cta-objectstore-dump-object the logs are the same as before:

   "reportfailurelogs": [
    "Dec  2 17:47:20.389938 ctatps001 In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1",
    "Dec  2 17:47:35.392166 ctatps001 In EOSReporter::AsyncQueryHandler::HandleResponse(): failed to XrdCl::FileSystem::Query() [ERROR] Operation expired code:206 errNo:0 status:1"

We have also tested with the dcache-cta plugin in different versions: 0.6.0, 0.7.0 and now 0.8.0, and recently we’ve upgraded to dCache 8.2 too.

If it is fine for you, could we set up a meeting sometime next week to take another look at this issue together?

Many thanks!

eli_carr · 23 March 2023 13:20

Hi, just writing again here to say that this issue has been solved!
It’s a bit weird, but we found out that our Kafka test machines for our dCache test instance were down since a long time ago. After turning them on to make some other stuff we saw that surprisingly that was it, the eosreporter error was gone.
I still don’t quite understand what does Kafka have to do in the reporting process between CTA and dCache but it feels nice getting rid of this problem at last!