Recall failure on basic testing


I’m trying to deconstruct what EOS and CTA do partly for testing our own file format and partly with an eye towards our eventual migration. To that end I’ve doing the following:

  1. Run prepare_tests to get the tapes inserted and labeled
  2. Make placeholders in EOS with a modified version of eos-import-files
  3. Insert the relevant rows in the CTA database for the files I’m writing
  4. Write a fake tape with MHVTL

I finally got all of that where I think it is right, but when I

xrdfs ${EOSINSTANCE} prepare -s /eos/ctaeos/cta/etc/group

I get the following error (when I had database fields mismatched I got useful feedback)

[root@cms-testing scripts]# eos root://ctaeos attr ls eos/ctaeos/cta/etc/group                                                        
sys.retrieve.error="trigger workflow - synchronous workflow failed"                                                                   

Any idea what this synchronous workflow failed error means or how I can get more information on what actually failed?


I found this in the taped log:

Apr 8 23:19:19.836948 tpsrv01 cta-taped: LVL="ERROR" PID="180" TID="266" MSG="[ReadFile::position] - Reading HDR1" thread="DiskWrite" tapeDrive="VDSTK11" tapeVid="V01005" mountId="1" threadCount="10" threadID="1" fileId="1" dstURL="root://ctaeos.cta.svc.cluster.local//eos/ctaeos/cta/etc/group?eos.lfn=fxid:7&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&[](" fSeq="1" received_archiveFileID="1" expected_NSBLOCKId="0" received_NSBLOCKId="-1" failed_Status="1"

Does that suggest I don’t have my tape headers quite correct?

“Synchronous workflow failed” suggests an error in queuing the file for retrieve rather than the retrieval operation itself. However, the taped error suggests that a tape was in fact mounted. Are this error and the log message from the same attempt? What messages do you see in the CTA Frontend log?

The other place to look to see why something failed is in the failed requests (cta-admin failedrequests ls) which will show all requests that CTA has given up on (all retry attempts exhausted). cta-admin --json fr ls -l will give the error log messages associated with that request.

The taped error does indeed suggest that the virtual tape does not have a valid header.

Thanks Michael. TL;DR I solved my problem. The “dd” command I was using was putting in a bunch of extra file marks, so no surprise CTA couldn’t read the header. Once I fixed that, all is working and I can recall a file from CTA that CTA never wrote.

To answer your question, though, of what was there. I didn’t get the failed requests as the authentication had expired and I noticed all the extra marks. So I restarted. But in the frontend logs I see this:

Apr 8 23:19:17.524814 ctafrontend cta-frontend: LVL="INFO" PID="207" TID="756" MSG="In RequestMessage::process(): processing SSI event" user="ctaeos@ctafrontend" eventType="PREPARE" eosInstance="ctaeos" diskFilePath="/eos/ctaeos/cta/etc/**group**" diskFileId="7"

Apr 8 23:19:17.588637 ctafrontend cta-frontend: LVL="INFO" PID="207" TID="756" MSG="In OStoreDB::queueRetrieve(): recorded request for queueing (enqueueing posted to thread pool)." user="ctaeos@ctafrontend" tapeVid="V01005" jobObject="RetrieveRequest-Frontend-ctafrontend-207-20220408-23:07:51-0-0" fileId="1" diskInstance="ctaeos" diskFilePath="/eos/ctaeos/cta/etc/**group**" diskFileId="7" vidSelectionTime="0.008642" agentReferencingTime="0.000567" insertionTime="0.000271" taskPostingTime="0.000217" taskQueueSize="1" totalTime="0.009697"

Apr 8 23:19:17.588901 ctafrontend cta-frontend: LVL="INFO" PID="207" TID="756" MSG="Queued retrieve request" user="ctaeos@ctafrontend" fileId="1" instanceName="ctaeos" diskFilePath="/eos/ctaeos/cta/etc/**group**" diskFileOwnerUid="1000" diskFileGid="1000" dstURL="root://ctaeos.cta.svc.cluster.local//eos/ctaeos/cta/etc/**group**?eos.lfn=fxid:7&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&[](" errorReportURL="eosQuery://ctaeos.cta.svc.cluster.local//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=7&mgm.logid=cta&mgm.event=sync::retrieve_failed&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&mgm.errmsg=" creationHost="ctafrontend" creationTime="1649452757" creationUser="ctaeos" requesterName="poweruser1" requesterGroup="powerusers" criteriaArchiveFileId="1" criteriaCreationTime="1649452723" criteriaDiskFileId="7" criteriaDiskFileOwnerUid="1000" criteriaDiskInstance="ctaeos" criteriaFileSize="387" reconciliationTime="1649452723" storageClass="ctaStorageClass" checksumType="ADLER32" checksumValue="723878c3" tapeTapefile0="(vid=V01005 fSeq=1 blockId=0 fileSize=387 copyNb=1 creationTime=1649452723)" selectedVid="V01005" verifyOnly="0" catalogueTime="0.009505" schedulerDbTime="0.052336" policyName="ctasystest" policyMinAge="1" policyPriority="1" retrieveRequestId="RetrieveRequest-Frontend-ctafrontend-207-20220408-23:07:51-0-0"

Apr 8 23:19:17.601829 ctafrontend cta-frontend: LVL="INFO" PID="207" TID="228" MSG="In OStoreDB::queueRetrieve_bottomHalf(): added job to queue (enqueueing finished)." tapeVid="V01005" queueObject="RetrieveQueueToTransferForUser-V01005-Frontend-ctafrontend-207-20220408-23:07:51-0-1" jobObject="RetrieveRequest-Frontend-ctafrontend-207-20220408-23:07:51-0-0" fileId="1" diskInstance="ctaeos" diskFilePath="/eos/ctaeos/cta/etc/**group**" diskFileId="7" requestLockTime="0.000021" queueingTime="0.011781" commitTime="0.000152" queueUnlockTime="0.000165" requestUnlockTime="0.000005" agentOwnershipResetTime="0.000275" totalTime="0.012399"

I grepped on the file name so it’s possible there was more which did actually indicate a failure.

Hi Eric,

I’m glad you figured out how to create a dummy tape and can read it back in CTA.

Those log messages indicate a successful retrieve queueing request, though the filename of the file being recalled is strange /eos/ctaeos/cta/etc/**group**, with asterisks in the filename?

Also the log messages do not correspond to the extended attributes on the file you listed above. That filename is /eos/ctaeos/cta/etc/group (without asterisks in the name) so appears to be not the same disk file. And the timestamp of the queueing request is different. In the file metadata, the queue object was created at 23:07:51, which is earlier than the log mesages at 23:19:19.

Anyway you solved your problem so I guess no need to dig any further.

I suspect the **group** is just left over cruft from the colorizing grep I used. I’ll make sure in the future, but I didn’t notice this in my terminal.

Yes, that makes sense! So it was the same file.