Hi everyone,
I am having the strangest issue when doing prepare/stage requests. The request’s dstURL appears to be localhost, instead of the url of the mgm. The only change we made recently was upgrading our cta stack. I will post software versions at the end (old and new). Here is what happens:
xrdfs mgm prepare -s /eos/cta/repoa/data/new.tar # Issue stage request (run on mgm host)
WFE.log:
210709 04:58:04 ESC[49;32mINFO ESC[0m WFE:1615 default SYNC::PREPARE /eos/cta/repoa/data/new.tar cta-frontend-ctafrontend.cta.svc.cluster.archive:10955 fxid=0002695a mgm.reqid=“04460a3600d0
:3cf7fcd8.60e69807:15:1625806684”
210709 04:58:05 ESC[49;32mINFO ESC[0m WFE:2460 protoWFEndPoint=“cta-frontend-ctafrontend.cta.svc.cluster.archive:10955” protoWFResource=“/ctafrontend” fullPath=“/eos/cta/repoa/data/new.tar”
event=“sync::prepare” timeSpentMs=118 msg=“Sent SSI protocol buffer request”
cta-frontend.log:
Jul 9 04:58:05.059022 cta-frontend-ctafrontend-0 cta-frontend: LVL=“INFO” PID=“1” TID=“12804” MSG=“Queued retrieve request” fileId=“21” instanceName=“cta” diskFilePath=“/eos/cta/repoa/data/new.tar” diskFileOwnerUid=“48” diskFileGid=“48” dstURL=“root://localhost//eos/cta/repoa/data/new.tar?eos.lfn=fxid:2695a&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&eos.space=default&oss.asize=754688000” errorReportURL=“eosQuery://localhost//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=2695a&mgm.logid=cta&mgm.event=sync::retrieve_failed&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&mgm.errmsg=” creationHost=“10-54-0-208.cta-mgm.cta.svc.cluster.archive” creationTime=“1625806684” creationUser=“cta” requesterName=“daemon” requesterGroup=“daemon” criteriaArchiveFileId=“21” criteriaCreationTime=“1625726822” criteriaDiskFileId=“158042” criteriaDiskFileOwnerUid=“48” criteriaDiskInstance=“cta” criteriaFileSize=“754688000” reconciliationTime=“1625726822” storageClass=“single-copy-backup” checksumType=“ADLER32” checksumValue=“3dedc9e2” tapeTapefile0=“(vid=A00764 fSeq=2 blockId=11 fileSize=754688000 copyNb=1 creationTime=1625726822)” selectedVid=“A00764” verifyOnly=“0” catalogueTime=“0.025525” schedulerDbTime=“0.091562” policyName=“default-mp” policyMinAge=“1” policyPriority=“1” retrieveRequestId=“RetrieveRequest-Frontend-cta-frontend-ctafrontend-0-1-20210708-06:31:09-0-1575”
The file is read from tape, but then fails to be written to disk.
cta-tapeserver.log:
Jul 9 03:05:03.560357 cta-tapeserver-ctatapeserver-0 cta-taped: LVL=“ERROR” PID=“18715” TID=“18809” MSG=“In RecallReportPacker::ReportError::execute(): failing retrieve job after exception.” thread=“RecallReportPacker” tapeDrive=“DR0” mountId=“21” failureLog=“Jul 9 03:05:03.560250 cta-tapeserver-ctatapeserver-0 In XrootWriteFile::XrootWriteFile failed XrdCl::File::Open() on root://localhost//eos/cta/repoa/data/new.tar?eos.lfn=fxid:2695a&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&eos.space=default&oss.asize=754688000 [FATAL] Connection error code:108 errNo:0 status:3” fileId=“21”
From what I understand the MGM is responsible for constructing the correct dstURL in the protobuf request and that dstURL is constructed from the space configuration. Here is our space config:
[root@eos-mgm-0 /]# eos space ls
┌──────────┬────────────────┬────────────┬────────────┬──────┬─────────┬───────────────┬──────────────┬─────────────┬─────────────┬──────┬──────────┬───────────┬───────────┬──────┬────────┬───────────┬──────┬────────┬───────────┐
│type │ name│ groupsize│ groupmod│ N(fs)│ N(fs-rw)│ sum(usedbytes)│ sum(capacity)│ capacity(rw)│ nom.capacity│ quota│ balancing│ threshold│ converter│ ntx│ active│ wfe│ ntx│ active│ intergroup│
└──────────┴────────────────┴────────────┴────────────┴──────┴─────────┴───────────────┴──────────────┴─────────────┴─────────────┴──────┴──────────┴───────────┴───────────┴──────┴────────┴───────────┴──────┴────────┴───────────┘
spaceview default 20 24 40 40 1.90 TB 239.96 TB 239.96 TB 0 B off off 20 on 2 0 on 10 0 off
spaceview tape 0 24 1 0 0 B 0 B 0 B 0 B off off 20 off 2 0 off 1 0 off
[root@eos-mgm-0 /]# eos space status default
------------------------------------------------------------------------------------
Space Variables
…
autorepair := on
balancer := off
balancer.node.ntx := 2
balancer.node.rate := 25
balancer.threshold := 20
converter := on
converter.ntx := 2
drainer.node.nfs := 5
drainer.node.ntx := 2
drainer.node.rate := 25
drainperiod := 86400
filearchivedgc := off
geobalancer := off
geobalancer.ntx := 10
geobalancer.threshold := 5
graceperiod := 86400
groupbalancer := off
groupbalancer.ntx := 10
groupbalancer.threshold := 5
groupmod := 24
groupsize := 20
quota := off
scan_disk_interval := 14400
scan_ns_interval := 259200
scan_ns_rate := 50
scaninterval := 604800
scanrate := 100
tgc.availbytes := 1000000000000000000
tgc.qryperiodsecs := 320
tgc.totalbytes := 0
tracker := off
wfe := on
wfe.interval := 10
wfe.ntx := 10
[root@eos-mgm-0 /]# eos space status tape
------------------------------------------------------------------------------------
Space Variables
…
autorepair := off
balancer := off
balancer.node.ntx := 2
balancer.node.rate := 25
balancer.threshold := 20
converter := off
converter.ntx := 2
drainer.node.nfs := 5
drainer.node.ntx := 2
drainer.node.rate := 25
drainperiod := 86400
filearchivedgc := off
geobalancer := off
geobalancer.ntx := 10
geobalancer.threshold := 5
graceperiod := 86400
groupbalancer := off
groupbalancer.ntx := 10
groupbalancer.threshold := 5
groupmod := 24
groupsize := 0
quota := off
scan_disk_interval := 14400
scan_ns_interval := 259200
scan_ns_rate := 50
scaninterval := 604800
scanrate := 100
tgc.availbytes := 0
tgc.qryperiodsecs := 320
tgc.totalbytes := 1000000000000000000
wfe := off
wfe.interval := 10
wfe.ntx := 1
I believe default space is configured as the space to write files back to:
xrd.cf.mgm:
mgmofs.prepare.dest.space
Versions:
cta: 4.0-2 (old: 3.1-5)
eos: 4.8.45 (old: 4.8.3)
xrootd: 4.12.8 (old was a mix of 4.12.8 and 4.12.3)
Has anyone run into this before or have any insight on the dstURL construction?