Incorrect dstURL passed to cta on retrieve request

denis.lujanski · 9 July 2021 05:32

Hi everyone,
I am having the strangest issue when doing prepare/stage requests. The request’s dstURL appears to be localhost, instead of the url of the mgm. The only change we made recently was upgrading our cta stack. I will post software versions at the end (old and new). Here is what happens:

xrdfs mgm prepare -s /eos/cta/repoa/data/new.tar # Issue stage request (run on mgm host)

WFE.log:
210709 04:58:04 ESC[49;32mINFO ESC[0m WFE:1615 default SYNC::PREPARE /eos/cta/repoa/data/new.tar cta-frontend-ctafrontend.cta.svc.cluster.archive:10955 fxid=0002695a mgm.reqid=“04460a3600d0
:3cf7fcd8.60e69807:15:1625806684”
210709 04:58:05 ESC[49;32mINFO ESC[0m WFE:2460 protoWFEndPoint=“cta-frontend-ctafrontend.cta.svc.cluster.archive:10955” protoWFResource=“/ctafrontend” fullPath=“/eos/cta/repoa/data/new.tar”
event=“sync::prepare” timeSpentMs=118 msg=“Sent SSI protocol buffer request”

cta-frontend.log:
Jul 9 04:58:05.059022 cta-frontend-ctafrontend-0 cta-frontend: LVL=“INFO” PID=“1” TID=“12804” MSG=“Queued retrieve request” fileId=“21” instanceName=“cta” diskFilePath=“/eos/cta/repoa/data/new.tar” diskFileOwnerUid=“48” diskFileGid=“48” dstURL=“root://localhost//eos/cta/repoa/data/new.tar?eos.lfn=fxid:2695a&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&eos.space=default&oss.asize=754688000” errorReportURL=“eosQuery://localhost//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=2695a&mgm.logid=cta&mgm.event=sync::retrieve_failed&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&mgm.errmsg=” creationHost=“10-54-0-208.cta-mgm.cta.svc.cluster.archive” creationTime=“1625806684” creationUser=“cta” requesterName=“daemon” requesterGroup=“daemon” criteriaArchiveFileId=“21” criteriaCreationTime=“1625726822” criteriaDiskFileId=“158042” criteriaDiskFileOwnerUid=“48” criteriaDiskInstance=“cta” criteriaFileSize=“754688000” reconciliationTime=“1625726822” storageClass=“single-copy-backup” checksumType=“ADLER32” checksumValue=“3dedc9e2” tapeTapefile0=“(vid=A00764 fSeq=2 blockId=11 fileSize=754688000 copyNb=1 creationTime=1625726822)” selectedVid=“A00764” verifyOnly=“0” catalogueTime=“0.025525” schedulerDbTime=“0.091562” policyName=“default-mp” policyMinAge=“1” policyPriority=“1” retrieveRequestId=“RetrieveRequest-Frontend-cta-frontend-ctafrontend-0-1-20210708-06:31:09-0-1575”

The file is read from tape, but then fails to be written to disk.
cta-tapeserver.log:
Jul 9 03:05:03.560357 cta-tapeserver-ctatapeserver-0 cta-taped: LVL=“ERROR” PID=“18715” TID=“18809” MSG=“In RecallReportPacker::ReportError::execute(): failing retrieve job after exception.” thread=“RecallReportPacker” tapeDrive=“DR0” mountId=“21” failureLog=“Jul 9 03:05:03.560250 cta-tapeserver-ctatapeserver-0 In XrootWriteFile::XrootWriteFile failed XrdCl::File::Open() on root://localhost//eos/cta/repoa/data/new.tar?eos.lfn=fxid:2695a&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&eos.space=default&oss.asize=754688000 [FATAL] Connection error code:108 errNo:0 status:3” fileId=“21”

From what I understand the MGM is responsible for constructing the correct dstURL in the protobuf request and that dstURL is constructed from the space configuration. Here is our space config:

[root@eos-mgm-0 /]# eos space ls
┌──────────┬────────────────┬────────────┬────────────┬──────┬─────────┬───────────────┬──────────────┬─────────────┬─────────────┬──────┬──────────┬───────────┬───────────┬──────┬────────┬───────────┬──────┬────────┬───────────┐
│type │ name│ groupsize│ groupmod│ N(fs)│ N(fs-rw)│ sum(usedbytes)│ sum(capacity)│ capacity(rw)│ nom.capacity│ quota│ balancing│ threshold│ converter│ ntx│ active│ wfe│ ntx│ active│ intergroup│
└──────────┴────────────────┴────────────┴────────────┴──────┴─────────┴───────────────┴──────────────┴─────────────┴─────────────┴──────┴──────────┴───────────┴───────────┴──────┴────────┴───────────┴──────┴────────┴───────────┘
spaceview default 20 24 40 40 1.90 TB 239.96 TB 239.96 TB 0 B off off 20 on 2 0 on 10 0 off
spaceview tape 0 24 1 0 0 B 0 B 0 B 0 B off off 20 off 2 0 off 1 0 off

[root@eos-mgm-0 /]# eos space status default

------------------------------------------------------------------------------------

Space Variables

…

autorepair := on
balancer := off
balancer.node.ntx := 2
balancer.node.rate := 25
balancer.threshold := 20
converter := on
converter.ntx := 2
drainer.node.nfs := 5
drainer.node.ntx := 2
drainer.node.rate := 25
drainperiod := 86400
filearchivedgc := off
geobalancer := off
geobalancer.ntx := 10
geobalancer.threshold := 5
graceperiod := 86400
groupbalancer := off
groupbalancer.ntx := 10
groupbalancer.threshold := 5
groupmod := 24
groupsize := 20
quota := off
scan_disk_interval := 14400
scan_ns_interval := 259200
scan_ns_rate := 50
scaninterval := 604800
scanrate := 100
tgc.availbytes := 1000000000000000000
tgc.qryperiodsecs := 320
tgc.totalbytes := 0
tracker := off
wfe := on
wfe.interval := 10
wfe.ntx := 10
[root@eos-mgm-0 /]# eos space status tape

------------------------------------------------------------------------------------

Space Variables

…

autorepair := off
balancer := off
balancer.node.ntx := 2
balancer.node.rate := 25
balancer.threshold := 20
converter := off
converter.ntx := 2
drainer.node.nfs := 5
drainer.node.ntx := 2
drainer.node.rate := 25
drainperiod := 86400
filearchivedgc := off
geobalancer := off
geobalancer.ntx := 10
geobalancer.threshold := 5
graceperiod := 86400
groupbalancer := off
groupbalancer.ntx := 10
groupbalancer.threshold := 5
groupmod := 24
groupsize := 0
quota := off
scan_disk_interval := 14400
scan_ns_interval := 259200
scan_ns_rate := 50
scaninterval := 604800
scanrate := 100
tgc.availbytes := 0
tgc.qryperiodsecs := 320
tgc.totalbytes := 1000000000000000000
wfe := off
wfe.interval := 10
wfe.ntx := 1

I believe default space is configured as the space to write files back to:

xrd.cf.mgm:
mgmofs.prepare.dest.space

Versions:
cta: 4.0-2 (old: 3.1-5)
eos: 4.8.45 (old: 4.8.3)
xrootd: 4.12.8 (old was a mix of 4.12.8 and 4.12.3)

Has anyone run into this before or have any insight on the dstURL construction?

mdavis · 9 July 2021 15:04

Hi Denis,

Do you have a log message from before the upgrade showing the value of dstURL?

Changing CTA version won’t change dstURL. As you say this is sent by the MGM so possibly changing the EOS version could make a difference.

In the EOS code, the hostname that goes into dstURL can be the MGM hostname or its alias if set:

 if (gOFS->MgmOfsAlias.length()) {
    mgmHostName = gOFS->MgmOfsAlias.c_str();
  } else if (gOFS->HostName != nullptr) {
    mgmHostName = gOFS->HostName;
  } // else error

I’ll ask the EOS team to see if someone could have an idea why this should have changed.

esindril · 9 July 2021 15:36

Hi Denis,

What is the output that you got when you started your MGM daemon concerning mgmofs.alias? I am not aware of any modification on the EOS side that would lead to such an outcome.
Also when starting the MGM there is the MGM_HOST variable printed in the logs. Can you also paste the log line concerning this?

Thanks,
Elvin

denis.lujanski · 12 July 2021 00:08

Hello Michael and Elvin,

Thanks very much for such a prompt reply. I have dug around and found that mgmofs.alias was not set in xrd.cf.mgm at all, and EOS_MGM_ALIAS was set to localhost in /etc/sysconfig/eos. This produced these loglines:

=====> mgmofs.alias: localhost
210711 23:30:21 time=1626046221.785362 func=Configure level=NOTE logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@eos-mgm-0.mgm.cta.svc.cluster.archive:1094 tid=00007f9cad75d780 source=XrdMgmOfsConfigure:1537 tident= sec= uid=0 gid=0 name
= geo="" MGM_HOST=eos-mgm-0.mgm.cta.svc.cluster.archive MGM_PORT=1094 VERSION=4.8.45 RELEASE=1 *******

After setting EOS_MGM_ALIAS=eos-mgm-0.mgm.cta.svc.cluster.archive in /etc/sysconfig/eos, we get:

=====> mgmofs.alias: eos-mgm-0.mgm.cta.svc.cluster.archive
210711 23:46:44 time=1626047204.722153 func=Configure level=NOTE logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@eos-mgm-0.mgm.cta.svc.cluster.archive:1094 tid=00007f70bff28780 source=XrdMgmOfsConfigure:1537 tident= sec= uid=0 gid=0 name
= geo="" MGM_HOST=eos-mgm-0.mgm.cta.svc.cluster.archive MGM_PORT=1094 VERSION=4.8.45 RELEASE=1

cta-frontend.log:
Jul 11 23:51:42.445939 cta-frontend-ctafrontend-0 cta-frontend: LVL=“INFO” PID=“1” TID=“63842” MSG=“Queued retrieve request” fileId=“21” instanceName=“cta” diskFilePath="/eos/cta/repoa/data/new.tar" diskFileOwnerUid=“48” diskFileGid=“48” dstURL=“root://eos-mgm-0.mgm.cta.svc.cluster.archive//eos/cta/repoa/data/new.tar?eos.lfn=fxid:2695a&eos.ruid=0&eos.rgid=0&eos.injection=1&eos.workflow=retrieve_written&eos.space=default&oss.asize=754688000” errorReportURL=“eosQuery://eos-mgm-0.mgm.cta.svc.cluster.archive//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=2695a&mgm.logid=cta&mgm.event=sync::retrieve_failed&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&mgm.errmsg=” creationHost=“10-54-0-38.cta-mgm.cta.svc.cluster.archive” creationTime=“1626047502” creationUser=“cta” requesterName=“daemon” requesterGroup=“daemon” criteriaArchiveFileId=“21” criteriaCreationTime=“1625726822” criteriaDiskFileId=“158042” criteriaDiskFileOwnerUid=“48” criteriaDiskInstance=“cta” criteriaFileSize=“754688000” reconciliationTime=“1625726822” storageClass=“single-copy-backup” checksumType=“ADLER32” checksumValue=“3dedc9e2” tapeTapefile0="(vid=A00764 fSeq=2 blockId=11 fileSize=754688000 copyNb=1 creationTime=1625726822)" selectedVid=“A00764” verifyOnly=“0” catalogueTime=“0.008985” schedulerDbTime=“0.025124” policyName=“default-mp” policyMinAge=“1” policyPriority=“1” retrieveRequestId=“RetrieveRequest-Frontend-cta-frontend-ctafrontend-0-1-20210708-06:31:09-0-32730”

The retrieves now work successfully. I suspect this was due to an eos version change, as I cannot find any other changes we made (other than upgrades).

Thank you again!

Denis