. You are right. And we already have encountered such tape server problems…
And now we find another problem about cta-send-closew.sh
or cta-send-event
command
.
We have cta-frontend and eos mgm server running on the same node, and EOS_MGM_URL
for user biyj is set to root://besh01./
. We know that cta and eos should be seperated in different nodes for production usage, but now we only have several nodes for test.
When a file is failed to archive, we use the command to trigger archival again:
$ cta-send-closew.sh /eos/laso/lto7/a.img
And the logs fromo cta-frontend
are:
210422 14:46:40 2450889 XrootdXeq: biyj.2478442:82@besh01 pvt IPv4 login as eoslaso
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Prepare():
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Resource name: /ctafrontend
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Resource user:
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Resource info:
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Hosts to avoid:
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Affinity: Default
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Resource options:
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Resource Client:
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Protocol: sss
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Name: eoslaso
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Host: besh01
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Vorg: NULL
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Role: NULL
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Grps: eoslaso
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Endorsements: NULL
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Creds: NULL
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Credslen: 0
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Rsvd: 0
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Tident:
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_open: /ctafrontend prepared.
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_write: 0:/ctafrontend rsz=1041 wsz=1041
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_Activate: 0:/ctafrontend [new wtReq] oucbuff rqsz=1041
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_DoIt: 0:/ctafrontend [begun xqReq] Calling service processor
210422 14:46:40 2452674 ssi_Pb::RequestProc: pid:2447760 tid:140165189162752 Called RequestProc() constructor
210422 14:46:40 2452674 ssi_Pb::Service: pid:2447760 tid:140165189162752 ProcessRequest(): Binding Processor to Request
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_BindDone: 0:/ctafrontend [begun xqReq] Bind called; for request 0
210422 14:46:40 2452674 ssi_Pb::RequestProc: pid:2447760 tid:140165189162752 Called Execute()
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_GetRequest: 0:/ctafrontend [bound xqReq] sz=1041
210422 14:46:40 2452674 ssi_Pb::RequestProc: pid:2447760 tid:140165189162752 RequestProc(): received 1041 bytes
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_fctl: 0:/ctafrontend query resp status
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_fctl: 0:/ctafrontend resp not ready
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_Done: 0:/ctafrontend [bound xqReq] wtrsp sent; resp pend
{
"notification": {
"wf": {
"event": "CLOSEW",
"instance": {
"name": "eoslaso",
"url": "root://localhost:1094//eos/laso/lto7/a.img?eos.lfn=fxid:00000085"
},
"requesterInstance": "cta-send-event",
"verifyOnly": false,
"vid": ""
},
"cli": {
"user": {
"username": "biyj",
"groupname": "u07"
}
},
"transport": {
"dstUrl": "",
"reportUrl": "eosQuery://localhost:1094//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=00000085&mgm.logid=cta&mgm.event=sync::archived&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&cta_archive_file_id=290",
"errorReportUrl": "eosQuery://localhost:1094//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=00000085&mgm.logid=cta&mgm.event=sync::archive_failed&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&cta_archive_file_id=290&mgm.errmsg="
},
"file": {
"fid": "133",
"pid": "0",
"owner": {
"uid": 12142,
"gid": 600
},
"size": "1073741824",
"mode": 0,
"lpath": "/eos/laso/lto7/a.img",
"xattr": {
"sys.vtrace": "[Thu Apr 22 13:10:52 2021] uid:12142[biyj] gid:600[u07] tident:biyj.2462824:365@localhost name:biyj dn: prot:unix host:localhost domain:localdomain geo: sudo:0",
"sys.eos.btime": "1619068252.863695240",
"sys.archive.file_id": "290",
"sys.utrace": "1c3e23aa-a329-11eb-a50c-38eaa7a302a6",
"sys.cta.archive.objectstore.id": "ArchiveRequest-Frontend-besh01-2447760-20210422-11:41:13-0-55",
"sys.archive.storage_class": "laso-l7"
},
"csb": {
"cs": [
{
"type": "ADLER32",
"value": "RKJo+g=="
}
]
}
}
},
"clientCtaVersion": "4.0-1",
"clientXrootdSsiProtobufInterfaceVersion": "v0.98"
}
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_RelReqBuff: 0:/ctafrontend [bound xqReq] called
210422 14:46:40 2452674 ssi_Pb::RequestProc: pid:2447760 tid:140165189162752 RequestProc(): sending metadata:
{
"type": "RSP_SUCCESS",
"xattr": {
"sys.cta.objectstore.id": "ArchiveRequest-Frontend-besh01-2447760-20210422-11:41:13-0-81"
},
"messageTxt": "",
"showHeader": "NONE"
}
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_ProcessResponse: 0:/ctafrontend [bound xqReq] Response presented wtr=True
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_ProcessResponse: 0:/ctafrontend [bound doRsp] Resp data sz=0
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_WakeUp: 0:/ctafrontend [bound doRsp] respCBarg=10052000000f0
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_AttnInfo: 0:/ctafrontend 91 byte metadata (0x080112570a1673...) sent.
210422 14:46:40 2447769 biyj.2478442:82@besh01 Ssi_Done: 0:/ctafrontend [bound odRsp] resp sent; no additional data remains
210422 14:46:40 2447769 biyj.2478442:82@besh01 Ssi_Finalize: 0:/ctafrontend [done odRsp] Calling Finished(False)
210422 14:46:40 2447769 ssi_Pb::RequestProc: pid:2447760 tid:140165855815424 Called Finished()
210422 14:46:40 2452674 ssi_Pb::Service: pid:2447760 tid:140165189162752 ProcessRequest(): Unbinding Processor from Request
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_Dispose: 0:n/a [done odRsp] Recycling request...
210422 14:46:40 2452674 ssi_Pb::RequestProc: pid:2447760 tid:140165189162752 Called ~RequestProc() destructor
210422 14:46:40 2448550 biyj.2478442:82@besh01 Ssi_close: /ctafrontend del=False
210422 14:46:40 2448550 XrootdXeq: biyj.2478442:82@besh01 disc 0:00:00
We can see that url for /eos/laso/lto7/a.img
is "url": "root://localhost:1094//eos/laso/lto7/a.img?eos.lfn=fxid:00000085"
, while correct one should be "url": "root://besh01.:1094//eos/laso/lto7/a.img?eos.lfn=fxid:00000085"
. This leads tape server into a staled status, trying to mount and read a local file a.img.
And by the way, we have to delete ArchiveRequest object from Ceph directly to remove the staled archive request
$ for i in $(rados ls -p eoscta_metadata -N cta-ns|grep "^ArchiveRequest"); do rados rm -p eoscta_metadata -N cta-ns $i;done
. Is there any proper way to cancel failed ArchiveRequest on the tape server?
Best regards,
Yujiang Bi