Path is empty in the output of command cta-admin tf ls -v TAPEID

biyujiang · 18 April 2021 08:59

Hi there,
We have deployed one testing CTA instance, and are trying to make it work properly. I notice that the tapefile command give us information about files on the tape, like

cta tf ls -v V01006
archive id copy no    vid fseq block id instance disk fxid size checksum type checksum value storage class owner group    creation time path
       150       2 V01006    2       11  eoslaso        41 1.0G       ADLER32       21a9f19c      laso-raw 12142   600 2021-04-17 11:16 -
        81       2 V01006    3     4020  eoslaso        2b  541       ADLER32       2338a465      laso-raw 12142   600 2021-04-18 11:51 -
        82       2 V01006    4     4030  eoslaso        2c  595       ADLER32       dd01b757      laso-raw 12142   600 2021-04-18 12:00 -
       121       2 V01006    5     4040  eoslaso        35 1.0G       ADLER32       21a9f19c      laso-raw 12142   600 2021-04-18 15:09 -

Everything looks normal except that the path column is empty. What is the path column designed to record? Will it record the files’ eos path in the future? It would be more convenient if the path shows file’s absolute path in EOS.

vlado · 18 April 2021 10:28

Hi Yujiang Bi,

EOS lookup of the file paths is a heavy operation. That is why it was decided to have it behind -l option.

See this example with and without -l option:

[root@ctaproductionfrontend01 ~]# cta-admin tapefile ls -v L75687 | head -5
archive id copy no    vid fseq block id  instance disk fxid   size checksum type checksum value storage class  owner group    creation time path
1822775011       1 L75687    1        0 eosctacms  6ca556e3   6.2G       ADLER32       c9af3d1b           cms  22014  1399 2020-11-21 11:07 -
1822783433       1 L75687    2    23734 eosctacms  6ca577c9   6.1G       ADLER32       662b1eb0           cms  22014  1399 2020-11-21 11:54 -
1822774912       1 L75687    3    47132 eosctacms  6ca55680 716.8K       ADLER32       1a523c9a           cms 103031  1399 2020-11-21 11:07 -
1822783209       1 L75687    4    47144 eosctacms  6ca576e9   6.2G       ADLER32       24f085c1           cms  22014  1399 2020-11-21 11:53 -

[root@ctaproductionfrontend01 ~]# cta-admin tapefile ls -v L75687 -l | head -5
archive id copy no    vid fseq block id  instance disk fxid   size checksum type checksum value storage class  owner group    creation time path
1822775011       1 L75687    1        0 eosctacms  6ca556e3   6.2G       ADLER32       c9af3d1b           cms  22014  1399 2020-11-21 11:07 /eos/ctacms/archive/cms/store/data/Commissioning2020/VRRandom1/RAW/v1/000/338/731/00000/DCA04933-207D-A848-82C9-1EFF32F970B5.root
1822783433       1 L75687    2    23734 eosctacms  6ca577c9   6.1G       ADLER32       662b1eb0           cms  22014  1399 2020-11-21 11:54 /eos/ctacms/archive/cms/store/data/Commissioning2020/VRRandom11/RAW/v1/000/338/731/00000/B5725C9C-D253-234D-B7E0-76B9E1B67E9B.root
1822774912       1 L75687    3    47132 eosctacms  6ca55680 716.8K       ADLER32       1a523c9a           cms 103031  1399 2020-11-21 11:07 /eos/ctacms/archive/cms/store/logs/prod/2020/11/WMAgent/PromptReco_Run338632_MinimumBias/PromptReco_Run338632_MinimumBias-AlcaSkimLogCollect-b7g19n0537-1-logs.tar
1822783209       1 L75687    4    47144 eosctacms  6ca576e9   6.2G       ADLER32       24f085c1           cms  22014  1399 2020-11-21 11:53 /eos/ctacms/archive/cms/store/data/Commissioning2020/VRRandom12/RAW/v1/000/338/731/00000/11264488-31BF-A846-906B-0BD4135C6B40.root

Plese note that in order for -l option to work, security certificate between the CTA frontend and the EOS instance has to be configured.

Best regards,

Vladimir Bahyl

biyujiang · 18 April 2021 13:24

Hi, Vladimir Bahyl,

Thanks for the explanation. So the command
cta tapefile ls -v TAPEID -l
will look up each file’s eos path instead of recording its path in the database when the file was archived, right?
If one tape has tens of thousands of files on it, this command may take a very long time. So why not add the file’s eos path to the database when archiving it to avoid looking up the eos namespace?

mdavis · 19 April 2021 07:01

Hi Yujiang Bi,

Originally there was a copy of the namespace data in the database, but we decided to remove it for two reasons:

It introduces the problem of keeping the EOS namespace and the DB synchronised when metadata operations are carried out (move, rename).
It massively increases the size of the database. At CERN we will soon have a billion files in the DB. If we include the full path for every file we would need an additional 1 TB of DB space. This reduces the number of rows which can be cached and would therefore slow down everyday DB operations.

If you just want a fast listing of files on tape, use tapefile ls without -l. If you want a detailed listing including filenames, use tapefile ls -l which puts a bit of load on the EOS MGM, but this is acceptable given the relative infrequency of this kind of query compared to normal archival and retrieval operations.

As Vlado says, EOS has to be configured with gRPC and CTA has to be configured with the EOS gRPC token to allow CTA to query the namespace.

Cheers,

Michael

biyujiang · 19 April 2021 07:51

Hi, Michael,
We have configured EOS and CTA with gRPC, and it works. It does increase the size of CTA database. We are trying to understand EOS & CTA now .

BTW, I’m confused by cta-taped service behavior. One thing is that all tape drives are at down status when cta-taped service starts or restarts. We have to configure them UP . So why not set to UP? Or is there any option for cta-taped to control it?

Another is that cta-taped's log is written to /var/log/message directly, which is not friendly to debug. Will it be redirected to a seperate log file like /var/log/cta/cta-taped.log in the future?

Best regards
Yujiang Bi

vlado · 19 April 2021 08:04

Hi Yujiang Bi,

I will take care of these 2 questions:

It is simply too dangerous to put drives UP when the daemon restarts. Think about physical tape drives with cartridges. It happens that a server is rebooted by mistake when a tape drive is not properly unloaded. By having the drives DOWN after reboot, the operator needs to put them UP manually. This implies a check that the tape drive is free (there is a check inside of the daemon) and ready to accept requests. There might be other reasons why the node is not yet ready for production (reinstall) so having drives DOWN by default is a safety measure.
Regarding the logs, at CERN we put the log into separate log file using this in /etc/rsyslog.d/00-cta-daemons.conf:

# EPOCH nanosecond AND human readable
template(name="epochnanosecond4humans" type="string" string="[%TIMESTAMP:::date-unixtimestamp%.%timestamp:::date-subseconds%000] %TIMESTAMP%.%timestamp:::date-subseconds% %HOSTNAME% %syslogtag%%msg:::sp-if-no-1st-sp%%msg:::drop-last-lf%\n")

# CTA tape server daemon
# Keep locally as well as forward the logs to the CTALOG server
if $programname == 'cta-taped' then /var/log/cta/cta-taped.log;epochnanosecond4humans
if $programname == 'cta-taped' then @127.0.0.1:6140;epochnanosecond4humans
if $programname == 'cta-taped' then :omrelp:ctalog:20514
& stop

It was left to the system administrator to do what they want with the log files. We can share that RSYSLOG config file.

Best regards,

Vladimir Bahyl

vlado · 19 April 2021 08:07

Yujiang Bi,

there is one more thing worth knowing, you can use sort of regular expressions to put drives UP or DOWN with the cta-admin drive command. Example:

[root@tpsrv028 rsyslog.d]# cta-admin drive ls | grep I1L
 IBM1L8 I1L80931 tpsrv316      Up              -     Free 49133      -            -     -     -      -     -       -        0        -   7 -
 IBM1L8 I1L80933 tpsrv318      Up              -     Free 26701      -            -     -     -      -     -       -        0        -   0 -
 IBM1L8 I1L80934 tpsrv319      Up              -     Free 78093      -            -     -     -      -     -       -        0        -   0 -
 IBM1L8 I1L80941 tpsrv312      Up              -     Free 41234      -            -     -     -      -     -       -        0        -   1 -
 IBM1L8 I1L80942 tpsrv313      Up              -     Free 13682      -            -     -     -      -     -       -        0        -   5 -
 IBM1L8 I1L80943 tpsrv314      Up ArchiveForUser Transfer  3329 L80396 r_atlas_prod ATLAS   368   1.1T 244.0   32507        0        -   1 -
[root@tpsrv028 rsyslog.d]# cta-admin drive up I1L8093.*
Drive I1L80934: set Up.
Drive I1L80933: set Up.
Drive I1L80931: set Up.

biyujiang · 22 April 2021 07:17

. You are right. And we already have encountered such tape server problems…

And now we find another problem about cta-send-closew.sh or cta-send-event command .

We have cta-frontend and eos mgm server running on the same node, and EOS_MGM_URL for user biyj is set to root://besh01./. We know that cta and eos should be seperated in different nodes for production usage, but now we only have several nodes for test.

When a file is failed to archive, we use the command to trigger archival again:

$ cta-send-closew.sh /eos/laso/lto7/a.img

And the logs fromo cta-frontend are:

210422 14:46:40 2450889 XrootdXeq: biyj.2478442:82@besh01 pvt IPv4 login as eoslaso
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096 Prepare():
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096    Resource name: /ctafrontend
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096    Resource user: 
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096    Resource info: 
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096    Hosts to avoid: 
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096    Affinity: Default
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096    Resource options: 
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096    Resource Client: 
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Protocol:     sss
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Name:         eoslaso
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Host:         besh01
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Vorg:         NULL
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Role:         NULL
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Grps:         eoslaso
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Endorsements: NULL
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Creds:        NULL
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Credslen:     0
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Rsvd:         0
210422 14:46:40 2450889 ssi_Pb::Service: pid:2447760 tid:140165191268096       Tident:       
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_open: /ctafrontend prepared.
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_write: 0:/ctafrontend rsz=1041 wsz=1041
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_Activate: 0:/ctafrontend [new wtReq] oucbuff rqsz=1041
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_DoIt: 0:/ctafrontend [begun xqReq] Calling service processor
210422 14:46:40 2452674 ssi_Pb::RequestProc: pid:2447760 tid:140165189162752 Called RequestProc() constructor
210422 14:46:40 2452674 ssi_Pb::Service: pid:2447760 tid:140165189162752 ProcessRequest(): Binding Processor to Request
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_BindDone: 0:/ctafrontend [begun xqReq] Bind called; for request 0
210422 14:46:40 2452674 ssi_Pb::RequestProc: pid:2447760 tid:140165189162752 Called Execute()
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_GetRequest: 0:/ctafrontend [bound xqReq] sz=1041
210422 14:46:40 2452674 ssi_Pb::RequestProc: pid:2447760 tid:140165189162752 RequestProc(): received 1041 bytes
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_fctl: 0:/ctafrontend query resp status
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_fctl: 0:/ctafrontend resp not ready
210422 14:46:40 2450889 biyj.2478442:82@besh01 Ssi_Done: 0:/ctafrontend [bound xqReq] wtrsp sent; resp pend
{
 "notification": {
  "wf": {
   "event": "CLOSEW",
   "instance": {
    "name": "eoslaso",
    "url": "root://localhost:1094//eos/laso/lto7/a.img?eos.lfn=fxid:00000085"
   },
   "requesterInstance": "cta-send-event",
   "verifyOnly": false,
   "vid": ""
  },
  "cli": {
   "user": {
    "username": "biyj",
    "groupname": "u07"
   }
  },
  "transport": {
   "dstUrl": "",
   "reportUrl": "eosQuery://localhost:1094//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=00000085&mgm.logid=cta&mgm.event=sync::archived&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&cta_archive_file_id=290",
   "errorReportUrl": "eosQuery://localhost:1094//eos/wfe/passwd?mgm.pcmd=event&mgm.fid=00000085&mgm.logid=cta&mgm.event=sync::archive_failed&mgm.workflow=default&mgm.path=/dummy_path&mgm.ruid=0&mgm.rgid=0&cta_archive_file_id=290&mgm.errmsg="
  },
  "file": {
   "fid": "133",
   "pid": "0",
   "owner": {
    "uid": 12142,
    "gid": 600
   },
   "size": "1073741824",
   "mode": 0,
   "lpath": "/eos/laso/lto7/a.img",
   "xattr": {
    "sys.vtrace": "[Thu Apr 22 13:10:52 2021] uid:12142[biyj] gid:600[u07] tident:biyj.2462824:365@localhost name:biyj dn: prot:unix host:localhost domain:localdomain geo: sudo:0",
    "sys.eos.btime": "1619068252.863695240",
    "sys.archive.file_id": "290",
    "sys.utrace": "1c3e23aa-a329-11eb-a50c-38eaa7a302a6",
    "sys.cta.archive.objectstore.id": "ArchiveRequest-Frontend-besh01-2447760-20210422-11:41:13-0-55",
    "sys.archive.storage_class": "laso-l7"
   },
   "csb": {
    "cs": [
     {
      "type": "ADLER32",
      "value": "RKJo+g=="
     }
    ]
   }
  }
 },
 "clientCtaVersion": "4.0-1",
 "clientXrootdSsiProtobufInterfaceVersion": "v0.98"
}
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_RelReqBuff: 0:/ctafrontend [bound xqReq] called
210422 14:46:40 2452674 ssi_Pb::RequestProc: pid:2447760 tid:140165189162752 RequestProc(): sending metadata:
{
 "type": "RSP_SUCCESS",
 "xattr": {
  "sys.cta.objectstore.id": "ArchiveRequest-Frontend-besh01-2447760-20210422-11:41:13-0-81"
 },
 "messageTxt": "",
 "showHeader": "NONE"
}
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_ProcessResponse: 0:/ctafrontend [bound xqReq] Response presented wtr=True
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_ProcessResponse: 0:/ctafrontend [bound doRsp] Resp data sz=0
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_WakeUp: 0:/ctafrontend [bound doRsp] respCBarg=10052000000f0
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_AttnInfo: 0:/ctafrontend 91 byte metadata (0x080112570a1673...) sent.
210422 14:46:40 2447769 biyj.2478442:82@besh01 Ssi_Done: 0:/ctafrontend [bound odRsp] resp sent; no additional data remains
210422 14:46:40 2447769 biyj.2478442:82@besh01 Ssi_Finalize: 0:/ctafrontend [done odRsp] Calling Finished(False)
210422 14:46:40 2447769 ssi_Pb::RequestProc: pid:2447760 tid:140165855815424 Called Finished()
210422 14:46:40 2452674 ssi_Pb::Service: pid:2447760 tid:140165189162752 ProcessRequest(): Unbinding Processor from Request
210422 14:46:40 2452674 biyj.2478442:82@besh01 Ssi_Dispose: 0:n/a [done odRsp] Recycling request...
210422 14:46:40 2452674 ssi_Pb::RequestProc: pid:2447760 tid:140165189162752 Called ~RequestProc() destructor
210422 14:46:40 2448550 biyj.2478442:82@besh01 Ssi_close: /ctafrontend del=False
210422 14:46:40 2448550 XrootdXeq: biyj.2478442:82@besh01 disc 0:00:00

We can see that url for /eos/laso/lto7/a.img is "url": "root://localhost:1094//eos/laso/lto7/a.img?eos.lfn=fxid:00000085", while correct one should be "url": "root://besh01.:1094//eos/laso/lto7/a.img?eos.lfn=fxid:00000085". This leads tape server into a staled status, trying to mount and read a local file a.img.

And by the way, we have to delete ArchiveRequest object from Ceph directly to remove the staled archive request

$ for i in $(rados ls -p eoscta_metadata -N cta-ns|grep "^ArchiveRequest"); do rados rm -p eoscta_metadata -N cta-ns $i;done

. Is there any proper way to cancel failed ArchiveRequest on the tape server?

Best regards,
Yujiang Bi

mdavis · 23 April 2021 10:36

A failed archive request (or retrieve request) should end up in the failed queue. Indeed there is no automatic removal of failed requests, by design. The request will stay in this queue so that it can be investigated by an operator. However there is a tool to clean up.

cta-admin failedrequest ls will list all jobs in the failed queue. You can filter by justarchived, justretrieved, tapepool or vid.
To see the full list of log messages associated with each failed job, use cta-admin --json failedrequest ls -l. You should see one log message per failed retry attempt (three tries per mount across two mounts = six tries in total).
Once your analysis is complete, you can remove failed jobs from the objectstore with cta-admin failedrequest rm. It shouldn’t be necessary to use the rados low-level tools.

Cheers, Michael