Issue after upgrading to cta 5.8.4-1

Hello friends,

We have just upgraded our cta deployment from 4.7.8-1 to 5.8.4-1 and we are running into issues writing files to tape. We did this upgrade in dev (which is identical to prod, except using mhvtl) and had no such issues.

We did the schema 11 → 12 upgrade as part of the software upgrade process.

We write a 1M file to eoscta, and we observe the tapeservers attempt to write them to tape and the tape drives produce the following error:

Mar  1 03:16:46.329283 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="353" TID="672" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="DR0" mountId="75705" exceptionMessageValue="filesWrittenToTape: filesWrittenToTape: File size mismatch: expected=1049713705, actual=1048576: archiveFileId=1, diskInstanceName=cta, diskFileId=14901694"
...
Mar  1 03:16:46.359053 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="353" TID="672" MSG="In MigrationReportPacker::WorkerThread::run(): Received a CTA exception while reporting archive mount results." thread="ReportPacker" tapeDrive="DR0" tapeVid="A01978" mountId="75705" exceptionMSG="In ArchiveMount::reportJobsBatchWritten(): got an exception"
...
Mar  1 03:20:42.238575 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="681" TID="849" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="DR0" mountId="75706" exceptionMessageValue="filesWrittenToTape: filesWrittenToTape: File size mismatch: expected=1049713705, actual=1048576: archiveFileId=1, diskInstanceName=cta, diskFileId=14901694"
...
Mar  1 03:20:42.238575 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="681" TID="849" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="DR0" mountId="75706" exceptionMessageValue="filesWrittenToTape: filesWrittenToTape: File size mismatch: expected=1049713705, actual=1048576: archiveFileId=1, diskInstanceName=cta, diskFileId=14901694"

The file in eoscta:

[root@eos-mgm-0 /]# eos fileinfo /eos/cta/upgrade/eosd/1Megabytezzz
  File: '/eos/cta/upgrade/eosd/1Megabytezzz'  Flags: 0644
  Size: 1048576
Status: healthy
Modify: Wed Mar  1 03:13:34 2023 Timestamp: 1677640414.902949973
Change: Wed Mar  1 03:13:34 2023 Timestamp: 1677640414.902949973
Access: Wed Mar  1 03:13:34 2023 Timestamp: 1677640414.873246971
 Birth: Wed Mar  1 03:13:34 2023 Timestamp: 1677640414.873246971
  CUid: 48 CGid: 48 Fxid: 00e361be Fid: 14901694 Pid: 5137160 Pxid: 004e6308
XStype: adler    XS: 6c fe 2e 56    ETAGs: "4000143024062464:6cfe2e56"
Layout: replica Stripes: 1 Blocksize: 4k LayoutId: 00100012 Redundancy: d1::t0
  #Rep: 1
┌───┬──────┬──────────────────────────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                                          host│      schedgroup│            path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴──────────────────────────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0        5 5002538b1082d420-0.fst.cta.svc.cluster.archive        default.5            /data     booted             rw      nodrain   online            crlt::crlt-v3

Checking file on fst:

[root@crlt-v3 ctafrontend]# kubectl -n cta exec -ti 5002538b1082d420-0 -c fst -- ls -l /data/000005d2/00e361be
-rwx------ 1 daemon daemon 1048576 Mar  1 03:13 /data/000005d2/00e361be

I also tried with a 2M file, and we get this in tapeserver logs:

Mar  1 03:35:07.686559 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="349" TID="671" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="DR1" mountId="75707" exceptionMessageValue="filesWrittenToTape: filesWrittenToTape: File size mismatch: expected=1048816260, actual=2097152: archiveFileId=2, diskInstanceName=cta, diskFileId=14901703"
Mar  1 03:35:07.716905 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="349" TID="671" MSG="In MigrationReportPacker::WorkerThread::run(): Received a CTA exception while reporting archive mount results." thread="ReportPacker" tapeDrive="DR1" tapeVid="A01978" mountId="75707" exceptionMSG="In ArchiveMount::reportJobsBatchWritten(): got an exception"

Not sure where it is getting the expected size.

We rolled back to version 4.7.8-1 and it is working okay.

Do you have any insights into why this might be happening?

We run xrootd 5.5.1 and eoscta 5.1.8

Thank you as always :slight_smile:

Warm Regards,

Denis

FST logs during when tapeserver reads the 2M file:

230301 03:44:55 time=1677642295.067798 func=open                     level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:177              tident=cta.1219:1071@[::ffff:10.54.0.92] sec=unix  uid=0 gid=0 name=cta geo="" path=/eos/cta/upgrade/eosd/2Magain info=cap.msg=<...>&cap.sym=<...>&eos.clientinfo=zbase64:MDAwMDAwODJ4nBXIYQrDIAxA4avsAgtpi6MTchiNkQqlSowdu/3mv++92uTyaoTw2h6l+vihcuWpzockwkn7NqE4chaVNEfTUilJDuO02f9dVNjIIa5u2+O67BwZ8YmQuwFbgH4z8Dm6iUJQPsotfsG3+wFscSwy&eos.lfn=fxid:e361d7&mgm.blockchecksum=ignore&mgm.id=00e361d7&mgm.logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9&mgm.mtime=1677642024&mgm.replicahead=0&mgm.replicaindex=0 open_mode=0
230301 03:44:55 time=1677642295.068076 func=ProcessCapOpaque         level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:2743             tident=cta.1219:1071@[::ffff:10.54.0.92] sec=(null) uid=99 gid=99 name=(null) geo="" capability=&tapeenabled=1&mgm.access=read&mgm.ruid=2&mgm.rgid=2&mgm.uid=99&mgm.gid=99&mgm.path=/eos/cta/upgrade/eosd/2Magain&mgm.manager=eos-mgm-0.mgm.cta.svc.cluster.archive:1094&mgm.fid=00e361d7&mgm.cid=5137160&mgm.sec=sss|daemon|[::ffff:10.54.0.92]||daemon|||cta-taped&mgm.lid=1048594&mgm.bookingsize=1024&mgm.fsid=29&mgm.url0=root://5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095//&mgm.fsid0=29&cap.valid=1677645895
230301 03:44:55 time=1677642295.068134 func=open                     level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:256              tident=cta.1219:1071@[::ffff:10.54.0.92] sec=(null) uid=2 gid=2 name=nobody geo="" ns_path=/eos/cta/upgrade/eosd/2Magain fst_path=/data/000005d2/00e361d7
230301 03:44:55 time=1677642295.068278 func=open                     level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:596              tident=cta.1219:1071@[::ffff:10.54.0.92] sec=(null) uid=2 gid=2 name=nobody geo="" fst_path=/data/000005d2/00e361d7 open-mode=0 create-mode=180 layout-name=replica oss-opaque=&mgm.lid=1048594&mgm.bookingsize=0
230301 03:44:55 time=1677642295.068296 func=fileOpen                 level=INFO  logid=6e150132-b7e3-11ed-b13c-d64a2fd44e97 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=LocalIo:70                     tident=<service> sec=      uid=0 gid=0 name= geo="" flags=0 path=/data/000005d2/00e361d7
230301 03:44:55 time=1677642295.068348 func=open                     level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:696              tident=cta.1219:1071@[::ffff:10.54.0.92] sec=(null) uid=2 gid=2 name=nobody geo="" msg="layout size" disk_size=2097152 db_size= 2097152
230301 03:44:55 time=1677642295.068420 func=open                     level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:783              tident=cta.1219:1071@[::ffff:10.54.0.92] sec=(null) uid=2 gid=2 name=nobody geo="" open-duration=0.632ms path='/eos/cta/upgrade/eosd/2Magain' fxid=00e361d7 path::print=0.359ms creation::barrier=0.058ms layout::exists=0.006ms get::localfmd=0.012ms resync::localfmd=0.066ms clone::fst=0.000ms layout::open=0.010ms layout::opened=0.060ms layout::stat=0.000ms layout::stat=0.000ms fileio::object=0.022ms open::accountingt=0.035ms end=0.004ms open=0.632ms
230301 03:44:55 time=1677642295.068432 func=stat                     level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:1303             tident=cta.1219:1071@[::ffff:10.54.0.92] sec=      uid=2 gid=2 name=nobody geo="" path=/eos/cta/upgrade/eosd/2Magain fxid=00e361d7 size=2097152 mtime=1677642024.29223099
230301 03:44:55 time=1677642295.074257 func=VerifyChecksum           level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:3679             tident=cta.1219:1071@[::ffff:10.54.0.92] sec=      uid=2 gid=2 name=nobody geo="" msg="read checksum info" xs_type=adler xs_computed=3c495d2b xs_local=3c495d2b fxid=00e361d7 fsid=29
230301 03:44:55 time=1677642295.074781 func=_close                   level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:1543             tident=cta.1219:1071@[::ffff:10.54.0.92] sec=      uid=2 gid=2 name=nobody geo="" viaDelete=0 writeDelete=0
230301 03:44:55 time=1677642295.074796 func=VerifyChecksum           level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:3679             tident=cta.1219:1071@[::ffff:10.54.0.92] sec=      uid=2 gid=2 name=nobody geo="" msg="read checksum info" xs_type=adler xs_computed=3c495d2b xs_local=3c495d2b fxid=00e361d7 fsid=29
230301 03:44:55 time=1677642295.074864 func=_close                   level=INFO  logid=static.............................. unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:2020             tident= sec=(null) uid=99 gid=99 name=- geo="" msg="log=6e1492f6-b7e3-11ed-88aa-d66219132fc9&path=/eos/cta/upgrade/eosd/2Magain&fstpath=/data/000005d2/00e361d7&ruid=2&rgid=2&td=cta.1219:1071@[::ffff:10.54.0.92]&host=5002538b218cbc00-0.fst.cta.svc.cluster.archive&lid=1048594&fid=14901719&fsid=29&ots=1677642295&otms=67&cts=1677642295&ctms=74&nrc=8&nwc=0&rb=2097152&rb_min=262144&rb_max=262144&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=27.74&iot=7.038&idt=5.086&lrt=1.249&lrvt=0.000&lwt=0.000&ot=0.632&ct=0.071&rt=1.82&rvt=0.00&wt=0.00&osize=2097152&csize=2097152&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=sss&sec.name=daemon&sec.host=[::ffff:10.54.0.92]&sec.vorg=&sec.grps=daemon&sec.role=&sec.info=&sec.app=cta-taped"
230301 03:44:55 time=1677642295.074874 func=_close                   level=INFO  logid=6e1492f6-b7e3-11ed-88aa-d66219132fc9 unit=fst@5002538b218cbc00-0.fst.cta.svc.cluster.archive:1095 tid=00007f0c7d7fd700 source=XrdFstOfsFile:2251             tident=cta.1219:1071@[::ffff:10.54.0.92] sec=      uid=2 gid=2 name=nobody geo="" msg="done close" rc=0 errc=0

Hi Denis,

So as I understand, the expected file size is correct in EOS but is appearing incorrectly in the tape server logs. Can you determine more precisely where the file size is changing to the incorrect value?

In /var/log/cta/cta-frontend.log, you will see a message after each CLOSEW event, Queued archive request, which has a field fileSize. What value do you see there?

Cheers,

Michael

Hi Michael,

Thanks very much for your quick reply! I have upgraded the prod deployment again in order to retrieve the cta-frontend.log logs, however the issues disappeared. Cannot imagine what was different this time.

Sorry to waste your time.

Warm Regards,

Denis