Hello friends,
We have just upgraded our cta deployment from 4.7.8-1 to 5.8.4-1 and we are running into issues writing files to tape. We did this upgrade in dev (which is identical to prod, except using mhvtl) and had no such issues.
We did the schema 11 → 12 upgrade as part of the software upgrade process.
We write a 1M file to eoscta, and we observe the tapeservers attempt to write them to tape and the tape drives produce the following error:
Mar 1 03:16:46.329283 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="353" TID="672" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="DR0" mountId="75705" exceptionMessageValue="filesWrittenToTape: filesWrittenToTape: File size mismatch: expected=1049713705, actual=1048576: archiveFileId=1, diskInstanceName=cta, diskFileId=14901694"
...
Mar 1 03:16:46.359053 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="353" TID="672" MSG="In MigrationReportPacker::WorkerThread::run(): Received a CTA exception while reporting archive mount results." thread="ReportPacker" tapeDrive="DR0" tapeVid="A01978" mountId="75705" exceptionMSG="In ArchiveMount::reportJobsBatchWritten(): got an exception"
...
Mar 1 03:20:42.238575 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="681" TID="849" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="DR0" mountId="75706" exceptionMessageValue="filesWrittenToTape: filesWrittenToTape: File size mismatch: expected=1049713705, actual=1048576: archiveFileId=1, diskInstanceName=cta, diskFileId=14901694"
...
Mar 1 03:20:42.238575 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="681" TID="849" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="DR0" mountId="75706" exceptionMessageValue="filesWrittenToTape: filesWrittenToTape: File size mismatch: expected=1049713705, actual=1048576: archiveFileId=1, diskInstanceName=cta, diskFileId=14901694"
The file in eoscta:
[root@eos-mgm-0 /]# eos fileinfo /eos/cta/upgrade/eosd/1Megabytezzz
File: '/eos/cta/upgrade/eosd/1Megabytezzz' Flags: 0644
Size: 1048576
Status: healthy
Modify: Wed Mar 1 03:13:34 2023 Timestamp: 1677640414.902949973
Change: Wed Mar 1 03:13:34 2023 Timestamp: 1677640414.902949973
Access: Wed Mar 1 03:13:34 2023 Timestamp: 1677640414.873246971
Birth: Wed Mar 1 03:13:34 2023 Timestamp: 1677640414.873246971
CUid: 48 CGid: 48 Fxid: 00e361be Fid: 14901694 Pid: 5137160 Pxid: 004e6308
XStype: adler XS: 6c fe 2e 56 ETAGs: "4000143024062464:6cfe2e56"
Layout: replica Stripes: 1 Blocksize: 4k LayoutId: 00100012 Redundancy: d1::t0
#Rep: 1
┌───┬──────┬──────────────────────────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│ host│ schedgroup│ path│ boot│ configstatus│ drain│ active│ geotag│
└───┴──────┴──────────────────────────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
0 5 5002538b1082d420-0.fst.cta.svc.cluster.archive default.5 /data booted rw nodrain online crlt::crlt-v3
Checking file on fst:
[root@crlt-v3 ctafrontend]# kubectl -n cta exec -ti 5002538b1082d420-0 -c fst -- ls -l /data/000005d2/00e361be
-rwx------ 1 daemon daemon 1048576 Mar 1 03:13 /data/000005d2/00e361be
I also tried with a 2M file, and we get this in tapeserver logs:
Mar 1 03:35:07.686559 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="349" TID="671" MSG="In ArchiveMount::reportJobsBatchWritten(): got an exception" thread="MainThread" tapeDrive="DR1" mountId="75707" exceptionMessageValue="filesWrittenToTape: filesWrittenToTape: File size mismatch: expected=1048816260, actual=2097152: archiveFileId=2, diskInstanceName=cta, diskFileId=14901703"
Mar 1 03:35:07.716905 ctatapeserver-crlt-v3-0 cta-taped: LVL="ERROR" PID="349" TID="671" MSG="In MigrationReportPacker::WorkerThread::run(): Received a CTA exception while reporting archive mount results." thread="ReportPacker" tapeDrive="DR1" tapeVid="A01978" mountId="75707" exceptionMSG="In ArchiveMount::reportJobsBatchWritten(): got an exception"
Not sure where it is getting the expected size.
We rolled back to version 4.7.8-1 and it is working okay.
Do you have any insights into why this might be happening?
We run xrootd 5.5.1 and eoscta 5.1.8
Thank you as always
Warm Regards,
Denis