Cta-verify-file not reliable?

Hi,

We are using cta-ops-verify-tapeto carry out full and partial (10,1000,10) verifcation of tapes. We have observed the following:

  • Files that failed to be verified with the message
{"epoch_time":1777455703.653701997,"local_time":"2026-04-29T10:41:43+0100","hostname":"getafix-ts10","program":"cta-taped","log_level":"ERROR","pid":2531628,"tid":2535720,"message":"File verification failed","drive_name":"obelix_ts1170_31","instance":"antares","sched_backend":"cephUser","thread":"DiskWrite","tapeDrive":"obelix_ts1170_31","tapeVid":"TD4290","mountId":"3599726","vo":"storaged_dls","tapePool":"dia_structsurf","threadCount":10,"threadID":1,"fileId":4383897767,"dstURL":"file://dummy","fSeq":58743,"errorMessage":"In DriveGeneric::readBlock: Failed ST read (with checksum) Errno=5: Input/output error","readWriteTime":0.0,"checksumingTime":0.0,"waitDataTime":875.922771,"waitReportingTime":0.001098,"checkingErrorTime":0.0,"openingTime":0.0,"closingTime":0.0,"transferTime":0.0,"totalTime":0.0,"dataVolume":0,"globalPayloadTransferSpeedMBps":0.0,"diskPerformanceMBps":0.0,"openRWCloseToTransferTimeRatio":0.0}

were successfully extracted with cta-readtp

  • Running directly cta-verify-file on a single file, generated the same error (including some warning that there was “No tape block movement for too long”)
[root@getafix-ts14 ~]# cat /var/log/cta/cta-taped-obelix_ts1170_27.log | grep 4388468672
{"epoch_time":1778056954.407529676,"local_time":"2026-05-06T09:42:34+0100","hostname":"getafix-ts14","program":"cta-taped","log_level":"INFO","pid":3282061,"tid":3282216,"message":"Successfully positioned for reading","drive_name":"obelix_ts1170_27","instance":"antares","sched_backend":"cephUser","thread":"TapeRead","tapeDrive":"obelix_ts1170_27","tapeVid":"TD4345","mountId":"3616509","vo":"storaged_dls","tapePool":"diamond_i03","mediaType":"TS1170","logicalLibrary":"obelix_ts1170","mountType":"Retrieve","labelFormat":"0000","vendor":"IBM","capacityInBytes":50000000000000,"fileId":4388468672,"BlockId":146929882,"fSeq":6430,"dstURL":"file://dummy","isRepack":false,"isVerifyOnly":true}
{"epoch_time":1778058625.899924148,"local_time":"2026-05-06T10:10:25+0100","hostname":"getafix-ts14","program":"cta-taped","log_level":"WARN","pid":3282061,"tid":3282215,"message":"No tape block movement for too long during recalling","drive_name":"obelix_ts1170_27","instance":"antares","sched_backend":"cephUser","thread":"Watchdog","tapeDrive":"obelix_ts1170_27","tapeVid":"TD4345","mountId":"3616509","vo":"storaged_dls","tapePool":"diamond_i03","TimeSinceLastBlockMove":600.049411,"TimeSinceLastBlockMoveReport":1848.173852,"NoBlockMoveMaxSecs":600.0,"fileId":4388468672,"fSeq":6430}
{"epoch_time":1778059347.941049972,"local_time":"2026-05-06T10:22:27+0100","hostname":"getafix-ts14","program":"cta-taped","log_level":"WARN","pid":3282061,"tid":3282215,"message":"No tape block movement for too long during recalling","drive_name":"obelix_ts1170_27","instance":"antares","sched_backend":"cephUser","thread":"Watchdog","tapeDrive":"obelix_ts1170_27","tapeVid":"TD4345","mountId":"3616509","vo":"storaged_dls","tapePool":"diamond_i03","TimeSinceLastBlockMove":600.085557,"TimeSinceLastBlockMoveReport":722.04083,"NoBlockMoveMaxSecs":600.0,"fileId":4388468672,"fSeq":6430}
{"epoch_time":1778059655.712842825,"local_time":"2026-05-06T10:27:35+0100","hostname":"getafix-ts14","program":"cta-taped","log_level":"ERROR","pid":3282061,"tid":3282216,"message":"Error reading a file in TapeReadFileTask","drive_name":"obelix_ts1170_27","instance":"antares","sched_backend":"cephUser","thread":"TapeRead","tapeDrive":"obelix_ts1170_27","tapeVid":"TD4345","mountId":"3616509","vo":"storaged_dls","tapePool":"diamond_i03","mediaType":"TS1170","logicalLibrary":"obelix_ts1170","mountType":"Retrieve","labelFormat":"0000","vendor":"IBM","capacityInBytes":50000000000000,"fileId":4388468672,"BlockId":146929882,"fSeq":6430,"dstURL":"file://dummy","isRepack":false,"isVerifyOnly":true,"fileBlock":72,"ErrorMessage":"In DriveGeneric::readBlock: Failed ST read (with checksum) Errno=5: Input/output error"}
{"epoch_time":1778059655.713455697,"local_time":"2026-05-06T10:27:35+0100","hostname":"getafix-ts14","program":"cta-taped","log_level":"ERROR","pid":3282061,"tid":3282217,"message":"In DriveGeneric::readBlock: Failed ST read (with checksum) Errno=5: Input/output error","drive_name":"obelix_ts1170_27","instance":"antares","sched_backend":"cephUser","thread":"DiskWrite","tapeDrive":"obelix_ts1170_27","tapeVid":"TD4345","mountId":"3616509","vo":"storaged_dls","tapePool":"diamond_i03","threadCount":10,"threadID":0,"fileId":4388468672,"dstURL":"file://dummy","fSeq":6430,"received_archiveFileID":4388468672,"expected_NSBLOCKId":0,"received_NSBLOCKId":null,"failed_Status":true}
{"epoch_time":1778059655.714488000,"local_time":"2026-05-06T10:27:35+0100","hostname":"getafix-ts14","program":"cta-taped","log_level":"ERROR","pid":3282061,"tid":3282217,"message":"File verification failed","drive_name":"obelix_ts1170_27","instance":"antares","sched_backend":"cephUser","thread":"DiskWrite","tapeDrive":"obelix_ts1170_27","tapeVid":"TD4345","mountId":"3616509","vo":"storaged_dls","tapePool":"diamond_i03","threadCount":10,"threadID":0,"fileId":4388468672,"dstURL":"file://dummy","fSeq":6430,"errorMessage":"In DriveGeneric::readBlock: Failed ST read (with checksum) Errno=5: Input/output error","readWriteTime":0.0,"checksumingTime":0.0,"waitDataTime":2831.242039,"waitReportingTime":0.000987,"checkingErrorTime":0.0,"openingTime":0.0,"closingTime":0.0,"transferTime":0.0,"totalTime":0.0,"dataVolume":0,"globalPayloadTransferSpeedMBps":0.0,"diskPerformanceMBps":0.0,"openRWCloseToTransferTimeRatio":0.0}
{"epoch_time":1778059655.716557071,"local_time":"2026-05-06T10:27:35+0100","hostname":"getafix-ts14","program":"cta-taped","log_level":"ERROR","pid":3282061,"tid":3282227,"message":"In RecallReportPacker::ReportError::execute(): failing retrieve job after exception.","drive_name":"obelix_ts1170_27","instance":"antares","sched_backend":"cephUser","thread":"RecallReportPacker","tapeDrive":"obelix_ts1170_27","tapeVid":"TD4345","mountId":"3616509","vo":"storaged_dls","tapePool":"diamond_i03","failureLog":"May  6 10:27:35.715120 getafix-ts14 In DriveGeneric::readBlock: Failed ST read (with checksum) Errno=5: Input/output error","fileId":4388468672}
{"epoch_time":1778059655.783726379,"local_time":"2026-05-06T10:27:35+0100","hostname":"getafix-ts14","program":"cta-taped","log_level":"INFO","pid":3282061,"tid":3282227,"message":"In OStoreDB::RetrieveJob::failTransfer(): enqueued job for reporting","drive_name":"obelix_ts1170_27","instance":"antares","sched_backend":"cephUser","thread":"RecallReportPacker","tapeDrive":"obelix_ts1170_27","mountId":"3616509","vo":"storaged_dls","tapePool":"diamond_i03","fileId":4388468672,"copyNb":1,"failureReason":"May  6 10:27:35.715120 getafix-ts14 In DriveGeneric::readBlock: Failed ST read (with checksum) Errno=5: Input/output error","requestObject":"RetrieveRequest-Frontend-cta-front01.scd.rl.ac.uk-368749-20260413-11:31:35-0-155632","retriesWithinMount":1,"maxRetriesWithinMount":3,"totalRetries":1,"maxTotalRetries":6}
  • In the most recent case of a file that failed to be verified with the above message, apart from successfully extracting it with cta-readtp, the file was also successfully recalled to EOS via xrdfs prepare -s

Our questions: does cta-verify-tape use a different mechanism to read a file from tape to a dummy which is different than the one used by cta-readtp and cta-taped? Is this mechanism reliable as an indicator for a a failed file verifcation or is it prone to generate false positives?

Thanks,

George

Dear George,

In the output you provided, there is this clear error:

readBlock: Failed ST read (with checksum) Errno=5: Input/output error

What do you have in /var/log/messages from the st driver around the same time?

To reply to your question, cta-readtp and cta-taped use the sme logic to read files from tape.
cta-verify-file doesn’t really read the files, it only submits the request. It is then cta-taped that does the reading.

Could it be that one tape drive has problems with this tape while the otherone can read it without issues?

In any case, I would simply repack this tape.

Hope this helps. Cheers,

Vladimir
CERN

Hi Vlado,

Thanks for the prompt and indeed helpfull reply.

For the very frist error I mentioned above (Apr 29 10:41:43), I see in /var/log/messages the following

Apr 29 10:41:43 getafix-ts10 kernel: st 15:0:1:0: [st1] Sense Key : Medium Error [current]
Apr 29 10:41:43 getafix-ts10 kernel: st 15:0:1:0: [st1] Add. Sense: Unrecovered read error

and likewise for a few other files from the same tape. The LumOS MLM health state for this tape switched to Average (yellow) and we are repacking this tape.

Looking at the the last example tape, on which I also run the standalone cta-verify-file, I see

May  6 01:58:47 getafix-ts10 kernel: st 15:0:1:0: [st1] Sense Key : Hardware Error [current]

May  6 10:27:35 getafix-ts14 kernel: st 15:0:1:0: [st1] Error 30000 (driver bt 0, host bt 0x3).

Interestingly, the LumOS MLM health state for this tape remains green. I suppose this error is not serious enough…? Where I can find more info on the kernel st driver by the way?

From the above I understand that we need to intergrate the output of /var/log/messages in our verification procedure (as you do I think)

Best,

George

Hi George,

Integrating checking of /var/log/messages to your operations when receiving Input/output error is just best practice, not necessrily related to the CTA verification process.

For the other errors lines you mention, they are 10 hours apart. Are you sure there is no Medium Error also reported sometime around that time?

Regarding the MLM, I do not know how the internal algorithm works - i.e. how many errors are needed until the cartridge health turns amber.

Cheers,

Vladimir