Sporadic Input/Output errors during tape labeling

eacciong · 6 March 2026 15:00

Hi,

We have been experiencing sporadic Input/output errors during the tape labeling process. These are brand new LTO-10 tapes.

Out of 150 tapes labeled, 3 have failed, all showing the following error:

2026-03-03 18:46:38 [INFO] [<lambda>] Mar 3 18:46:38.663287308 cta-tape-label: LVL="ERROR" PID="53528" TID="53528" MSG="Label session failed to label the tape" userName="root" tapeVid="S00088" tapeOldLabel="" force="false" tapeLoadTimeout="7200" tapeLabelError="[TapeLabelCmd::checkTapeLabel] - Reading VOL1: In DriveGeneric::readExactBlock: Failed ST read Errno=5: Input/output error"

Additionally, we have encountered another Input/output error while writing a file:

# jq 'select(.log_level == "ERROR")' cta-taped-IBMLIB2-LTO10-F04C3R1.log-20260306 
{
  "epoch_time": 1772743658.935952320,
  "local_time": "2026-03-05T21:47:38+0100",
  "hostname": "ctatps009",
  "program": "cta-taped",
  "log_level": "ERROR",
  "pid": 24591,
  "tid": 317784,
  "message": "An error occurred for this file. End of migrations.",
  "drive_name": "IBMLIB2-LTO10-F04C3R1",
  "instance": "prod",
  "sched_backend": "cephprod",
  "thread": "TapeWrite",
  "tapeDrive": "IBMLIB2-LTO10-F04C3R1",
  "tapeVid": "S00025",
  "mountId": "14",
  "vo": "vo-magic",
  "tapePool": "vo-magic.M2",
  "mediaType": "LTO10",
  "logicalLibrary": "IBMLIB2-LTO10",
  "mountType": "ArchiveForUser",
  "vendor": "Fujifilm",
  "capacityInBytes": 30000000000000,
  "fileId": 4295110189,
  "fileSize": 1055039040,
  "fSeq": 11331,
  "diskURL": "root://dc089.pic.es:39811/0000D0EFF15215DC4BD3A05BE053C345ED4B",
  "exceptionMessage": "In DriveGeneric::writeBlock: Failed ST write with crc32c Errno=5: Input/output error"
}
{
  "epoch_time": 1772743658.936201998,
  "local_time": "2026-03-05T21:47:38+0100",
  "hostname": "ctatps009",
  "program": "cta-taped",
  "log_level": "ERROR",
  "pid": 24591,
  "tid": 317785,
  "message": "In MigrationReportPacker::ReportError::execute(): failing archive job after exception.",
  "drive_name": "IBMLIB2-LTO10-F04C3R1",
  "instance": "prod",
  "sched_backend": "cephprod",
  "thread": "MainThread",
  "tapeDrive": "IBMLIB2-LTO10-F04C3R1",
  "mountId": "14",
  "vo": "vo-magic",
  "tapePool": "vo-magic.M2",
  "failureLog": "Mar  5 21:47:38.936070 ctatps009 In DriveGeneric::writeBlock: Failed ST write with crc32c Errno=5: Input/output error",
  "fileId": 4295110189
}
{
  "epoch_time": 1772743658.983086561,
  "local_time": "2026-03-05T21:47:38+0100",
  "hostname": "ctatps009",
  "program": "cta-taped",
  "log_level": "ERROR",
  "pid": 24591,
  "tid": 317784,
  "message": "Exception in TapeWriteSingleThread-TapeCleaning when unmounting/unloading the tape. Putting the drive down.",
  "drive_name": "IBMLIB2-LTO10-F04C3R1",
  "instance": "prod",
  "sched_backend": "cephprod",
  "thread": "TapeWrite",
  "tapeDrive": "IBMLIB2-LTO10-F04C3R1",
  "tapeVid": "S00025",
  "mountId": "14",
  "vo": "vo-magic",
  "tapePool": "vo-magic.M2",
  "exceptionMessage": "Could not close device file: /dev/nst1 Errno=5: Input/output error"
}
{
  "epoch_time": 1772757897.554570250,
  "local_time": "2026-03-06T01:44:57+0100",
  "hostname": "ctatps009",
  "program": "cta-taped",
  "log_level": "ERROR",
  "pid": 1797,
  "tid": 1797,
  "message": "In CleanerSession::exceptionThrowingExecute(), failed to clean the Drive with a tape mounted. Disabling the tape.",
  "drive_name": "IBMLIB2-LTO10-F04C3R1",
  "instance": "prod",
  "sched_backend": "cephprod",
  "tapeVid": "S00025",
  "tapeDrive": "IBMLIB2-LTO10-F04C3R1",
  "logicalLibrary": "IBMLIB2-LTO10",
  "host": "ctatps009",
  "exceptionMsg": "Failed ST ioctl (MTREW) in DriveGeneric::rewind Errno=5: Input/output error"
}
{
  "epoch_time": 1772757897.560436059,
  "local_time": "2026-03-06T01:44:57+0100",
  "hostname": "ctatps009",
  "program": "cta-taped",
  "log_level": "ERROR",
  "pid": 1797,
  "tid": 1797,
  "message": "Cleaner failed, the drive is going down.",
  "drive_name": "IBMLIB2-LTO10-F04C3R1",
  "instance": "prod",
  "sched_backend": "cephprod",
  "tapeVid": "S00025",
  "tapeDrive": "IBMLIB2-LTO10-F04C3R1",
  "exceptionMessage": "Failed ST ioctl (MTREW) in DriveGeneric::rewind Errno=5: Input/output error"
}
{
  "epoch_time": 1772758060.547890955,
  "local_time": "2026-03-06T01:47:40+0100",
  "hostname": "ctatps009",
  "program": "cta-taped",
  "log_level": "ERROR",
  "pid": 1797,
  "tid": 1797,
  "message": "Aborting cta-taped on uncaught exception. Stack trace follows.",
  "drive_name": "IBMLIB2-LTO10-F04C3R1",
  "instance": "prod",
  "sched_backend": "cephprod",
  "exceptionMessage": "In SocketPair::send(): failed to send():  Errno=32: Broken pipe"
}

We are running CTA version 5.11.14.0-1 and using AlmaLinux 9.7.

I wanted to ask if you are also seeing these errors.

Thanks in advance,

Esther

vlado · 8 March 2026 07:52

Hi Esther,

we do not have experience with LTO-10, but that shouldn’t matter much in this case.

Normally, when you see input/output error in CTA, then there is additional information in /var/log/messages around that time.

Do you see anything there with could better indicate what the issue is?

Also - anything in the library logs?

Best regards,

Vladimir

eacciong · 9 March 2026 10:33

Hi Vlado,

When we see errors during labeling, they are always reported in the messages log as a medium error:

2026-03-03T18:40:23.716132+01:00 ctatps010 kernel: st 13:0:0:0: [st0] Sense Key : Medium Error [current] 
2026-03-03T18:40:23.716270+01:00 ctatps010 kernel: st 13:0:0:0: [st0] Add. Sense: Recorded entity not found

It seems strange to have a media failure since these are new tapes. I also tried labeling tapes from a different purchase, and we saw the same error.

When this happens, the library reports faulty media:

Cartridge S00088LA had a read, write, or positioning error because of faulty media.

It also automatically opens a case indicating that drive diagnostics are required. However, IBM support informed us that the drive looks fine to them and they consider it a temporary media error.

I opened a specific case with IBM regarding this issue, and they advised me to update the LTO-10 drive firmware to the latest version (they are currently on the recommended version). However, even with the latest version, we are still seeing the error. They have now escalated the case to development team.

Additionally, as I mentioned in the initial post, apart from the repeated errors during labeling, we also encountered an I/O error during a write operation. The messages log reported the following error:

2026-03-06T01:44:57.549145+01:00 ctatps009 kernel: qla2xxx [0000:98:00.1]-8802:13: Aborting from RISC nexus=13:0:0 sp=000000004d757196 cmd=000000009d42b7a0 handle=4c4
2026-03-06T01:44:57.549189+01:00 ctatps009 kernel: qla2xxx [0000:98:00.1]-587c:13: Abort command issued - hdl=4c4, type=8
2026-03-06T01:44:57.549197+01:00 ctatps009 kernel: qla2xxx [0000:98:00.1]-3822:13: FCP command status: 0x5-0x0 (0x80000) nexus=13:0:0 portid=000002 oxid=0x309 cdb=01000000000000000000 len=0x0 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=000000004d757196 cp=000000009d42b7a0.
2026-03-06T01:44:57.552495+01:00 ctatps009 kernel: qla2xxx [0000:98:00.1]-8803:13: Abort command mbx cmd=000000009d42b7a0, rval=0.
2026-03-06T01:44:57.552504+01:00 ctatps009 kernel: qla2xxx [0000:98:00.1]-801c:13: Abort command issued nexus=13:0:0 -- 2002.
2026-03-06T01:44:57.553134+01:00 ctatps009 kernel: st 13:0:0:0: [st1] Error 30000 (driver bt 0, host bt 0x3).

In this case, the library didn’t report any error; the drive just asked for a cleaning. In CTA, the drive went down indicating “[STALE] [cta-taped] INFO Shutdown”.

Thanks for your help,

Esther

vlado · 9 March 2026 20:15

Hello Esther,

Thank you for the additional information.

As I mentioned earlier, we do not have a lot of experience with LTO-10. As you may know the head technology of those tape drives is very new/different from LTO-9. That is why I am not surprised that you see strange errors when deploying LTO-10 at scale. As with any new technology, it will take some time until the teething problems from various conditions in the (customer) field are resolved.

It is hard to find information about the Recorded Entity Not Found error, but I was able to locate this page:

which mentions:

Byte 12 ASC	Byte 13 ASCQ	Description
14	00	Recorded Entity Not Found - A space or Locate command failed because a format violation prevented the target from being found.

In any case, this is definitely for IBM to resolve - the CERN CTA team can not really help here. As the library itself reports issues, you just need to push them to escale the ticket to the engineers given that this is new drives/cartridges.

Initially you mentioned that you have problem with 3 tapes out of 150, but now you also wrote: I also tried labeling tapes from a different purchase, and we saw the same error. So how many tapes of out the total are affected?

What I would do is try to use dd command to try to write some data at the begining of the tape. Just something like:

mt -f /dev/nst0 rewind
dd if=/etc/<some small file> of=/dev/nst0 bs=256k

If the problem is really bad, you should be able to provoke it with this command. If it doesn’t fail, rewind again and try to increase the file size, if success try to label the tape again.

Regarding the second error from qla2xxx fibre channel HBA driver - I would guess that the error was so severe that the drive reset itself which is why you lost the fibre connection. Could it be that?

Finally, my two comments:

As you are deploying new technology which wasn’t yet hardened by thousands other customers, you need to double check that the data written on those tapes is readable. I would initially read full tapes or at least put some verification in place.
If 3 tapes out of 150 have problems, use the standard Lifetime Warranty and have them replaced (by Dexxon).

Hope this helps. Best regards,

Vladimir
CERN

eacciong · 10 March 2026 16:53

Hi Vlado,

Thank you very much for your help.

For now, we don’t have LTO-10 in production yet. Before adding the drives to CTA, we performed performance tests writing with different file sizes and using various block sizes, and we didn’t encounter any errors. However, we only used a small set of tapes during those tests, that must be why we didn’t encounter this issue.

Currently, 4 tapes are affected: 3/150 from the first batch and 1/60 from the second purchase.

Two were resolved with a labeling retry, but the others failed repeatedly with the same error across different drives.

With one of the tapes that was failing repeatedly, I performed the test you suggested, and writing data to the beginning of the tape worked without any errors. Following that, I attempted the labeling again and it worked fine.

Yes, it appears the connection was aborted due to a timeout when the drive failed to respond. After this error occurred, I uploaded the drive dump to the IBM case. They have escalated it and are currently reviewing it.

For now, we are still testing this technology, but we will certainly have to significantly increase the tape verification process once we are in production.

We have contacted Fujifilm, and in addition to replacing the tapes, they will also send them to the lab for analysis.

Thanks again. Best regards,

Esther