Problem Tape workflow

jeff · 24 May 2024 13:51

What is CTA’s workflow as a CTA operator to handle tapes with “bad” files? Are there CTA operator scripts available for automatic repack of such a tape?

Our current tape system Enstore handles it this way. Enstore reads the files from the start of the tape to the end. If the file is “bad” we get these types of errors transfer failed READ_ERROR FTT_EBLANK TAPE. When this happens the drive goes down with the tape still in the drive. We must manually eject the tape, which can take up to 15 minutes. Then we mark the current file as “bad” and continue reading starting at the next file.

Thanks,

Jeff (Fermilab)

mdavis · 24 May 2024 14:12

Hi Jeff,

Your question is quite broad and I’m not sure I can give a concise answer. There are many possible failure modes and we have a toolbox of different ways to deal with them. Here is a high-level overview:

As I understand, you are only talking about read errors (not write errors). CTA will try retry a few times on a read error: it will retry 3 times within the same mount. If that does not succeed, the retrieve request will be requeued and it will try another 3 times in a different tape drive. If that still doesn’t succeed, the request is failed. In any case, a failed read on one file will not prevent reading the other files on the tape.

Outside of the CTA tape server, the Tape Alerting System monitors the system for read errors. It decides if a drive is reporting too many errors (and should be put down) or if a tape is reporting too many errors (and should be disabled). In either case, an operator will be alerted and will follow a set of procedures to test (and put the tape back in operation) or to perform low-level data recovery (and then repack the tape). There was a CTA Workshop presentation on tape states in 2023 and a presentation on the Tape Alerting System at the last CTA Workshop.

I hope that helps.

Michael

rbachman · 24 May 2024 14:41

Hi Jeff,
to build further on Michael’s answer, here is the workflow we use at CERN:

Once a tape has failed to be read/written in CTA and been caught by the Tape Alerting System, it falls to the operators to deal with it. In the general case the procedure looks something like this:

The problematic tape is repacked. This may be done in one of two ways:
- Manually, using cta-admin repack
  1. cta-admin tape ch --vid <vid> --full true --state REPACKING --reason "<user>: Repacking due to ..."
  2. cta-admin repack add --vid <vid> --mountpolicy <dedicated_high_priority_mount_policy_for_repack>
- In an automated fashion using the ATRESYS operator tool.
In most cases a repack will be sufficient to make the files on that tape available on new media. The old tape can then be tested using the cta-ops-admin tape mediacheck command (part of the ctaopsadmin operator tools) and, depending on the outcome, be re-inserted into the system or discarded.
If the repack has failed, such that one did not copy all files to new media, then the easiest step is to re-try it. We find that this may save you time and should be attempted before (4) is attempted. In ATRESYS a repack may be retried (cta-ops-repack-manager retry), or if using cta-admin it the existing repack should be removed and then re-submitted. Should it still fail, retry again with a different tape drive or a different set of tape drive drivers (if available).
If multiple repacks have failed and the above options exhausted, one may try to extract the files manually. We have a tool for this, but it is at present not ready to be published. Effectively, one uses it to select a set of files around the problematic area on the tape (AFTER repacking as many files as possible), and then the script dds the files to disk. From here they may be be re-written to tape.
If (4) failed as well, then there is nothing more we can do on site. This is extremely rare. At this point we would inform the user of the issue, and make a decision on whether or not to contact the media vendor to have them attempt data recovery.

fnal_jeff · 6 November 2024 16:20

From this fluentd config monitoring/fluentd/cta-ops-tape-alerting-system.conf · master · cta / CTA Operations Utilities · GitLab

Does CERN have their taped logs set to DEBUG or INFO in production?

rbachman · 7 November 2024 07:46

Hi Jeff,
CTA logging is generally configured to be INFO-level, both for the core CTA frontend/taped, as well as the operations utilities.

DEBUG may, however, still be mentioned in the monitoring config files, to either exclude these logs from monitoring while we troubleshoot (to avoid flooding the system with too verbose output), or to send them to an appropriate place (sometimes just stdout).
In the case with the file you mention, DEBUG logs are never relevant for TAS, and so are excluded.