Cta-taped is killed after upgrading to 4.7.12-2

george_patargias · 13 January 2023 12:40

Hello,

We upgrade our EOSCTA instance at RAL on Wed 11/1 to EOS 4.8.88 and CTA 4.7.12-2.

We have noticed that since the upgrade, cta-taped across tape servers is killed/shut down.
It looks like the subprocess on one drive (we have 2 drives per tape server) is shut down after a fatal failure

2023-01-13T11:39:06.310857+00:00 cta-ts06 cta-taped: LVL=“CRIT” PID=“888” TID=“888” MSG=“In DriveHandler::processFatal(): shutting down after fatal failure.” tapeDrive=“asterix_ts1160_04” PreviousState=“Unmounting” PreviousType=“Archive” NewState=“Fatal” NewType=“Archive”

which results in a failed tape session on this drive

2023-01-13T11:39:06.473805+00:00 cta-ts06 cta-taped: LVL=“INFO” PID=“888” TID=“888” MSG=“Tape session finished” tapeVid=“CT4065” mountType=“ArchiveForUser” mountId=“487144” volReqId=“487144” tapeDrive=“asterix_ts1160_04” vendor=“IBM” vo=“atlas” mediaType=“TS1160” tapePool=“atlvalid” logicalLibrary=“asterix_ts” capacityInBytes=“20000000000000” Error_sessionKilled=“1” killSignal=“9” status=“failure”

but then the cta-taped subprocess for the other drive is also shut down

2023-01-13T11:39:06.611399+00:00 cta-ts06 cta-taped: LVL=“INFO” PID=“888” TID=“888” MSG=“In DriveHandler::shutdown(): simply killing the process.” tapeDrive=“asterix_ts1160_05”

and the end even the MaintainanceHandler is stopped
2023-01-13T11:44:07.490438+00:00 cta-ts06 cta-taped: LVL=“INFO” PID=“888” TID=“888” MSG=“Propagated SIGCHILD.” SubprocessName=“signalHandler”
2023-01-13T11:44:07.490725+00:00 cta-ts06 cta-taped: LVL=“INFO” PID=“888” TID=“888” MSG=“Propagated SIGCHILD.” SubprocessName=“drive:asterix_ts1160_04”
2023-01-13T11:44:07.491018+00:00 cta-ts06 cta-taped: LVL=“INFO” PID=“888” TID=“888” MSG=“Propagated SIGCHILD.” SubprocessName=“drive:asterix_ts1160_05”
2023-01-13T11:44:07.491309+00:00 cta-ts06 cta-taped: LVL=“INFO” PID=“888” TID=“888” MSG=“Maintenance subprocess exited. Will not spawn new one as we are shutting down.” pid=“891” eode=“0”
2023-01-13T11:44:07.491591+00:00 cta-ts06 cta-taped: LVL=“INFO” PID=“888” TID=“888” MSG=“Propagated SIGCHILD.” SubprocessName=“maintenanceHandler”
2023-01-13T11:44:07.491868+00:00 cta-ts06 cta-taped: LVL=“INFO” PID=“888” TID=“888” MSG=“cta-taped exiting.”

Can you please advise us what to do? Is there a bug in the 4.7.12-2 cta-taped and we need to go to 4.7.13?

Many thanks,

George

jleduc · 16 January 2023 14:12

Hello George,
from which CTA version did you upgrade to 4.7.12-2?
Could you also send the result of cta-admin --json version?

Julien

george_patargias · 16 January 2023 15:02

Hi Julien,

We upgradedfrom CTA 4.6.1-1. The output from cta-admin --json version is

[root@cta-adm ~]# cta-admin --json version
[{“clientVersion”:{“ctaVersion”:“4.7.12-2”,“xrootdSsiProtobufInterfaceVersion”:“v1.3”},“serverVersion”:{“ctaVersion”:“4.7.12-2”,“xrootdSsiProtobufInterfaceVersion”:“v1.3”},“catalogueConnectionString”:“oracle:cta/******@CTA_VEGA”,“catalogueVersion”:“12.0”,“isUpgrading”:false}][root@cta-adm ~]#

Thanks,

George

afonso · 16 January 2023 15:38

Hello George,

The json string you provided seems to be fine…

Could you try upgrading to the latest publicly available version of CTA (4.7.14-1)? It includes several fixes. If the problem persists, then we will have to check in more detail.

Cheers,
Joao

george_patargias · 16 January 2023 16:54

Hi Julien,

Thanks for this. It is true that the issue showed up exactly after upgrading to 4.7.12. I chedcked the Release notes for bug fixes relating to cta-taped but couldn’t find something relevant,

We will have a go with 4.7.14-1. Sanity check: as far as I can tell from the Release Notes, the v12.0 schema should be compatible right?

George

afonso · 17 January 2023 09:52

Hi,

Yes, we also did not find a bug fix clearly referring to this in the Release Notes. But maybe we are just not looking at the right logs… Anyway, it’s better to check 4.7.14-1 first.

Regarding the catalogue, you should have no problems. The schema is still v12.0 in all our latest releases.

gtf87 · 2 February 2023 11:59

Looking at the taped logs it appears that the problem is when the tape daemon is trying to unload a tape. We can see in the logs where it fails to unload the tape within the number the nbAttempts is set to. For whatever reason the logic in CTA appears to be to stop the tape daemon. At CERN that is probably not an issue as you only have one drive/server. At RAL and other sites we have multiple drives/server, so just stopping the daemon when the other drive is working fine is not what we want. So -

Is there a way to change the variable nbAttempts=10 to a greater value, or is this hard coded?
Is there a way to configure the cta-taped so that it keeps running after configuring the drive unmounted drive down? Oddly we see the tape daemon fail to load a tape after 10 attempts, but the daemon carries on running!

snippet of log -
2023-02-01T16:40:47.658328+00:00 cta-ts01 cta-taped: LVL=“ERROR” PID=“6565” TID=“8021” MSG=“Exception in TapeReadSingleThread-TapeCleaning when unmounting/unloading the tape. Putting the drive down.” thread=“TapeRead” tapeDrive=“asterix_ts1160_09” tapeVid=“CT2835” mountId=“520863” exceptionMessage=“Failed to dismount tape: vid=CT2835 slot=smc9: Failed to dismount tape in SCSI tape-library: vid=CT2835 librarySlot=smc9: Received error from rmcd after several fast retries: nbAttempts=10 rmcErrorStream=smc_dismount: SR017 - find_cartridge CT2835 failed : Not Ready to Ready Transition”
2023-02-01T16:40:47.660287+00:00 cta-ts01 cta-taped: LVL=“INFO” PID=“6563” TID=“6563” MSG=“In DriveHandler::processEvent(): changing session state” tapeDrive=“asterix_ts1160_09” PreviousState=“Unmounting” PreviousType=“Retrieve” NewState=“Fatal” NewType=“Retrieve”
2023-02-01T16:40:47.660587+00:00 cta-ts01 cta-taped: LVL=“CRIT” PID=“6563” TID=“6563” MSG="In DriveHandler::processFatal(): shutting down after fatal failure." tapeDrive=“asterix_ts1160_09” PreviousState=“Unmounting” PreviousType=“Retrieve” NewState=“Fatal” NewType=“Retrieve”

vyurchen · 8 February 2023 15:10

Hi,

Thank you for providing the detailed info! Concerning your questions:

This value is hard coded, we will see if we can expose it to be configurable.
It’s true, that the current logic is to shut down cta-taped in case of a critical failure in one of the subprocesses. We will investigate the possibility to continue running the daemon to support multi-drive setup.

Cheers,
Vova

jleduc · 17 February 2023 16:16

Hi George,
we discussed the corresponding cta development issue in this afternoon development meeting.

Could you evaluate the following workaround for the multiple tape drive issue on your side and give us some feedback about it:

run cta-rmcd as you do today
run 1 cta-taped process per tape drive in its own container, each cta-taped process configured to only use one tape drive out of the ones that are connected on your box

This is already how we are running in CTA CI: the 2 cta-taped processes are seeing all SCSI tape drives but only configured to use 1 per process.

Let us know if this is acceptable for you: isolating the drive processes upon fatal failure is way trickier and will require some time.

Best regards,
Julien

george_patargias · 23 February 2023 12:25

Hi Julien,

Many thanks for these suggestions and for looking into this issue.

We discussed your propsosed solution in our CTA project meeting today and and we agreed that it is definitely something worth trying in the near future (after we finish the migration of our last CASTOR instance to EOS/CTA).

Just a note of clarification: if I understand correctly what your proposed solution achieves is to prevent the “good” drive(s) from being brought down because of failures (related to tape mounts etc) seen in the “faulty” drive.

There is also the question of making cta-taped more “tolerable” to drive failures: instead of shutting down after, say, 10 tape mount failures it would shut down after 20 failures or make this value configurable as it was in CASTOR as Tim mentioned above. Is this kind of change included in your develompent plans?

Best,

George

Best,

George

jleduc · 23 February 2023 13:28

Hi George,
Indeed the goal is to protect the good drive.
with this proposal you have 2 independent set of processes forked by cta-taped: 1 per tape drive. Therefore when something real wrong happens inside one set of subprocess, exceptions causing fatal issues are constrained to one specific set of processes and only affect one tape drive. Using containers would allows everyone to fall back on the same single drive configuration and avoid exponential combination of use cases that depends on the number of connected tape drives. See the corresponding CTA development ticket: Keep cta-taped alive if at least one drive handler is alive.

For the tape mount operations with cta-rmcd this is entirely independent: you just need to run one per tape server and it will be shared by the 2 local cta-taped containers.

Regarding the container technology you could spawn a podman container with systemd that runs in userland as cta:tape system user.

For your second question, RMC_MAXRQSTATTEMPTS was set by default to 10 in castor and it was changed to a configuration parameter on 2015-01-17 (in CASTOR commit 91e32fc3bbcbe2e32c05a9ab56f639171d94adf3). It looks like it is not possible to configure it since early CTA days: beginning of 2017 it was hardcoded to its current value. We have created a CTA dev ticket for this issue Make RMC maxRqstAttempts variable configurable, you can find the planning/priority details there.

Best regards,
Julien Leduc

george_patargias · 6 December 2023 16:18

Hi @jleduc

Sorry to come back to this. Can you please point me to some clues how to run cta-taped in a container? You said above that

“This is already how we are running in CTA CI: the 2 cta-taped processes are seeing all SCSI tape drives but only configured to use 1 per process”

So, maybe you could point me to this documentation?

Thanks,

George

jleduc · 7 December 2023 08:29

Hi George,
we are going to change the cta-taped strategy and code to deal with multiple tape drives:

1 cta-taped service per tape drive with its own configuration file

This will simplify a lot multiple tape drives management on the same server.

It will look like systemctl start cta-taped@drive1.

We are currently working on gathering the use cases, operation model and writing specifications for this.

So this will become mainstream.

In the meantime if you want to run cta-taped in containers you can mount cta-taped config files directory structure/log directory in a CC7 container and follow installation instructions: Public RPMs installation instructions - EOSCTA Docs

You just need to make sure that cta-taped is not using syslog for its logs (just configure it and launch it like in CI: continuousintegration/docker/ctafrontend/cc7/opt/run/bin/taped.sh · main · cta / CTA · GitLab).

I can have a look at what I have on my side, but I launched some FSTs in podman pods this way.

Cheers,
Julien

george_patargias · 7 December 2023 15:14

HI Julien,

Thanks for the reply. Thanks for the reminder about multi-drive support in future CTA realeases - I do remember that being mentioned in the recent pre-GDB meeting

I am afaid I don’t know much about containers - this is an excellent oppprtunitty to learn!

Do you have a container image that I can use for cta-taped or I need to build it (and how)?
I am not sure how the /opt/run/bin/taped.sh (with the cta-rmcd bits removede) can be used with podman.
How do you configure the containerised cta-taped to work with one drive? By using
taped TpConfigPath /etc/cta/TPCONFIG and a seperate TPCONFIG for each container?

Thanks again,

George