Should run cleaner but VID is missing

Hello Friends,

We have recently had a tape drive go down twice with the following reason:

[cta-taped] ERROR In DriveHandler::runChild(): Should run cleaner but VID is missing. Putting the drive down.

Every time this happened, there was a tape in the drive (different vid each time).

Looks like it’s hitting this: tapeserver/daemon/DriveHandler.cpp · v4.6.3-1 · cta / CTA · GitLab

I am not sure why, because m_previousVid.empty() should not evaluate to True is there is a tape in the drive?

We recently upgraded cta from 4.0-2 to 4.6.3, not sure if that had anything to do with it.

I will also paste some logs from the tapeserver itself. Maybe this is the root cause of the original failure: In SocketPair::receive(): connection reset by peer

Jun  6 10:05:03.686072 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="352" TID="352" MSG="In GarbageCollector::acquireTargets(): started tracking an untracked agent" SubprocessName="maintenanceHandler" agentAddress="ID="1" MSG="Computed new timeout" tapeDrive="DR9" TimeoutType="StateChange" LastStateChangeTime="8828389" LastHeartBeatTime="8828945" LastDataMovementTime="8828404" Now="8828945" Timeout="8828989"
Jun  6 10:04:20.414591 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="1" TID="1" MSG="Computed new timeout" tapeDrive="DR9" TimeoutType="StateChange" LastStateChangeTime="8828389" LastHeartBeatTime="8828945" LastDataMovementTime="8828404" Now="8828945" Timeout="8828989"
Jun  6 10:04:35.429144 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="1" TID="1" MSG="Computed new timeout" tapeDrive="DR9" TimeoutType="StateChange" LastStateChangeTime="8828389" LastHeartBeatTime="8828960" LastDataMovementTime="8828404" Now="8828960" Timeout="8828989"
Jun  6 10:04:35.429294 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="1" TID="1" MSG="Computed new timeout" tapeDrive="DR9" TimeoutType="StateChange" LastStateChangeTime="8828389" LastHeartBeatTime="8828960" LastDataMovementTime="8828404" Now="8828960" Timeout="8828989"
Jun  6 10:04:50.444027 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="1" TID="1" MSG="Computed new timeout" tapeDrive="DR9" TimeoutType="StateChange" LastStateChangeTime="8828389" LastHeartBeatTime="8828975" LastDataMovementTime="8828404" Now="8828975" Timeout="8828989"
Jun  6 10:04:50.444116 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="1" TID="1" MSG="Computed new timeout" tapeDrive="DR9" TimeoutType="StateChange" LastStateChangeTime="8828389" LastHeartBeatTime="8828975" LastDataMovementTime="8828404" Now="8828975" Timeout="8828989"
Jun  6 10:05:04.902678 cta-tapeserver-crlt-v5-0 cta-taped: LVL="WARN" PID="1" TID="1" MSG="In DriveHandler::processTimeout(): Killed subprocess." tapeDrive="DR9" SessionState="Mounting" SessionType="Archive" TimeoutType="StateChange" SessionTypeWhenTimeoutDecided="Archive" SessionStateWhenTimeoutDecided="Mounting" LastDataMovementTime="8828404" LastHeartbeatTime="8828975" LastStateChangeTime="8828389" Now="8828989" ThisTimeout="8828989" BeforeStateChangeTimeout_s="-0.013755" SubprocessId="11852"
Jun  6 10:05:20.344926 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="1" TID="1" MSG="In DriveHandler::processEvent(): Got a peer disconnect: closing socket and waiting for SIGCHILD" tapeDrive="DR9" Message="In SocketPair::receive(): connection reset by peer."
Jun  6 10:05:20.344998 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="1" TID="1" MSG="In signal handler, received SIGCHLD and propagations to other handlers" signal="Child exited" senderPID="11852" senderUID="0"
Jun  6 10:05:20.345019 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="1" TID="1" MSG="Handler received SIGCHILD. Propagating to all handlers." SubprocessName="signalHandler"
Jun  6 10:05:20.345029 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="1" TID="1" MSG="Propagated SIGCHILD." SubprocessName="signalHandler"
Jun  6 10:05:20.345231 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="1" TID="1" MSG="Drive subprocess crashed. Will spawn a new one." tapeDrive="DR9" pid="11852" IfSignaled="1" TermSignal="9" CoreDump="0"
Jun  6 10:05:20.345260 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="1" TID="1" MSG="Tape session finished" tapeVid="A02058" mountType="ArchiveForUser" mountId="508278" volReqId="508278" tapeDrive="DR9" vendor="IBM" vo="Shared" mediaType="LTO7" tapePool="crlt-piranha-tapes" logicalLibrary="crlt-piranha" capacityInBytes="6000000000000" stillOpenFileForThread0="root://eos-mgm-0.mgm.cta.svc.cluster.archive//eos/cta/aarnet-cloudstor/uow.edu.au/jpickard@uow.edu.au/data/d1/d182a422ab8960cae585984d16951735bac6855755f1bde1c9cfb32095fde30e?eos.lfn=fxid:9c076f" stillOpenFileForThread1="root://eos-mgm-0.mgm.cta.svc.cluster.archive//eos/cta/aarnet-cloudstor/uow.edu.au/jpickard@uow.edu.au/data/ca/cadf35a101959f272fdf7fab2dcf1ea880ec30292325e7f36e973cc4b2548685?eos.lfn=fxid:9c0770" Error_sessionKilled="1" killSignal="9" status="failure"
Jun  6 10:05:20.345322 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="1" TID="1" MSG="Propagated SIGCHILD." SubprocessName="drive:DR9"
Jun  6 10:05:20.345942 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="1" TID="1" MSG="Propagated SIGCHILD." SubprocessName="maintenanceHandler"
Jun  6 10:05:20.345967 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="1" TID="1" MSG="Subprocess handler requested forking" SubprocessName="drive:DR9"
Jun  6 10:05:20.345979 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="1" TID="1" MSG="Subprocess handler will fork" SubprocessName="drive:DR9"
Jun  6 10:05:20.347850 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="In child process. Running child." SubprocessName="drive:DR9"
Jun  6 10:05:20.348059 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="In DriveHandler::runChild(): will connect to object store backend." SubprocessName="drive:DR9" backendPath="rados://admin@eoscta_metadata:cta-ns"
Jun  6 10:05:20.348097 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="In DriveHandler::runChild(): will create agent entry. Enabling leaving non-empty agent behind." SubprocessName="drive:DR9" processName="DriveProcess-DR9"
Jun  6 10:05:20.366422 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="BackendRados::BackendRados() about to create a new context" contextId="0"
Jun  6 10:05:20.366953 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="BackendRados::BackendRados() context created. About to set namespace." contextId="0"
Jun  6 10:05:20.366969 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="BackendRados::BackendRados() namespace set. About to test access." contextId="0"
Jun  6 10:05:20.367832 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="BackendRados::BackendRados() about to create a new context" contextId="1"
Jun  6 10:05:20.367845 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="BackendRados::BackendRados() context created. About to set namespace." contextId="1"
...
...
Jun  6 10:05:20.424772 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="Started Rados worker thread" thread="radosWorker" threadID="121"
Jun  6 10:05:20.425064 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="Started Rados worker thread" thread="radosWorker" threadID="122"
Jun  6 10:05:20.425527 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="Started Rados worker thread" thread="radosWorker" threadID="123"
Jun  6 10:05:20.425749 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="Started Rados worker thread" thread="radosWorker" threadID="124"
Jun  6 10:05:20.425899 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="Started Rados worker thread" thread="radosWorker" threadID="125"
Jun  6 10:05:20.426377 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="Started Rados worker thread" thread="radosWorker" threadID="126"
Jun  6 10:05:20.426709 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="Started Rados worker thread" thread="radosWorker" threadID="127"
Jun  6 10:05:20.426927 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="In BackendRados::BackendRados(): created worker threads" workThreads="128"
Jun  6 10:05:20.439399 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="In DriveHandler::createCatalogue(): will get catalogue login information." SubprocessName="drive:DR9" fileCatalogConfigFile="/etc/cta/cta-catalogue.conf" caller="DriveHandler::runChild()"
Jun  6 10:05:20.439844 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="In DriveHandler::createCatalogue(): will connect to catalogue." SubprocessName="drive:DR9" fileCatalogConfigFile="/etc/cta/cta-catalogue.conf" caller="DriveHandler::runChild()"
Jun  6 10:05:20.441442 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="In DriveHandler::runChild(): will create scheduler." SubprocessName="drive:DR9"
Jun  6 10:05:20.441599 cta-tapeserver-crlt-v5-0 cta-taped: LVL="DEBUG" PID="12042" TID="12042" MSG="In DriveHandler::runChild(): will ping scheduler." SubprocessName="drive:DR9"
Jun  6 10:05:20.456842 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="In Scheduler::ping(): success." SubprocessName="drive:DR9" catalogueTime="0.014588" schedulerDbTime="0.000537" checkEnvironmentVariablesTime="0.000073"
Jun  6 10:05:20.456908 cta-tapeserver-crlt-v5-0 cta-taped: LVL="ERROR" PID="12042" TID="12042" MSG="In DriveHandler::runChild(): Should run cleaner but VID is missing. Putting the drive down." SubprocessName="drive:DR9"
Jun  6 10:05:20.460804 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="In Scheduler::setDesiredDriveState(): success." SubprocessName="drive:DR9" drive="DR9" up="down" force="no" reason="[cta-taped] ERROR In DriveHandler::runChild(): Should run cleaner but VID is missing. Putting the drive down." comment="" schedulerDbTime="0.000837"
Jun  6 10:05:20.463240 cta-tapeserver-crlt-v5-0 cta-taped: LVL="INFO" PID="12042" TID="12042" MSG="In Agent::removeAndUnregisterSelf(): Removed agent object." agentObject="DriveProcess-DR9-cta-tapeserver-crlt-v5-0-12042-20220606-10:05:20-0"

Thank you in advance,

Denis

Sorry, Denis, this is a regression. It’s fixed in 4.7.3-1.

Ahh, thanks heaps, Michael :slight_smile: Not to worry, 4.7.3-1 is on the horizon for us this month!

Thank you as always,

Warm Regards

Denis

G’day Michael,

After upgrading the 4.7.3-1, we are still seeing this issue. Any ideas?

Warm Regrads,

Denis

Hello Denis,

It appeared that our fix covered only a part of the problem. We are working on a final solution.

Cheers,
Vova

Thanks Vova :slight_smile: , much appreciated. Will patiently wait.

Hi Denis,

The fix you are waiting for is in CTA v4.7.6-1. We will deploy this release in production tomorrow.

Cheers,

Michael

Hello Michael,

Thanks for that! We will look to deploy that as soon as possible :slight_smile:

Warm Regards,

Denis Lujanski