[STALE] state means that the cta-taped process did not update it’s state since many seconds (over a threshold). That is not normal and definitely not normal for a running session.
It usually shows that there is some underlying problem.
In the past, we have seen on multiple occasions that the drives became STALE:
global scheduler lock aquisition time issue: Sign in to CERN
investigate in the cta-taped logs in which stage the archive session is (is it waiting for a disk to provide data? is it still transferring data to drive? is it generating a report that the data was successfuly written?)
investigate what kind of / how many objects you have in the object store (is there some corruption / leak somewhere where objects are not being properly removed?)
Example of our cleanup of the object store is here: Sign in to CERN
Since the drive states are in a database, could this hint a problem with the database?
Prior to CTA v4.6.0-1, drive states were kept in the objectstore. From v4.6.0 onwards, this functionality was moved to the DB. Can you confirm what version of cta-taped you are running?
In that case the drive states are in the DB and the issues with drive reporting that Vlado mentioned should be fixed. It’s strange to me that reinitialising the objectstore should have an effect on this issue.
When the drive reporting is stale, what is the last action taken by the drive (grep for the hostname in cta-taped.log and check the last log entries)?
When the transfer session finishes, does the drive reporting return to normal without a manual intervention?
Reinitialising works because the objectstore will be in a new slate. Then requests will be resubmitted and new tape sessions start till they get STALE again.
sample snippet of one:
Aug 30 01:19:41 tpm302 cta-taped: LVL="WARN" PID="17727" TID="17727" MSG="In DriveHandler::processTimeout(): Killed subprocess." tapeDrive="rz3-2,4,4" SessionState="Running" Sess
ionType="Archive" TimeoutType="DataMovement" SessionTypeWhenTimeoutDecided="Archive" SessionStateWhenTimeoutDecided="Running" LastDataMovementTime="1156317" LastHeartbeatTime="11
57202" LastStateChangeTime="1150212" Now="1157217" ThisTimeout="1157217" BeforeDataMovementTimeout_s="-0.003234" BeforeHeartbeatTimeout_s="45.627177" SubprocessId="1488"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="In signal handler, received SIGCHLD and propagations to other handlers" signal="Child exited" senderPID=
"1488" senderUID="0"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Handler received SIGCHILD. Propagating to all handlers." SubprocessName="signalHandler"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="signalHandler"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="drive:rz3-2,3,2"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="drive:rz3-2,3,4"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="drive:rz3-2,4,3"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Drive subprocess crashed. Will spawn a new one." tapeDrive="rz3-2,4,4" pid="1488" IfSignaled="1" TermSig
nal="9" CoreDump="0"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Tape session finished" tapeVid="L09229" mountType="ArchiveForUser" mountId="12" volReqId="12" tapeDrive=
"rz3-2,4,4" vendor="HPE" vo="xfel" mediaType="LTO9" tapePool="FXE-2023" logicalLibrary="ctaltolib-rz3-1" capacityInBytes="18000000000000" wasTapeMounted="1" mountTime="16.219651"
positionTime="93.672077" waitInstructionsTime="1.191164" waitFreeMemoryTime="0.000000" waitDataTime="3234.891367" waitReportingTime="0.027062" checksumingTime="410.242432" readW
riteTime="2382.950788" flushTime="63.009445" unloadTime="0.000000" unmountTime="0.000000" encryptionControlTime="0.002331" transferTime="6092.312258" totalTime="7100.169509" deli
veryTime="7100.169509" drainingTime="0.000000" dataVolume="1214453072268" filesCount="646" headerVolume="310080" payloadTransferSpeedMBps="171.045645" driveTransferSpeedMBps="171
.045689" Error_sessionKilled="1" killSignal="9" status="failure"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="drive:rz3-2,4,4"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="maintenanceHandler"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Subprocess handler requested forking" SubprocessName="drive:rz3-2,4,4"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Subprocess handler will fork" SubprocessName="drive:rz3-2,4,4"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="24530" TID="24530" MSG="In child process. Running child." SubprocessName="drive:rz3-2,4,4"
Sometimes a successful transfer session finishes and jobs are queued to be reported but the drive will only be released after restarting another tape server which then does the reporting job. All tape serves have been configured to report.
We have only observed this recently as we’ve recently moved XFEL to CTA. So perhaps too many requests or how dCache is interacting with CTA. Restarting the dCache frontend has also seen to release the drives and then queued reports are reported. Tigran has opened an issue on the same.