How to resolve STALE drives

Mwai · 30 August 2023 07:53

hello,

Occasionally, some of our drives become stale while on an ongoing archive session:

[root@cta-frontend ctaObjectStore]# ca dr ls
        library     drive   host desired        request   status since    vid   tapepool   vo files  data  MB/s session priority activity          age reason
ctaltolib-rz3-1 rz3-2,3,2 tpm302      Up ArchiveForUser Transfer 31383 L09114 FXE-202302 xfel  4940 10.7T 339.6       5        0        -            9 -
ctaltolib-rz3-1 rz3-2,3,4 tpm302      Up ArchiveForUser Transfer 31423 L09206 HED-202302 xfel  9752 10.5T 332.1       6        0        -            2 -
ctaltolib-rz3-1 rz3-2,4,2 tpm301      Up ArchiveForUser Transfer  8976 L09200    archive xfel    71 25.2G  26.6      11        0        - 8176 [STALE] -
ctaltolib-rz3-1 rz3-2,4,3 tpm302      Up ArchiveForUser Transfer 31329 L09111 MID-202302 xfel   398  3.0T  96.4       7        0        -            2 -
ctaltolib-rz3-1 rz3-2,4,4 tpm302      Up ArchiveForUser Transfer  8997 L09229   FXE-2023 xfel   646  1.2T 171.0      12        0        - 2006 [STALE] -

Sometimes restarting cta-frontend-grpc or cta-taped kind of helps but ultimately reinitialising the objectstore is what rectifies it best.

Is there another way of dealing with this?

Mwai

vlado · 30 August 2023 12:44

Hello Mwai,

[STALE] state means that the cta-taped process did not update it’s state since many seconds (over a threshold). That is not normal and definitely not normal for a running session.

It usually shows that there is some underlying problem.

In the past, we have seen on multiple occasions that the drives became STALE:

global scheduler lock aquisition time issue: Sign in to CERN
4.7.6-1 version issue: Sign in to CERN

I would suggest that you should:

investigate in the cta-taped logs in which stage the archive session is (is it waiting for a disk to provide data? is it still transferring data to drive? is it generating a report that the data was successfuly written?)
investigate what kind of / how many objects you have in the object store (is there some corruption / leak somewhere where objects are not being properly removed?)

Example of our cleanup of the object store is here: Sign in to CERN

Since the drive states are in a database, could this hint a problem with the database?

Hope this helps. Best regards,

Vladimir

mdavis · 31 August 2023 07:10

Hi @Mwai,

Prior to CTA v4.6.0-1, drive states were kept in the objectstore. From v4.6.0 onwards, this functionality was moved to the DB. Can you confirm what version of cta-taped you are running?

Michael

Mwai · 31 August 2023 10:37

Hi @mdavis,

we are running v5.8.7-1

Thanks

mdavis · 31 August 2023 11:42

In that case the drive states are in the DB and the issues with drive reporting that Vlado mentioned should be fixed. It’s strange to me that reinitialising the objectstore should have an effect on this issue.

When the drive reporting is stale, what is the last action taken by the drive (grep for the hostname in cta-taped.log and check the last log entries)?

When the transfer session finishes, does the drive reporting return to normal without a manual intervention?

Mwai · 1 September 2023 13:18

Reinitialising works because the objectstore will be in a new slate. Then requests will be resubmitted and new tape sessions start till they get STALE again.

sample snippet of one:

Aug 30 01:19:41 tpm302 cta-taped: LVL="WARN" PID="17727" TID="17727" MSG="In DriveHandler::processTimeout(): Killed subprocess." tapeDrive="rz3-2,4,4" SessionState="Running" Sess
ionType="Archive" TimeoutType="DataMovement" SessionTypeWhenTimeoutDecided="Archive" SessionStateWhenTimeoutDecided="Running" LastDataMovementTime="1156317" LastHeartbeatTime="11
57202" LastStateChangeTime="1150212" Now="1157217" ThisTimeout="1157217" BeforeDataMovementTimeout_s="-0.003234" BeforeHeartbeatTimeout_s="45.627177" SubprocessId="1488"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="In signal handler, received SIGCHLD and propagations to other handlers" signal="Child exited" senderPID=
"1488" senderUID="0"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Handler received SIGCHILD. Propagating to all handlers." SubprocessName="signalHandler"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="signalHandler"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="drive:rz3-2,3,2"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="drive:rz3-2,3,4"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="drive:rz3-2,4,3"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Drive subprocess crashed. Will spawn a new one." tapeDrive="rz3-2,4,4" pid="1488" IfSignaled="1" TermSig
nal="9" CoreDump="0"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Tape session finished" tapeVid="L09229" mountType="ArchiveForUser" mountId="12" volReqId="12" tapeDrive=
"rz3-2,4,4" vendor="HPE" vo="xfel" mediaType="LTO9" tapePool="FXE-2023" logicalLibrary="ctaltolib-rz3-1" capacityInBytes="18000000000000" wasTapeMounted="1" mountTime="16.219651"
 positionTime="93.672077" waitInstructionsTime="1.191164" waitFreeMemoryTime="0.000000" waitDataTime="3234.891367" waitReportingTime="0.027062" checksumingTime="410.242432" readW
riteTime="2382.950788" flushTime="63.009445" unloadTime="0.000000" unmountTime="0.000000" encryptionControlTime="0.002331" transferTime="6092.312258" totalTime="7100.169509" deli
veryTime="7100.169509" drainingTime="0.000000" dataVolume="1214453072268" filesCount="646" headerVolume="310080" payloadTransferSpeedMBps="171.045645" driveTransferSpeedMBps="171
.045689" Error_sessionKilled="1" killSignal="9" status="failure"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="drive:rz3-2,4,4"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Propagated SIGCHILD." SubprocessName="maintenanceHandler"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Subprocess handler requested forking" SubprocessName="drive:rz3-2,4,4"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="17727" TID="17727" MSG="Subprocess handler will fork" SubprocessName="drive:rz3-2,4,4"
Aug 30 01:19:42 tpm302 cta-taped: LVL="INFO" PID="24530" TID="24530" MSG="In child process. Running child." SubprocessName="drive:rz3-2,4,4"

Sometimes a successful transfer session finishes and jobs are queued to be reported but the drive will only be released after restarting another tape server which then does the reporting job. All tape serves have been configured to report.

We have only observed this recently as we’ve recently moved XFEL to CTA. So perhaps too many requests or how dCache is interacting with CTA. Restarting the dCache frontend has also seen to release the drives and then queued reports are reported. Tigran has opened an issue on the same.

jleduc · 1 September 2023 13:26

Hi @Mwai,
which objectstore do you have on the back of your CTA instance?

Mwai · 1 September 2023 13:31

hi @jleduc we use an nfs mount.