Understanding CleanUp/STALE drive states

Hi,

I would like to understand a bit more what is happenning when a CTA drives enters the “CleanUp” state. Obviously, some kind of clean-up that follows the archival/retrieval is involved but what exactly is it?

More importantly, can you please give us some clues as to why the drive can get stuck to the “CleanUp” state; the (session?) age becomes a very large number and the result is that drive becomes STALE.

Is this indicative of a problem with the tape library? I was wondering if we hit “taped RmcNetTimeout” - at the moment it has the compile time default, 600s. For the record, we have been running with “taped RmcRequestAttempts 30” for quite some time now.

Many thanks,

George

Hello George,

as you say, after a transfer session (either archive or retrieve) a cleaner session is run. In summary, this ‘cleaner’ is responsible for rewinding the tape, unmounting it, and logging tape alerts.

What does a very large number mean? And, where are you seeing this number?

I assume the ‘STALE’ state you are referring comes from the cta-admin output? It is something specific generated from cta-admin, basically when a drive takes longer to update its status than the configured drive timeout, the [STALE] string gets appendend to the age of the session. This STALE string does not show in the json output of the command.

For the ‘stuck in the cleanup state’. There is a default timeout of 300 secs, this value comes from the tapeLoadTimeout of the taped configuration file. So, if there is a problem with the drive it will take tapeLoadTimeout seconds for the cleaner session to fail, and it may look like it is stuck.

For the RMC attempts and net timeout, we use the compile time defaults in production, 600 and 10 respectively.

Cheers,
Pablo.

Hi Pablo,

Thanks for your reply.

The large number I mentioned (and also the STALEstate) can be seen in the output of the “cta-admin dr ls” commnd. For example,

obelix_ts obelix_ts1160_06 getafix-ts04 Up - CleanUp 1736 - - - - - - 18446744073709551615 0 - 1736 [STALE] -

We have set TapeLoadTimeout to a quite high value (2400). Likewise with WatchdogMountMaxSecs and WatchdogUnmountMaxSecs options.

I asked re RmcRequestAttempts and RmcNetTimeout, first to see what are the values you use (which you mentioned, thanks!) but also if the second of these, RmcNetTimeout, has any relevance to the issues we are seeing.

Best,

George

Hi,

The high number you see is not the session age. It is the session ID number. There is a bug report on this Fix wrong session ID 18446744073709551615 (overflow) (#695) · Issues · cta / CTA · GitLab It will be fixed at some point but it is not a priority. So, those values have noting to do with the issue.

Let me know if you have any other questions related to this.

Sorry for the inconvenience.

Thanks clarifying.

Is RmcNetTimeout (set to 600) any relevant to the issue?

George

That timeout should not be relevant to the issue.

If that timeout is being reached you should be able to see something in the rmcd logs.

Thanks for the confirmation Pablo