Understanding CleanUp/STALE drive states

george_patargias · 2 August 2024 11:21

Hi,

I would like to understand a bit more what is happenning when a CTA drives enters the “CleanUp” state. Obviously, some kind of clean-up that follows the archival/retrieval is involved but what exactly is it?

More importantly, can you please give us some clues as to why the drive can get stuck to the “CleanUp” state; the (session?) age becomes a very large number and the result is that drive becomes STALE.

Is this indicative of a problem with the tape library? I was wondering if we hit “taped RmcNetTimeout” - at the moment it has the compile time default, 600s. For the record, we have been running with “taped RmcRequestAttempts 30” for quite some time now.

Many thanks,

George

poliverc · 5 August 2024 08:18

Hello George,

as you say, after a transfer session (either archive or retrieve) a cleaner session is run. In summary, this ‘cleaner’ is responsible for rewinding the tape, unmounting it, and logging tape alerts.

What does a very large number mean? And, where are you seeing this number?

I assume the ‘STALE’ state you are referring comes from the cta-admin output? It is something specific generated from cta-admin, basically when a drive takes longer to update its status than the configured drive timeout, the [STALE] string gets appendend to the age of the session. This STALE string does not show in the json output of the command.

For the ‘stuck in the cleanup state’. There is a default timeout of 300 secs, this value comes from the tapeLoadTimeout of the taped configuration file. So, if there is a problem with the drive it will take tapeLoadTimeout seconds for the cleaner session to fail, and it may look like it is stuck.

For the RMC attempts and net timeout, we use the compile time defaults in production, 600 and 10 respectively.

Cheers,
Pablo.

george_patargias · 5 August 2024 10:54

Hi Pablo,

Thanks for your reply.

The large number I mentioned (and also the STALEstate) can be seen in the output of the “cta-admin dr ls” commnd. For example,

obelix_ts obelix_ts1160_06 getafix-ts04 Up - CleanUp 1736 - - - - - - 18446744073709551615 0 - 1736 [STALE] -

We have set TapeLoadTimeout to a quite high value (2400). Likewise with WatchdogMountMaxSecs and WatchdogUnmountMaxSecs options.

I asked re RmcRequestAttempts and RmcNetTimeout, first to see what are the values you use (which you mentioned, thanks!) but also if the second of these, RmcNetTimeout, has any relevance to the issues we are seeing.

Best,

George

poliverc · 5 August 2024 12:40

Hi,

The high number you see is not the session age. It is the session ID number. There is a bug report on this Fix wrong session ID 18446744073709551615 (overflow) (#695) · Issues · cta / CTA · GitLab It will be fixed at some point but it is not a priority. So, those values have noting to do with the issue.

Let me know if you have any other questions related to this.

Sorry for the inconvenience.

george_patargias · 5 August 2024 13:00

Thanks clarifying.

Is RmcNetTimeout (set to 600) any relevant to the issue?

George

poliverc · 6 August 2024 06:39

That timeout should not be relevant to the issue.

If that timeout is being reached you should be able to see something in the rmcd logs.

george_patargias · 3 September 2024 08:52

Thanks for the confirmation Pablo

george_patargias · 5 February 2025 10:07

Hi Pablo,

Can you please explain the meaning of the following cta-taped directive?

# Maximum time allowed to determine a drive is ready, in seconds.
taped WatchdogCheckMaxSecs 240

Would the increase of this value be of any help with the drives getting stuck to the CleanUp state and (often) become STALE?

For reference, we have been operating for a while with the following increased time outs to deal with a tape library being slow at times.

# Maximum time allowed to mount a tape, in seconds.
taped WatchdogMountMaxSecs 2400
#
# Maximum time allowed to unmount a tape, in seconds.
taped WatchdogUnmountMaxSecs 2400

# Maximum time to wait for a tape to load, in seconds.
taped TapeLoadTimeout 2400

Thanks,

George

poliverc · 17 February 2025 08:57

Hi George,

This set of timeouts is used to determine how long the daemon should wait for the operations mentioned, and if the operation takes longer, to kill the session.

The STALE concept is an addition to the output of the cta-admin CLI (which can also be configured), but it is just a note to inform the operator that the drive has not updated its status in the catalogue for the configured amount of time.

What you should check is the reason why the drive is not reporting its state to the Cataloge, i.e, the daemon has been shut down and the not restarted.

Regards,
Pablo.

george_patargias · 27 February 2025 11:55

Hi Pablo,

Thanks for the reply. The reason why think that the drive is not reporting its state is because the drive has not come available in the 1200 seconds (or whatever the timeout value).

Do you see anything obviously wrong with the following values?

# The number of attempts a retriable RMC request should be issued.
taped RmcRequestAttempts 30

# Maximum time to wait for a tape to load, in seconds
taped TapeLoadTimeout 1200

# Maximum time allowed to determine a drive is ready, in seconds.
taped WatchdogCheckMaxSecs 240

# Maximum time allowed to schedule a single mount, in seconds.
taped WatchdogScheduleMaxSecs 600

# Maximum time allowed to mount a tape, in seconds.
taped WatchdogMountMaxSecs 1200

# Maximum time allowed to unmount a tape, in seconds.
taped WatchdogUnmountMaxSecs 1200

# Maximum time allowed to shutdown of a tape session, in seconds.
taped WatchdogShutdownMaxSecs 1500

George

poliverc · 28 February 2025 11:18

The values look ok. If the drive is not reporting its state it is because something has gone wrong.

Is the daemon running? What do you see in the tapeserver logs?

Regards,
Pablo.