Garbage collector stuck in loop

ewv · 16 May 2025 14:30

We had some issues yesterday with a library. We now seem to be in a place where the garbage collector has taken over in the taped processes. We see a lot of messages like this in the logs

“message”:“In GarbageCollector::cleanupDeadAgent(): agent already deleted when trying to lock it. Skipping it.”

However, there is plenty of work in the queue, plenty of tapes to write to, etc. But no work is getting picked up.

What can we do?

nbugel · 16 May 2025 15:59

Hi Eric,

As a first step to get the tapeservers to start working, you could disable the maintenance process on some of them. Make sure to leave at least a few maintenance processes alive or the system will grind to a halt.

Our recommendation would be to leave a couple of maintenance processes running to see if they can clean up over the weekend and reexamine the situation on Monday. If the objectstore does not return to a normal state, the nuclear option would be to wipe it, but this is of course not preferable.

It might be useful to take a look at the tapeserver logs to try and figure out how the objectstore got in this state. The various command-line tools provided by cta-objectstore-tools can help with this.

Cheers,
Niels

jleduc · 16 May 2025 16:47

Hi Eric,
if nothing is being processed you should evaluate impact/benefit of wiping the objectstore ASAP.

Once wiped you will unblock new requests coming to your production instance: this is what operations side have a tendency to go for.

And then you need to define a strategy to re-inject the requests lost in the deleted objecstore starting with archive side: wait a bit and reinject files that are not on tape with a creation timestamp that means these where in the deleted objectstore.
This can be easier for FTS use cases as these should overwrite files not one tape after 24h usually.

Best regards,
Julien Leduc

ewv · 21 May 2025 18:54

Turns out a library problem caused all the writable tapes to become disabled.