Tape aware GC on FST stopped working

george_patargias · 25 June 2024 13:56

Hi,

We noticed that the tape aware GC on running on EOS FST (retrieve space) nodes (cta-fst-gcd) stopped working after the upgrade to EOS 5.2.23 and CTA 5.10.10.1: we can see “stagerrm” lines in the log up until the date of the upgrade but not after that.

Initially, I thought that this happened because we had not provided

xrdsecssskt=/etc/cta/cta-taped.sss.keytab

in /etc/cta/cta-fst-gcd.conf

But there was still no GC activity even after providing cta-taped.sss.keytab and probably this is is not too surprising as the garbage collector was working even without an SSS key (which I dont quite understand…).

So, on all retrieve EOS nodes, we have disk replicas that date as far as back as 20/06 which for some reason are “invisible” to the FST GC.

Do you have any suggestion?

Thanks,

George

poliverc · 26 June 2024 12:41

Hello George,

the cta-fst-gcd is a simple python script that runs as a separate service on the fsts. The provided systemd service file, by default, is set to not restart the service. Can you check the service is running? You should also be able to see the process.

If the script is running, could you provide some logs?

Cheers, Pablo.

george_patargias · 26 June 2024 14:16

Hi Pablo,

Thanks for the reply.

The service is indeed running and the only logging we see is the following. The file systems appear to be scanned but the files they hold seem “invisible” to the garbage collector. For some reason, the script cannot see the available FS space and the age of the files

2024/06/26 00:02:11.711000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Number of local file systems is 15”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 0: path=/eos/data-sdb eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 1: path=/eos/data-sdc eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 2: path=/eos/data-sdd eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 3: path=/eos/data-sde eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 4: path=/eos/data-sdf eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 5: path=/eos/data-sdg eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 6: path=/eos/data-sdh eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 7: path=/eos/data-sdi eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 8: path=/eos/data-sdj eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 9: path=/eos/data-sdk eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 10: path=/eos/data-sdl eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 11: path=/eos/data-sdm eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 12: path=/eos/data-sdn eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 13: path=/eos/data-sdo eos_space=retrieve”
2024/06/26 00:02:11.712000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Local file system 14: path=/eos/data-sdp eos_space=retrieve”
2024/06/26 00:07:11.745000 antares-eos08.scd.rl.ac.uk INFO cta-fst-gcd:LVL=“INFO” PID=“2620973” TID=“2620973” MSG=“Number of local file systems is 15”

poliverc · 26 June 2024 15:05

If you know the name or fxid of one of this files that is not being removed from the fsts, could you check what the MGM reports about that file eos file info command(fileinfo — EOS CITRINE documentation), and also, could you try looking for the creation and modification time of said file directly in the /eos/data-sdX/… ?

The logs look ok, what are the current configured limits to triger the garbage collection?

george_patargias · 27 June 2024 12:04

Hi Pablo,

Thanks for this. Indeed, by checking the creation time of a few “old” disk replicas we realised that it was being updated for some reason. After a bit of investigation, we established that the reason why ctime was updated was the scanning that EOS runs in the background to check for file integrity (so, space.scanrate and space.scaninterval were not set to zero).

Every after scan, at least one of the replica’s metadata was changed (user.eos.timestamp). Now, with EOS 5.2, all these metadata that used
to be stored on the FST LevelDB, now constitute the replica’s extended attributes and any change of these also changes the file’s ctime.

We have now disabled EOS file scanning for the default and the retrieve space and we hope to see stagerrm events in the cta-fst;gcd logs soon (after gc_age_secs)

George

george_patargias · 9 July 2024 12:52

Hi again,

Even after disabling the EOS FS scan, the ScanDir thread did update the ctime of files in the retrieve space. For the time being, we have got around this by increasing the scan_disk_interval to a value far larger than either gc_age_secs or absolute_max_age_secs, i.e 7 days.

We have opended a thread in the EOS forum re FS scanning - Disable EOS FS scanning - #4 by tombyrne - EOS Community

A probably better solution might be for the CTA FST garbage collection to look at the mtime instead of the ctime. Is this a change you could consider?

Thanks,

George

poliverc · 9 July 2024 13:09

Hello George,

we don’t use the eos fs scan, as it can hinder the performance of the buffers. Disabling it (scaninterval 0 and scanrate 0) will make it not run which solves the problem. What changes did you apply to disable the eos fs scan?

We have a presentation from 2023 on how we set up the EOS instances (slide 5): https://indico.cern.ch/event/1103358/contributions/4760648/attachments/2404348/4112646/220309-EOSWS_Running%20an%20EOS%20instance%20with%20tape%20on%20the%20back.pdf

Cheers,
Pablo.

george_patargias · 9 July 2024 15:46

Hi Pablo,

Yes, we did what Julien suggested in his presentation

[root@cta-adm ~]# eos space config retrieve space.scaninterval=0
success: setting scaninterval=0
[root@cta-adm ~]# eos space config retrieve space.scanrate=0
success: setting scanrate=0

EOS developers (Elvin) told us that the above commands will affect filesystems that will be added to the retrieve space in the future . To affect the existing ones, we were suggested to do

eos space config retrieve fs.scaninterval=0

we also tried

eos space config retrieve fs.scanrate=0

which looks like it was not enough…

George

poliverc · 10 July 2024 13:47

Hi,

could you also try to change the configuration of the filesystem? fs — EOS CITRINE documentation

First check the values of the fs entries with: eos fs status <fsid>, you can get the fsid from the eos fs ls command. Scaninterval and scanrate should be 0. If not try the following commands:

eos fs config <fsid> scaninterval=0
eos fs config <fsid> scanrate=0

If this does not fix the problem could you provide the output of the ls command?

Cheers.