Garbage collection on EOS space

Hello,

Just a last check on the GC set up on the default and retrieve spaces.

Is “filearchivedgc” turned on only on the default (ingest) space?

According to the notes from the RAL-CERN meeting (https://codimd.web.cern.ch/MPE_HYmBSMCWpfzHpz4FsA#), I have the following comments with regards the space.tgc.XXX flags.

  • space.tgc.availbytes - should be set high on default space to keep buffer empty? What about retrieve?
  • space.tgc.qryperiodsecs - needs to be [FST publish.interval+5s] and EOS_FST_DELETE_QUERY_INTERVAL*2, any reason to change from that?
  • space.tgc.totalbytes - should be set close to total capacity of the space (but lower to allow TGC to run in case of HW failures?)

Are these comments valid also for the retrieve space?

We don’t have EOS_FST_DELETE_QUERY_INTERVAL defined in the /etc/sysconfig/eos_env. What is a reasonable value for this option?

Also, I can’t see that we have set publish.interval for any of our EOS nodes. How do we set it and to what value?

Many thanks,

George

Hi George,

I’ll write the conclusion here so others in the community can see it. Julien says the following:

… all retrieve space GC should be performed by the fst-gcd and eviction this allows us to monitor useful space in retrieve configuring the MGM GCD would mean operating at full space and then you have no idea if something is wrong looking at retrieve space occupancy. Operationally this is really important

So, your current configurations for the MGM garbage collector are correct. They are the default configuration which basically switches off MGM garbage collection for each the EOS spaces they control.

Agreeing with Julien, you now need to make sure your FST garbage collectors are configured. Their role here is to act as a safety net in the event that a file is not automatically evicted from disk via FTS. You should be configuring your FSTs like so:

[itctabuild02] ~ > ssh root@eosctafst0110.cern.ch cat /etc/cta/cta-fst-gcd.conf
###
# Puppet generated
###

# There must always be a main section
[main]

log_file = /var/log/eos/cta-fst-gcd/cta-fst-gcd.log ; Path of garbage collector log file
mgm_host = eosctaatlas.cern.ch ; Fully qualified host name of EOS MGM
eos_spaces = retrieve ; Space separated list of the names of the EOS spaces to be garbage collected
eos_space_to_min_free_bytes = retrieve:100000000000 ; Minimum number of free bytes a filesystem should have
gc_age_secs = 14400 ; Age at which a file can be considered for garbage collection
absolute_max_age_secs = 86400 ; Age at which a file will be considered for garbage collection no matter the amount of free space
query_period_secs = 310 ; Delay in seconds between free space queries to the local file systems
main_loop_period_secs = 300 ; Period in seconds of the main loop of the cta-fst-gcd daemon
xrdsecssskt = /etc/eos.keytab ; Path to simple shared secret to authenticate with EOS MGM
[itctabuild02] ~ >

Cheers,

Steve

Thanks a lot for this Steve.

I will leave the tgc values alone and configure the FST GC for the both spaces as per your examples.

One more thing. How exactly we are going to ensure that FST will facilitate the eviction of files from the buffer? From this ticket, https://its.cern.ch/jira/browse/FTSCFG-16 that came recently to our attention, it looks like we need to register our CTA instance to the FTS ActiveMQ topics, a service running (only) at CERN? Is this the only way to set up FST eviction?

Best,

George

Hi George,

Your question: Is “filearchivedgc” turned on only on the default (ingest) space?
Answer: Yes.

Are these comments valid also for the retrieve space?
Answer: Yes, however space.tgc.availbytes should be set to 0. Setting it to 0 effectively disables the MGM garbage collector which is what we have now agreed with Julien Leduc. You will only be using the FST garbage collectors at RAL.

Your question: We don’t have EOS_FST_DELETE_QUERY_INTERVAL defined in the /etc/sysconfig/eos_env. What is a reasonable value for this option?
Answer: Please set it to 30 seconds, for example:

[root@eosctafst0217 ~]# grep EOS_FST_DELETE_QUERY_INTERVAL /etc/sysconfig/eos_env
EOS_FST_DELETE_QUERY_INTERVAL="30"
[root@eosctafst0217 ~]# 

Your question: Also, I can’t see that we have set publish.interval for any of our EOS nodes. How do we set it and to what value?
You cannot see the publish.interval unless you set it. This is a feature of EOS. publish.interval does not need to be set so I would personally leave it. However if you do decide to explicitly set it to its default value of 10 seconds then please execute the following:

eos node config NODE publish.interval=10

Cheers,

Steve

Hi George,

Your question: How exactly we are going to ensure that FST will facilitate the eviction of files from the buffer?
Answer: This is Julien’s area of expertise. If I understand correctly the current solution is to indeed use FTS ActiveMQ topics.

Do you have a problem with using FTSActiveMQ topics at RAL?

Cheers,

Steve

Hi George,

I have just spoken with Julien via mattermost and the conclusion is that if RAL only use the xrootd protocol with CTA then FTS will be able to automatically evict files from the retrieve space without the need for using FTS ActiveMQ topics.

Does this help RAL?

Cheers,

Steve

Hi Steve,

Thanks, I think it helps but I will discuss more with the other team members.

Does the automatic FST mediated file eviction also apply to the default space (ingest buffer)?

Best,

George

Hi George,

The default EOS space should have the “File archived” garbage collector switched on. Specifically the following command should have been executed:

eos space config default space.filearchivedgc=on

With the “File archive” garbage collector on there is no need for FTS to evict files from the default space.

Regards,

Steve

Hi Steve,

Thanks a lot for this.

We need to understand a bit more how EOS/CTA and FTS interact in the context of garbage collection.

You mentioned that the EOS/CTA instances at CERN rely on the use of FTS ActiveMQ topics to facilitate garbage collection in the retrieve space. We would have a problem using FTS ActiveMQ topics at RAL if this was the only means to achieve garbage collection: it is not exactly ideal to rely exclusively on the availability of a remote service for such a vital task namely freeing up disk space of a limited size buffer which is part of a production system at RAL.

However, since the safety net of garbage collection at the FST level (cta-fst-gcd) is in place then we could (conceivably) use FTS ActiveMQ topics. You mentioned that “if RAL only use the xrootd protocol with CTA then FTS will be able to automatically evict files from the retrieve space without the need for using FTS ActiveMQ topics”.

Given that RAL will not only use the xrootd protocol with CTA but also, to an increasing extent, the http protocol (gridftp will be phased out this year), does this mean that RAL
will have to use FTS ActiveMQ topics after all?

Best,

George

Hi George,

Support for buffer eviction via HTTP will come, but this is a WLCG-level development with a timescale of around 1 year (my own estimate). RAL has to decide whether it wants to support HTTP in the meantime, with some other eviction mechanism. Remember however that right now there is no HTTP support for staging either, so what you’re proposing to offer is xrootd staging and then HTTP transfer, is that really a use-case for you?

Oliver.

Hi Oliver,

We are only trying to understand if there is a requirement to use FTS ActiveMQ topics
or not depending on what protocols will be used with CTA (Julien mentioned above that if xrootd is the only protocol to be used with CTA then the anwer is no).

I think I now understand that “the only protocol” means to exclude things like gridFTP and S3 and not HTTP which runs on top of xrootd.

George

Hi George,

If you want to support HTTP you need a way to evict each file from the buffer after it is transferred out. You have the following options

  1. Wait until there is an API that allows FTS to do this for you
  2. Use FTS ActiveMQ topics
  3. Find another solution of some kind

Cheers,

Oliver.