EOS node performance bottleneck

george_patargias · 30 September 2025 13:14

Hello,

We are carrying performace tests on one of new EOS nodes (EOS 5.3.10-1 installed) with 8 x SSDs and 100Gb card and we are persistently hitting a bottleneck for write-only workloads (~70Gb/s) which also affects reads in mixed I/O workloads.

All NIC tuning has been done and we have established that it is the kernel page cache that is somehow gets on the way. Given that, in general, the page cache improves storage node perf there must be something wrong in the way that is configured on our node.

My question to you: have you modified any page cache kernel settings on the EOS nodes that used as CTA buffer? For example, do you use write through (default I think) or write back cache?Have you changed any other parameter in /sys/block/dev/queue/ ?

Thanks,

George

jleduc · 1 October 2025 07:32

Dear George,

I did not have to tweak anything for the current Run-3 generation of hardware, I am getting 25Gb/s Write + 25Gb/s Read which is 50Gb/s worth of IO with like really old machines by today’s standard.

Before going to the EOS layer, did you try to run series of microbenchmarks to qualify your new hardware? I usually spends quite some time with ddwriting or reading large byte streams on raw device (flushing cache between microbenchmarks) to qualify HDDs/SSDs/NVMes to understand potential bottlenecks at the disk/flash or HBA level when multiple storage subsystems are involved. Then doing the same at FS level should give you similar performance as EOS long byte streams.

Changing cache setting is not the way to reproduc production performance as caching in memory would not be enough for anything close to production: files stay on disk servers for 20-30 minutes this would not fit in any kernel cache layer.

Do you have some dd flushed to storage numbers/graphs? Like writing 1TB per MVMe in parallel to 1, 2, 3,… systems at once, then read back and so on?

Best regards,

Julien Leduc

george_patargias · 1 October 2025 13:53

Hi Julien,

Thanks for the reply.

Before going to the EOS layer, we did run some standard FIO straming tests on the node’s SSDs and we got the advertised performance for a single a drive and a nice linear scaling for multiple drives. It is only after we ran into the performance issues with EOS that we re-ran FIO but now using buffered (instead of direct) IO and we confirmed that the bootleneck is indeed related with the kernel page cache.

For the EOS testing we run a series of increasing parallel XRootD transfers - 2, 4, 8, 16, 32, 64 and 128 - from two different client machines that are on the same network and have also 100Gb NICs.

Pasting some Grafana screenshots from representative tests

Write only tests

Please note the performance degradation after the peak with 32 concurrent transfers

Write/read test

Write/read test with the total number of EOS filesystems doubled (i.e 2 XFS partitions/SSD) to see if this improves performance (it doesnt)

Thanks for the offer to run the microbenchmarks (what are these anyway?). I will discuss with the rest of the team and get back to you.

Best

George

jleduc · 1 October 2025 13:58

Before getting into local microbenchmarks, let me ping EOS devops team as they have some 100Gb machines on their side (contrary to CTA Service) and may have some valuable hints for EOS tweaking.

Julien

ccaffy · 1 October 2025 14:49

Hi @george_patargias ,

You may try direct I/O with EOS if you wish.
One can activate direct I/O by adding the following extended attribute on the destination directory:

eos attr set sys.forced.iotype:w=direct /eos/path/to/dest/dir

If you want it at the space level (better if you want to apply it for every transfer), you can set it like eos space config <spacename> space.policy.iotype=w=direct (4.5. Interfaces — EOS DIOPSIDE documentation).

Please tell us if that improves the performance!

Otherwise, could we organize a zoom call for live-debugging and testing!

Cheers,
Cedric from the EOS team

apeters · 1 October 2025 14:50

You can try direct IO in EOS to get better throughput. The buffer cache becomes a bottleneck around 8-10 GB/s once it is full. I have seen the same in benchmarking on 100 GB hardware.

Here is the documentation:
``
# configure default FST iotype eos space config default space.policy.iotype:w=direct

george_patargias · 1 October 2025 15:12

Magic, thank you so much Cedric and Andreas!

We will give this a try and get back to you.

ccaffy · 3 October 2025 07:00

Whoops I just saw there’s a typo in my reply eos space config <spacename> space.policy.iotype:w=direct.

george_patargias · 6 October 2025 15:32

Thanks again Cedric. We ran a couple performace tests after setting eos space config default space.policy.iotype:w=direct..

Write only test

Read/write test

The performance certainly look better(!) but I am a bit troubled with the large network i/o fluctuations observed with increased number of parallel transfers. Is this something to be expected?

Can it be that the following kernel settings are incompatible with the direct I/O EOS space policy

vm.dirty_background_ratio = 20
vm.dirty_ratio = 80

Best,

George

ccaffy · 7 October 2025 07:23

Hi @george_patargias ,

As you are using direct I/O there is no page cache involved anymore. So I guess those parameters do not need to be tweaked → their tweaking should not have any effect.

How parallel are these transfers? Do you transfer multiple batches, one after the other?

What can be done also is to try iperf3 tests between your clients and your server. That would allow to remove EOS from the loop and really test the network between your clients and your server.

Cedric

george_patargias · 7 October 2025 13:20

Thanks for the confirmation about the parameters Cedric.

We run the following loop (on two different client hosts)

for i in 1 2 4 8 16 32 64; do echo “Starting $i transfers”; timeout 600 ./run_parallel_writes.sh $i 100; sleep 60; done

which initiates an increasing number 1,2,4…parrallel batches of XRootD transfers and each batch consists of 100 serial transfers.

We have already run iperf3 tests between clinets and the server, and we established that 8 x iperf3 transfers from one of the new EOS nodes can saturate the100G NIC card of another. So, we are happy with the NIC card. We are also happy with the direct FIO tests on the SSDs as we are obtaining the adverised performance.

Best,

George

ccaffy · 7 October 2025 15:01

OK thanks for your input.

I think we can organize a call so that we try different things live.
What do you think?

This week I’m available only tomorrow wednesday the 8th but next week I’m available the whole week. Don’t hesitate to reach me via email cedric.caffy@cern.ch if you want to organize a zoom chat.

Cheers,
Cedric

ccaffy · 16 October 2025 13:00

Hi @george_patargias and @tom.byrne ,

Any news from the tests? Is it better now?

Thanks!

Cheers,
Cedric

george_patargias · 17 October 2025 11:08

Hi Cedric,

Sorry for the delay. When first I tried to reproduce the performance (64 parallel writes) that we saw during our zoom chat - having set direct I/O space policy for writes and assigned 8 schedulling groups to each EOS filesystem - the network inbound rate was again fluctuating and eos fs ls –io showed that some SSDs were more loaded than others.

I rebooted, reinstalled the node and tried again and now we gor the write perf we were expecting

not sure why we didnt get in the first attempt but lets hope it stays this way….!

I am going to try running now 32 reads/32 writes and play with space.policy.iopriority as you showed us. Just to confirm: for the space.policy.iopriority to work we need to have the bfq scheduler configured on SSDs, right? We currently have none which may explain we the change didnt work during the chat.

By the way, I read that bfq is good for HDD and slow SSDs and kyber is recommended for fast SSDs. Is the kyber scheduler supported by EOS?

ccaffy · 17 October 2025 11:19

Hi George,

Thanks for your reply. I think you should probably not use the I/O priorities for your use-case. CTA does not use them and I guess direct I/O is good enough.

From this website: @Large Research - A Systematic Configuration Space Exploration of the Linux Kyber I/O Scheduler

We report 11 observations and make 5 guidelines that indicate that (i) Kyber can deliver up to
 26.3% lower read latency than the None scheduler with interfering write workloads; (ii) with a file 
system, Kyber can be configured to deliver up to 35.9% lower read latency at the cost of 34.5%–
50.3% lower write throughput, allowing users to make a trade-off between read latency and write 
throughput; and (iii) Kyber leads to performance losses when Kyber is used with multiple 
throughput-bound workloads and the SSDs is not the bottleneck.

Let’s try to keep the configuration simple, stupid.

Can you try real workflows with this machine? I do not know if you have a preproduction CTA instance, but it might be worth looking at how this machine behaves with loads from tape drives and read from users.

Cheers,
Cedric

george_patargias · 24 October 2025 12:30

Hi Cedric,

Thanks for these comments. We have nodes of this generatoin in production - although with non direct I/O and with the single scheduling group - and the real loads coming from tape servers are never high enough to get more than 10Gb/s either way.

Please have a look at the following network I/O screenshot from the most recent test on the node that is deployed on our preprod instance

The test starts with 64 parallel writes and runs for about 15 mins. Performance good
Then, only 32 parrallel writes are runnig for another 5 mins. Performance remains good
After 5 mins, and while 32 parallel writes are running, I initiate 32 parallel reads. The read prod looks initially good but then it is flaky
Then I stop all transfers for about 10 mins and I restart 32 writes and 32 reads. The read perf is even more flaky and wrtie perf also shoes signs of degradation
I stop all treansfers for a few mins and restart 32 writes and 32 reads. Write perf is not so bad the read perf conitnues flaky

I did notice a disk “hot spotting” during the last two sets of tests: the SSD with the lowest occupancy (but not more than 2% less compared to the other SSDs) attracted at some an abnormally higher number of reads and writes, 4-5 times higher than other SSDs. For example

antares-eos30.gridpp.rl.ac.uk:1095     28        default.1            undef       0.68       828.54      1452.93          0          0          0      3      2  79.82      6.13 TB      7.68 TB          633    375.07 M          0  12517   3983 MB
 antares-eos30.gridpp.rl.ac.uk:1095     29        default.2            undef       0.80      1443.04      1355.01          0          0          0      0      2  80.08      6.15 TB      7.68 TB          635    375.07 M          0  12401   3752 MB
 antares-eos30.gridpp.rl.ac.uk:1095     30        default.3            undef       0.55       667.64      1489.61          0          0          0      1      2  79.82      6.13 TB      7.68 TB          633    375.07 M          0  12261   4603 MB
 antares-eos30.gridpp.rl.ac.uk:1095     31        default.4            undef       0.65       880.38      1467.66          0          0          0      4      2  79.69      6.12 TB      7.68 TB          632    375.07 M          0  12247   4616 MB
 antares-eos30.gridpp.rl.ac.uk:1095     32        default.5            undef       0.26       104.42      1126.51          0          0          0      2      2  79.95      6.14 TB      7.68 TB          634    375.07 M          0  12472   3847 MB
 antares-eos30.gridpp.rl.ac.uk:1095     33        default.6            undef       0.66      1187.38      1234.67          0          0          0      2      3  79.95      6.14 TB      7.68 TB          634    375.07 M          0  12338   4031 MB
 antares-eos30.gridpp.rl.ac.uk:1095     34        default.7            undef       0.98      2447.79      1414.38          0          0          0      1      3  79.82      6.13 TB      7.68 TB          633    375.07 M          0  12515   4637 MB
 antares-eos30.gridpp.rl.ac.uk:1095     35        default.8            undef       1.00      1308.21      1288.48          0          0          0     19     16  77.61      5.96 TB      7.68 TB          616    375.07 M          0  12313   3741 MB

Towards the end of the test, I noticed that the SSD “hot spotting” dissapeared for a short time (I/O was more balanaced across the SSDs) which correlated immediatelly with the maxing-out of the iftop rates. As this lasted only a few secs you cant see it in the above screenshot.

As per your suggestions, I thought the assignment of a seperate scheduling group to each SSD/Fileystem ensures a balanced distribution of transfers. Is this a bug in the EOS scheduler or what?

I am running EOS 5.3.10

george_patargias · 24 October 2025 14:27

Apologies, I just found out about the EOS scheduler - 4.6. MGM Microservices — EOS DIOPSIDE documentation (hopefully this is the page)

As far as I can see, I dont have any scheduling enabled

[root@cta-adm-preprodtier1 ~]# eos space status default  | grep balancer
balancer                         := off
balancer.node.ntx                := 2
balancer.node.rate               := 25
balancer.threshold               := 20
geobalancer                      := off
geobalancer.ntx                  := 10
geobalancer.threshold            := 5
groupbalancer                    := off
groupbalancer.engine             := std
groupbalancer.file_attempts      := 50
groupbalancer.max_file_size      := 16G
groupbalancer.max_threshold      := 0
groupbalancer.min_file_size      := 1G
groupbalancer.min_threshold      := 0
groupbalancer.ntx                := 10
groupbalancer.threshold          := 5

Is the first one - balancer - that I need?

ccaffy · 24 October 2025 14:42

Hi George,

Thanks for the news.

Nope, the balancing balances data so that the disks are filled at around the same level.

I’m going to ask my colleague who implemented the scheduler in EOS.

Cheers,
Cedric

george_patargias · 24 October 2025 15:04

Many thanks. Just for reference, I havent touched any of these either - 4.8. Using EOS — EOS DIOPSIDE documentation

ccaffy · 24 October 2025 15:08

Yes, that’s why I wanted to talk to my colleague, it is exactly related to that.

You may try them → roundrobin, random… and see if you see an improvement. But we are not sure it will actually work because you do not have many SSDs as well in the thing.