EOS SSD read performance and implications for CTA tape drive throughput

tom.byrne · 6 October 2021 15:06

Hi all,

We’re starting throw some data at our CTA instance, and have noticed an odd asymmetry with our archive and retrieve rates, which seems to be linked to the “single file xrdcp rates” from our CTA EOS cluster.

Our setup for testing has a single node EOS with 15 SSD filesystems in the ‘default’ (archive) space, and a single tape server and drive. This node has been tested with FIO and could achieve 500+ MB/s read or write performance on a single drive, and the performance scaled as expected when testing more drives.

When testing EOS on the same node, a single xrdcp write could achieve the expected rate e.g. ~500MB/s. In comparison, xrdcp reads could only hit ~260MB/s with a single stream. Both reads and writes could happily saturate the nodes network when doing multiple transfers (although it took more reads to saturate, 13 rd vs 6 wr). I experimented with multi-stream xrdcp (the -S option), and could achieve reasonable read rates compared to the underlying disks with ~8 streams.

We then moved on to verifying archive and recall rates of the whole setup, and we saw the same asymmetry. A single tape drive in ‘retrieve’ would write data back to EOS at 400MB/s (our tape drive line speeds), but archival was capped at ~260MB/s per tape drive. We tested a variety of file sizes (2GB-20GB), and although the cta-taped often reported faster ‘driveTransferSpeedMBps’ for the smaller files, the network rate from the EOS node to the tape server never exceeded that 260MB/s (e.g. the rate of a single xrdcp).

This raised a few questions:

Has anyone got any other ideas as to why our EOS read performance is half of what we expect? We believe our EOS setup is similar to the suggested CTA EOS setup (1 stripe, 1 replica, 4k block size, scanning disabled etc.), but we may be missing something obvious.
What sort of single file xrdcp read and write performance do other CTA users see from their SSD EOS clusters?
Does the cta-taped copy files from EOS into its buffer in a serial fashion? e.g. does the single file read performance of EOS need to be greater than the tape drives write performance in order to saturate the tape drive and keep the buffer full, or can the cta-taped copy files into its memory buffer in parallel (assuming there is space).
Is there a way to configure cta-taped to do multi-stream xrootd transfers from EOS? This may not be the right way to solve this problem, but it would be interesting to test.

Thanks for reading, happy to give more(!) detail if needed.

Cheers,
Tom

jleduc · 7 October 2021 08:46

Hi Tom,
I did plenty of performance measurements to validate the various CERN eoscta instances and hardware models.

FIO is testing the raw disk performance underneath and you should not loose a single bit/s of performance in per xrootd stream performance. I prefer parallel dds to synthetic benchmarks because optimizing EOS for CTA is simple: you just need to measure streamed IOs.

You really need to understand why 1 stream read performance is just half the nominal speed of your raw SSD performance. The reason for this is simple: if you have a file larger than memory dedicated for file caching in cta-taped the file will be streamed from eos to the tape at 1 xrootd stream rate. 250MB/s is lower than any drive write speed and therefore the drive efficiency would be much lower.

This is really bad as the reason to go for SSDs is precisely to use tape drives as efficiently as possible.

You need to focus on xrdcp read out of eos performance.

identify the file on SSD and read it with dd to /dev/null several times (do not forget to flush the caches between your reads). Do this immediately after the file has been written over and over. FIO is nice but dd tells you what you need from real eos files in the same condition as xrdcp reading the file our of SSD.
look at low level ios on the problematic disk: use your favorite tool for this. There could be a separate soft that reads data at the same time as your xrdcp stream out to the tape server (I used sar in the past as it is easy to get the per device counters you need).
you said you have 15 SSDs and you tested with 6 write streams: what about the performance of 1 write stream per SSD and 1 to 15 of these? and then read back with dd using 1 and then 15 streams on different SSDs? There can be bottlenecks in SAS/SATA controlers of PCI at these rates.

These bandwidth driven servers always have bottlenecks: just measure these at all levels: SSD individual performance, SAS controler/PCI performance, then network.

What is the network card attached to one of these machines?

Regarding your questions:

I defined all these configuration for EOSCTA instances because I was bitten by each of them… I may have forgotten something. Could you add a dump of something like eos -j fs ls default| jq '.result[0]' anonymizing uuid and host ?
SSD performance
cta-admin --json dr ls | jq '.[].driveConfig[]| select(.key == "NbDiskThreads")' will tell you how many xrdcp streams cta-taped is using (Compile time default is 10)
more stream per file can just lower performance

Cheers,
Julien

tom.byrne · 7 October 2021 09:56

Hi Julien,

Thanks for the detailed message. Your first point was (unfortunately) very helpful, a quick check with dd confirmed that the dd streaming rates of files on the SSDs is identical to the xrdcp rate, which I was not expecting. I’ll go back around the loop and work out what changed between my original testing and the current EOS setup.

I’ll come back to the rest of your points if that doesn’t fix the issue, but thanks again for all the information!

Cheers,
Tom

tom.byrne · 7 October 2021 11:52

The difference in DD and FIO performance was due to not using direct IO with DD. With direct IO (bypassing the page cache), I get the expected ‘SSD performance’ with both benchmarks.

I’m now investigating two things:

a) should buffered IO be this slow?
b) should the fst be bypassing the page cache for IO?

Any thoughts would be appreciated!

jleduc · 7 October 2021 12:39

Regarding your dd tests: did you go for raw device level test first? dd of=/dev/sdXX ... and vice versa to assess raw device read/write performance?

Then what about the filesystem contribution in the stack of performance degradation?
I suggest you try both ext4 and xfs and dd over the filesystem.

I am using xfs for my SSD filesystems on eoscta FSTs, I am not sure this is written somewhere (but just a baseline performance check).

Then one note on buffered IO and eoscta instance: if the latency between the disk write and the tape read is low enough the disk server is likely to find its file in memory with no need to read the disk.

I have seen this happening quite often during low throughput retrieve or conversion driven workflows (latency for archival is too high to be served from memory…).

tom.byrne · 7 October 2021 17:31

I tested both the raw device, and then tested with an XFS filesystem. I got the same performance in both cases.

Your points about using the OS page cache to your advantage in eoscta make a lot of sense. I spent some time trying to understand the differences between buffered and unbuffered IO from our SSDs perspective as you suggested (I was using iostat as I am more familiar with it).

The main difference I saw was the buffered IO was resulting in significantly smaller IO requests hitting the disk, and small queue sizes (avgrq-sz and avgqu-sz in iostat -x terms), resulting in the slow IO and poor disk util.

I found increasing the readahead size (/sys/block/sdX/queue/read_ahead_kb) for the EOS SSDs had a significant improvement of buffered IO speed. It was set at 128KB, which matched up with the request sizes reported by iostat for buffered IO. I increased it to the max configured IO size (512KB) and the xrdcp speed rose to ~350MB/s. Increasing it further (1MB+) resulted in the xrdcp reads hitting 550+MB/s (as queue sizes increased).

EOS can now keep a tape drive happily writing at line speed. Many thanks for your help!

Out of interest, do you know what read_ahead_kb you are running on your SSDs? Since all requests are going to be xrdcp reads, which default to 8MB I think, I was wondering if that was a good number to aim for, but there may be a reason for a smaller number. Also, it would be good to know if you think it is not a good idea to run with a read_ahead_kb set higher than the default!

Cheers
Tom

jleduc · 7 October 2021 21:03

Good to hear your streaming performance is reaching nominal device speed and is symmetrical.

read_ahead_kb is set at 4096 on our SSDs: this helped a lot with spinners to avoid expensive seeks, it this worth keeping high for SSDs too as we need to maximize individual stream performance of large files per SSD to/from an army of tape drives.

Would be interesting to measure the impact of this number by mesuring iops and individual io size for read per SSD device.

But you should be good to go.

Just make sure you measure the various bottlenecks inside your machine while these are not yet in production: 1 stream writing per SSD from 1 to 15 SSDs, then same for reads and finally 15 reads + 15 writes while measuring with sar or any other IO performance measurement. This should allow you to measure the full throughput you can use to your SSD subsystem through PCIe + SAS controller.

This baseline may help for later optimizations and it is easier to measure before the servers are in production.

Cheers,
Julien

tom.byrne · 8 October 2021 09:59

Cool, glad to hear you have a similar size of read ahead set.

I ran a quick loop through various values for read_ahead_kb yesterday when I was trying to settle on a value, from 128KB to 64MB in powers of two, repeatedly dd'ing a large file (and clearing caches etc in between).

The general trend was the request size increasing and the iops decreasing with increasing readahead, until the readahead hit the max_sectors_kb. I believe max_sectors_kb is the max IO size the underlying drive will be given. Note: I had it set at 1024kB for this test, but it defaults to 512kB on our SSDs, the performance is very similar, so I’m going to leave it at 512kB in production.

After that, the IOPS, read rates, queue size and request latency start increasing with increasing readahead (as each ‘read’ is broken up into more operations), this continues until the drive is saturated (readahead 32MB in my test), at which point the latency and queue size continue to climb, with no further increase in IOPS or read rate.

`read_ahead_kb`	r/s	rMB/s	avg req size (kB)	avg Q size	avg wait (ms)	% util
128	2164.5	277.056	128	0.74	0.34	73.83
256	1177.75	301.504	256	0.77	0.65	75.5
512	590.25	302.208	512	0.74	1.25	73.25
1024	294	301.056	1024	0.88	3	82.55
2048	324.69	332.4808	1024	1.78	5.48	84.01
4096	392.5	401.658	1023.33	3.06	7.8	78.92
8192	443.75	454.4	1024	6.77	15.23	85.9
16384	490.27	502.0409	1024	15.47	31.55	92.89
32768	538	550.912	1024	43.32	79.8	100
65536	529.68	542.387	1024	59.43	113.58	99.75

I noted that dd seemed to be very single threaded, so I needed a large readahead to get the queue size high enough to saturate the drive. For our drives, the queue size required for saturation with this workload seems to be ~20. A single stream xrdcp seems to be slightly better at filling the queue than dd, so saturation occurs at ~4-8MB readahead on these drives with xrdcp, with queue size sitting around 23. In general, it feels like you shouldn’t need to set a larger readahead than the max_sectors_kb, but you then rely on the application being able to queue up enough IO to saturate the drive, which doesn’t seem to be the case for dd or xrdcp.

In terms of whole machine benchmarking, I have run tests using FIO on combinations of drives (with unbuffered IO), with combinations of workloads. The only observed disk bottleneck is the HBA PCIe 3.0 8x interface, which caps out at about ~7GB/s. This limit is well below the networking capability of the node (25Gig), so I’m not too concerned about that. I’m going to double check a few of the results with buffered IO, to confirm that we see similar performance to my previous testing, especially now I’ve understood the difference between buffered and non-buffered IO performance.

Thanks again for all your help!
Tom

george_patargias · 11 November 2021 11:49

Hi Julien,

In a large scale recall test from LHCb that started yesterday, we get EOS outbound rates > 300MB/s although Tom had set read_ahead_kb to 8,192 on all retrieve buffer nodes.

The retrieve space is filling up and we are relatively confident that there isn’t a bootleneck between EOS and our Ceph disk cluster.

Do we need to reduce/increase the above read_ahead_kb value?

Thanks,

George

jleduc · 12 November 2021 21:08

Hi George,
if you have a separate retrieve space and no colliding archival activity on the disks used for retrieve you should have a look at the read/write accesses to the SSDs.
There could be some scanning activity from EOS side, or something else. This is why the first thing is to check that the read/write SSD activity correspond to what you expect with regard to the number of recalling tape drives and transfers.
One thing would be to have a look at the retrieve space configuration and compare it with the default space: just make sure that they have the same scannimg and others config.

Julien

tom.byrne · 16 November 2021 17:44

Hi Julien,

Thanks for the reply. After a bit more poking, it looks like the link between our EOS and Echo in the FTS may have not been configured with enough concurrent transfers to empty the buffer at the rate it was being filled by the tape drives. Currently, I think the read and write performance of our SSD EOS cluster is looking good after the changes we discussed.

Cheers,
Tom

jleduc · 17 November 2021 14:39

Hi Tom,
thanks for the feedback and your confirmation that everything is fine on the raw hardware performance side

Cheers,
Julien