I did a couple tests to evaluate the impact of filesystem overhead and
block size, so here are some preliminary results. I'm running a more
extensive set of tests, but some of this seems interesting.
I did two sets of tests:
1) fio test on raw devices
2) fio tests on ext4/xfs with different fs block size
Both sets of tests were executed with varying iodepth (1, 2, 4, ...) and
number of processes (1, 8).
The results are attached - CSV file with results, and PDF with pivot
tables showing them in more readable format.
1) raw device tests
The results for raw devices have regular patterns, with smaller blocks
giving better performance - particularly for read workloads. For write
workloads, it's similar, except that 4K blocks perform better than 1-2K
ones (this applies especially to the NVMe device).
2) fs tests
This shows how the tests perform on ext4/xfs filesystems with different
block sizes (1K-4K). Overall the patterns are fairly similar to raw
devices. There are a couple strange things, though.
For example, ext4 often behaves like this on the "write" (i.e.
sequential write) benchmark:
fs block 1K 2K 4K 8K 16K 32K
--------------------------------------------------------------
1024 33374 28290 27286 26453 22341 19568
2048 33420 38595 75741 63790 48474 33474
4096 33959 38913 73949 63940 49217 33017
It's somewhat expected that 1-2K blocks perform worse than 4K (the raw
device behaves the same way), but notice how the behavior differs
depending on the fs block. For 2k and 4K fs blocks the throughput
improves, but for 1K blocks it just goes down. For higher iodepth values
this is even more visible:
fs block 1K 2K 4K 8K 16K 32K
------------------------------------------------------------
1024 34879 25708 24744 23937 22527 19357
2048 31648 50348 282696 236118 121750 60646
4096 34273 39890 273395 214817 135072 66943
The interesting thing is xfs does not have this issue.
Furthermore, it seems interesting to compare iops on a filesystem to the
raw device, which might be seen as "best case" without the fs overhead.
The "comparison" attachmens do exactly that.
There are two interesting observations, here:
1) ext4 seems to have some issue with 1-2K random writes (randrw and
randwrite tests) with larger 2-4K filesystem blocks. Consider for
example this:
fs block 1K 2K 4K 8K 16K 32K
------------------------------------------------------------------
1024 214765 143564 108075 83098 58238 38569
2048 66010 216287 260116 214541 113848 57045
4096 66656 64155 268141 215860 109175 54877
Agian, the xfs does not behave like this.
2) Interestingly enough, compe cases can actually perform better on a
filesystem than directly on the raw device - I'm not sure what's the
explanation, but it only happens on the SSD RAID (not on the NVMe), and
with higher iodepth values.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company