Re: pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pgcon unconference / impact of block size on performance
Date
Msg-id 58f299fd-2812-61f8-d089-9836e8bea333@enterprisedb.com
Whole thread Raw
In response to RE: pgcon unconference / impact of block size on performance  (Jakub Wartak <Jakub.Wartak@tomtom.com>)
List pgsql-hackers
I did a couple tests to evaluate the impact of filesystem overhead and
block size, so here are some preliminary results. I'm running a more
extensive set of tests, but some of this seems interesting.

I did two sets of tests:

1) fio test on raw devices

2) fio tests on ext4/xfs with different fs block size

Both sets of tests were executed with varying iodepth (1, 2, 4, ...) and
number of processes (1, 8).

The results are attached - CSV file with results, and PDF with pivot
tables showing them in more readable format.


1) raw device tests

The results for raw devices have regular patterns, with smaller blocks
giving better performance - particularly for read workloads. For write
workloads, it's similar, except that 4K blocks perform better than 1-2K
ones (this applies especially to the NVMe device).


2) fs tests

This shows how the tests perform on ext4/xfs filesystems with different
block sizes (1K-4K). Overall the patterns are fairly similar to raw
devices. There are a couple strange things, though.

For example, ext4 often behaves like this on the "write" (i.e.
sequential write) benchmark:

   fs block     1K       2K       4K       8K      16K      32K
  --------------------------------------------------------------
   1024      33374    28290    27286    26453    22341    19568
   2048      33420    38595    75741    63790    48474    33474
   4096      33959    38913    73949    63940    49217    33017

It's somewhat expected that 1-2K blocks perform worse than 4K (the raw
device behaves the same way), but notice how the behavior differs
depending on the fs block. For 2k and 4K fs blocks the throughput
improves, but for 1K blocks it just goes down. For higher iodepth values
this is even more visible:

   fs block     1K      2K       4K       8K      16K     32K
  ------------------------------------------------------------
   1024      34879   25708    24744    23937    22527   19357
   2048      31648   50348   282696   236118   121750   60646
   4096      34273   39890   273395   214817   135072   66943

The interesting thing is xfs does not have this issue.

Furthermore, it seems interesting to compare iops on a filesystem to the
raw device, which might be seen as "best case" without the fs overhead.
The "comparison" attachmens do exactly that.

There are two interesting observations, here:

1) ext4 seems to have some issue with 1-2K random writes (randrw and
randwrite tests) with larger 2-4K filesystem blocks. Consider for
example this:

   fs block      1K        2K        4K        8K      16K      32K
  ------------------------------------------------------------------
   1024      214765    143564    108075     83098    58238    38569
   2048       66010    216287    260116    214541   113848    57045
   4096       66656     64155    268141    215860   109175    54877

Agian, the xfs does not behave like this.

2) Interestingly enough, compe cases can actually perform better on a
filesystem than directly on the raw device - I'm not sure what's the
explanation, but it only happens on the SSD RAID (not on the NVMe), and
with higher iodepth values.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Improve TAP tests of pg_upgrade for cross-version tests
Next
From: "Jonathan S. Katz"
Date:
Subject: 2022-06-16 release announcement draft