pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers

Hi,

At on of the pgcon unconference sessions a couple days ago, I presented
a bunch of benchmark results comparing performance with different
data/WAL block size. Most of the OLTP results showed significant gains
(up to 50%) with smaller (4k) data pages.

This opened a long discussion about possible explanations - I claimed
one of the main factors is the adoption of flash storage, due to pretty
fundamental differences between HDD and SSD systems. But the discussion
concluded with an agreement to continue investigating this, so here's an
attempt to support the claim with some measurements/data.

Let me present results of low-level fio benchmarks on a couple different
HDD and SSD drives. This should eliminate any postgres-related influence
(e.g. FPW), and demonstrates inherent HDD/SSD differences.

Each of the PDF pages shows results for five basic workloads:

 - random read
 - random write
 - random r/w
 - sequential read
 - sequential write

The chars on the left show IOPS, charts on right bandwidth. The x-axis
shows I/O depth - number of concurrent I/O requests or queue length,
with values 1, 2, 4, 8, 64, 128. And each "group" shows results for
different page size (1K, 2K, 4K, 8K, 16K, 32K). The colored page size is
the default value (8K).

This makes it clear how a page size affects performance (IOPS and BW)
for a given I/O depth, and also the impact of higher I/O depth.


I do think the difference between HDD and SSD storage is pretty clearly
visible (even though there is some variability between the SSD devices).

IMHO the crucial difference is that for HDD, the page size has almost no
impact on IOPS (in the random workloads). If you look at the random read
results, the page size does not matter - once you fix the I/O depth, the
result are pretty much exactly the same. For the random write test it's
even clearer, because the I/O depth does not matter and you get 350 IOPS
no matter the page size or I/O depth.

This makes perfect sense, because for "spinning rust" the dominant part
is seeking to the right part of the platter. And once you've seeked to
the right place, it does not matter much if you read 1K or 32K - the
cost is much lower than the seek.

And the economy is pretty simple - you can't really improve IOPS, but
you can improve bandwidth by using larger pages. If you do 350 IOPS, it
can be either 350kB/s with 1K pages or 11MB/s with 32KB pages).

So we'd gain very little by using smaller pages, and larger pages
improve bandwidth - not just for random tests, but sequential too. And
8KB seems like a reasonable compromise - bandwidth with 32KB pages is
better, but with higher I/O depths (8 or more) we get pretty close,
likely due to hitting SATA limits.


Now, compare this to the SSD. There are some differences between the
models, manufacturers, interface etc. but the impact of page size on
IOPS is pretty clear. On the Optane you can get +20-30% by using 4K
pages, on the Samsung it's even more, etc. This means that workloads
dominated by random I/O get significant benefit from smaller pages.

Another consequence of this is that for sequential workloads, the
difference between page sizes is smaller, because when smaller pages
reach better IOPS this reduces the difference in bandwidth.


If you imagine two extremes:

  1) different pages yield the same IOPS

  2) different pages yield the same bandwidth

then old-school HDDs are pretty close to (1), while future storage
systems (persistent memory) is likely close to (2).

This matters, because various trade-offs we've made in the past are
reasonable for (1), but will be inefficient for (2). And as the results
I shared during the pgcon session suggest, we might do so much better
even for current SSDs, which are somewhere between (1) and (2).


The other important factor is the native SSD page, which is similar to
sectors on HDD. SSDs however don't allow in-place updates, and have to
reset/rewrite of the whole native page. It's actually more complicated,
because the reset happens at a much larger scale (~8MB block), so it
does matter how quickly we "dirty" the data. The consequence is that
using data pages smaller than the native page (depends on the device,
but seems 4K is the common value) either does not help or actually hurts
the write performance.

All the SSD results show this behavior - the Optane and Samsung nicely
show that 4K is much better (in random write IOPS) than 8K, but 1-2K
pages make it worse.


I'm sure there are other important factors - for example, eliminating
the very expensive "seek" cost (SSDs can do 10k-100k IOPS easily, while
HDDs did ~100-400 IOPS), other steps start to play much bigger role. I
wouldn't be surprised if memcpy() started to matter, for example.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Phil Florent
Date:
Subject: Error from the foreign RDBMS on a foreign table I have no privilege on
Next
From: Roberto Mello
Date:
Subject: Re: pgcon unconference / impact of block size on performance