pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | pgcon unconference / impact of block size on performance |
Date | |
Msg-id | b4861449-6c54-ccf8-e67c-c039228cdc6d@enterprisedb.com Whole thread Raw |
Responses |
Re: pgcon unconference / impact of block size on performance
RE: pgcon unconference / impact of block size on performance Re: pgcon unconference / impact of block size on performance Re: pgcon unconference / impact of block size on performance Re: pgcon unconference / impact of block size on performance Re: pgcon unconference / impact of block size on performance |
List | pgsql-hackers |
Hi, At on of the pgcon unconference sessions a couple days ago, I presented a bunch of benchmark results comparing performance with different data/WAL block size. Most of the OLTP results showed significant gains (up to 50%) with smaller (4k) data pages. This opened a long discussion about possible explanations - I claimed one of the main factors is the adoption of flash storage, due to pretty fundamental differences between HDD and SSD systems. But the discussion concluded with an agreement to continue investigating this, so here's an attempt to support the claim with some measurements/data. Let me present results of low-level fio benchmarks on a couple different HDD and SSD drives. This should eliminate any postgres-related influence (e.g. FPW), and demonstrates inherent HDD/SSD differences. Each of the PDF pages shows results for five basic workloads: - random read - random write - random r/w - sequential read - sequential write The chars on the left show IOPS, charts on right bandwidth. The x-axis shows I/O depth - number of concurrent I/O requests or queue length, with values 1, 2, 4, 8, 64, 128. And each "group" shows results for different page size (1K, 2K, 4K, 8K, 16K, 32K). The colored page size is the default value (8K). This makes it clear how a page size affects performance (IOPS and BW) for a given I/O depth, and also the impact of higher I/O depth. I do think the difference between HDD and SSD storage is pretty clearly visible (even though there is some variability between the SSD devices). IMHO the crucial difference is that for HDD, the page size has almost no impact on IOPS (in the random workloads). If you look at the random read results, the page size does not matter - once you fix the I/O depth, the result are pretty much exactly the same. For the random write test it's even clearer, because the I/O depth does not matter and you get 350 IOPS no matter the page size or I/O depth. This makes perfect sense, because for "spinning rust" the dominant part is seeking to the right part of the platter. And once you've seeked to the right place, it does not matter much if you read 1K or 32K - the cost is much lower than the seek. And the economy is pretty simple - you can't really improve IOPS, but you can improve bandwidth by using larger pages. If you do 350 IOPS, it can be either 350kB/s with 1K pages or 11MB/s with 32KB pages). So we'd gain very little by using smaller pages, and larger pages improve bandwidth - not just for random tests, but sequential too. And 8KB seems like a reasonable compromise - bandwidth with 32KB pages is better, but with higher I/O depths (8 or more) we get pretty close, likely due to hitting SATA limits. Now, compare this to the SSD. There are some differences between the models, manufacturers, interface etc. but the impact of page size on IOPS is pretty clear. On the Optane you can get +20-30% by using 4K pages, on the Samsung it's even more, etc. This means that workloads dominated by random I/O get significant benefit from smaller pages. Another consequence of this is that for sequential workloads, the difference between page sizes is smaller, because when smaller pages reach better IOPS this reduces the difference in bandwidth. If you imagine two extremes: 1) different pages yield the same IOPS 2) different pages yield the same bandwidth then old-school HDDs are pretty close to (1), while future storage systems (persistent memory) is likely close to (2). This matters, because various trade-offs we've made in the past are reasonable for (1), but will be inefficient for (2). And as the results I shared during the pgcon session suggest, we might do so much better even for current SSDs, which are somewhere between (1) and (2). The other important factor is the native SSD page, which is similar to sectors on HDD. SSDs however don't allow in-place updates, and have to reset/rewrite of the whole native page. It's actually more complicated, because the reset happens at a much larger scale (~8MB block), so it does matter how quickly we "dirty" the data. The consequence is that using data pages smaller than the native page (depends on the device, but seems 4K is the common value) either does not help or actually hurts the write performance. All the SSD results show this behavior - the Optane and Samsung nicely show that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it worse. I'm sure there are other important factors - for example, eliminating the very expensive "seek" cost (SSDs can do 10k-100k IOPS easily, while HDDs did ~100-400 IOPS), other steps start to play much bigger role. I wouldn't be surprised if memcpy() started to matter, for example. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: