Re: pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pgcon unconference / impact of block size on performance
Date
Msg-id 62160038-cf65-72a6-4738-343454d72e87@enterprisedb.com
Whole thread Raw
In response to RE: pgcon unconference / impact of block size on performance  (Jakub Wartak <Jakub.Wartak@tomtom.com>)
Responses Re: pgcon unconference / impact of block size on performance
RE: pgcon unconference / impact of block size on performance
List pgsql-hackers
On 6/6/22 16:27, Jakub Wartak wrote:
> Hi Tomas,
> 
>> Hi,
>>
>> At on of the pgcon unconference sessions a couple days ago, I presented a
>> bunch of benchmark results comparing performance with different data/WAL
>> block size. Most of the OLTP results showed significant gains (up to 50%) with
>> smaller (4k) data pages.
> 
> Nice. I just saw this
https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do
you have any plans for publishing those other graphs too (e.g. WAL block
size impact)?
> 

Well, there's plenty of charts in the github repositories, including the
charts I think you're asking for:

https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/xeon/20220406-fpw/16/heatmap-tps.png

https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/i5/20220427-fpw/16/heatmap-io-tps.png


I admit the charts may not be documented very clearly :-(

>> This opened a long discussion about possible explanations - I claimed one of the
>> main factors is the adoption of flash storage, due to pretty fundamental
>> differences between HDD and SSD systems. But the discussion concluded with an
>> agreement to continue investigating this, so here's an attempt to support the
>> claim with some measurements/data.
>>
>> Let me present results of low-level fio benchmarks on a couple different HDD
>> and SSD drives. This should eliminate any postgres-related influence (e.g. FPW),
>> and demonstrates inherent HDD/SSD differences.
>> All the SSD results show this behavior - the Optane and Samsung nicely show
>> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
>> worse.
>>
> [..]
> Can you share what Linux kernel version, what filesystem , it's
> mount options and LVM setup were you using if any(?)
> 

The PostgreSQL benchmarks were with 5.14.x kernels, with either ext4 or
xfs filesystems.

i5 uses LVM on the 6x SATA SSD devices, with this config:

bench ~ # mdadm --detail /dev/md0
/dev/md0:
           Version : 0.90
     Creation Time : Thu Feb  8 15:05:49 2018
        Raid Level : raid0
        Array Size : 586106880 (558.96 GiB 600.17 GB)
      Raid Devices : 6
     Total Devices : 6
   Preferred Minor : 0
       Persistence : Superblock is persistent

       Update Time : Thu Feb  8 15:05:49 2018
             State : clean
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

        Chunk Size : 512K

Consistency Policy : none

              UUID : 24c6158c:36454b38:529cc8e5:b4b9cc9d (local to host
bench)
            Events : 0.1

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       8       49        3      active sync   /dev/sdd1
       4       8       65        4      active sync   /dev/sde1
       5       8       81        5      active sync   /dev/sdf1

bench ~ # mount | grep md0
/dev/md0 on /mnt/raid type xfs
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=16,swidth=96,noquota)


and the xeon just uses ext4 on the device directly:

/dev/nvme0n1p1 on /mnt/data type ext4 (rw,relatime)


> I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ 
> ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio
> only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only.
> Results are attached, w/o graphs.
> 
>> Now, compare this to the SSD. There are some differences between 
>> the models, manufacturers, interface etc. but the impact of page
>> size on IOPS is pretty clear. On the Optane you can get +20-30% by
>> using 4K pages, on the Samsung it's even more, etc. This means that
>> workloads dominated by random I/O get significant benefit from
>> smaller pages.
> 
> Yup, same here, reproduced, 1.42x faster on writes:
> [root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth
> [root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}'  | sort -n
> 1k/1.txt: bw=24162KB/s, iops=24161,
> 2k/1.txt: bw=47164KB/s, iops=23582,
> 4k/1.txt: bw=280450KB/s, iops=70112, <<<
> 8k/1.txt: bw=393082KB/s, iops=49135,
> 16k/1.txt: bw=393103KB/s, iops=24568,
> 32k/1.txt: bw=393283KB/s, iops=12290,
>
> BTW it's interesting to compare to your's Optane 900P result (same 
> two high bars for IOPS @ 4,8kB), but in my case it's even more import
> to select 4kB so it behaves more like Samsung 860 in your case
> 

Thanks. Interesting!

> # 1.41x on randreads
> [root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth
> [root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n
> 1k/1.txt: bw=169938KB/s, iops=169937,
> 2k/1.txt: bw=376653KB/s, iops=188326,
> 4k/1.txt: bw=691529KB/s, iops=172882, <<<
> 8k/1.txt: bw=976916KB/s, iops=122114,
> 16k/1.txt: bw=990524KB/s, iops=61907,
> 32k/1.txt: bw=974318KB/s, iops=30447,
> 
> I think that the above just a demonstration of device bandwidth 
> saturation: 32k*30k IOPS =~ 1GB/s random reads. Given that DB would
> be tuned @ 4kB for app(OLTP), but once upon a time Parallel Seq
> Scans "critical reports" could only achieve 70% of what it could
> achieve on 8kB, correct? (I'm assuming most real systems are really
> OLTP but with some reporting/data exporting needs).
> 

Right, that's roughly my thinking too. Also, OLAP queries often do a lot
of random I/O, due to index scans etc.

I also wonder how is this related to filesystem page size - in all the
benchmarks I did I used the default (4k), but maybe it'd behave if the
filesystem page matched the data page.

> One way or another it would be very nice to be able to select the
> tradeoff using initdb(1) without the need to recompile, which then
> begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,
> DB page size, ...).>
> Do you envision any plans for this we still in a need to gather more
> info exactly why this happens? (perf reports?)
> 

Not sure I follow. Plans for what? Something that calibrates cost
parameters? That might be useful, but that's a rather separate issue
from what's discussed here - page size, which needs to happen before
initdb (at least with how things work currently).

The other issue (e.g. with effective_io_concurrency) is that it very
much depends on the access pattern - random pages and sequential pages
will require very different e_i_c values. But again, that's something to
discuss in a separate thread (e.g. [1])

[1]: https://postgr.es/m/Yl92RVoXVfs+z2Yj@momjian.us

> Also have you guys discussed on that meeting any long-term future 
> plans on storage layer by any chance ? If sticking to 4kB pages on 
> DB/page size/hardware sector size, wouldn't it be possible to win
> also disabling FPWs in the longer run using uring (assuming O_DIRECT
> | O_ATOMIC one day?)>
> I recall that Thomas M. was researching O_ATOMIC, I think he wrote
> some of that pretty nicely in [1] 
> 
> [1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

No, no such discussion - at least no in this unconference slot.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: oat_post_create expected behavior
Next
From: David Geier
Date:
Subject: Re: Assertion failure with barriers in parallel hash join