RE: pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | RE: pgcon unconference / impact of block size on performance |
Date | |
Msg-id | AS8PR07MB8249D56663CE82B5826D64B2F6A29@AS8PR07MB8249.eurprd07.prod.outlook.com Whole thread Raw |
In response to | pgcon unconference / impact of block size on performance (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: pgcon unconference / impact of block size on performance
|
List | pgsql-hackers |
Hi Tomas, > Hi, > > At on of the pgcon unconference sessions a couple days ago, I presented a > bunch of benchmark results comparing performance with different data/WAL > block size. Most of the OLTP results showed significant gains (up to 50%) with > smaller (4k) data pages. Nice. I just saw this https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do you have any plans for publishingthose other graphs too (e.g. WAL block size impact)? > This opened a long discussion about possible explanations - I claimed one of the > main factors is the adoption of flash storage, due to pretty fundamental > differences between HDD and SSD systems. But the discussion concluded with an > agreement to continue investigating this, so here's an attempt to support the > claim with some measurements/data. > > Let me present results of low-level fio benchmarks on a couple different HDD > and SSD drives. This should eliminate any postgres-related influence (e.g. FPW), > and demonstrates inherent HDD/SSD differences. > All the SSD results show this behavior - the Optane and Samsung nicely show > that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it > worse. > [..] Can you share what Linux kernel version, what filesystem , it's mount options and LVM setup were you using if any(?) I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no LVM,libaio only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only. Results are attached, w/o graphs. > Now, compare this to the SSD. There are some differences between the models, manufacturers, interface etc. but the impactof page size on IOPS is pretty clear. On the Optane you can get +20-30% by using 4K pages, on the Samsung it's evenmore, etc. This means that workloads dominated by random I/O get significant benefit from smaller pages. Yup, same here, reproduced, 1.42x faster on writes: [root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth [root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}' | sort -n 1k/1.txt: bw=24162KB/s, iops=24161, 2k/1.txt: bw=47164KB/s, iops=23582, 4k/1.txt: bw=280450KB/s, iops=70112, <<< 8k/1.txt: bw=393082KB/s, iops=49135, 16k/1.txt: bw=393103KB/s, iops=24568, 32k/1.txt: bw=393283KB/s, iops=12290, BTW it's interesting to compare to your's Optane 900P result (same two high bars for IOPS @ 4,8kB), but in my case it's evenmore import to select 4kB so it behaves more like Samsung 860 in your case # 1.41x on randreads [root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth [root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n 1k/1.txt: bw=169938KB/s, iops=169937, 2k/1.txt: bw=376653KB/s, iops=188326, 4k/1.txt: bw=691529KB/s, iops=172882, <<< 8k/1.txt: bw=976916KB/s, iops=122114, 16k/1.txt: bw=990524KB/s, iops=61907, 32k/1.txt: bw=974318KB/s, iops=30447, I think that the above just a demonstration of device bandwidth saturation: 32k*30k IOPS =~ 1GB/s random reads. Given thatDB would be tuned @ 4kB for app(OLTP), but once upon a time Parallel Seq Scans "critical reports" could only achieve70% of what it could achieve on 8kB, correct? (I'm assuming most real systems are really OLTP but with some reporting/dataexporting needs). One way or another it would be very nice to be able to select the tradeoff using initdb(1)without the need to recompile, which then begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,DB page size, ...). Do you envision any plans for this we still in a need to gather more info exactly why this happens? (perf reports?) Also have you guys discussed on that meeting any long-term future plans on storage layer by any chance ? If sticking to 4kBpages on DB/page size/hardware sector size, wouldn't it be possible to win also disabling FPWs in the longer run usinguring (assuming O_DIRECT | O_ATOMIC one day?) I recall that Thomas M. was researching O_ATOMIC, I think he wrote some of that pretty nicely in [1] [1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO
Attachment
pgsql-hackers by date: