Re: pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: pgcon unconference / impact of block size on performance |
Date | |
Msg-id | 62160038-cf65-72a6-4738-343454d72e87@enterprisedb.com Whole thread Raw |
In response to | RE: pgcon unconference / impact of block size on performance (Jakub Wartak <Jakub.Wartak@tomtom.com>) |
Responses |
Re: pgcon unconference / impact of block size on performance
RE: pgcon unconference / impact of block size on performance |
List | pgsql-hackers |
On 6/6/22 16:27, Jakub Wartak wrote: > Hi Tomas, > >> Hi, >> >> At on of the pgcon unconference sessions a couple days ago, I presented a >> bunch of benchmark results comparing performance with different data/WAL >> block size. Most of the OLTP results showed significant gains (up to 50%) with >> smaller (4k) data pages. > > Nice. I just saw this https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do you have any plans for publishing those other graphs too (e.g. WAL block size impact)? > Well, there's plenty of charts in the github repositories, including the charts I think you're asking for: https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/xeon/20220406-fpw/16/heatmap-tps.png https://github.com/tvondra/pg-block-bench-pgbench/blob/master/process/heatmaps/i5/20220427-fpw/16/heatmap-io-tps.png I admit the charts may not be documented very clearly :-( >> This opened a long discussion about possible explanations - I claimed one of the >> main factors is the adoption of flash storage, due to pretty fundamental >> differences between HDD and SSD systems. But the discussion concluded with an >> agreement to continue investigating this, so here's an attempt to support the >> claim with some measurements/data. >> >> Let me present results of low-level fio benchmarks on a couple different HDD >> and SSD drives. This should eliminate any postgres-related influence (e.g. FPW), >> and demonstrates inherent HDD/SSD differences. >> All the SSD results show this behavior - the Optane and Samsung nicely show >> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it >> worse. >> > [..] > Can you share what Linux kernel version, what filesystem , it's > mount options and LVM setup were you using if any(?) > The PostgreSQL benchmarks were with 5.14.x kernels, with either ext4 or xfs filesystems. i5 uses LVM on the 6x SATA SSD devices, with this config: bench ~ # mdadm --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Thu Feb 8 15:05:49 2018 Raid Level : raid0 Array Size : 586106880 (558.96 GiB 600.17 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Thu Feb 8 15:05:49 2018 State : clean Active Devices : 6 Working Devices : 6 Failed Devices : 0 Spare Devices : 0 Chunk Size : 512K Consistency Policy : none UUID : 24c6158c:36454b38:529cc8e5:b4b9cc9d (local to host bench) Events : 0.1 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1 3 8 49 3 active sync /dev/sdd1 4 8 65 4 active sync /dev/sde1 5 8 81 5 active sync /dev/sdf1 bench ~ # mount | grep md0 /dev/md0 on /mnt/raid type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=16,swidth=96,noquota) and the xeon just uses ext4 on the device directly: /dev/nvme0n1p1 on /mnt/data type ext4 (rw,relatime) > I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ > ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no LVM, libaio > only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only. > Results are attached, w/o graphs. > >> Now, compare this to the SSD. There are some differences between >> the models, manufacturers, interface etc. but the impact of page >> size on IOPS is pretty clear. On the Optane you can get +20-30% by >> using 4K pages, on the Samsung it's even more, etc. This means that >> workloads dominated by random I/O get significant benefit from >> smaller pages. > > Yup, same here, reproduced, 1.42x faster on writes: > [root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth > [root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}' | sort -n > 1k/1.txt: bw=24162KB/s, iops=24161, > 2k/1.txt: bw=47164KB/s, iops=23582, > 4k/1.txt: bw=280450KB/s, iops=70112, <<< > 8k/1.txt: bw=393082KB/s, iops=49135, > 16k/1.txt: bw=393103KB/s, iops=24568, > 32k/1.txt: bw=393283KB/s, iops=12290, > > BTW it's interesting to compare to your's Optane 900P result (same > two high bars for IOPS @ 4,8kB), but in my case it's even more import > to select 4kB so it behaves more like Samsung 860 in your case > Thanks. Interesting! > # 1.41x on randreads > [root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth > [root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n > 1k/1.txt: bw=169938KB/s, iops=169937, > 2k/1.txt: bw=376653KB/s, iops=188326, > 4k/1.txt: bw=691529KB/s, iops=172882, <<< > 8k/1.txt: bw=976916KB/s, iops=122114, > 16k/1.txt: bw=990524KB/s, iops=61907, > 32k/1.txt: bw=974318KB/s, iops=30447, > > I think that the above just a demonstration of device bandwidth > saturation: 32k*30k IOPS =~ 1GB/s random reads. Given that DB would > be tuned @ 4kB for app(OLTP), but once upon a time Parallel Seq > Scans "critical reports" could only achieve 70% of what it could > achieve on 8kB, correct? (I'm assuming most real systems are really > OLTP but with some reporting/data exporting needs). > Right, that's roughly my thinking too. Also, OLAP queries often do a lot of random I/O, due to index scans etc. I also wonder how is this related to filesystem page size - in all the benchmarks I did I used the default (4k), but maybe it'd behave if the filesystem page matched the data page. > One way or another it would be very nice to be able to select the > tradeoff using initdb(1) without the need to recompile, which then > begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency, > DB page size, ...).> > Do you envision any plans for this we still in a need to gather more > info exactly why this happens? (perf reports?) > Not sure I follow. Plans for what? Something that calibrates cost parameters? That might be useful, but that's a rather separate issue from what's discussed here - page size, which needs to happen before initdb (at least with how things work currently). The other issue (e.g. with effective_io_concurrency) is that it very much depends on the access pattern - random pages and sequential pages will require very different e_i_c values. But again, that's something to discuss in a separate thread (e.g. [1]) [1]: https://postgr.es/m/Yl92RVoXVfs+z2Yj@momjian.us > Also have you guys discussed on that meeting any long-term future > plans on storage layer by any chance ? If sticking to 4kB pages on > DB/page size/hardware sector size, wouldn't it be possible to win > also disabling FPWs in the longer run using uring (assuming O_DIRECT > | O_ATOMIC one day?)> > I recall that Thomas M. was researching O_ATOMIC, I think he wrote > some of that pretty nicely in [1] > > [1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO No, no such discussion - at least no in this unconference slot. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: