RE: pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers

From Jakub Wartak
Subject RE: pgcon unconference / impact of block size on performance
Date
Msg-id AS8PR07MB8249D56663CE82B5826D64B2F6A29@AS8PR07MB8249.eurprd07.prod.outlook.com
Whole thread Raw
In response to pgcon unconference / impact of block size on performance  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: pgcon unconference / impact of block size on performance
List pgsql-hackers
Hi Tomas,

> Hi,
>
> At on of the pgcon unconference sessions a couple days ago, I presented a
> bunch of benchmark results comparing performance with different data/WAL
> block size. Most of the OLTP results showed significant gains (up to 50%) with
> smaller (4k) data pages.

Nice. I just saw this https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference , do you have any plans for
publishingthose other graphs too (e.g. WAL block size impact)? 

> This opened a long discussion about possible explanations - I claimed one of the
> main factors is the adoption of flash storage, due to pretty fundamental
> differences between HDD and SSD systems. But the discussion concluded with an
> agreement to continue investigating this, so here's an attempt to support the
> claim with some measurements/data.
>
> Let me present results of low-level fio benchmarks on a couple different HDD
> and SSD drives. This should eliminate any postgres-related influence (e.g. FPW),
> and demonstrates inherent HDD/SSD differences.
> All the SSD results show this behavior - the Optane and Samsung nicely show
> that 4K is much better (in random write IOPS) than 8K, but 1-2K pages make it
> worse.
>
[..]
Can you share what Linux kernel version, what filesystem , it's mount options and LVM setup were you using if any(?)

I've hastily tried your script on 4VCPU/32GB RAM/1xNVMe device @ ~900GB (AWS i3.xlarge), kernel 5.x, ext4 defaults, no
LVM,libaio only, fio deviations: runtime -> 1min, 64GB file, 1 iteration only. Results are attached, w/o graphs.  

> Now, compare this to the SSD. There are some differences between the models, manufacturers, interface etc. but the
impactof page size on IOPS is pretty clear. On the Optane you can get +20-30% by using 4K pages, on the Samsung it's
evenmore, etc. This means that workloads dominated by random I/O get significant benefit from smaller pages. 

Yup, same here, reproduced, 1.42x faster on writes:
[root@x ~]# cd libaio/nvme/randwrite/128/ # 128=queue depth
[root@x 128]# grep -r "write:" * | awk '{print $1, $4, $5}'  | sort -n
1k/1.txt: bw=24162KB/s, iops=24161,
2k/1.txt: bw=47164KB/s, iops=23582,
4k/1.txt: bw=280450KB/s, iops=70112, <<<
8k/1.txt: bw=393082KB/s, iops=49135,
16k/1.txt: bw=393103KB/s, iops=24568,
32k/1.txt: bw=393283KB/s, iops=12290,
BTW it's interesting to compare to your's Optane 900P result (same two high bars for IOPS @ 4,8kB), but in my case it's
evenmore import to select 4kB so it behaves more like Samsung 860 in your case 

# 1.41x on randreads
[root@x ~]# cd libaio/nvme/randread/128/ # 128=queue depth
[root@x 128]# grep -r "read :" | awk '{print $1, $5, $6}' | sort -n
1k/1.txt: bw=169938KB/s, iops=169937,
2k/1.txt: bw=376653KB/s, iops=188326,
4k/1.txt: bw=691529KB/s, iops=172882, <<<
8k/1.txt: bw=976916KB/s, iops=122114,
16k/1.txt: bw=990524KB/s, iops=61907,
32k/1.txt: bw=974318KB/s, iops=30447,

I think that the above just a demonstration of device bandwidth saturation: 32k*30k IOPS =~ 1GB/s random reads. Given
thatDB would be tuned @ 4kB for app(OLTP), but once upon a time Parallel Seq Scans "critical reports" could only
achieve70% of what it could achieve on 8kB, correct? (I'm assuming most real systems are really OLTP but with some
reporting/dataexporting needs). One way or another it would be very nice to be able to select the tradeoff using
initdb(1)without the need to recompile, which then begs for some initdb --calibrate /mnt/nvme
(effective_io_concurrency,DB page size, ...). 

Do you envision any plans for this we still in a need to gather more info exactly why this happens? (perf reports?)

Also have you guys discussed on that meeting any long-term future plans on storage layer by any chance ? If sticking to
4kBpages on DB/page size/hardware sector size, wouldn't it be possible to win also disabling FPWs in the longer run
usinguring (assuming O_DIRECT | O_ATOMIC one day?) 
I recall that Thomas M. was researching O_ATOMIC, I think he wrote some of that pretty nicely in [1]

[1] - https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

Attachment

pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list
Next
From: Robert Haas
Date:
Subject: Re: Add index scan progress to pg_stat_progress_vacuum