Home > mailing lists

RE: pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers

From	Jakub Wartak
Subject	RE: pgcon unconference / impact of block size on performance
Date	June 7, 2022 09:46:51
Msg-id	PR3PR07MB8243877BA3B42D37641EDD58F6A59@PR3PR07MB8243.eurprd07.prod.outlook.com Whole thread Raw
In response to	Re: pgcon unconference / impact of block size on performance (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses	Re: pgcon unconference / impact of block size on performance
List	pgsql-hackers

Tree view

Hi Tomas,

> Well, there's plenty of charts in the github repositories, including the charts I
> think you're asking for:

Thanks.

> I also wonder how is this related to filesystem page size - in all the benchmarks I
> did I used the default (4k), but maybe it'd behave if the filesystem page matched
> the data page.

That may be it - using fio on raw NVMe device (without fs/VFS at all) shows:

[root@x libaio-raw]# grep -r -e 'write:' -e 'read :' *
nvme/randread/128/1k/1.txt:  read : io=7721.9MB, bw=131783KB/s, iops=131783, runt= 60001msec [b]
nvme/randread/128/2k/1.txt:  read : io=15468MB, bw=263991KB/s, iops=131995, runt= 60001msec [b] 
nvme/randread/128/4k/1.txt:  read : io=30142MB, bw=514408KB/s, iops=128602, runt= 60001msec [b]
nvme/randread/128/8k/1.txt:  read : io=56698MB, bw=967635KB/s, iops=120954, runt= 60001msec
nvme/randwrite/128/1k/1.txt:  write: io=4140.9MB, bw=70242KB/s, iops=70241, runt= 60366msec [a]
nvme/randwrite/128/2k/1.txt:  write: io=8271.5MB, bw=141161KB/s, iops=70580, runt= 60002msec [a]
nvme/randwrite/128/4k/1.txt:  write: io=16543MB, bw=281164KB/s, iops=70291, runt= 60248msec
nvme/randwrite/128/8k/1.txt:  write: io=22924MB, bw=390930KB/s, iops=48866, runt= 60047msec

So, I've found out two interesting things while playing with raw vs ext4:
a) I've got 70k IOPS always randwrite even on 1k,2k,4k without ext4 (so as expected, this was ext4 4kb default fs page
sizeimpact as you was thinking about when fio 1k was hitting ext4 4kB block)
 
b) Another thing that you could also include in testing is that I've spotted a couple of times single-threaded fio
mightcould be limiting factor (numjobs=1 by default), so I've tried with numjobs=2,group_reporting=1 and got this below
ouputon ext4 defaults even while dropping caches (echo 3) each loop iteration -- something that I cannot explain (ext4
directI/O caching effect? how's that even possible? reproduced several times even with numjobs=1) - the point being
2066431kb IOPS @ ext4 direct-io > 131783 1kB IOPS @ raw, smells like some caching effect because for randwrite it does
nothappen. I've triple-checked with iostat -x... it cannot be any internal device cache as with direct I/O that doesn't
happen:

[root@x libaio-ext4]# grep -r -e 'write:' -e 'read :' *
nvme/randread/128/1k/1.txt:  read : io=12108MB, bw=206644KB/s, iops=206643, runt= 60001msec [b]
nvme/randread/128/2k/1.txt:  read : io=18821MB, bw=321210KB/s, iops=160604, runt= 60001msec [b]
nvme/randread/128/4k/1.txt:  read : io=36985MB, bw=631208KB/s, iops=157802, runt= 60001msec [b]
nvme/randread/128/8k/1.txt:  read : io=57364MB, bw=976923KB/s, iops=122115, runt= 60128msec
nvme/randwrite/128/1k/1.txt:  write: io=1036.2MB, bw=17683KB/s, iops=17683, runt= 60001msec [a, as before]
nvme/randwrite/128/2k/1.txt:  write: io=2023.2MB, bw=34528KB/s, iops=17263, runt= 60001msec [a, as before]
nvme/randwrite/128/4k/1.txt:  write: io=16667MB, bw=282977KB/s, iops=70744, runt= 60311msec [reproduced benefit, as per
earlieremail]
 
nvme/randwrite/128/8k/1.txt:  write: io=22997MB, bw=391839KB/s, iops=48979, runt= 60099msec

> > One way or another it would be very nice to be able to select the
> > tradeoff using initdb(1) without the need to recompile, which then
> > begs for some initdb --calibrate /mnt/nvme (effective_io_concurrency,
> > DB page size, ...).> Do you envision any plans for this we still in a
> > need to gather more info exactly why this happens? (perf reports?)
> >
> 
> Not sure I follow. Plans for what? Something that calibrates cost parameters?
> That might be useful, but that's a rather separate issue from what's discussed
> here - page size, which needs to happen before initdb (at least with how things
> work currently).
[..]

Sorry, I was too far teched and assumed you guys were talking very long term. 

-J.

pgsql-hackers by date:

From: Dong Wook Lee
Date: 07 June 2022, 09:32:27
Subject: Re: Add TAP test for auth_delay extension

From: Jakub Wartak
Date: 07 June 2022, 09:46:53
Subject: RE: pgcon unconference / impact of block size on performance

RE: pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers

Previous

Next