Home > mailing lists

RE: pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers

From	Jakub Wartak
Subject	RE: pgcon unconference / impact of block size on performance
Date	June 8, 2022 14:15:17
Msg-id	PR3PR07MB824303DFFA61592568BE5BA1F6A49@PR3PR07MB8243.eurprd07.prod.outlook.com Whole thread Raw
In response to	Re: pgcon unconference / impact of block size on performance (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses	Re: pgcon unconference / impact of block size on performance
List	pgsql-hackers

Tree view

Hi, got some answers!

TL;DR for fio it would make sense to use many stressfiles (instead of 1) and same for numjobs ~ VCPU to avoid various
pitfails.

> >> The really
> >> puzzling thing is why is the filesystem so much slower for smaller
> >> pages. I mean, why would writing 1K be 1/3 of writing 4K?
> >> Why would a filesystem have such effect?
> >
> > Ha! I don't care at this point as 1 or 2kB seems too small to handle
> > many real world scenarios ;)
[..]
> Independently of that, it seems like an interesting behavior and it might tell us
> something about how to optimize for larger pages.

OK, curiosity won:

With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~90-100 (close to fio's 128 queue depth?) and I'm
getting~70k IOPS [with maxdepth=128] 
With randwrite on ext4 directio using 1kb the avgqu-sz is just 0.7 and I'm getting just ~17-22k IOPS [with
maxdepth=128]->  conclusion: something is being locked thus preventing queue to build up 
With randwrite on ext4 directio using 4kb the avgqu-sz reaches ~2.3 (so something is queued) and I'm also getting ~70k
IOPSwith minimal possible maxdepth=4 ->  conclusion: I just need to split the lock contention by 4. 

The 1kB (slow) profile top function is aio_write() -> .... -> iov_iter_get_pages() -> internal_get_user_pages_fast()
andthere's sadly plenty of "lock" keywords inside {related to memory manager, padding to full page size, inode locking}
alsoone can find some articles / commits related to it [1] which didn't made a good feeling to be honest as the fio is
usingjust 1 file (even while I'm on kernel 5.10.x). So I've switched to 4x files and numjobs=4 and got easily 60k IOPS,
contentionsolved whatever it was :) So I would assume PostgreSQL (with it's splitting data files by default on 1GB
boundariesand multiprocess architecture) should be relatively safe from such ext4 inode(?)/mm(?) contentions even with
smallest1kb block sizes on Direct I/O some day.   

[1] - https://www.phoronix.com/scan.php?page=news_item&px=EXT4-DIO-Faster-DBs

> > Both scenarios (raw and fs) have had direct=1 set. I just cannot understand
> how having direct I/O enabled (which disables caching) achieves better read
> IOPS on ext4 than on raw device... isn't it contradiction?
> >
>
> Thanks for the clarification. Not sure what might be causing this. Did you use the
> same parameters (e.g. iodepth) in both cases?

Explanation: it's the CPU scheduler migrations mixing the performance result during the runs of fio  (as you have in
yourframework). Various VCPUs seem to be having varying max IOPS characteristics (sic!) and CPU scheduler seems to be
unawareof it. At least on 1kB and 4kB blocksize this happens also notice that some VCPUs [XXXX marker] don't reach 100%
CPUreaching almost twice the result; while cores 0, 3 do reach 100% and lack CPU power to perform more. The only thing
thatI don't get is that it doesn't make sense from extened lscpu output (but maybe it's AWS XEN mixing real CPU
mappings,who knows).  

[root@x ~]# for((x=0; x<=3; x++)) ; do echo "$x:"; taskset -c $x fio fio.ext4 | grep -e 'read :' -e 'cpu '; done
0:
  read : io=2416.8MB, bw=123730KB/s, iops=123730, runt= 20001msec
  cpu          : usr=42.98%, sys=56.52%, ctx=2317, majf=0, minf=41 [XXXX: 100% cpu bottleneck and just 123k IOPS]
1:
  read : io=4077.9MB, bw=208774KB/s, iops=208773, runt= 20001msec
  cpu          : usr=29.47%, sys=51.43%, ctx=2993, majf=0, minf=42  [XXXX, some idle power and 208k IOPS just by
switchingto core1...] 
2:
  read : io=4036.7MB, bw=206636KB/s, iops=206636, runt= 20001msec
  cpu          : usr=31.00%, sys=52.41%, ctx=2815, majf=0, minf=42 [XXXX]
3:
  read : io=2398.4MB, bw=122791KB/s, iops=122791, runt= 20001msec
  cpu          : usr=44.20%, sys=55.20%, ctx=2522, majf=0, minf=41
[root@x ~]# for((x=0; x<=3; x++)) ; do echo "$x:"; taskset -c $x fio fio.raw | grep -e 'read :' -e 'cpu '; done
0:
  read : io=2512.3MB, bw=128621KB/s, iops=128620, runt= 20001msec
  cpu          : usr=47.62%, sys=51.58%, ctx=2365, majf=0, minf=42
1:
  read : io=4070.2MB, bw=206748KB/s, iops=206748, runt= 20159msec
  cpu          : usr=29.52%, sys=42.86%, ctx=2808, majf=0, minf=42 [XXXX]
2:
  read : io=4101.3MB, bw=209975KB/s, iops=209975, runt= 20001msec
  cpu          : usr=28.05%, sys=45.09%, ctx=3419, majf=0, minf=42 [XXXX]
3:
  read : io=2519.4MB, bw=128985KB/s, iops=128985, runt= 20001msec
  cpu          : usr=46.59%, sys=52.70%, ctx=2371, majf=0, minf=41

[root@x ~]# lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       yes    3000.0000 1200.0000
1   0    0      1    1:1:1:0       yes    3000.0000 1200.0000
2   0    0      0    0:0:0:0       yes    3000.0000 1200.0000
3   0    0      1    1:1:1:0       yes    3000.0000 1200.0000
[root@x ~]# lscpu | grep -e ^Model -e ^NUMA -e ^Hyper
NUMA node(s):        1
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Hypervisor vendor:   Xen
NUMA node0 CPU(s):   0-3
[root@x ~]# diff -u fio.raw fio.ext4
--- fio.raw     2022-06-08 12:32:26.603482453 +0000
+++ fio.ext4    2022-06-08 12:32:36.071621708 +0000
@@ -1,5 +1,5 @@
 [global]
-filename=/dev/nvme0n1
+filename=/mnt/nvme/fio/data.file
 size=256GB
 direct=1
 ioengine=libaio
[root@x ~]# cat fio.raw
[global]
filename=/dev/nvme0n1
size=256GB
direct=1
ioengine=libaio
runtime=20
numjobs=1
group_reporting=1

[job]
rw=randread
iodepth=128
bs=1k
size=64GB
[root@x ~]#

-J.

pgsql-hackers by date:

From: Pavel Stehule
Date: 08 June 2022, 13:17:50
Subject: Re: => operator for named parameters in open cursor

From: Tom Lane
Date: 08 June 2022, 14:16:01
Subject: Re: Collation version tracking for macOS

RE: pgcon unconference / impact of block size on performance - Mailing list pgsql-hackers

Previous

Next