Re: pg_stat_io_histogram - Mailing list pgsql-hackers
| From | Jakub Wartak |
|---|---|
| Subject | Re: pg_stat_io_histogram |
| Date | |
| Msg-id | CAKZiRmw7-vkmYKo6Z64jUHpA-7Z71S_dM6GCV+X2jL_xL5t7Aw@mail.gmail.com Whole thread Raw |
| In response to | Re: pg_stat_io_histogram (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
| Responses |
Re: pg_stat_io_histogram
|
| List | pgsql-hackers |
On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:
[..]
> > Of course most of the I/O calls today are hitting page cache, so one would
> > expect they'll be < 128us most of the time
>
> Have you measured whether overhead is measurable when hitting the page cache?
> I'd hope that it doesn't, due to io combing amortizing the cost somewhat. But
> it seems worth measuring.
Not yet, I first wanted to hear if I'm not sailing into some plain stupid
direction somewhere with this idea or implementation (e.g.
that INSTR_TIME_GET_MICROSEC() was a really stupid omission from my side).
I'll try to perform this test overhead measurement hopefully with v3 once
we settle on how to do that bit shifting/clz().
[..]
Here's the answer: on properly isolated perf test run (my old&legacy&predictiable
4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper warmup,
no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to VFS
cache)
and started on just using single NUMA: numactl --membind=0 --cpunodebind=0
measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
master+track_io_timings=on, 60s warmup and then 3x runs
tps = 44615.603668
tps = 44556.191492
tps = 44813.793981
avg = 44662
master+track_io_timings=on+patch, , 60s warmup and then 3x runs
tps = 44441.879384
tps = 44403.101737
tps = 45036.747418
avg = 44627
so that's like 99.921% (so literally no overhead) and yields picture like:
4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper warmup,
no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to VFS
cache)
and started on just using single NUMA: numactl --membind=0 --cpunodebind=0
measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
master+track_io_timings=on, 60s warmup and then 3x runs
tps = 44615.603668
tps = 44556.191492
tps = 44813.793981
avg = 44662
master+track_io_timings=on+patch, , 60s warmup and then 3x runs
tps = 44441.879384
tps = 44403.101737
tps = 45036.747418
avg = 44627
so that's like 99.921% (so literally no overhead) and yields picture like:
postgres=# select bucket_latency_us, bucket_count
from pg_stat_io_histogram
where
bucket_count > 0 and
backend_type = 'client backend' and
io_type = 'read' and
context = 'normal'
order by bucket_latency_us;
bucket_latency_us | bucket_count
-------------------+--------------
[0,9) | 273455
[8,17) | 3820379
[16,33) | 29359
[32,65) | 585
[64,129) | 467
[128,257) | 419
[256,513) | 15828
[512,1025) | 89
So one can also see 0.25..0.5ms bucket being larger there (initial reading
of the data from physical device) and that's hardware RAID-1 with 2x
Intel D3-S4510. And if I do pg_buffercache_evict_all()+vm drop_cache+
from pg_stat_io_histogram
where
bucket_count > 0 and
backend_type = 'client backend' and
io_type = 'read' and
context = 'normal'
order by bucket_latency_us;
bucket_latency_us | bucket_count
-------------------+--------------
[0,9) | 273455
[8,17) | 3820379
[16,33) | 29359
[32,65) | 585
[64,129) | 467
[128,257) | 419
[256,513) | 15828
[512,1025) | 89
So one can also see 0.25..0.5ms bucket being larger there (initial reading
of the data from physical device) and that's hardware RAID-1 with 2x
Intel D3-S4510. And if I do pg_buffercache_evict_all()+vm drop_cache+
pg_stat_reset_shared(), I get this picture (note for bulkreads):
postgres=# select bucket_latency_us, bucket_count
postgres=# select bucket_latency_us, bucket_count
from pg_stat_io_histogram
where
bucket_count > 0 and
backend_type = 'client backend' and context='bulkread';
bucket_latency_us | bucket_count
-------------------+--------------
[0,9) | 102555
[8,17) | 70
[16,33) | 3938
[32,65) | 6763
[64,129) | 5206
[128,257) | 9392
[256,513) | 10959
[512,1025) | 21372
[1024,2049) | 502
[2048,4097) | 34
[4096,8193) | 2
[8192,16385) | 2
[16384,32769) | 7
where
bucket_count > 0 and
backend_type = 'client backend' and context='bulkread';
bucket_latency_us | bucket_count
-------------------+--------------
[0,9) | 102555
[8,17) | 70
[16,33) | 3938
[32,65) | 6763
[64,129) | 5206
[128,257) | 9392
[256,513) | 10959
[512,1025) | 21372
[1024,2049) | 502
[2048,4097) | 34
[4096,8193) | 2
[8192,16385) | 2
[16384,32769) | 7
So clearly there's a distinction between reading the VFS cache and hitting physical I/O.
Now we'll just probably settle on the proper get_bucket_index() impl.
-J.
pgsql-hackers by date: