Re: pg_stat_io_histogram - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: pg_stat_io_histogram
Date
Msg-id CAKZiRmxNVMOV7x5c_Amqk=2mmYJOqsfHgE8N8O9jGjgfBYa8kQ@mail.gmail.com
Whole thread Raw
In response to Re: pg_stat_io_histogram  (Andres Freund <andres@anarazel.de>)
Responses Re: pg_stat_io_histogram
List pgsql-hackers
On Mon, Jan 26, 2026 at 4:08 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-01-26 10:40:52 +0100, Jakub Wartak wrote:
> > I'm proposing that we add pg_stat_io_histogram that would track/show I/O
> > latencies profile, so we could quickly identify I/O outliers.
>
> I think that's an interesting feature!

Hi Andres, thanks for kind words and review!

[..]
>
> For measuring particularly stuck things, I've been wondering about having a
> regular timer that starts to collect more information if stuck in a place for
> a while. That would probably end up being lower overhead than constantly
> measuring... But it would also be a lot more work.

Well if something is really stuck, I think the wait events are covering us on that,
aren't they? One can argue if they carry enough information (for me they mostly
do, but I'm trying to squeeze some more stuff into them in a nearby thread [1],
BTW: it's kind of "blocked" due to that 56-bit relfilenode idea/question,
any thoughts on that?)

> > Right now the patch does not include per-backend/PID tracking, hopefully if
> > there will be interest in this, I'll add it, but I would like to first hear
> > if that's a good idea. The current implementation uses fast bucket
> > calculation to avoid overheads and tries to cover most useful range of
> > devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
> > abnormally high latency too as from time to time I'm try to help with I/O
> > stuck for *seconds*, usually a sign of some I/O multipath issues, device
> > resetting, or hypervisor woes).
>
> Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good
> NVMe disk you'll get < 10us, after all.

I was blind and concentrated way too much on the bad-behaving I/O rather than good
 I/O - let's call it I/O negativity bias 8)

Now v2 contains the min bucket lowered to 8us (but max then is just ~131ms, I
didn't want it to use more than 64b total, 16*4b (uint32)=64b and well
16*8b(uint64)=128b already, so that's why it's capped max at 131072us right now).

> I see a few problems with the source of the latency measurements though:
>
> - The latency gathered is that it's quite heavily affected by scheduler
>   noise. If your process isn't scheduled because other processes are busy
>   doing stuff on the CPU, it's quite possible to get results many orders of
>   magnitude wrong.
>
> - With AIO, you're measuring wait time, and that time can be affected by other
>   IOs in the queue. That will often *drastically* overestimate IO latency
>   measured this way.
>
> I don't see how we can do better absent cooperation from the kernel (by
> putting lower-level measurements into io_uring completions, for example)
> though.  So maybe this is just how it has to be and we ought to just document
> it.

Right, I think this is a complex topic on it's own, I've added a small section into
the docs. I didn't want to start the thread with undermining my own results, but
indeed I'm getting "bad" numbers. Bad in essence that perceived latency
numbers do not match with other stuff:

E.g.checkpointer's fsync/fdatasync latency is awful, although i've been using
this to simulate latency (just 2ms!, but it ended up adding way more):
    dd if=/dev/zero of=/fs bs=1M count=1024
    losetup /dev/loop15 /fs
    echo "0 $(blockdev --getsz /dev/loop15) delay /dev/loop15 0 2" | \
        dmsetup create delay
    mkfs.ext4 /dev/mapper/delay
    mount /dev/mapper/delay /mnt

... and it has e.g. quite interesting effects:

- lack of "noatime" the impact is clearly visible on fsync

- even with noatime I'm was getting spikes of latenices above 131ms
  (sic!) with this:
    select pg_stat_reset_shared();
    pgbench -i -s 10 -p 1234 -h /tmp postgres
    checkpoint;

- I've created attached bpftrace to see the gap between kernel and uprobes, but
  it's not that high, sample of the view

    backend  |  object  | context |  io_type  | lat_us | lat_ms  | count
    ----------+----------+---------+-----------+--------+---------+-------
    checkpoi | relation | normal  | fsync     |  32768 |  32.768 |    42
    checkpoi | relation | normal  | fsync     |  65536 |  65.536 |     3
    checkpoi | relation | normal  | fsync     | 262144 | 262.144 |     1

  vs eBPF, which does not seem to see that, worst case seem to be like wasted
  ~20us (gap is between user and kernel)

    [fdatasync] PID 54197  | Kernel: 12943  us | User: 12964  us | Lag: 21us
    [..but usually it's just:]
    [    fsync] PID 52266  | Kernel: 1711   us | User: 1714   us | Lag: 3us
    [    fsync] PID 52266  | Kernel: 19913  us | User: 19916  us | Lag: 3us
    [    fsync] PID 52266  | Kernel: 1993   us | User: 1996   us | Lag: 3us
    [    fsync] PID 52266  | Kernel: 1994   us | User: 1995   us | Lag: 1us
    [    fsync] PID 52266  | Kernel: 53734  us | User: 53736  us | Lag: 2us
    [    fsync] PID 52266  | Kernel: 8066   us | User: 8072   us | Lag: 6us
    [    fsync] PID 52266  | Kernel: 2107   us | User: 2109   us | Lag: 2us
    [    fsync] PID 52266  | Kernel: 1972   us | User: 1974   us | Lag: 2us
   (this is on 2ms delayed + noatime fs + with CONFIG_HZ=1000 + laptop's NVMe
   that's idle).

- in mdsyncfiletag() we seem to have pgstat_count_io_op_time() *after* potential
  FileClose(), but I haven't witnessed long close(), it's still context-switch

- I've spotted that power mgmt might be influencing it even further (but that's not
  in the docs yet, dunno if I should add it there like the next item on the list)

> >  backend  |  object  |  context  |  io_type  | lat_us | lat_ms | count
> > ----------+----------+-----------+-----------+--------+--------+-------
> >  autovacu | relation | normal    | read      |    128 |  0.128 |    54
>
> Perhaps the latency should be represented as a range?

Cool idea, I haven't even thought about this one! From now v2 shows:

postgres=# select
    substring(backend_type,1,8) as btype,
    object, context, io_type, bucket_latency_us as lat_us, bucket_count as cnt
from pg_stat_get_io_histogram()
where
    bucket_count > 0
order by 1,2,3,4,5 ;
  btype   |  object  | context |  io_type  |   lat_us    | cnt
----------+----------+---------+-----------+-------------+-----
[..]
 checkpoi | relation | normal  | write     | [0,9)       |  33
 checkpoi | relation | normal  | write     | [8,17)      |   8
 checkpoi | relation | normal  | write     | [16,33)     |   1


> > Of course most of the I/O calls today are hitting page cache, so one would
> > expect they'll be < 128us most of the time
>
> Have you measured whether overhead is measurable when hitting the page cache?
> I'd hope that it doesn't, due to io combing amortizing the cost somewhat. But
> it seems worth measuring.

Not yet, I first wanted to hear if I'm not sailing into some plain stupid
direction somewhere with this idea or implementation (e.g.
that INSTR_TIME_GET_MICROSEC() was a really stupid omission from my side).

I'll try to perform this test overhead measurement hopefully with v3 once
we settle on how to do that bit shifting/clz().

> I assume you made pgstat_get_io_op_name() return "hit?" because you don't
> expect that to ever be hit?

Yes, my patch seems to always return 0 for "hit?". I'll need to investigate that
further.

> > +static inline int get_bucket_index(uint32_t val) {
> > +#define MIN_PG_STAT_IO_HIST_LATENCY 127
> > +     const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1;
> > +     /*
> > +      * hopefully calculated to be 25 by the compiler:
> > +      * clz(127) = clz(01111111b on uint32) = 25
> > +      */
> > +     const uint32_t min_latency_leading_zeros =
> > +             pg_leading_zero_bits32(MIN_PG_STAT_IO_HIST_LATENCY);
> > +
> > +     /*
> > +      * make sure the tmp value at least 127 (our minimum bucket size)
> > +      * as __builtin_clz might return undefined behavior when operating on 0
> > +      */
> > +     uint32_t tmp = val | MIN_PG_STAT_IO_HIST_LATENCY;
>
> > +     /* count leading zeros */
> > +     int leading_zeros = pg_leading_zero_bits32(tmp);
> > +
> > +     /* normalize the index */
> > +     uint32_t index = min_latency_leading_zeros - leading_zeros;
> > +
> > +     /* clamp it to the maximum */
> > +     return (index > max_index) ? max_index : index;
> > +}
>
> Wouldn't it be easier to handle the minimum latency by shifting right?

What you seem to suggest seems to be equally width buckets {1,2,3,4..ms} and
not logarithmic buckets {1,2,4,8..ms} or am I missing something? The patch as is
stands has two ways #ifdef implementations now, with bitwise shifting working
now, but even in objdump on -O2 there's plenty of those jumps because of current
code.

> I think we may also need to handle inputs that don't fit a uint32.

Fixed.

> [..] For things
> like a stopped VM or such we could see IOs that that don't fit into a uint32
> when measured in microseconds.  So perhaps I'd make the input to the bucket
> calc 64 bits, then shift to the minimum precision and mask to implement
> clamping.

> > @@ -152,6 +189,10 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
> >               INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
> >                                          io_time);
> >
> > +             /* calculate the bucket_index based on latency in us */
> > +             bucket_index = get_bucket_index(INSTR_TIME_GET_MICROSEC(io_time));
> > +             PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
> > +
> >               /* Add the per-backend count */
> >               pgstat_count_backend_io_op_time(io_object, io_context, io_op,
>
> It's annoying to have to convert to microseconds here, that's not free :(.

Oooops, fixed, get_bucket_index() now operates directly on nanos/int64.

[..]
> > +     HIST_IO_COL_COUNT,
> > +     HIST_IO_COL_RESET_TIME,
> > +     HIST_IO_NUM_COLUMNS
> > +} history_get_history_state;
>
]> Think the IO_NUM_COLUMNS reference in the comment is a copy-pasto.  I don't
> think this should be introduced in the middle of the pg_stat_io implementation.

Right, I've moved it to just before pg_stat_io_histogram_build_tuples().

>
> > +/*
> > + * pg_leading_zero_bits32
> > + *           Returns the number of leading 0-bits in x, starting at the most significant bit position.
> > + *           Word must not be 0 (as it is undefined behavior).
> > + */
> > +static inline int
> > +pg_leading_zero_bits32(uint32 word)
>
> Do we really need this in addition to the already existing
> pg_leftmost_one_pos32()? Particularly because that already has an msvc
> implementation...

Well, I would be all in for removal , but please see above the get_bucket_index() discussion. 
I've tried to get rid of it (but maybe i misunderstood something), but in the end I think it is more
 elegant/faster to have it there so that code in get_bucket_index() stays more readable, rather 
than throw more bitwise voodoo there(?)

-J.

[1] - https://www.postgresql.org/message-id/CAKZiRmyZzmOODYS6n8mns9zN4RcS3o9kfrdQDyeRupqaGp9PmQ%40mail.gmail.com
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: pg_upgrade: optimize replication slot caught-up check
Next
From: Amul Sul
Date:
Subject: Re: pg_waldump: support decoding of WAL inside tarfile