Re: pg_stat_io_histogram - Mailing list pgsql-hackers
| From | Ants Aasma |
|---|---|
| Subject | Re: pg_stat_io_histogram |
| Date | |
| Msg-id | CANwKhkMwydw8P-H-tUzbmNsbk7B=buM+UDT4Rb90Y_OBhNQtdg@mail.gmail.com Whole thread |
| In response to | Re: pg_stat_io_histogram (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
| Responses |
Re: pg_stat_io_histogram
|
| List | pgsql-hackers |
On Thu, 12 Feb 2026 at 10:35, Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:
BTW how have you arrived with the "240x" number? We have 16 buckets for each
of the object/context/type.
Sorry, I worded that poorly. I meant that we store a histogram for each combination, 240 in total.
> even then scanning an extra 25KB of mostly zeroes on every commit
> doesn't seem great. Maybe making the histogram accumulation
> conditional on the counter field being non-zero is enough to avoid any
> issues? I haven't yet constructed a benchmark to see if it's actually
> a problem or not. Select only pgbench with small shared buffers and
> scale that fits into page cache should be an adversarial use case
> while still being reasonably realistic.
Earlier I've done some benchmarks (please see [1]) based on recommendations
by Andres to keep low io_combine_limit for that and just tiny shared_buffers.
I'm getting too much noise to derive any results, and as this is related
to I/O even probably context switches start playing a role there... sadly we
seem not to have a performance farm to answer this.
I glossed over the first benchmark you did. That's pretty close to what I was talking about - exercise the stats collection part by having ~1 I/O served from page cache per a trivial transaction. And the prewarm should exercise the per I/O overhead. If neither of them have any measurable overhead then I can't think of a workload where it could be worse.
TBH, I'm not sure how to progress with this, I mean we could as you say:
- reduce PgStat_PendingIO.pending_hist_time_buckets by removing
IOCONTEXT_NUM_TYPES
(not a big loss, just lack of showing BAS strategy)
I'm on the fence on this. For the actual problems I've had to diagnose it wouldn't have mattered. But latency differences of bulk vs. normal access might be useful for understanding benchmark results better. A 5x reduction in size is pretty big.
- we could even further reduce PgStat_PendingIO.pending_hist_time_buckets
by removing IOOBJECT_NUM_TYPES, but those are just 3 and they might be
useful
WAL write vs. relation write is a very useful distinction for me.
... and are You saying to try to do this below thing too?
@@ -288,8 +290,9 @@ pgstat_io_flush_cb(bool nowait)
for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
-
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
-
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
+
if(PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]
> 0)
+
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
I meant this:
@@ -287,6 +287,7 @@ pgstat_io_flush_cb(bool nowait)
bktype_shstats->times[io_object][io_context][io_op] +=
INSTR_TIME_GET_MICROSEC(time);
+ if (PendingIOStats.counts[io_object][io_context][io_op] > 0)
for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
bktype_shstats->times[io_object][io_context][io_op] +=
INSTR_TIME_GET_MICROSEC(time);
+ if (PendingIOStats.counts[io_object][io_context][io_op] > 0)
for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
Most object/context/op combinations will have a 0 count, so no point in actually looking at the histogram.
.. but the main problem, even if I do all of that I won't be able to
reliably measure the impact, probably the best I could say is
"runs good as well as master, +/- 3%".
Could you somehow help me with this? I mean should we reduce the scope(remove
context) and add that "if"?
I think if we only aggregate histograms conditionally, then having a ton of different histograms is less of a problem. Only the histograms that have any data will get accessed. The overhead is limited to the memory usage which I think is acceptable.
I'll run a few benchmarks on what I have available here to see if I can tease out anything more than the no effect with a 3% error margin we have today.
> I'm not familiar enough with the new stats infrastructure to tell
> whether it's a problem, but it seems odd that
> pgstat_flush_backend_entry_io() isn't modified to aggregate the
> histograms.
Well I'm first time doing this too, and my understanding is that
pgstat_io.c::pgstat_io_flush_cb() is flushing the global statistics
(per backend-type) while the per-individual backend
pgstat_flush_backend_entry_io() (from pgstat_backend.c) is more about
per-PID-backends stats (--> for: select * from pg_stat_get_backend_io(PID)).
In terms of this patch, the per-backend-PID-I/O histograms are not implemented
yet, and I've raised this question earlier, but I'm starting to believe
the answer is probably no, we should not implement those (more overhead
for no apparent benefit, as most of the cases could be tracked down just with
this overall per-backend-type stats ).
I agree that per-PID histograms are probably not worth the extra work. But this left me wondering if we are allocating the whole set of histograms too many times. I don't think every place that uses PgStat_BktypeIO actually needs the histograms. I will need to dig around to understand this code a bit better.
Regards,
Ants Aasma
pgsql-hackers by date: