Re: pg_stat_io_histogram - Mailing list pgsql-hackers

From Jakub Wartak
Subject Re: pg_stat_io_histogram
Date
Msg-id CAKZiRmz-KH2a_qcwGeFwsoajCsQEd8jXyFLx0B0yKgkK=LhBLA@mail.gmail.com
Whole thread
In response to Re: pg_stat_io_histogram  (Andres Freund <andres@anarazel.de>)
Responses Re: pg_stat_io_histogram
List pgsql-hackers
On Mon, Mar 2, 2026 at 4:06 PM Andres Freund <andres@anarazel.de> wrote:

Hi Andres,

> On 2026-03-02 09:01:05 +0100, Jakub Wartak wrote:
> > On Thu, Feb 26, 2026 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > but I think having it in PgStat_BktypeIO is not great. This makes
> > > > > > PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot
> > > > > > be half a megabyte bigger for no reason seems too wasteful.
> > > > >
> > > > > Yea, that's not awesome.
> > > >
> > > > Guys, question, could You please explain me what are the drawbacks of having
> > > > this semi-big (internal-only) stat snapshot of 0.5MB? I'm struggling to
> > > > understand two things:
> > > > a) 0.5MB is not a lot those days (ok my 286 had 1MB in the day ;))
> > >
> > > I don't really agree with that, I guess. And even if I did, it's one thing to
> > > use 0.5MB when you actually use it, it's quite another when most of that
> > > memory is never used.
> > >
> > >
> > > With the patch, *every* backend ends up with a substantially larger
> > > pgStatLocal. Before:
> > >
> > > nm -t d --size-sort -r -S src/backend/postgres|head -n20|less
> > > (the second column is the decimal size, third the type of the symbol)
> > >
> > > 0000000004131808 0000000000297456 r yy_transition
> > > ...
> > > 0000000003916352 0000000000054744 r UnicodeDecompMain
> > > 0000000021004896 0000000000052824 B pgStatLocal
> > > 0000000003850592 0000000000040416 r unicode_categories
> > > ...
> > >
> > > after:
> > > 0000000023220512 0000000000329304 B pgStatLocal
> > > 0000000018531648 0000000000297456 r yy_transition
> > > ...
> > >
> > > And because pgStatLocal is zero initialized data, it'll be on-demand-allocated
> > > in every single backend (whereas e.g. yy_transition is read-only shared).  So
> > > you're not talking a single time increase, you're multiplying it by the numer
> > > of active connections
> > >
> > > Now, it's true that most backend won't ever touch pgStatLocal.  However, most
> > > backends will touch Pending[Backend]IOStats, which also increased noticably:
> > >
> > > before:
> > > 0000000021060960 0000000000002880 b PendingIOStats
> > > 0000000021057792 0000000000002880 b PendingBackendStats
> > >
> > > after:
> > > 0000000023568416 0000000000018240 b PendingIOStats
> > > 0000000023549888 0000000000018240 b PendingBackendStats
> > >
> > >
> > > Again, I think some increase here doesn't have to be fatal, but increasing
> > > with mainly impossible-to-use memory seems just too much waste to mee.
> > >
> > >
> > > This also increases the shared-memory usage of pgstats: Before it used ~300kB
> > > on a small system. That nearly doubles with this patch. But that's perhaps
> > > less concerning, given it's per-system, rather than per-backend memory usage.
> > >
> > >
> > >
> > > > b) how does it affect anything, because testing show it's not?
> > >
> > > Which of your testing would conceivably show the effect?  The concern here
> > > isn't really performance, it's that it increases our memory usage, which you'd
> > > only see having an effect if you are tight on memory or have a workload that
> > > is cache sensitive.
> > >
> >
> > Oh ok, now I get understand the problem about pgStatLocal properly,
> > thanks for detailed
> > explanation! (but I'm somewhat I'm still lost a little in the woods of
> > pgstat infra). Anyway, I
> > agree that PgStat_IO started to be way too big especially when the
> > pg_stat_io(_histogram)
> > views wouldn't be really accessed.
> >
> > How about the attached v6-0002? It now dynamically allocates PgStat_IO
> > memory to avoid
> > the memory cost (only allocated if pgstat_io_snapshot_cb() is used).Is
> > that the right path? And
> > if so, perhaps it should allocate it from mxct
> > pgStatLocal.snapshot.context instead?
>
> I think even the per-backend pending IO stats are too big. And for both
> pending stats, stored stats and snapshots, I still don't think I am OK with
> storing so many histograms that are not possible to use.  I think that needs
> to be fixed first.

v7-0001: no changes since quite some time

Memory reduction stuff (I didn't want to squash it, so for now they are
separate)

v7-0002:
   As PendingBackendStats (per individual backend IO stats) was not collecting
   latency buckets at all (but it was sharing the the same struct/typedef),
   I cloned the  struct without those latency buckets. This reduces struct back
   again from 18240, back to 2880 bytes per backend (BSS) as on master.

v7-0003:
   Sadly I couldn't easily make backend-local side recording inside
   PendingIOStats dynamically from within pgstat_count_io_op_time() on first
   use of specific IO traffic type, so that is for each
   [IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES] as any
   MemoryContextAlloc() from there happens to be used as part of critical
   sections and this blows up.

   It's just +15kB per backend, so I hope that is ok when
   we just allocate if we have really desire to use it
   (track_io/wal_io_timings on) --   so nm(1) reports just 2888 (so just
   +8b pointer). The drawback of this is that setting GUCs locally won't be
   effective for histogram collection immediatley,  but only for newly spawned
   backends. This means that I had to switch it to TAP test instead, so
   it can be tested. I don't have strong opinion if that saving +15kB is
   worth it or not for users not running with track_[io/wal_io]_timings.

v7-0004:
   (This was already sent with previous message) With orginal v5 every backend
   had big pgStatLocal (0000000000329304 B pgStatLocal) that was there but not
   used at all if pg_stat_io(_histogram) views wouldn't be really accessed. Now
   it is (0000000000000984 B pgStatLocal) and allocates
   PgStat_Snapshot.PgStat_IO only when quering those views.

So with all 3 above combined we have back:
0000000011573376 0000000000002888 B PendingIOStats
0000000011570304 0000000000002880 b PendingBackendStats
0000000011569184 0000000000000984 B pgStatLocal

That's actual saving over master itself:
0000000011577344 0000000000052824 B pgStatLocal
0000000011633408 0000000000002880 b PendingIOStats
0000000011630304 0000000000002880 b PendingBackendStats

> This also increases the shared-memory usage of pgstats: Before it used ~300kB
> on a small system. That nearly doubles with this patch. But that's perhaps
> less concerning, given it's per-system, rather than per-backend memory usage.

v7-0005:
   Skipping 4 backend types out of of 17 makes it ignoring ~23% of backend
   types and with simple array , I can get this down from ~592384 down to
   ~519424 _total_ memory allocated for'Shared Memory Stats' shm (this one
   was sent earlier).

v7-0006:
   We could reduce total pgstats shm  down to ~482944b if we would eliminate
   tracking of two further IMHO useless types: autovacuum_launcher and
   standalone_backend. Master is @ 315904 (so that's just 163kb more according
   to pg_shm_allocations).

Patches probably need some squash and pgident, etc.

-J.

Attachment

pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: pg_plan_advice
Next
From: David Steele
Date:
Subject: Re: Improve checks for GUC recovery_target_xid