pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) - Mailing list pgsql-hackers

From Andres Freund
Subject pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date
Msg-id 20200124195226.lth52iydq2n2uilq@alap3.anarazel.de
Whole thread Raw
Responses Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
List pgsql-hackers
Hi,

Currently pg_stat_bgwriter.buffers_backend is pretty useless to gauge
whether backends are doing writes they shouldn't do. That's because it
counts things that are either unavoidably or unlikely doable by other
parts of the system (checkpointer, bgwriter).

In particular extending the file can not currently be done by any
another type of process, yet is counted. When using a buffer access
strategy it is also very likely that writes have to be done by the
'dirtying' backend itself, as the buffer will be reused soon after (when
not previously in s_b that is).

Additionally pg_stat_bgwriter.buffers_backend also counts writes done by
autovacuum et al.


I think it'd make sense to at least split buffers_backend into
buffers_backend_extend,
buffers_backend_write,
buffers_backend_write_strat

but it could also be worthwhile to expand it into
buffers_backend_extend,
buffers_{backend,checkpoint,bgwriter,autovacuum}_write
buffers_{backend,autovacuum}_write_stat

Possibly by internally, in contrast to SQL level, having just counter
arrays indexed by backend types.


It's also noteworthy that buffers_backend is accounted in an absurd
manner. One might think that writes are accounted from backend -> shared
memory or such. But instead it works like this:

1) backend flushes buffer in bufmgr.c, accounts for backend *write time*
2) mdwrite writes and  registers a sync request, which forwards the sync request to checkpointer
3) ForwardSyncRequest(), when not called by bgwriter, increments CheckpointerShmem->num_backend_writes
4) checkpointer, whenever doing AbsorbSyncRequests(), moves
   CheckpointerShmem->num_backend_writes to
   BgWriterStats.m_buf_written_backend (local memory!)
5) Occasionally it calls pgstat_send_bgwriter(), which sends the data to
   pgstat (which bgwriter also does)
6) Which then updates the shared memory used by the display functions

Worthwhile to note that backend buffer read/write *time* is accounted
differently. That's done via pgstat_send_tabstat().


I think there's very little excuse for the indirection via checkpointer,
besides architectually being weird, it actually requires that we
continue to wake up checkpointer over and over instead of optimizing how
and when we submit fsync requests.

As far as I can tell we're also simply not accounting at all for writes
done outside of shared buffers. All writes done directly through
smgrwrite()/extend() aren't accounted anywhere as far as I can tell.


I think we also count things as writes that aren't writes: mdtruncate()
is AFAICT counted as one backend write for each segment. Which seems
weird to me.


Lastly, I don't understand what the point of sending fixed size stats,
like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
I don't like it's architecture, we obviously need something like pgstat
to handle variable amounts of stats (database, table level etc
stats). But that doesn't at all apply to these types of global stats.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: making the backend's json parser work in frontend code
Next
From: Ranier Vilela
Date:
Subject: [PATCH] /src/backend/access/transam/xlog.c, tiny improvements