Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date
Msg-id 20220706192047.sikr2vuwvpslqdrw@alap3.anarazel.de
Whole thread Raw
In response to Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
List pgsql-hackers
Hi,

On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote:
> From 2d089e26236c55d1be5b93833baa0cf7667ba38d Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 28 Jun 2022 11:33:04 -0400
> Subject: [PATCH v22 1/3] Add BackendType for standalone backends
> 
> All backends should have a BackendType to enable statistics reporting
> per BackendType.
> 
> Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
> alphabetize the BackendTypes). Both the bootstrap backend and single
> user mode backends will have BackendType B_STANDALONE_BACKEND.
> 
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Discussion:
https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
> ---
>  src/backend/utils/init/miscinit.c | 17 +++++++++++------
>  src/include/miscadmin.h           |  5 +++--
>  2 files changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
> index eb43b2c5e5..07e6db1a1c 100644
> --- a/src/backend/utils/init/miscinit.c
> +++ b/src/backend/utils/init/miscinit.c
> @@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
>  {
>      Assert(!IsPostmasterEnvironment);
>  
> +    MyBackendType = B_STANDALONE_BACKEND;

Hm. This is used for singleuser mode as well as bootstrap. Should we
split those? It's not like bootstrap mode really matters for stats, so
I'm inclined not to.


> @@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
>       * out the initial relation mapping files.
>       */
>      RelationMapFinishBootstrap();
> +    // TODO: should this be done for bootstrap?
> +    pgstat_report_io_ops();

Hm. Not particularly useful, but also not harmful. But we don't need an
explicit call, because it'll be done at process exit too. At least I
think, it could be that it's different for bootstrap.



> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 2e146aac93..e6dbb1c4bb 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -1712,6 +1712,9 @@ AutoVacWorkerMain(int argc, char *argv[])
>          recentXid = ReadNextTransactionId();
>          recentMulti = ReadNextMultiXactId();
>          do_autovacuum();
> +
> +        // TODO: should this be done more often somewhere in do_autovacuum()?
> +        pgstat_report_io_ops();
>      }

Don't think you need all these calls before process exit - it'll happen
via pgstat_shutdown_hook().

IMO it'd be a good idea to add pgstat_report_io_ops() to
pgstat_report_vacuum()/analyze(), so that the stats for a longrunning
autovac worker get updated more regularly.


> diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
> index 91e6f6ea18..87e4b9e9bd 100644
> --- a/src/backend/postmaster/bgwriter.c
> +++ b/src/backend/postmaster/bgwriter.c
> @@ -242,6 +242,7 @@ BackgroundWriterMain(void)
>  
>          /* Report pending statistics to the cumulative stats system */
>          pgstat_report_bgwriter();
> +        pgstat_report_io_ops();
>  
>          if (FirstCallSinceLastCheckpoint())
>          {

How about moving the pgstat_report_io_ops() into
pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems
unnecessary to have multiple pgstat_* calls in these places.



> +/*
> + * Flush out locally pending IO Operation statistics entries
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true. Writer processes are mutually excluded
> + * using LWLock, but readers are expected to use change-count protocol to avoid
> + * interference with writers.
> + *
> + * If nowait is true, this function returns true if the lock could not be
> + * acquired. Otherwise return false.
> + *
> + */
> +bool
> +pgstat_flush_io_ops(bool nowait)
> +{
> +    PgStat_IOPathOps *dest_io_path_ops;
> +    PgStatShared_BackendIOPathOps *stats_shmem;
> +
> +    PgBackendStatus *beentry = MyBEEntry;
> +
> +    if (!have_ioopstats)
> +        return false;
> +
> +    if (!beentry || beentry->st_backendType == B_INVALID)
> +        return false;
> +
> +    stats_shmem = &pgStatLocal.shmem->io_ops;
> +
> +    if (!nowait)
> +        LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
> +    else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
> +        return true;

Wonder if it's worth making the lock specific to the backend type?


> +    dest_io_path_ops =
> +        &stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
> +

This could be done before acquiring the lock, right?


> +void
> +pgstat_io_ops_snapshot_cb(void)
> +{
> +    PgStatShared_BackendIOPathOps *stats_shmem = &pgStatLocal.shmem->io_ops;
> +    PgStat_IOPathOps *snapshot_ops = pgStatLocal.snapshot.io_path_ops;
> +    PgStat_IOPathOps *reset_ops;
> +
> +    PgStat_IOPathOps *reset_offset = stats_shmem->reset_offset;
> +    PgStat_IOPathOps reset[BACKEND_NUM_TYPES];
> +
> +    pgstat_copy_changecounted_stats(snapshot_ops,
> +            &stats_shmem->stats, sizeof(stats_shmem->stats),
> +            &stats_shmem->changecount);

This doesn't make sense - with multiple writers you can't use the
changecount approach (and you don't in the flush part above).


> +    LWLockAcquire(&stats_shmem->lock, LW_SHARED);
> +    memcpy(&reset, reset_offset, sizeof(stats_shmem->stats));
> +    LWLockRelease(&stats_shmem->lock);

Which then also means that you don't need the reset offset stuff. It's
only there because with the changecount approach we can't take a lock to
reset the stats (since there is no lock). With a lock you can just reset
the shared state.


> +void
> +pgstat_count_io_op(IOOp io_op, IOPath io_path)
> +{
> +    PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
> +    PgStat_IOOpCounters *cumulative_counters =
> +            &cumulative_IOOpStats.data[io_path];

the pending_/cumultive_ prefix before an uppercase-first camelcase name
seems ugly...

> +    switch (io_op)
> +    {
> +        case IOOP_ALLOC:
> +            pending_counters->allocs++;
> +            cumulative_counters->allocs++;
> +            break;
> +        case IOOP_EXTEND:
> +            pending_counters->extends++;
> +            cumulative_counters->extends++;
> +            break;
> +        case IOOP_FSYNC:
> +            pending_counters->fsyncs++;
> +            cumulative_counters->fsyncs++;
> +            break;
> +        case IOOP_WRITE:
> +            pending_counters->writes++;
> +            cumulative_counters->writes++;
> +            break;
> +    }
> +
> +    have_ioopstats = true;
> +}

Doing two math ops / memory accesses every time seems off. Seems better
to maintain cumultive_counters whenever reporting stats, just before
zeroing pending_counters?


> +/*
> + * Report IO operation statistics
> + *
> + * This works in much the same way as pgstat_flush_io_ops() but is meant for
> + * BackendTypes like bgwriter for whom pgstat_report_stat() will not be called
> + * frequently enough to keep shared memory stats fresh.
> + * Backends not typically calling pgstat_report_stat() can invoke
> + * pgstat_report_io_ops() explicitly.
> + */
> +void
> +pgstat_report_io_ops(void)
> +{

This shouldn't be needed - the flush function above can be used.


> +    PgStat_IOPathOps *dest_io_path_ops;
> +    PgStatShared_BackendIOPathOps *stats_shmem;
> +
> +    PgBackendStatus *beentry = MyBEEntry;
> +
> +    Assert(!pgStatLocal.shmem->is_shutdown);
> +    pgstat_assert_is_up();
> +
> +    if (!have_ioopstats)
> +        return;
> +
> +    if (!beentry || beentry->st_backendType == B_INVALID)
> +        return;

Is there a case where this may be called where we have no beentry?

Why not just use MyBackendType?


> +    stats_shmem = &pgStatLocal.shmem->io_ops;
> +
> +    dest_io_path_ops =
> +        &stats_shmem->stats[backend_type_get_idx(beentry->st_backendType)];
> +
> +    pgstat_begin_changecount_write(&stats_shmem->changecount);

A mentioned before, the changecount stuff doesn't apply here. You need a
lock.


> +PgStat_IOPathOps *
> +pgstat_fetch_backend_io_path_ops(void)
> +{
> +    pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
> +    return pgStatLocal.snapshot.io_path_ops;
> +}
> +
> +PgStat_Counter
> +pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op)
> +{
> +    PgStat_IOOpCounters *counters = &cumulative_IOOpStats.data[io_path];
> +
> +    switch (io_op)
> +    {
> +        case IOOP_ALLOC:
> +            return counters->allocs;
> +        case IOOP_EXTEND:
> +            return counters->extends;
> +        case IOOP_FSYNC:
> +            return counters->fsyncs;
> +        case IOOP_WRITE:
> +            return counters->writes;
> +        default:
> +            elog(ERROR, "IO Operation %s for IO Path %s is undefined.",
> +                    pgstat_io_op_desc(io_op), pgstat_io_path_desc(io_path));
> +    }
> +}

There's currently no user for this, right? Maybe let's just defer the
cumulative stuff until we need it?


> +const char *
> +pgstat_io_path_desc(IOPath io_path)
> +{
> +    const char *io_path_desc = "Unknown IO Path";
> +

This should be unreachable, right?


> From f2b5b75f5063702cbc3c64efdc1e7ef3cf1acdb4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 4 Jul 2022 15:44:17 -0400
> Subject: [PATCH v22 3/3] Add system view tracking IO ops per backend type

> Add pg_stat_buffers, a system view which tracks the number of IO
> operations (allocs, writes, fsyncs, and extends) done through each IO
> path (e.g. shared buffers, local buffers, unbuffered IO) by each type of
> backend.

I think I like pg_stat_io a bit better? Nearly everything in here seems
to fit better in that.

I guess we could split out buffers allocated, but that's actually
interesting in the context of the kind of IO too.


>       <row>
>        <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
>        <entry>One row only, showing statistics about WAL activity. See
> @@ -3595,7 +3604,102 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
>         <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
>        </para>
>        <para>
> -       Time at which these statistics were last reset
> +       Time at which these statistics were last reset.
> +      </para></entry>

Grammar critique time :)


> +CREATE VIEW pg_stat_buffers AS
> +SELECT
> +       b.backend_type,
> +       b.io_path,
> +       b.alloc,
> +       b.extend,
> +       b.fsync,
> +       b.write,
> +       b.stats_reset
> +FROM pg_stat_get_buffers() b;

Do we want to expose all data to all users? I guess pg_stat_bgwriter
does? But this does split things out a lot more...



> +    for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> +    {
> +        PgStat_IOOpCounters *counters = io_path_ops->data;
> +        Datum        backend_type_desc =
> +            CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i)));
> +            /* const char *log_name = GetBackendTypeDesc(idx_get_backend_type(i)); */
> +
> +        for (int j = 0; j < IOPATH_NUM_TYPES; j++)
> +        {
> +            Datum values[BUFFERS_NUM_COLUMNS];
> +            bool nulls[BUFFERS_NUM_COLUMNS];
> +            memset(values, 0, sizeof(values));
> +            memset(nulls, 0, sizeof(nulls));
> +
> +            values[BUFFERS_COLUMN_BACKEND_TYPE] = backend_type_desc;
> +            values[BUFFERS_COLUMN_IO_PATH] = CStringGetTextDatum(pgstat_io_path_desc(j));

Random musing: I wonder if we should start to use SQL level enums for
this kind of thing.


>  DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
>  DROP TABLE prevstats;
> +SELECT pg_stat_reset_shared('buffers');
> + pg_stat_reset_shared 
> +----------------------
> + 
> +(1 row)
> +
> +SELECT pg_stat_force_next_flush();
> + pg_stat_force_next_flush 
> +--------------------------
> + 
> +(1 row)
> +
> +SELECT write = 0 FROM pg_stat_buffers WHERE io_path = 'Shared' and backend_type = 'checkpointer';
> + ?column? 
> +----------
> + t
> +(1 row)


Don't think you can rely on that. The lookup of the view, functions
might have needed to load catalog data, which might have needed to evict
buffers.  I think you can do something more reliable by checking that
there's more written buffers after a checkpoint than before, or such.


Would be nice to have something testing that the ringbuffer stats stuff
does something sensible - that feels not entirely trivial.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Add red-black tree missing comparison searches
Next
From: Alvaro Herrera
Date:
Subject: Re: EINTR in ftruncate()