Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) - Mailing list pgsql-hackers

From Andres Freund
Subject Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date
Msg-id 20210803181258.622v2q5lv4swwkdw@alap3.anarazel.de
Whole thread Raw
In response to Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers
Hi,

On 2021-08-02 18:25:56 -0400, Melanie Plageman wrote:
> Thanks for the feedback!
> 
> I agree it makes sense to count strategy writes separately.
> 
> I thought about this some more, and I don't know if it makes sense to
> only count "avoidable" strategy writes.
> 
> This would mean that a backend writing out a buffer from the strategy
> ring when no clean shared buffers (as well as no clean strategy buffers)
> are available would not count that write as a strategy write (even
> though it is writing out a buffer from its strategy ring). But, it
> obviously doesn't make sense to count it as a regular buffer being
> written out. So, I plan to change this code.

What do you mean with "no clean shared buffers ... are available"?



> The most substantial missing piece of the patch right now is persisting
> the data across reboots.
> 
> The two places in the code I can see to persist the buffer action stats
> data are:
> 1) using the stats collector code (like in
> pgstat_read/write_statsfiles()
> 2) using a before_shmem_exit() hook which writes the data structure to a
> file and then read from it when making the shared memory array initially

I think it's pretty clear that we should go for 1. Having two mechanisms for
persisting stats data is a bad idea.


> Also, I'm unsure how writing the buffer action stats out in
> pgstat_write_statsfiles() will work, since I think that backends can
> update their buffer action stats after we would have already persisted
> the data from the BufferActionStatsArray -- causing us to lose those
> updates.

I was thinking it'd work differently. Whenever a connection ends, it reports
its data up to pgstats.c (otherwise we'd loose those stats). By the time
shutdown happens, they all need to have already have reported their stats - so
we don't need to do anything to get the data to pgstats.c during shutdown
time.


> And, I don't think I can use pgstat_read_statsfiles() since the
> BufferActionStatsArray should have the data from the file as soon as the
> view containing the buffer action stats can be queried. Thus, it seems
> like I would need to read the file while initializing the array in
> CreateBufferActionStatsCounters().

Why would backends need to read that data back?


> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> index 55f6e3711d..96cac0a74e 100644
> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
>          pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
>          pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
>          pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
> -        pg_stat_get_buf_written_backend() AS buffers_backend,
> -        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
> -        pg_stat_get_buf_alloc() AS buffers_alloc,
>          pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;

Material for a separate patch, not this. But if we're going to break
monitoring queries anyway, I think we should consider also renaming
maxwritten_clean (and perhaps a few others), because nobody understands what
that is supposed to mean.



> @@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
>  
>      LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
>  
> -    /* Count all backend writes regardless of if they fit in the queue */
> -    if (!AmBackgroundWriterProcess())
> -        CheckpointerShmem->num_backend_writes++;
> -
>      /*
>       * If the checkpointer isn't running or the request queue is full, the
>       * backend will have to perform its own fsync request.  But before forcing
> @@ -1106,8 +1090,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
>           * Count the subset of writes where backends have to do their own
>           * fsync
>           */
> +        /* TODO: should we count fsyncs for all types of procs? */
>          if (!AmBackgroundWriterProcess())
> -            CheckpointerShmem->num_backend_fsync++;
> +            pgstat_increment_buffer_action(BA_Fsync);
> +

Yes, I think that'd make sense. Now that we can disambiguate the different
types of syncs between procs, I don't see a point of having a process-type
filter here. We just loose data...



>          /* don't set checksum for all-zero page */
> @@ -1229,11 +1234,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>                      if (XLogNeedsFlush(lsn) &&
>                          StrategyRejectBuffer(strategy, buf))
>                      {
> +                        /*
> +                         * Unset the strat write flag, as we will not be writing
> +                         * this particular buffer from our ring out and may end
> +                         * up having to find a buffer from main shared buffers,
> +                         * which, if it is dirty, we may have to write out, which
> +                         * could have been prevented by checkpointing and background
> +                         * writing
> +                         */
> +                        StrategyUnChooseBufferFromRing(strategy);
> +
>                          /* Drop lock/pin and loop around for another buffer */
>                          LWLockRelease(BufferDescriptorGetContentLock(buf));
>                          UnpinBuffer(buf, true);
>                          continue;
>                      }

Could we combine this with StrategyRejectBuffer()? It seems a bit wasteful to
have two function calls into freelist.c when the second happens exactly when
the first returns true?


> +
> +                    /*
> +                     * TODO: there is certainly a better way to write this
> +                     * logic
> +                     */
> +
> +                    /*
> +                     * The dirty buffer that will be written out was selected
> +                     * from the ring and we did not bother checking the
> +                     * freelist or doing a clock sweep to look for a clean
> +                     * buffer to use, thus, this write will be counted as a
> +                     * strategy write -- one that may be unnecessary without a
> +                     * strategy
> +                     */
> +                    if (StrategyIsBufferFromRing(strategy))
> +                    {
> +                        pgstat_increment_buffer_action(BA_Write_Strat);
> +                    }
> +
> +                        /*
> +                         * If the dirty buffer was one we grabbed from the
> +                         * freelist or through a clock sweep, it could have been
> +                         * written out by bgwriter or checkpointer, thus, we will
> +                         * count it as a regular write
> +                         */
> +                    else
> +                        pgstat_increment_buffer_action(BA_Write);

It seems this would be better solved by having an "bool *from_ring" or
GetBufferSource* parameter to StrategyGetBuffer().


> @@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
>      /*
>       * bufToWrite is either the shared buffer or a copy, as appropriate.
>       */
> +
> +    /*
> +     * TODO: consider that if we did not need to distinguish between a buffer
> +     * flushed that was grabbed from the ring buffer and written out as part
> +     * of a strategy which was not from main Shared Buffers (and thus
> +     * preventable by bgwriter or checkpointer), then we could move all calls
> +     * to pgstat_increment_buffer_action() here except for the one for
> +     * extends, which would remain in ReadBuffer_common() before smgrextend()
> +     * (unless we decide to start counting other extends). That includes the
> +     * call to count buffers written by bgwriter and checkpointer which go
> +     * through FlushBuffer() but not BufferAlloc(). That would make it
> +     * simpler. Perhaps instead we can find somewhere else to indicate that
> +     * the buffer is from the ring of buffers to reuse.
> +     */
>      smgrwrite(reln,
>                buf->tag.forkNum,
>                buf->tag.blockNum,

Can we just add a parameter to FlushBuffer indicating what the source of the
write is?


> @@ -247,7 +257,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
>       * the rate of buffer consumption.  Note that buffers recycled by a
>       * strategy object are intentionally not counted here.
>       */
> -    pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
> +    pgstat_increment_buffer_action(BA_Alloc);
>  
>      /*
>       * First check, without acquiring the lock, whether there's buffers in the

> @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
>           */
>          *complete_passes += nextVictimBuffer / NBuffers;
>      }
> -
> -    if (num_buf_alloc)
> -    {
> -        *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
> -    }
>      SpinLockRelease(&StrategyControl->buffer_strategy_lock);
>      return result;
>  }

Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
suspect this patch shouldn't get rid of numBufferAllocs at the same time as
overhauling the stats stuff. Perhaps we don't need both - but it's not obvious
that that's the case / how we can make that work.




> +void
> +pgstat_increment_buffer_action(BufferActionType ba_type)
> +{
> +    volatile PgBackendStatus *beentry   = MyBEEntry;
> +
> +    if (!beentry || !pgstat_track_activities)
> +        return;
> +
> +    if (ba_type == BA_Alloc)
> +        pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.allocs, 1);
> +    else if (ba_type == BA_Extend)
> +        pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.extends, 1);
> +    else if (ba_type == BA_Fsync)
> +        pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.fsyncs, 1);
> +    else if (ba_type == BA_Write)
> +        pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes, 1);
> +    else if (ba_type == BA_Write_Strat)
> +        pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes_strat, 1);
> +}

I don't think we want to use atomic increments here - they're *slow*. And
there only ever can be a single writer to a backend's stats. So just doing
something like
    pg_atomic_write_u64(&var, pg_atomic_read_u64(&var) + 1)
should do the trick.


> +/*
> + * Called for a single backend at the time of death to persist its I/O stats
> + */
> +void
> +pgstat_record_dead_backend_buffer_actions(void)
> +{
> +    volatile PgBackendBufferActionStats *ba_stats;
> +    volatile    PgBackendStatus *beentry = MyBEEntry;
> +
> +    if (beentry->st_procpid != 0)
> +        return;
> +
> +    // TODO: is this correct? could there be a data race? do I need a lock?
> +    ba_stats = &BufferActionStatsArray[beentry->st_backendType];
> +    pg_atomic_add_fetch_u64(&ba_stats->allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs));
> +    pg_atomic_add_fetch_u64(&ba_stats->extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends));
> +    pg_atomic_add_fetch_u64(&ba_stats->fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs));
> +    pg_atomic_add_fetch_u64(&ba_stats->writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes));
> +    pg_atomic_add_fetch_u64(&ba_stats->writes_strat,
pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat));
> +}

I don't see a race, FWIW.

This is where I propose that we instead report the values up to the stats
collector, instead of having a separate array that we need to persist


> +/*
> + * Fill the provided values array with the accumulated counts of buffer actions
> + * taken by all backends of type backend_type (input parameter), both alive and
> + * dead. This is currently only used by pg_stat_get_buffer_actions() to create
> + * the rows in the pg_stat_buffer_actions system view.
> + */
> +void
> +pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values)
> +{
> +    int            i;
> +    volatile PgBackendStatus *beentry;
> +
> +    /*
> +     * Add stats from all exited backends
> +     */
> +    values[BA_Alloc] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].allocs);
> +    values[BA_Extend] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].extends);
> +    values[BA_Fsync] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].fsyncs);
> +    values[BA_Write] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes);
> +    values[BA_Write_Strat] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes_strat);
> +
> +    /*
> +     * Loop through all live backends and count their buffer actions
> +     */
> +    // TODO: see note in pg_stat_get_buffer_actions() about inefficiency of this method
> +
> +    beentry = BackendStatusArray;
> +    for (i = 1; i <= MaxBackends; i++)
> +    {
> +        /* Don't count dead backends. They should already be counted */
> +        if (beentry->st_procpid == 0)
> +            continue;
> +        if (beentry->st_backendType != backend_type)
> +            continue;
> +
> +        values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
> +        values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
> +        values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
> +        values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
> +        values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
> +
> +        beentry++;
> +    }
> +}

It seems to make a bit more sense to have this sum up the stats for all
backend types at once.

> +        /*
> +         * Currently, the only supported backend types for stats are the following.
> +         * If this were to change, pg_proc.dat would need to be changed as well
> +         * to reflect the new expected number of rows.
> +         */
> +        Datum values[BUFFER_ACTION_NUM_TYPES];
> +        bool nulls[BUFFER_ACTION_NUM_TYPES];

Ah ;)

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Slow standby snapshot
Next
From: Robert Haas
Date:
Subject: Re: A varint implementation for PG?