Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?) - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date
Msg-id CAAKRu_avP+b+KEbCe1K0mdKhbJ5gVx8YzV37p+ukCGkqFY1HJA@mail.gmail.com
Whole thread Raw
In response to Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)  (Andres Freund <andres@anarazel.de>)
Responses Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
List pgsql-hackers
On Tue, Aug 3, 2021 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-08-02 18:25:56 -0400, Melanie Plageman wrote:
> > Thanks for the feedback!
> >
> > I agree it makes sense to count strategy writes separately.
> >
> > I thought about this some more, and I don't know if it makes sense to
> > only count "avoidable" strategy writes.
> >
> > This would mean that a backend writing out a buffer from the strategy
> > ring when no clean shared buffers (as well as no clean strategy buffers)
> > are available would not count that write as a strategy write (even
> > though it is writing out a buffer from its strategy ring). But, it
> > obviously doesn't make sense to count it as a regular buffer being
> > written out. So, I plan to change this code.
>
> What do you mean with "no clean shared buffers ... are available"?
>

I think I was talking about the scenario in which a backend using a
strategy does not find a clean buffer in the strategy ring and goes to
look in the freelist for a clean shared buffer and doesn't find one.

I was probably talking in circles up there. I think the current
patch counts the right writes in the right way, though.

>
>
> > The most substantial missing piece of the patch right now is persisting
> > the data across reboots.
> >
> > The two places in the code I can see to persist the buffer action stats
> > data are:
> > 1) using the stats collector code (like in
> > pgstat_read/write_statsfiles()
> > 2) using a before_shmem_exit() hook which writes the data structure to a
> > file and then read from it when making the shared memory array initially
>
> I think it's pretty clear that we should go for 1. Having two mechanisms for
> persisting stats data is a bad idea.

New version uses the stats collector.

>
>
> > Also, I'm unsure how writing the buffer action stats out in
> > pgstat_write_statsfiles() will work, since I think that backends can
> > update their buffer action stats after we would have already persisted
> > the data from the BufferActionStatsArray -- causing us to lose those
> > updates.
>
> I was thinking it'd work differently. Whenever a connection ends, it reports
> its data up to pgstats.c (otherwise we'd loose those stats). By the time
> shutdown happens, they all need to have already have reported their stats - so
> we don't need to do anything to get the data to pgstats.c during shutdown
> time.
>

When you say "whenever a connection ends", what part of the code are you
referring to specifically?

Also, when you say "shutdown", do you mean a backend shutting down or
all backends shutting down (including postmaster) -- like pg_ctl stop?

>
> > And, I don't think I can use pgstat_read_statsfiles() since the
> > BufferActionStatsArray should have the data from the file as soon as the
> > view containing the buffer action stats can be queried. Thus, it seems
> > like I would need to read the file while initializing the array in
> > CreateBufferActionStatsCounters().
>
> Why would backends need to read that data back?
>

To get totals across restarts, but, doesn't matter now that I am using
stats collector.

>
> > diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
> > index 55f6e3711d..96cac0a74e 100644
> > --- a/src/backend/catalog/system_views.sql
> > +++ b/src/backend/catalog/system_views.sql
> > @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
> >          pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
> >          pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
> >          pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
> > -        pg_stat_get_buf_written_backend() AS buffers_backend,
> > -        pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
> > -        pg_stat_get_buf_alloc() AS buffers_alloc,
> >          pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
>
> Material for a separate patch, not this. But if we're going to break
> monitoring queries anyway, I think we should consider also renaming
> maxwritten_clean (and perhaps a few others), because nobody understands what
> that is supposed to mean.
>
>

Do you mean I shouldn't remove anything from the pg_stat_bgwriter view?

>
> > @@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
> >
> >       LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
> >
> > -     /* Count all backend writes regardless of if they fit in the queue */
> > -     if (!AmBackgroundWriterProcess())
> > -             CheckpointerShmem->num_backend_writes++;
> > -
> >       /*
> >        * If the checkpointer isn't running or the request queue is full, the
> >        * backend will have to perform its own fsync request.  But before forcing
> > @@ -1106,8 +1090,10 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
> >                * Count the subset of writes where backends have to do their own
> >                * fsync
> >                */
> > +             /* TODO: should we count fsyncs for all types of procs? */
> >               if (!AmBackgroundWriterProcess())
> > -                     CheckpointerShmem->num_backend_fsync++;
> > +                     pgstat_increment_buffer_action(BA_Fsync);
> > +
>
> Yes, I think that'd make sense. Now that we can disambiguate the different
> types of syncs between procs, I don't see a point of having a process-type
> filter here. We just loose data...
>
>

Done

>
> >               /* don't set checksum for all-zero page */
> > @@ -1229,11 +1234,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
> >                                       if (XLogNeedsFlush(lsn) &&
> >                                               StrategyRejectBuffer(strategy, buf))
> >                                       {
> > +                                             /*
> > +                                              * Unset the strat write flag, as we will not be writing
> > +                                              * this particular buffer from our ring out and may end
> > +                                              * up having to find a buffer from main shared buffers,
> > +                                              * which, if it is dirty, we may have to write out, which
> > +                                              * could have been prevented by checkpointing and background
> > +                                              * writing
> > +                                              */
> > +                                             StrategyUnChooseBufferFromRing(strategy);
> > +
> >                                               /* Drop lock/pin and loop around for another buffer */
> >                                               LWLockRelease(BufferDescriptorGetContentLock(buf));
> >                                               UnpinBuffer(buf, true);
> >                                               continue;
> >                                       }
>
> Could we combine this with StrategyRejectBuffer()? It seems a bit wasteful to
> have two function calls into freelist.c when the second happens exactly when
> the first returns true?
>
>
> > +
> > +                                     /*
> > +                                      * TODO: there is certainly a better way to write this
> > +                                      * logic
> > +                                      */
> > +
> > +                                     /*
> > +                                      * The dirty buffer that will be written out was selected
> > +                                      * from the ring and we did not bother checking the
> > +                                      * freelist or doing a clock sweep to look for a clean
> > +                                      * buffer to use, thus, this write will be counted as a
> > +                                      * strategy write -- one that may be unnecessary without a
> > +                                      * strategy
> > +                                      */
> > +                                     if (StrategyIsBufferFromRing(strategy))
> > +                                     {
> > +                                             pgstat_increment_buffer_action(BA_Write_Strat);
> > +                                     }
> > +
> > +                                             /*
> > +                                              * If the dirty buffer was one we grabbed from the
> > +                                              * freelist or through a clock sweep, it could have been
> > +                                              * written out by bgwriter or checkpointer, thus, we will
> > +                                              * count it as a regular write
> > +                                              */
> > +                                     else
> > +                                             pgstat_increment_buffer_action(BA_Write);
>
> It seems this would be better solved by having an "bool *from_ring" or
> GetBufferSource* parameter to StrategyGetBuffer().
>

I've addressed both of these in the new version.

>
> > @@ -2895,6 +2948,20 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
> >       /*
> >        * bufToWrite is either the shared buffer or a copy, as appropriate.
> >        */
> > +
> > +     /*
> > +      * TODO: consider that if we did not need to distinguish between a buffer
> > +      * flushed that was grabbed from the ring buffer and written out as part
> > +      * of a strategy which was not from main Shared Buffers (and thus
> > +      * preventable by bgwriter or checkpointer), then we could move all calls
> > +      * to pgstat_increment_buffer_action() here except for the one for
> > +      * extends, which would remain in ReadBuffer_common() before smgrextend()
> > +      * (unless we decide to start counting other extends). That includes the
> > +      * call to count buffers written by bgwriter and checkpointer which go
> > +      * through FlushBuffer() but not BufferAlloc(). That would make it
> > +      * simpler. Perhaps instead we can find somewhere else to indicate that
> > +      * the buffer is from the ring of buffers to reuse.
> > +      */
> >       smgrwrite(reln,
> >                         buf->tag.forkNum,
> >                         buf->tag.blockNum,
>
> Can we just add a parameter to FlushBuffer indicating what the source of the
> write is?
>

I just noticed this comment now, so I'll address that in the next
version. I rebased today and noticed merge conflicts, so, it looks like
v5 will be on its way soon anyway.

>
> > @@ -247,7 +257,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
> >        * the rate of buffer consumption.  Note that buffers recycled by a
> >        * strategy object are intentionally not counted here.
> >        */
> > -     pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
> > +     pgstat_increment_buffer_action(BA_Alloc);
> >
> >       /*
> >        * First check, without acquiring the lock, whether there's buffers in the
>
> > @@ -411,11 +421,6 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
> >                */
> >               *complete_passes += nextVictimBuffer / NBuffers;
> >       }
> > -
> > -     if (num_buf_alloc)
> > -     {
> > -             *num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
> > -     }
> >       SpinLockRelease(&StrategyControl->buffer_strategy_lock);
> >       return result;
> >  }
>
> Hm. Isn't bgwriter using the *num_buf_alloc value to pace its activity? I
> suspect this patch shouldn't get rid of numBufferAllocs at the same time as
> overhauling the stats stuff. Perhaps we don't need both - but it's not obvious
> that that's the case / how we can make that work.
>
>

I initially meant to add a function to the patch like
pg_stat_get_buffer_actions() but which took a BufferActionType and
BackendType as parameters and returned a single value which is the
number of buffer action types of that type for that type of backend.

let's say I defined it like this:
uint64
  pg_stat_get_backend_buffer_actions_stats(BackendType backend_type,
                                          BufferActionType ba_type)

Then, I intended to use that in StrategySyncStart() to set num_buf_alloc
by subtracting the value of StrategyControl->numBufferAllocs from the
value returned by pg_stat_get_backend_buffer_actions_stats(B_BG_WRITER,
BA_Alloc), val, then adding that value, val, to
StrategyControl->numBufferAllocs.

I think that would have the same behavior as current, though I'm not
sure if the performance would end up being better or worse. It wouldn't
be atomically incrementing StrategyControl->numBufferAllocs, but it
would do a few additional atomic operations in StrategySyncStart() than
before. Also, we would do all the work done by
pg_stat_get_buffer_actions() in StrategySyncStart().

But that is called comparatively infrequently, right?

>
>
> > +void
> > +pgstat_increment_buffer_action(BufferActionType ba_type)
> > +{
> > +     volatile PgBackendStatus *beentry   = MyBEEntry;
> > +
> > +     if (!beentry || !pgstat_track_activities)
> > +             return;
> > +
> > +     if (ba_type == BA_Alloc)
> > +             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.allocs, 1);
> > +     else if (ba_type == BA_Extend)
> > +             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.extends, 1);
> > +     else if (ba_type == BA_Fsync)
> > +             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.fsyncs, 1);
> > +     else if (ba_type == BA_Write)
> > +             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes, 1);
> > +     else if (ba_type == BA_Write_Strat)
> > +             pg_atomic_add_fetch_u64(&beentry->buffer_action_stats.writes_strat, 1);
> > +}
>
> I don't think we want to use atomic increments here - they're *slow*. And
> there only ever can be a single writer to a backend's stats. So just doing
> something like
>     pg_atomic_write_u64(&var, pg_atomic_read_u64(&var) + 1)
> should do the trick.
>

Done

>
> > +/*
> > + * Called for a single backend at the time of death to persist its I/O stats
> > + */
> > +void
> > +pgstat_record_dead_backend_buffer_actions(void)
> > +{
> > +     volatile PgBackendBufferActionStats *ba_stats;
> > +     volatile        PgBackendStatus *beentry = MyBEEntry;
> > +
> > +     if (beentry->st_procpid != 0)
> > +             return;
> > +
> > +     // TODO: is this correct? could there be a data race? do I need a lock?
> > +     ba_stats = &BufferActionStatsArray[beentry->st_backendType];
> > +     pg_atomic_add_fetch_u64(&ba_stats->allocs, pg_atomic_read_u64(&beentry->buffer_action_stats.allocs));
> > +     pg_atomic_add_fetch_u64(&ba_stats->extends, pg_atomic_read_u64(&beentry->buffer_action_stats.extends));
> > +     pg_atomic_add_fetch_u64(&ba_stats->fsyncs, pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs));
> > +     pg_atomic_add_fetch_u64(&ba_stats->writes, pg_atomic_read_u64(&beentry->buffer_action_stats.writes));
> > +     pg_atomic_add_fetch_u64(&ba_stats->writes_strat,
pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat));
> > +}
>
> I don't see a race, FWIW.
>
> This is where I propose that we instead report the values up to the stats
> collector, instead of having a separate array that we need to persist
>

Changed

>
> > +/*
> > + * Fill the provided values array with the accumulated counts of buffer actions
> > + * taken by all backends of type backend_type (input parameter), both alive and
> > + * dead. This is currently only used by pg_stat_get_buffer_actions() to create
> > + * the rows in the pg_stat_buffer_actions system view.
> > + */
> > +void
> > +pgstat_recount_all_buffer_actions(BackendType backend_type, Datum *values)
> > +{
> > +     int                     i;
> > +     volatile PgBackendStatus *beentry;
> > +
> > +     /*
> > +      * Add stats from all exited backends
> > +      */
> > +     values[BA_Alloc] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].allocs);
> > +     values[BA_Extend] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].extends);
> > +     values[BA_Fsync] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].fsyncs);
> > +     values[BA_Write] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes);
> > +     values[BA_Write_Strat] = pg_atomic_read_u64(&BufferActionStatsArray[backend_type].writes_strat);
> > +
> > +     /*
> > +      * Loop through all live backends and count their buffer actions
> > +      */
> > +     // TODO: see note in pg_stat_get_buffer_actions() about inefficiency of this method
> > +
> > +     beentry = BackendStatusArray;
> > +     for (i = 1; i <= MaxBackends; i++)
> > +     {
> > +             /* Don't count dead backends. They should already be counted */
> > +             if (beentry->st_procpid == 0)
> > +                     continue;
> > +             if (beentry->st_backendType != backend_type)
> > +                     continue;
> > +
> > +             values[BA_Alloc] += pg_atomic_read_u64(&beentry->buffer_action_stats.allocs);
> > +             values[BA_Extend] += pg_atomic_read_u64(&beentry->buffer_action_stats.extends);
> > +             values[BA_Fsync] += pg_atomic_read_u64(&beentry->buffer_action_stats.fsyncs);
> > +             values[BA_Write] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes);
> > +             values[BA_Write_Strat] += pg_atomic_read_u64(&beentry->buffer_action_stats.writes_strat);
> > +
> > +             beentry++;
> > +     }
> > +}
>
> It seems to make a bit more sense to have this sum up the stats for all
> backend types at once.

Changed.

>
> > +             /*
> > +              * Currently, the only supported backend types for stats are the following.
> > +              * If this were to change, pg_proc.dat would need to be changed as well
> > +              * to reflect the new expected number of rows.
> > +              */
> > +             Datum values[BUFFER_ACTION_NUM_TYPES];
> > +             bool nulls[BUFFER_ACTION_NUM_TYPES];
>
> Ah ;)
>

I just went ahead and made a row for each backend type.

- Melanie

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
Next
From: Michael Meskes
Date:
Subject: Re: ECPG bug fix: DECALRE STATEMENT and DEALLOCATE, DESCRIBE