Re: Show WAL write and fsync stats in pg_stat_io - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Show WAL write and fsync stats in pg_stat_io
Date
Msg-id 20230809185233.k6xyvuthtauxsyr7@liskov
Whole thread Raw
In response to Re: Show WAL write and fsync stats in pg_stat_io  (Nazir Bilal Yavuz <byavuz81@gmail.com>)
Responses Re: Show WAL write and fsync stats in pg_stat_io
List pgsql-hackers
On Thu, Aug 03, 2023 at 04:38:41PM +0300, Nazir Bilal Yavuz wrote:
> Current status of the patch is:
> - 'WAL read' stats in xlogrecovery.c are added to pg_stat_io.
> - IOCONTEXT_INIT is added to count 'WAL init'. 'WAL init' stats are
> added to pg_stat_io.
> - pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
> - Working on which 'BackendType / IOContext / IOOp' should be banned
> in pg_stat_io.
> - Working on adding 'WAL read' to the xlogreader.c and walsender.c.
> - PendingWalStats.wal_sync and
> PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time are moved
> to pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively.

Cool! Thanks for the summary and for continuing to work on this.

> TODOs:
> - Documentation.
> - Thinking about how to interpret the different IOOps within the two
> IOContexts and discussing what would be useful to count.
> - Decide which 'BackendType / IOContext / IOOp' should not be tracked.
> - Adding 'WAL read' to the xlogreader.c and walsender.c. (This could
> be an another patch)

Yes, I would be explicit that you are not including WAL IO done exclusively in
the context of replication.

> - Adding WAIT_EVENT_WAL_COPY_* operations to pg_stat_io if needed.
> (This could be an another patch)

Yes, I think it makes sense as another patch.

> 
> On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I think it would be good to count WAL reads even though they are not
> > currently represented in pg_stat_wal. Here is a thread discussing this
> > [1].
> 
> I used the same implementation in the thread link [1]. I added 'WAL
> read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
> xlogreader.c and walsender.c because they cause some failures on:
> '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
> these. Also, I added Bharath to CC. I have a question about 'WAL
> read':
> 1. There are two places where 'WAL read' happens.
> a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
> time count is equal to XLOG_BLCKSZ but there are some cases it is not.
> For example
> - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
> - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
> These nbytes and count variables could be different from XLOG_BLCKSZ.
> 
> b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
> pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
> 
> So, what should op_bytes be set to for 'WAL read' operations?

If there is any combination of BackendType and IOContext which will
always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
op_bytes. For other cases, we may have to consider using op_bytes 1 and
tracking reads and write IOOps in number of bytes (instead of number of
pages). I don't actually know if there is a clear separation by
BackendType for these different cases.

The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
treat op_bytes * number of reads as an approximation of the number of
bytes read. I don't actually know what makes more sense. I don't think I
would like having a number for bytes that is not accurate.

> > I think we will also want to add an IOContext for WAL initialization.
> > Then we can track how long is spent doing 'WAL init' (including filling
> > the WAL file with zeroes). XLogFileInitInternal() is likely where we
> > would want to add it. And op_bytes for this would likely be
> > wal_segment_size. I thought I heard about someone proposing adding WAL
> > init to pg_stat_wal, but I can't find the thread.
> 
> Done. I created a new IOCONTEXT_INIT IOContext for the 'WAL init'. I
> have a question there:
> 1. Some of the WAL processes happens at initdb (standalone backend
> IOCONTEXT_NORMAL/(IOOP_READ & IOOP_WRITE) and
> IOCONTEXT_INIT/(IOOP_WRITE & IOOP_FSYNC)). Since this happens at the
> initdb, AFAIK there is no way to set 'track_wal_io_timing' and
> 'track_io_timing' variables there. So, their timings appear as 0.
> Should I use IsBootstrapProcessingMode() to enable WAL io timings at
> the initdb or are they not that much important?

I don't have an opinion about this. I can see an argument for doing it
either way. We do track other IO during initdb in pg_stat_io.

> > Any that we decide not to count for now should be "banned" in
> > pgstat_tracks_io_op() for clarity. For example, if we create a separate
> > IOContext for WAL file init, I'm not sure what would count as an
> > IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.
> >
> > Also, I think there are some backend types which will not generate WAL
> > and we should determine which those are and skip those rows in
> > pgstat_tracks_io_object().
> 
> I agree, I am working on this. I have a couple of questions:
> 1. Can client backend and background worker do IOCONTEXT_NORMAL/IOOP_READ?

I don't know the answer to this.

> 2. Is there an easy way to check if 'BackendType / IOOBJECT_WAL' does
> specific IOOp operations?

I don't think there is a general answer to this. You'll have to look at
the code and think about specific things that backend might do that
would require WAL. I think we'll definitely need other community members
to check our work for the valid combinations.

Completing the matrix of valid combinations of BackendType, IOOp, and
IOContext and defining each one is the biggest area where we could use
help from community members.

As an additional TODO, I would explore adding some tests to prevent
accidental removal of the pg_stat_io WAL tracking.

I think we can easily test IOCONTEXT_NORMAL WAL writes in
src/test/regress/sql/stats.sql (perhaps it is worth checking that
synchronous_commit is on in the test). IOCONTEXT_NORMAL WAL fsyncs
should again be easy to test if synchronous_commit is on and fsync is
on.

I'm not sure how to reliably test WAL reads (given timing). Logically,
you can sum WAL reads before a crash is initiated in one of the tests in
the recovery suite, and then sum them after the db has restarted and
there should definitely be an increase in WAL reads, but I don't know if
we need to do something to guarantee that there will have been WAL reads
(to avoid test flakes).

I'm also not sure how to reliably test any IOCONTEXT_INIT operations. We
need a before and after and I can't think of a cheap operation to ensure
a new WAL segment is written to or fsyncd in between a before and after
for the purposes of testing.

> > diff --git a/src/backend/access/transam/xlog.c
> > b/src/backend/access/transam/xlog.c
> > index 8b0710abe6..2ee6c21398 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
> > tli, bool flexible)
> >
> > I think we should likely follow the pattern of using
> > pgstat_prepare_io_time() and pgstat_count_io_op_time() as it is done
> > elsewhere. You could pass the IOObject as a parameter to
> > pgstat_prepare_io_time() in order to determine if we should check
> > track_io_timing or track_wal_io_timing. And we'll want to check
> > track_wal_io_timing if IOObject is IOOBJECT_WAL in
> > pgstat_count_io_op_time().
> 
> Done. Instead of passing parameters to pgstat_prepare_io_time(), I
> used a slightly different implementation. I return the current time if
> there is a chance that any 'time' can be tracked.

Cool!

> From 574fdec6ed8073dbc49053e6933db0310c7c62f5 Mon Sep 17 00:00:00 2001
> From: Nazir Bilal Yavuz <byavuz81@gmail.com>
> Date: Thu, 3 Aug 2023 16:11:16 +0300
> Subject: [PATCH v2] Show WAL stats on pg_stat_io
> 
> This patch aims to showing WAL stats per backend on pg_stat_io view.
> 
> With this patch, it can be seen how many WAL operations it makes, their
> context, types and total timings per backend in pg_stat_io view.

In the commit message, I would describe what kinds of WAL IO this
patchset currently covers -- i.e. not streaming replication WAL IO.

> ---
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 60c0b7ec3af..ee7b85e18ca 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -2245,6 +2229,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
>                  startoffset += written;
>              } while (nleft > 0);
>  

I'm not sure if the right location is here or in
pgstat_count_io_op_time(), but I would explain why you did not move
PendingWalStats.wal_writes counter into pg_stat_io code (and why you did
move the other PendingWalStats counters there.

> +            pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
> +                                    IOOP_WRITE, io_start, npages);
> +
>              npages = 0;
>  
>              /*
> @@ -2938,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
>      int            fd;
>      int            save_errno;
>      int            open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
> +    instr_time    io_start;
>  
>      Assert(logtli != 0);
>  
> @@ -2981,6 +2969,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
>                  (errcode_for_file_access(),
>                   errmsg("could not create file \"%s\": %m", tmppath)));
>  

Since you have two calls to pgstat_prepare_io_time() in this function, I
think it would be nice to have a comment above each to the effect of
"start timing writes for stats" and "start timing fsyncs for stats"

> +    io_start = pgstat_prepare_io_time();
> +
>      pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);

> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> index becc2bda62e..ee850af5514 100644
> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -1587,6 +1587,7 @@ PerformWalRecovery(void)
>      XLogRecord *record;
>      bool        reachedRecoveryTarget = false;
>      TimeLineID    replayTLI;
> +    uint32        pgstat_report_wal_frequency = 0;
>  
>      /*
>       * Initialize shared variables for tracking progress of WAL replay, as if
> @@ -1745,6 +1746,16 @@ PerformWalRecovery(void)
>               */
>              ApplyWalRecord(xlogreader, record, &replayTLI);
>  
> +            /*
> +             * Report pending statistics to the cumulative stats system once
> +             * every PGSTAT_REPORT_FREQUENCY times to not hinder performance.
> +             */
> +            if (pgstat_report_wal_frequency++ == PGSTAT_REPORT_FREQUENCY)
> +            {
> +                pgstat_report_wal(false);
> +                pgstat_report_wal_frequency = 0;
> +            }
> +

Is the above needed for your patch to work? What does it do? It should
probably be in a separate commit and should definitely have an
explanation.

> --- a/src/backend/utils/activity/pgstat_io.c
> +++ b/src/backend/utils/activity/pgstat_io.c
> @@ -87,17 +87,25 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
>      Assert((unsigned int) io_op < IOOP_NUM_TYPES);
>      Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));

I would add a comment here explaining that pg_stat_wal doesn't count WAL
init or WAL reads.

> +    if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> +       io_op == IOOP_FSYNC)
> +        PendingWalStats.wal_sync += cnt;
> +
>      PendingIOStats.counts[io_object][io_context][io_op] += cnt;
>  
>      have_iostats = true;
>  }

> +/*
> + * Prepares io_time for pgstat_count_io_op_time() function. It needs to return
> + * current time if there is a chance that any 'time' can be tracked.
> + */
>  instr_time
>  pgstat_prepare_io_time(void)
>  {
>      instr_time    io_start;
>  
> -    if (track_io_timing)
> +    if(track_io_timing || track_wal_io_timing)
>          INSTR_TIME_SET_CURRENT(io_start);
>      else
>          INSTR_TIME_SET_ZERO(io_start);

Since you asked me off-list why we had to do INSTR_TIME_SET_ZERO() and I
couldn't remember, it is probably worth a comment.

>  pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
>                          instr_time start_time, uint32 cnt)
>  {
> -    if (track_io_timing)
> +    if (pgstat_should_track_io_time(io_object, io_context))
>      {
>          instr_time    io_time;
>  
> @@ -124,6 +148,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
>              pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));

Now that we are adding more if statements to this function, I think we
should start adding more comments.

We should explain what the different counters here are for e.g.
pgBufferUsage for EXPLAIN, PendingWalStats for pg_stat_wal.

We should also explain what is tracked for each and why it differs --
e.g. some track time and some don't, some track only reads or writes,
etc.

Also we should mention why we are consolidating them here. That is, we
want to eventually deduplicate these counters, so we are consolidating
them first. This also makes it easy to compare what is tracked for which
stats or instrumentation purpose.

And for those IO counters that we haven't moved here, we should mention
it is because they track at a different level of granularity or at a
different point in the call stack.

>              if (io_object == IOOBJECT_RELATION)
>                  INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> +            /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
> +            else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
> +                INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
>          }


Also, I would reorder the if statements to be in order of the enum
values (e.g. FSYNC, READ, WRITE).

>          else if (io_op == IOOP_READ)
>          {
> @@ -131,6 +158,12 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
>              if (io_object == IOOBJECT_RELATION)
>                  INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
>          }
> +        else if (io_op == IOOP_FSYNC)
> +        {
> +            /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */

I wouldn't squeeze this comment here like this. It is hard to read

> +            if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
> +                INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);


> + * op_bytes can change according to IOObject and IOContext.
> + * Return BLCKSZ as default.
> + */
> +int
> +pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context)
> +{

Small typo in function name:
pgstat_get_io_op_btyes -> pgstat_get_io_op_bytes
I'd also mention why BLCKSZ is the default

> +    if (io_object == IOOBJECT_WAL)
> +    {
> +        if (io_context == IOCONTEXT_NORMAL)
> +            return XLOG_BLCKSZ;
> +        else if (io_context == IOCONTEXT_INIT)
> +            return wal_segment_size;
> +    }
> +
> +    return BLCKSZ;
> +}

> @@ -350,6 +405,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
>      if (!pgstat_tracks_io_bktype(bktype))
>          return false;
>  
> +    /*
> +     * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
> +     * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
> +     */
> +    if (io_object == IOOBJECT_WAL &&
> +        (io_context != IOCONTEXT_NORMAL &&

Little bit of errant whitespace here.

>      /*
>       * Currently, IO on temporary relations can only occur in the
>       * IOCONTEXT_NORMAL IOContext.
> @@ -439,6 +503,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
>      if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
>          return false;

I would expand on the comment to explain what NORMAL is for WAL -- what
we consider normal to be and why. And why it is different than INIT.

>  
> +    if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
> +       !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
> +       return false;
> +
> +    if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> +       !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
> +       return false;

These are the first "bans" that we have for an IOOp for a specific
combination of io_context and io_object. We should add a new comment for
this and perhaps consider what ordering makes most sense. I tried to
organize the bans from most broad to most specific at the bottom.

>  
> --- a/src/backend/utils/adt/pgstatfuncs.c
> +++ b/src/backend/utils/adt/pgstatfuncs.c
> @@ -1409,7 +1410,8 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
>                   * and constant multipliers, once non-block-oriented IO (e.g.
>                   * temporary file IO) is tracked.
>                   */
> -                values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);

There's a comment above this in the code that says this is hard-coded to
BLCKSZ. That comment needs to be updated or removed (in lieu of the
comment in your pgstat_get_io_op_bytes() function).


> +                op_bytes = pgstat_get_io_op_btyes(io_obj, io_context);
> +                values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
>  

> +extern PGDLLIMPORT bool track_wal_io_timing;
> +extern PGDLLIMPORT int wal_segment_size;

These shouldn't be in two places (i.e. they are already in xlog.h and
you added them in pgstat.h. pg_stat_io.c includes bufmgr.h for
track_io_timing, so you can probably justify including xlog.h.


- Melanie



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Use of additional index columns in rows filtering
Next
From: Christoph Moench-Tegeder
Date:
Subject: Re: Adding a pg_servername() function