Re: Show WAL write and fsync stats in pg_stat_io - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Show WAL write and fsync stats in pg_stat_io
Date
Msg-id ZZ4qKwSNxGIyEtzi@paquier.xyz
Whole thread Raw
In response to Re: Show WAL write and fsync stats in pg_stat_io  (Nazir Bilal Yavuz <byavuz81@gmail.com>)
Responses Re: Show WAL write and fsync stats in pg_stat_io
List pgsql-hackers
On Wed, Jan 03, 2024 at 04:10:58PM +0300, Nazir Bilal Yavuz wrote:
> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
>> Apologies if my previous wording sounded confusing.  The idea I had in
>> mind was to keep op_bytes in pg_stat_io, and extend it so as a value
>> of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
>> as a number of bytes.
>
> Oh, I understand it now. Yes, that makes sense.
> I thought removing op_bytes completely ( as you said "This patch
> extends it with two more operation sizes, and there are even cases
> where it may be a variable" ) from pg_stat_io view then adding
> something like {read | write | extend}_bytes and {read | write |
> extend}_calls could be better, so that we don't lose any information.

But then you'd lose the possibility to analyze correlations between
the size and the number of the operations, which is something that
matters for more complex I/O scenarios.  This does not need to be
tackled in this patch, which is useful on its own, though I am really
wondering if this is required for the recent work done by Thomas.
Perhaps Andres, Thomas or Melanie could comment on that?

>> Yeah, a limitation like that may be acceptable for now.  Tracking the
>> WAL writer and WAL sender activities can be relevant in a lot of cases
>> even if we don't have the full picture for the WAL receiver yet.
>
> I added that and disabled B_WAL_RECEIVER backend with comments
> explaining why. v8 is attached.

I can see that's what you have been adding here, which should be OK:

> -    if (track_io_timing)
> +    /*
> +     * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp IOs
> +     * but these IOs are not countable for now because IOOP_READ IOs' op_bytes
> +     * (number of bytes per unit of I/O) might not be the same all the time.
> +     * The current implementation requires that the op_bytes must be the same
> +     * for the same IOObject, IOContext and IOOp. To avoid confusion, the
> +     * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
> +     * now.
> +     */
> +    if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> +        return;

This could be worded better, but that's one of these nits from me I
usually tweak when committing stuff.

> +/*
> + * Decide if IO timings need to be tracked.  Timings associated to
> + * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
> + * else rely on track_io_timing.
> + */
> +static bool
> +pgstat_should_track_io_time(IOObject io_object)
> +{
> +    if (io_object == IOOBJECT_WAL)
> +        return track_wal_io_timing;
> +
> +    return track_io_timing;
> +}

One thing I was also considering is if eliminating this routine would
make pgstat_count_io_op_time() more readable the result, but I cannot
get to that.

>          if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
>          {
> -            pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> +            if (io_object != IOOBJECT_WAL)
> +                pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> +
>              if (io_object == IOOBJECT_RELATION)
>                  INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
>              else if (io_object == IOOBJECT_TEMP_RELATION)
> @@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
>          }
>          else if (io_op == IOOP_READ)
>          {
> -            pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> +            if (io_object != IOOBJECT_WAL)
> +                pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> +
>              if (io_object == IOOBJECT_RELATION)
>                  INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
>              else if (io_object == IOOBJECT_TEMP_RELATION)

A second thing is if this would be better with more switch/cases, say:
switch (io_op):
{
    case IOOP_EXTEND:
    case IOOP_WRITE:
        switch (io_object):
    {
        case WAL:
                /* do nothing */
                break;
        case RELATION:
        case TEMP:
            .. blah ..
    }
        break;
    case IOOP_READ:
        switch (io_object):
    {
        .. blah ..
    }
        break;
}

Or just this one to make it clear that nothing happens for WAL
objects:
switch (io_object):
{
   case WAL:
       /* do nothing */
       break;
   case RELATION:
       switch (io_op):
       {
           case IOOP_EXTEND:
       case IOOP_WRITE:
           .. blah ..
       case IOOP_READ:
           .. blah ..
       }
       break;
   case TEMP:
       /* same switch as RELATION */
       break;
}

This duplicates a bit things, but at least in the second case it's
clear which counters are updated when I/O timings are tracked.  It's
OK by me if people don't like this suggestion, but that would avoid
bugs like the one I found upthread.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pg_upgrade and logical replication
Next
From: Bharath Rupireddy
Date:
Subject: Re: Make NUM_XLOGINSERT_LOCKS configurable