Thread: Show WAL write and fsync stats in pg_stat_io

Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

28 June 2023, 10:09:14

Hi,

This is a WIP patch to add WAL write and fsync stats to pg_stat_io
view. There is a track_io_timing variable to control pg_stat_io
timings and a track_wal_io_timing variable to control WAL timings. I
couldn't decide on which logic to enable WAL timings on pg_stat_io.
For now, both pg_stat_io and track_wal_io_timing are needed to be
enabled to track WAL timings in pg_stat_io.

Also, if you compare WAL stats in pg_stat_wal and pg_stat_io; you can
come across differences. These differences are caused by the
background writer's WAL stats not being flushed. Because of that,
background writer's WAL stats are not seen in pg_stat_wal but in
pg_stat_io. I already sent a patch [1] to fix that.

[1] https://www.postgresql.org/message-id/CAN55FZ2FPYngovZstr%3D3w1KSEHe6toiZwrurbhspfkXe5UDocg%40mail.gmail.com

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

v1-0001-WIP-Show-WAL-write-and-fsync-stats-on-pg_stat_io.patch

Re: Show WAL write and fsync stats in pg_stat_io

From

Melanie Plageman

Date:

21 July 2023, 22:30:06

On Wed, Jun 28, 2023 at 6:09 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> This is a WIP patch to add WAL write and fsync stats to pg_stat_io
> view.

Thanks for working on this! I have some feedback on the content of the
patch as well as some items that I feel are missing.

I think it would be good to count WAL reads even though they are not
currently represented in pg_stat_wal. Here is a thread discussing this
[1].

Eventually, the docs will need an update as well. You can wait until a
later version of the patch to do this, but I would include it in a list
of the remaining TODOs in your next version.

I think we will also want to add an IOContext for WAL initialization.
Then we can track how long is spent doing WAL init (including filling
the WAL file with zeroes). XLogFileInitInternal() is likely where we
would want to add it. And op_bytes for this would likely be
wal_segment_size. I thought I heard about someone proposing adding WAL
init to pg_stat_wal, but I can't find the thread.

I think there is also an argument for counting WAL files recycled as
IOOP_REUSES. We should start thinking about how to interpret the
different IOOps within the two IOContexts and discussing what would be
useful to count. For example, should removing a logfile count as an
IOOP_EVICT? Maybe it is not directly related to "IO" enough or even an
interesting statistic, but we should think about what kinds of
IO-related WAL statistics we want to track.

Any that we decide not to count for now should be "banned" in
pgstat_tracks_io_op() for clarity. For example, if we create a separate
IOContext for WAL file init, I'm not sure what would count as an
IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.

Also, I think there are some backend types which will not generate WAL
and we should determine which those are and skip those rows in
pgstat_tracks_io_object().

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 8b0710abe6..2ee6c21398 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
tli, bool flexible)

I think we should likely follow the pattern of using
pgstat_prepare_io_time() and pgstat_count_io_op_time() as it is done
elsewhere. You could pass the IOObject as a parameter to
pgstat_prepare_io_time() in order to determine if we should check
track_io_timing or track_wal_io_timing. And we'll want to check
track_wal_io_timing if IOObject is IOOBJECT_WAL in
pgstat_count_io_op_time().

                     INSTR_TIME_SET_CURRENT(duration);

INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration,
start);

+                    pgstat_count_io_op_time(IOOBJECT_WAL,
IOCONTEXT_NORMAL, IOOP_WRITE, start, 1);
+                } else
+                {

Other users of pgstat_count_io_op_time()/io_op_n() which write multiple
pages at a time pass the number of pages in as the cnt parameter. (see
ExtendBufferedRelLocal() as an example). I think we want to do that for
WAL also. In this case, it would be the local variable "npages" and we
can do it outside of this loop.

It is true that the existing WAL stats count wal_writes here. However,
this is essentially counting write system calls, which is probably not
what we want for pg_stat_io. See [2] for a discussion about whether to
count blocks written back or writeback system calls for a previous
pg_stat_io feature. All of the other block-based IO statistics in
pg_stat_io count the number of blocks.

This being said, we probably want to just leave
PendingWalStats.wal_write++ here. We would normally move it into
pg_stat_io like we have with pgBufferUsage and the db IO stats that are
updated in pgstat_count_io_op_time(). This consolidation makes it easier
to eventually reduce the duplication. However, in this case, it seems
wal_write counts something we don't count in pg_stat_io, so it can
probably be left here. I would still move the
PendingWalStats.wal_write_time into pgstat_count_io_op_time(), since
that seems like it is the same as what will be in pg_stat_io.

Also, op_bytes for IOOBJECT_WAL/IOCONTEXT_NORMAL should be XLOG_BLCKSZ
(see comment in pg_stat_get_io() in pgstatfuncs.c). Those default to the
same value but can be made to be different.

+                    pgstat_count_io_op_n(IOOBJECT_WAL,
IOCONTEXT_NORMAL, IOOP_WRITE, 1);
                 }

                 PendingWalStats.wal_write++;

@@ -8233,6 +8237,10 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)

         INSTR_TIME_SET_CURRENT(duration);
         INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, duration, start);
+        pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
IOOP_FSYNC, start, 1);

I would wrap this line and check other lines to make sure they are not
too long.

+    } else
+    {
+        pgstat_count_io_op_n(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC, 1);
     }

     PendingWalStats.wal_sync++;

Same feedback as above about using the prepare/count pattern used for
pg_stat_io elsewhere. In this case, you should be able to move
PendingWalStats.wal_sync into there as well.

diff --git a/src/backend/utils/activity/pgstat_io.c
b/src/backend/utils/activity/pgstat_io.c
@@ -350,6 +352,11 @@ pgstat_tracks_io_object(BackendType bktype,
IOObject io_object,
     if (!pgstat_tracks_io_bktype(bktype))
         return false;

+
+    if (io_context != IOCONTEXT_NORMAL &&
+        io_object == IOOBJECT_WAL)
+        return false;

We should add more restrictions. See the top of my email for details.

> There is a track_io_timing variable to control pg_stat_io
> timings and a track_wal_io_timing variable to control WAL timings. I
> couldn't decide on which logic to enable WAL timings on pg_stat_io.
> For now, both pg_stat_io and track_wal_io_timing are needed to be
> enabled to track WAL timings in pg_stat_io.

Hmm. I could see a case where someone doesn't want to incur the
overhead of track_io_timing for regular IO but does want to do so for
WAL because they are interested in a specific issue. I'm not sure
though. I could be convinced otherwise (based on relative overhead,
etc).

> Also, if you compare WAL stats in pg_stat_wal and pg_stat_io; you can
> come across differences. These differences are caused by the
> background writer's WAL stats not being flushed. Because of that,
> background writer's WAL stats are not seen in pg_stat_wal but in
> pg_stat_io. I already sent a patch [1] to fix that.

Cool! Thanks for doing that.

- Melanie

[1]
https://www.postgresql.org/message-id/flat/20230216191138.jotc73lqb7xhfqbi%40awork3.anarazel.de#eb4a641427fa1eb013e9ecdd8648e640
[2] https://www.postgresql.org/message-id/20230504165738.4e2hfoddoels542c%40awork3.anarazel.de

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

03 August 2023, 13:38:41

Hi,

Thanks for the review!

Current status of the patch is:
- 'WAL read' stats in xlogrecovery.c are added to pg_stat_io.
- IOCONTEXT_INIT is added to count 'WAL init'. 'WAL init' stats are
added to pg_stat_io.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned
in pg_stat_io.
- Working on adding 'WAL read' to the xlogreader.c and walsender.c.
- PendingWalStats.wal_sync and
PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time are moved
to pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Thinking about how to interpret the different IOOps within the two
IOContexts and discussing what would be useful to count.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Adding 'WAL read' to the xlogreader.c and walsender.c. (This could
be an another patch)
- Adding WAIT_EVENT_WAL_COPY_* operations to pg_stat_io if needed.
(This could be an another patch)

On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I think it would be good to count WAL reads even though they are not
> currently represented in pg_stat_wal. Here is a thread discussing this
> [1].

I used the same implementation in the thread link [1]. I added 'WAL
read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
xlogreader.c and walsender.c because they cause some failures on:
'!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
these. Also, I added Bharath to CC. I have a question about 'WAL
read':
1. There are two places where 'WAL read' happens.
a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
time count is equal to XLOG_BLCKSZ but there are some cases it is not.
For example
- in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
- in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
These nbytes and count variables could be different from XLOG_BLCKSZ.

b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);

So, what should op_bytes be set to for 'WAL read' operations?

> Eventually, the docs will need an update as well. You can wait until a
> later version of the patch to do this, but I would include it in a list
> of the remaining TODOs in your next version.

Done. I shared TODOs at the top.

> I think we will also want to add an IOContext for WAL initialization.
> Then we can track how long is spent doing 'WAL init' (including filling
> the WAL file with zeroes). XLogFileInitInternal() is likely where we
> would want to add it. And op_bytes for this would likely be
> wal_segment_size. I thought I heard about someone proposing adding WAL
> init to pg_stat_wal, but I can't find the thread.

Done. I created a new IOCONTEXT_INIT IOContext for the 'WAL init'. I
have a question there:
1. Some of the WAL processes happens at initdb (standalone backend
IOCONTEXT_NORMAL/(IOOP_READ & IOOP_WRITE) and
IOCONTEXT_INIT/(IOOP_WRITE & IOOP_FSYNC)). Since this happens at the
initdb, AFAIK there is no way to set 'track_wal_io_timing' and
'track_io_timing' variables there. So, their timings appear as 0.
Should I use IsBootstrapProcessingMode() to enable WAL io timings at
the initdb or are they not that much important?

> I think there is also an argument for counting WAL files recycled as
> IOOP_REUSES. We should start thinking about how to interpret the
> different IOOps within the two IOContexts and discussing what would be
> useful to count. For example, should removing a logfile count as an
> IOOP_EVICT? Maybe it is not directly related to "IO" enough or even an
> interesting statistic, but we should think about what kinds of
> IO-related WAL statistics we want to track.

I added that to TODOs.

> Any that we decide not to count for now should be "banned" in
> pgstat_tracks_io_op() for clarity. For example, if we create a separate
> IOContext for WAL file init, I'm not sure what would count as an
> IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.
>
> Also, I think there are some backend types which will not generate WAL
> and we should determine which those are and skip those rows in
> pgstat_tracks_io_object().

I agree, I am working on this. I have a couple of questions:
1. Can client backend and background worker do IOCONTEXT_NORMAL/IOOP_READ?
2. Is there an easy way to check if 'BackendType / IOOBJECT_WAL' does
specific IOOp operations?

> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index 8b0710abe6..2ee6c21398 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
> tli, bool flexible)
>
> I think we should likely follow the pattern of using
> pgstat_prepare_io_time() and pgstat_count_io_op_time() as it is done
> elsewhere. You could pass the IOObject as a parameter to
> pgstat_prepare_io_time() in order to determine if we should check
> track_io_timing or track_wal_io_timing. And we'll want to check
> track_wal_io_timing if IOObject is IOOBJECT_WAL in
> pgstat_count_io_op_time().

Done. Instead of passing parameters to pgstat_prepare_io_time(), I
used a slightly different implementation. I return the current time if
there is a chance that any 'time' can be tracked.

>                      INSTR_TIME_SET_CURRENT(duration);
>
> INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration,
> start);
>
> +                    pgstat_count_io_op_time(IOOBJECT_WAL,
> IOCONTEXT_NORMAL, IOOP_WRITE, start, 1);
> +                } else
> +                {
>
> Other users of pgstat_count_io_op_time()/io_op_n() which write multiple
> pages at a time pass the number of pages in as the cnt parameter. (see
> ExtendBufferedRelLocal() as an example). I think we want to do that for
> WAL also. In this case, it would be the local variable "npages" and we
> can do it outside of this loop.
>
> It is true that the existing WAL stats count wal_writes here. However,
> this is essentially counting write system calls, which is probably not
> what we want for pg_stat_io. See [2] for a discussion about whether to
> count blocks written back or writeback system calls for a previous
> pg_stat_io feature. All of the other block-based IO statistics in
> pg_stat_io count the number of blocks.
>
> This being said, we probably want to just leave
> PendingWalStats.wal_write++ here. We would normally move it into
> pg_stat_io like we have with pgBufferUsage and the db IO stats that are
> updated in pgstat_count_io_op_time(). This consolidation makes it easier
> to eventually reduce the duplication. However, in this case, it seems
> wal_write counts something we don't count in pg_stat_io, so it can
> probably be left here. I would still move the
> PendingWalStats.wal_write_time into pgstat_count_io_op_time(), since
> that seems like it is the same as what will be in pg_stat_io.

Done. I moved PendingWalStats.wal_sync and
PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time to
pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively. Because
of this change, pg_stat_wal's and pg_stat_io's
IOOBJECT_WAL/IOCONTEXT_NORMAL/IOOP_WRITE counts are different but the
rest are the same.

> Also, op_bytes for IOOBJECT_WAL/IOCONTEXT_NORMAL should be XLOG_BLCKSZ
> (see comment in pg_stat_get_io() in pgstatfuncs.c). Those default to the
> same value but can be made to be different.

Done.

> I would wrap this line and check other lines to make sure they are not
> too long.

Done.

>
> +    } else
> +    {
> +        pgstat_count_io_op_n(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC, 1);
>      }
>
>      PendingWalStats.wal_sync++;
>
> Same feedback as above about using the prepare/count pattern used for
> pg_stat_io elsewhere. In this case, you should be able to move
> PendingWalStats.wal_sync into there as well.

Done.

> > There is a track_io_timing variable to control pg_stat_io
> > timings and a track_wal_io_timing variable to control WAL timings. I
> > couldn't decide on which logic to enable WAL timings on pg_stat_io.
> > For now, both pg_stat_io and track_wal_io_timing are needed to be
> > enabled to track WAL timings in pg_stat_io.
>
> Hmm. I could see a case where someone doesn't want to incur the
> overhead of track_io_timing for regular IO but does want to do so for
> WAL because they are interested in a specific issue. I'm not sure
> though. I could be convinced otherwise (based on relative overhead,
> etc).

Done. IOOBJECT_WAL uses track_wal_io_timing regardless of
track_io_timing for now.

> [1]
https://www.postgresql.org/message-id/flat/20230216191138.jotc73lqb7xhfqbi%40awork3.anarazel.de#eb4a641427fa1eb013e9ecdd8648e640
> [2] https://www.postgresql.org/message-id/20230504165738.4e2hfoddoels542c%40awork3.anarazel.de

In addition to these, are WAIT_EVENT_WAL_COPY_* operations needed to
be added to pg_stat_io? If the answer is yes, should I add them to the
current patch?

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

v2-0001-Show-WAL-stats-on-pg_stat_io.patch

Re: Show WAL write and fsync stats in pg_stat_io

From

Melanie Plageman

Date:

09 August 2023, 18:52:33

On Thu, Aug 03, 2023 at 04:38:41PM +0300, Nazir Bilal Yavuz wrote:
> Current status of the patch is:
> - 'WAL read' stats in xlogrecovery.c are added to pg_stat_io.
> - IOCONTEXT_INIT is added to count 'WAL init'. 'WAL init' stats are
> added to pg_stat_io.
> - pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
> - Working on which 'BackendType / IOContext / IOOp' should be banned
> in pg_stat_io.
> - Working on adding 'WAL read' to the xlogreader.c and walsender.c.
> - PendingWalStats.wal_sync and
> PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time are moved
> to pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively.

Cool! Thanks for the summary and for continuing to work on this.

> TODOs:
> - Documentation.
> - Thinking about how to interpret the different IOOps within the two
> IOContexts and discussing what would be useful to count.
> - Decide which 'BackendType / IOContext / IOOp' should not be tracked.
> - Adding 'WAL read' to the xlogreader.c and walsender.c. (This could
> be an another patch)

Yes, I would be explicit that you are not including WAL IO done exclusively in
the context of replication.

> - Adding WAIT_EVENT_WAL_COPY_* operations to pg_stat_io if needed.
> (This could be an another patch)

Yes, I think it makes sense as another patch.

> 
> On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > I think it would be good to count WAL reads even though they are not
> > currently represented in pg_stat_wal. Here is a thread discussing this
> > [1].
> 
> I used the same implementation in the thread link [1]. I added 'WAL
> read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
> xlogreader.c and walsender.c because they cause some failures on:
> '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
> these. Also, I added Bharath to CC. I have a question about 'WAL
> read':
> 1. There are two places where 'WAL read' happens.
> a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
> time count is equal to XLOG_BLCKSZ but there are some cases it is not.
> For example
> - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
> - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
> These nbytes and count variables could be different from XLOG_BLCKSZ.
> 
> b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
> pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
> 
> So, what should op_bytes be set to for 'WAL read' operations?

If there is any combination of BackendType and IOContext which will
always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
op_bytes. For other cases, we may have to consider using op_bytes 1 and
tracking reads and write IOOps in number of bytes (instead of number of
pages). I don't actually know if there is a clear separation by
BackendType for these different cases.

The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
treat op_bytes * number of reads as an approximation of the number of
bytes read. I don't actually know what makes more sense. I don't think I
would like having a number for bytes that is not accurate.

> > I think we will also want to add an IOContext for WAL initialization.
> > Then we can track how long is spent doing 'WAL init' (including filling
> > the WAL file with zeroes). XLogFileInitInternal() is likely where we
> > would want to add it. And op_bytes for this would likely be
> > wal_segment_size. I thought I heard about someone proposing adding WAL
> > init to pg_stat_wal, but I can't find the thread.
> 
> Done. I created a new IOCONTEXT_INIT IOContext for the 'WAL init'. I
> have a question there:
> 1. Some of the WAL processes happens at initdb (standalone backend
> IOCONTEXT_NORMAL/(IOOP_READ & IOOP_WRITE) and
> IOCONTEXT_INIT/(IOOP_WRITE & IOOP_FSYNC)). Since this happens at the
> initdb, AFAIK there is no way to set 'track_wal_io_timing' and
> 'track_io_timing' variables there. So, their timings appear as 0.
> Should I use IsBootstrapProcessingMode() to enable WAL io timings at
> the initdb or are they not that much important?

I don't have an opinion about this. I can see an argument for doing it
either way. We do track other IO during initdb in pg_stat_io.

> > Any that we decide not to count for now should be "banned" in
> > pgstat_tracks_io_op() for clarity. For example, if we create a separate
> > IOContext for WAL file init, I'm not sure what would count as an
> > IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.
> >
> > Also, I think there are some backend types which will not generate WAL
> > and we should determine which those are and skip those rows in
> > pgstat_tracks_io_object().
> 
> I agree, I am working on this. I have a couple of questions:
> 1. Can client backend and background worker do IOCONTEXT_NORMAL/IOOP_READ?

I don't know the answer to this.

> 2. Is there an easy way to check if 'BackendType / IOOBJECT_WAL' does
> specific IOOp operations?

I don't think there is a general answer to this. You'll have to look at
the code and think about specific things that backend might do that
would require WAL. I think we'll definitely need other community members
to check our work for the valid combinations.

Completing the matrix of valid combinations of BackendType, IOOp, and
IOContext and defining each one is the biggest area where we could use
help from community members.

As an additional TODO, I would explore adding some tests to prevent
accidental removal of the pg_stat_io WAL tracking.

I think we can easily test IOCONTEXT_NORMAL WAL writes in
src/test/regress/sql/stats.sql (perhaps it is worth checking that
synchronous_commit is on in the test). IOCONTEXT_NORMAL WAL fsyncs
should again be easy to test if synchronous_commit is on and fsync is
on.

I'm not sure how to reliably test WAL reads (given timing). Logically,
you can sum WAL reads before a crash is initiated in one of the tests in
the recovery suite, and then sum them after the db has restarted and
there should definitely be an increase in WAL reads, but I don't know if
we need to do something to guarantee that there will have been WAL reads
(to avoid test flakes).

I'm also not sure how to reliably test any IOCONTEXT_INIT operations. We
need a before and after and I can't think of a cheap operation to ensure
a new WAL segment is written to or fsyncd in between a before and after
for the purposes of testing.

> > diff --git a/src/backend/access/transam/xlog.c
> > b/src/backend/access/transam/xlog.c
> > index 8b0710abe6..2ee6c21398 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
> > tli, bool flexible)
> >
> > I think we should likely follow the pattern of using
> > pgstat_prepare_io_time() and pgstat_count_io_op_time() as it is done
> > elsewhere. You could pass the IOObject as a parameter to
> > pgstat_prepare_io_time() in order to determine if we should check
> > track_io_timing or track_wal_io_timing. And we'll want to check
> > track_wal_io_timing if IOObject is IOOBJECT_WAL in
> > pgstat_count_io_op_time().
> 
> Done. Instead of passing parameters to pgstat_prepare_io_time(), I
> used a slightly different implementation. I return the current time if
> there is a chance that any 'time' can be tracked.

Cool!

> From 574fdec6ed8073dbc49053e6933db0310c7c62f5 Mon Sep 17 00:00:00 2001
> From: Nazir Bilal Yavuz <byavuz81@gmail.com>
> Date: Thu, 3 Aug 2023 16:11:16 +0300
> Subject: [PATCH v2] Show WAL stats on pg_stat_io
> 
> This patch aims to showing WAL stats per backend on pg_stat_io view.
> 
> With this patch, it can be seen how many WAL operations it makes, their
> context, types and total timings per backend in pg_stat_io view.

In the commit message, I would describe what kinds of WAL IO this
patchset currently covers -- i.e. not streaming replication WAL IO.

> ---
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 60c0b7ec3af..ee7b85e18ca 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -2245,6 +2229,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
>                  startoffset += written;
>              } while (nleft > 0);
>  

I'm not sure if the right location is here or in
pgstat_count_io_op_time(), but I would explain why you did not move
PendingWalStats.wal_writes counter into pg_stat_io code (and why you did
move the other PendingWalStats counters there.

> +            pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
> +                                    IOOP_WRITE, io_start, npages);
> +
>              npages = 0;
>  
>              /*
> @@ -2938,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
>      int            fd;
>      int            save_errno;
>      int            open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
> +    instr_time    io_start;
>  
>      Assert(logtli != 0);
>  
> @@ -2981,6 +2969,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
>                  (errcode_for_file_access(),
>                   errmsg("could not create file \"%s\": %m", tmppath)));
>  

Since you have two calls to pgstat_prepare_io_time() in this function, I
think it would be nice to have a comment above each to the effect of
"start timing writes for stats" and "start timing fsyncs for stats"

> +    io_start = pgstat_prepare_io_time();
> +
>      pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);

> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> index becc2bda62e..ee850af5514 100644
> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -1587,6 +1587,7 @@ PerformWalRecovery(void)
>      XLogRecord *record;
>      bool        reachedRecoveryTarget = false;
>      TimeLineID    replayTLI;
> +    uint32        pgstat_report_wal_frequency = 0;
>  
>      /*
>       * Initialize shared variables for tracking progress of WAL replay, as if
> @@ -1745,6 +1746,16 @@ PerformWalRecovery(void)
>               */
>              ApplyWalRecord(xlogreader, record, &replayTLI);
>  
> +            /*
> +             * Report pending statistics to the cumulative stats system once
> +             * every PGSTAT_REPORT_FREQUENCY times to not hinder performance.
> +             */
> +            if (pgstat_report_wal_frequency++ == PGSTAT_REPORT_FREQUENCY)
> +            {
> +                pgstat_report_wal(false);
> +                pgstat_report_wal_frequency = 0;
> +            }
> +

Is the above needed for your patch to work? What does it do? It should
probably be in a separate commit and should definitely have an
explanation.

> --- a/src/backend/utils/activity/pgstat_io.c
> +++ b/src/backend/utils/activity/pgstat_io.c
> @@ -87,17 +87,25 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
>      Assert((unsigned int) io_op < IOOP_NUM_TYPES);
>      Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));

I would add a comment here explaining that pg_stat_wal doesn't count WAL
init or WAL reads.

> +    if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> +       io_op == IOOP_FSYNC)
> +        PendingWalStats.wal_sync += cnt;
> +
>      PendingIOStats.counts[io_object][io_context][io_op] += cnt;
>  
>      have_iostats = true;
>  }

> +/*
> + * Prepares io_time for pgstat_count_io_op_time() function. It needs to return
> + * current time if there is a chance that any 'time' can be tracked.
> + */
>  instr_time
>  pgstat_prepare_io_time(void)
>  {
>      instr_time    io_start;
>  
> -    if (track_io_timing)
> +    if(track_io_timing || track_wal_io_timing)
>          INSTR_TIME_SET_CURRENT(io_start);
>      else
>          INSTR_TIME_SET_ZERO(io_start);

Since you asked me off-list why we had to do INSTR_TIME_SET_ZERO() and I
couldn't remember, it is probably worth a comment.

>  pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
>                          instr_time start_time, uint32 cnt)
>  {
> -    if (track_io_timing)
> +    if (pgstat_should_track_io_time(io_object, io_context))
>      {
>          instr_time    io_time;
>  
> @@ -124,6 +148,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
>              pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));

Now that we are adding more if statements to this function, I think we
should start adding more comments.

We should explain what the different counters here are for e.g.
pgBufferUsage for EXPLAIN, PendingWalStats for pg_stat_wal.

We should also explain what is tracked for each and why it differs --
e.g. some track time and some don't, some track only reads or writes,
etc.

Also we should mention why we are consolidating them here. That is, we
want to eventually deduplicate these counters, so we are consolidating
them first. This also makes it easy to compare what is tracked for which
stats or instrumentation purpose.

And for those IO counters that we haven't moved here, we should mention
it is because they track at a different level of granularity or at a
different point in the call stack.

>              if (io_object == IOOBJECT_RELATION)
>                  INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> +            /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
> +            else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
> +                INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
>          }


Also, I would reorder the if statements to be in order of the enum
values (e.g. FSYNC, READ, WRITE).

>          else if (io_op == IOOP_READ)
>          {
> @@ -131,6 +158,12 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
>              if (io_object == IOOBJECT_RELATION)
>                  INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
>          }
> +        else if (io_op == IOOP_FSYNC)
> +        {
> +            /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */

I wouldn't squeeze this comment here like this. It is hard to read

> +            if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
> +                INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);


> + * op_bytes can change according to IOObject and IOContext.
> + * Return BLCKSZ as default.
> + */
> +int
> +pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context)
> +{

Small typo in function name:
pgstat_get_io_op_btyes -> pgstat_get_io_op_bytes
I'd also mention why BLCKSZ is the default

> +    if (io_object == IOOBJECT_WAL)
> +    {
> +        if (io_context == IOCONTEXT_NORMAL)
> +            return XLOG_BLCKSZ;
> +        else if (io_context == IOCONTEXT_INIT)
> +            return wal_segment_size;
> +    }
> +
> +    return BLCKSZ;
> +}

> @@ -350,6 +405,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
>      if (!pgstat_tracks_io_bktype(bktype))
>          return false;
>  
> +    /*
> +     * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
> +     * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
> +     */
> +    if (io_object == IOOBJECT_WAL &&
> +        (io_context != IOCONTEXT_NORMAL &&

Little bit of errant whitespace here.

>      /*
>       * Currently, IO on temporary relations can only occur in the
>       * IOCONTEXT_NORMAL IOContext.
> @@ -439,6 +503,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
>      if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
>          return false;

I would expand on the comment to explain what NORMAL is for WAL -- what
we consider normal to be and why. And why it is different than INIT.

>  
> +    if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
> +       !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
> +       return false;
> +
> +    if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> +       !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
> +       return false;

These are the first "bans" that we have for an IOOp for a specific
combination of io_context and io_object. We should add a new comment for
this and perhaps consider what ordering makes most sense. I tried to
organize the bans from most broad to most specific at the bottom.

>  
> --- a/src/backend/utils/adt/pgstatfuncs.c
> +++ b/src/backend/utils/adt/pgstatfuncs.c
> @@ -1409,7 +1410,8 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
>                   * and constant multipliers, once non-block-oriented IO (e.g.
>                   * temporary file IO) is tracked.
>                   */
> -                values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);

There's a comment above this in the code that says this is hard-coded to
BLCKSZ. That comment needs to be updated or removed (in lieu of the
comment in your pgstat_get_io_op_bytes() function).


> +                op_bytes = pgstat_get_io_op_btyes(io_obj, io_context);
> +                values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
>  

> +extern PGDLLIMPORT bool track_wal_io_timing;
> +extern PGDLLIMPORT int wal_segment_size;

These shouldn't be in two places (i.e. they are already in xlog.h and
you added them in pgstat.h. pg_stat_io.c includes bufmgr.h for
track_io_timing, so you can probably justify including xlog.h.


- Melanie

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

20 September 2023, 07:57:48

Hi,

Thanks for the review!

Current status of the patch is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
- IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned in pg_stat_io.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time / PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() / pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Try to set op_bytes for BackendType / IOContext.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
- Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

I am adding tracking of BackendType / IOContext / IOOp as tables, empty cell means it is not decided yet:

IOCONTEXT_NORMAL / Backend / IOOp table:

╔═════════════════════╦═══════╦═══════╦═══════╗
║ IOCONTEXT_NORMAL ║ read ║ write ║ fsync ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ autovacuum launcher ║ FALSE ║ FALSE ║ FALSE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ autovacuum worker ║ FALSE ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ client backend ║ ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ background worker ║ ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ background writer ║ ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ checkpointer ║ ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ standalone backend ║ TRUE ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ startup ║ TRUE ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walreceiver ║ ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walsender ║ ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walwriter ║ ║ TRUE ║ TRUE ║
╚═════════════════════╩═══════╩═══════╩═══════╝

IOCONTEXT_WAL_INIT / Backend / IOOp table:

╔═════════════════════╦═══════╦═══════╗
║ IOCONTEXT_WAL_INIT ║ write ║ fsync ║
╠═════════════════════╬═══════╬═══════╣
║ autovacuum launcher ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ autovacuum worker ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ client backend ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╣
║ background worker ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ background writer ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ checkpointer ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ standalone backend ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╣
║ startup ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ walreceiver ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ walsender ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ walwriter ║ ║ ║
╚═════════════════════╩═══════╩═══════╝

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> > On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > > I think it would be good to count WAL reads even though they are not
> > > currently represented in pg_stat_wal. Here is a thread discussing this
> > > [1].
> >
> > I used the same implementation in the thread link [1]. I added 'WAL
> > read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
> > xlogreader.c and walsender.c because they cause some failures on:
> > '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
> > these. Also, I added Bharath to CC. I have a question about 'WAL
> > read':
> > 1. There are two places where 'WAL read' happens.
> > a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
> > time count is equal to XLOG_BLCKSZ but there are some cases it is not.
> > For example
> > - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
> > - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
> > These nbytes and count variables could be different from XLOG_BLCKSZ.
> >
> > b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
> > pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
> >
> > So, what should op_bytes be set to for 'WAL read' operations?
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

I agree. I will edit that later, added to TODOs.

>
> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
> treat op_bytes * number of reads as an approximation of the number of
> bytes read. I don't actually know what makes more sense. I don't think I
> would like having a number for bytes that is not accurate.

Yes, the prior one makes more sense to me.

>
> > Should I use IsBootstrapProcessingMode() to enable WAL io timings at
> > the initdb or are they not that much important?
>
> I don't have an opinion about this. I can see an argument for doing it
> either way. We do track other IO during initdb in pg_stat_io.

I didn't add it for now. It is an easy change, it could be added later.

>
> As an additional TODO, I would explore adding some tests to prevent
> accidental removal of the pg_stat_io WAL tracking.
>
> I think we can easily test IOCONTEXT_NORMAL WAL writes in
> src/test/regress/sql/stats.sql (perhaps it is worth checking that
> synchronous_commit is on in the test). IOCONTEXT_NORMAL WAL fsyncs
> should again be easy to test if synchronous_commit is on and fsync is
> on.
>
> I'm not sure how to reliably test WAL reads (given timing). Logically,
> you can sum WAL reads before a crash is initiated in one of the tests in
> the recovery suite, and then sum them after the db has restarted and
> there should definitely be an increase in WAL reads, but I don't know if
> we need to do something to guarantee that there will have been WAL reads
> (to avoid test flakes).
>
> I'm also not sure how to reliably test any IOCONTEXT_INIT operations. We
> need a before and after and I can't think of a cheap operation to ensure
> a new WAL segment is written to or fsyncd in between a before and after
> for the purposes of testing.

IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.

For the IOCONTEXT_NORMAL reads and IOCONTEXT_INIT tests, I couldn't find a way to avoid test flakes. I am open to suggestions. I added these to TODOs.

>
> > ---
> > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> > index 60c0b7ec3af..ee7b85e18ca 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -2245,6 +2229,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
> > startoffset += written;
> > } while (nleft > 0);
> >
>
> I'm not sure if the right location is here or in
> pgstat_count_io_op_time(), but I would explain why you did not move
> PendingWalStats.wal_writes counter into pg_stat_io code (and why you did
> move the other PendingWalStats counters there.
>
> > + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
> > + IOOP_WRITE, io_start, npages);
> > +
> > npages = 0;
> >
> > /*
> > @@ -2938,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
> > int fd;
> > int save_errno;
> > int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
> > + instr_time io_start;
> >
> > Assert(logtli != 0);
> >
> > @@ -2981,6 +2969,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
> > (errcode_for_file_access(),
> > errmsg("could not create file \"%s\": %m", tmppath)));
> >
>
> Since you have two calls to pgstat_prepare_io_time() in this function, I
> think it would be nice to have a comment above each to the effect of
> "start timing writes for stats" and "start timing fsyncs for stats"

Done.

>
> > + io_start = pgstat_prepare_io_time();
> > +
> > pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
>
> > diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> > index becc2bda62e..ee850af5514 100644
> > --- a/src/backend/access/transam/xlogrecovery.c
> > +++ b/src/backend/access/transam/xlogrecovery.c
> > @@ -1587,6 +1587,7 @@ PerformWalRecovery(void)
> > XLogRecord *record;
> > bool reachedRecoveryTarget = false;
> > TimeLineID replayTLI;
> > + uint32 pgstat_report_wal_frequency = 0;
> >
> > /*
> > * Initialize shared variables for tracking progress of WAL replay, as if
> > @@ -1745,6 +1746,16 @@ PerformWalRecovery(void)
> > */
> > ApplyWalRecord(xlogreader, record, &replayTLI);
> >
> > + /*
> > + * Report pending statistics to the cumulative stats system once
> > + * every PGSTAT_REPORT_FREQUENCY times to not hinder performance.
> > + */
> > + if (pgstat_report_wal_frequency++ == PGSTAT_REPORT_FREQUENCY)
> > + {
> > + pgstat_report_wal(false);
> > + pgstat_report_wal_frequency = 0;
> > + }
> > +
>
> Is the above needed for your patch to work? What does it do? It should
> probably be in a separate commit and should definitely have an
> explanation.

Done, I omit that part.

>
> > --- a/src/backend/utils/activity/pgstat_io.c
> > +++ b/src/backend/utils/activity/pgstat_io.c
> > @@ -87,17 +87,25 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
> > Assert((unsigned int) io_op < IOOP_NUM_TYPES);
> > Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
>
> I would add a comment here explaining that pg_stat_wal doesn't count WAL
> init or WAL reads.

Done.

>
> > + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> > + io_op == IOOP_FSYNC)
> > + PendingWalStats.wal_sync += cnt;
> > +
> > PendingIOStats.counts[io_object][io_context][io_op] += cnt;
> >
> > have_iostats = true;
> > }
>
> > +/*
> > + * Prepares io_time for pgstat_count_io_op_time() function. It needs to return
> > + * current time if there is a chance that any 'time' can be tracked.
> > + */
> > instr_time
> > pgstat_prepare_io_time(void)
> > {
> > instr_time io_start;
> >
> > - if (track_io_timing)
> > + if(track_io_timing || track_wal_io_timing)
> > INSTR_TIME_SET_CURRENT(io_start);
> > else
> > INSTR_TIME_SET_ZERO(io_start);
>
> Since you asked me off-list why we had to do INSTR_TIME_SET_ZERO() and I
> couldn't remember, it is probably worth a comment.

Done.

>
> > pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
> > instr_time start_time, uint32 cnt)
> > {
> > - if (track_io_timing)
> > + if (pgstat_should_track_io_time(io_object, io_context))
> > {
> > instr_time io_time;
> >
> > @@ -124,6 +148,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
> > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
>
> Now that we are adding more if statements to this function, I think we
> should start adding more comments.
>
> We should explain what the different counters here are for e.g.
> pgBufferUsage for EXPLAIN, PendingWalStats for pg_stat_wal.
>
> We should also explain what is tracked for each and why it differs --
> e.g. some track time and some don't, some track only reads or writes,
> etc.
>
> Also we should mention why we are consolidating them here. That is, we
> want to eventually deduplicate these counters, so we are consolidating
> them first. This also makes it easy to compare what is tracked for which
> stats or instrumentation purpose.
>
> And for those IO counters that we haven't moved here, we should mention
> it is because they track at a different level of granularity or at a
> different point in the call stack.

Done.

>
> > if (io_object == IOOBJECT_RELATION)
> > INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> > + /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
> > + else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
> > + INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
> > }
>
>
> Also, I would reorder the if statements to be in order of the enum
> values (e.g. FSYNC, READ, WRITE).

Done.

>
> > else if (io_op == IOOP_READ)
> > {
> > @@ -131,6 +158,12 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
> > if (io_object == IOOBJECT_RELATION)
> > INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
> > }
> > + else if (io_op == IOOP_FSYNC)
> > + {
> > + /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
>
> I wouldn't squeeze this comment here like this. It is hard to read

Done.

>
> > + if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
> > + INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);
>
>
> > + * op_bytes can change according to IOObject and IOContext.
> > + * Return BLCKSZ as default.
> > + */
> > +int
> > +pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context)
> > +{
>
> Small typo in function name:
> pgstat_get_io_op_btyes -> pgstat_get_io_op_bytes
> I'd also mention why BLCKSZ is the default

Done.

>
> > + if (io_object == IOOBJECT_WAL)
> > + {
> > + if (io_context == IOCONTEXT_NORMAL)
> > + return XLOG_BLCKSZ;
> > + else if (io_context == IOCONTEXT_INIT)
> > + return wal_segment_size;
> > + }
> > +
> > + return BLCKSZ;
> > +}
>
> > @@ -350,6 +405,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
> > if (!pgstat_tracks_io_bktype(bktype))
> > return false;
> >
> > + /*
> > + * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
> > + * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
> > + */
> > + if (io_object == IOOBJECT_WAL &&
> > + (io_context != IOCONTEXT_NORMAL &&
>
> Little bit of errant whitespace here.

Done.

>
> > /*
> > * Currently, IO on temporary relations can only occur in the
> > * IOCONTEXT_NORMAL IOContext.
> > @@ -439,6 +503,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
> > if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
> > return false;
>
> I would expand on the comment to explain what NORMAL is for WAL -- what
> we consider normal to be and why. And why it is different than INIT.

Done.

>
> >
> > + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
> > + !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
> > + return false;
> > +
> > + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> > + !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
> > + return false;
>
> These are the first "bans" that we have for an IOOp for a specific
> combination of io_context and io_object. We should add a new comment for
> this and perhaps consider what ordering makes most sense. I tried to
> organize the bans from most broad to most specific at the bottom.

Done.

>
> >
> > --- a/src/backend/utils/adt/pgstatfuncs.c
> > +++ b/src/backend/utils/adt/pgstatfuncs.c
> > @@ -1409,7 +1410,8 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
> > * and constant multipliers, once non-block-oriented IO (e.g.
> > * temporary file IO) is tracked.
> > */
> > - values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
>
> There's a comment above this in the code that says this is hard-coded to
> BLCKSZ. That comment needs to be updated or removed (in lieu of the
> comment in your pgstat_get_io_op_bytes() function).

Done.

>
>
> > + op_bytes = pgstat_get_io_op_btyes(io_obj, io_context);
> > + values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
> >
>
> > +extern PGDLLIMPORT bool track_wal_io_timing;
> > +extern PGDLLIMPORT int wal_segment_size;
>
> These shouldn't be in two places (i.e. they are already in xlog.h and
> you added them in pgstat.h. pg_stat_io.c includes bufmgr.h for
> track_io_timing, so you can probably justify including xlog.h.

Done.

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

26 October 2023, 06:28:32

On Wed, Sep 20, 2023 at 10:57:48AM +0300, Nazir Bilal Yavuz wrote:
> Any kind of feedback would be appreciated.

This was registered in the CF, so I have given it a look.  Note that
0001 has a conflict with pgstat_count_io_op_time(), so it cannot be
applied.

+pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
+{
+    /*
+     * io times of IOOBJECT_WAL IOObject needs to be tracked when
+     * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
+     */
+    if (io_object == IOOBJECT_WAL)
+        return track_wal_io_timing;
+
+    return track_io_timing;

I can see the temptation to do that, but I have mixed feelings about
the approach of mixing two GUCs in a code path dedicated to pg_stat_io
where now we only rely on track_io_timing.  The result brings
confusion, while making pg_stat_io, which is itself only used for
block-based operations, harder to read.

The suggestion I am seeing here to have a pg_stat_io_wal (with a SRF)
is quite tempting, actually, creating a neat separation between the
existing pg_stat_io and pg_stat_wal (not a SRF), with a third view
that provides more details about the contexts and backend types for
the WAL stats with its relevant fields:
https://www.postgresql.org/message-id/CAAKRu_bM55pj3pPRW0nd_-paWHLRkOU69r816AeztBBa-N1HLA@mail.gmail.com

And perhaps just putting that everything that calls
pgstat_count_io_op_time() under track_io_timing is just natural?
What's the performance regression you would expect if both WAL and
block I/O are controlled by that, still one would expect only one of
them?

On top of that pg_stat_io is now for block-based I/O operations, so
that does not fit entirely in the picture, though I guess that Melanie
has thought more on the matter than me.  That may be also a matter of
taste.

+      /*  Report pending statistics to the cumulative stats system */
+      pgstat_report_wal(false);

This is hidden in 0001, still would be better if handled as a patch on
its own and optionally backpatch it as we did for the bgwriter with
e64c733bb1?

Side note: I think that we should spend more efforts in documenting
what IOContext and IOOp mean.  Not something directly related to this
patch, still this patch or things similar make it a bit harder which
part of it is used for what by reading pgstat.h.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

31 October 2023, 13:57:57

Hi,

Thank you for the feedback!

On Thu, 26 Oct 2023 at 09:28, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Sep 20, 2023 at 10:57:48AM +0300, Nazir Bilal Yavuz wrote:
> > Any kind of feedback would be appreciated.
>
> This was registered in the CF, so I have given it a look.  Note that
> 0001 has a conflict with pgstat_count_io_op_time(), so it cannot be
> applied.
>
> +pgstat_should_track_io_time(IOObject io_object, IOContext io_context)
> +{
> +       /*
> +        * io times of IOOBJECT_WAL IOObject needs to be tracked when
> +        * 'track_wal_io_timing' is set regardless of 'track_io_timing'.
> +        */
> +       if (io_object == IOOBJECT_WAL)
> +               return track_wal_io_timing;
> +
> +       return track_io_timing;
>
> I can see the temptation to do that, but I have mixed feelings about
> the approach of mixing two GUCs in a code path dedicated to pg_stat_io
> where now we only rely on track_io_timing.  The result brings
> confusion, while making pg_stat_io, which is itself only used for
> block-based operations, harder to read.
>
> The suggestion I am seeing here to have a pg_stat_io_wal (with a SRF)
> is quite tempting, actually, creating a neat separation between the
> existing pg_stat_io and pg_stat_wal (not a SRF), with a third view
> that provides more details about the contexts and backend types for
> the WAL stats with its relevant fields:
> https://www.postgresql.org/message-id/CAAKRu_bM55pj3pPRW0nd_-paWHLRkOU69r816AeztBBa-N1HLA@mail.gmail.com
>
> And perhaps just putting that everything that calls
> pgstat_count_io_op_time() under track_io_timing is just natural?
> What's the performance regression you would expect if both WAL and
> block I/O are controlled by that, still one would expect only one of
> them?

I will check these and I hope I will come back with something meaningful.

>
> +      /*  Report pending statistics to the cumulative stats system */
> +      pgstat_report_wal(false);
>
> This is hidden in 0001, still would be better if handled as a patch on
> its own and optionally backpatch it as we did for the bgwriter with
> e64c733bb1?

I thought about it again and found the use of
'pgstat_report_wal(false);' here wrong. This was mainly for flushing
WAL stats because of the WAL reads but pg_stat_wal doesn't have WAL
read stats, so there is no need to flush WAL stats here. I think this
should be replaced with 'pgstat_flush_io(false);'.

>
> Side note: I think that we should spend more efforts in documenting
> what IOContext and IOOp mean.  Not something directly related to this
> patch, still this patch or things similar make it a bit harder which
> part of it is used for what by reading pgstat.h.

I agree.

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

06 November 2023, 12:35:01

Hi,

On Tue, 31 Oct 2023 at 16:57, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> On Thu, 26 Oct 2023 at 09:28, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > And perhaps just putting that everything that calls
> > pgstat_count_io_op_time() under track_io_timing is just natural?
> > What's the performance regression you would expect if both WAL and
> > block I/O are controlled by that, still one would expect only one of
> > them?
>
> I will check these and I hope I will come back with something meaningful.

I applied the patches on upstream postgres and then run pgbench for each available clock sources couple of times:

# Set fsync = off and track_io_timing = on

# pgbench -i -s 100 test
# pgbench -M prepared -c16 -j8 -f <( echo "SELECT pg_logical_emit_message(true, \:client_id::text, '1234567890');") -T60 test

Results are:

╔═════════╦═══════════════════════════════╦════════╗
║ ║ track_wal_io_timing ║ ║
╠═════════╬═══════════════╦═══════════════╬════════╣
║ clock ║ on ║ off ║ change ║
║ sources ║ ║ ║ ║
╠═════════╬═══════════════╬═══════════════╬════════╣
║ tsc ║ ║ ║ ║
║ ║ 514814.459170 ║ 519826.284139 ║ %1 ║
╠═════════╬═══════════════╬═══════════════╬════════╣
║ hpet ║ ║ ║ ║
║ ║ 132116.272121 ║ 141820.548447 ║ %7 ║
╠═════════╬═══════════════╬═══════════════╬════════╣
║ acpi_pm ║ ║ ║ ║
║ ║ 394793.092255 ║ 403723.874719 ║ %2 ║
╚═════════╩═══════════════╩═══════════════╩════════╝

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

07 November 2023, 03:25:27

On Mon, Nov 06, 2023 at 03:35:01PM +0300, Nazir Bilal Yavuz wrote:
> Results are:
>
> ╔═════════╦═══════════════════════════════╦════════╗
> ║         ║      track_wal_io_timing      ║        ║
> ╠═════════╬═══════════════╦═══════════════╬════════╣
> ║  clock  ║       on      ║      off      ║ change ║
> ║ sources ║               ║               ║        ║
> ╠═════════╬═══════════════╬═══════════════╬════════╣
> ║   tsc   ║               ║               ║        ║
> ║         ║ 514814.459170 ║ 519826.284139 ║   %1   ║
> ╠═════════╬═══════════════╬═══════════════╬════════╣
> ║   hpet  ║               ║               ║        ║
> ║         ║ 132116.272121 ║ 141820.548447 ║   %7   ║
> ╠═════════╬═══════════════╬═══════════════╬════════╣
> ║ acpi_pm ║               ║               ║        ║
> ║         ║ 394793.092255 ║ 403723.874719 ║   %2   ║
> ╚═════════╩═══════════════╩═══════════════╩════════╝

Thanks for the tests.  That's indeed noticeable under this load.
Better to keep track_io_timing and track_wal_io_timing as two
separated beasts, at least that's clear.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Andres Freund

Date:

07 November 2023, 23:30:48

Hi,

On 2023-10-26 15:28:32 +0900, Michael Paquier wrote:
> On top of that pg_stat_io is now for block-based I/O operations, so
> that does not fit entirely in the picture, though I guess that Melanie
> has thought more on the matter than me.  That may be also a matter of
> taste.

I strongly disagree. A significant part of the design of pg_stat_io was to
make it possible to collect multiple sources of IO in a single view, so that
sysadmins don't have to look in dozens of places to figure out what is causing
what kind of IO.

We should over time collect all sources of IO in pg_stat_io. For some things
we might want to also have more detailed information in other views (e.g. it
doesn't make sense to track FPIs in pg_stat_io, but does make sense in
pg_stat_wal) - but that should be in addition, not instead of.

Greetings,

Andres Freund

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

08 November 2023, 00:52:16

On Tue, Nov 07, 2023 at 03:30:48PM -0800, Andres Freund wrote:
> I strongly disagree. A significant part of the design of pg_stat_io was to
> make it possible to collect multiple sources of IO in a single view, so that
> sysadmins don't have to look in dozens of places to figure out what is causing
> what kind of IO.

Okay.  Point taken.

> We should over time collect all sources of IO in pg_stat_io. For some things
> we might want to also have more detailed information in other views (e.g. it
> doesn't make sense to track FPIs in pg_stat_io, but does make sense in
> pg_stat_wal) - but that should be in addition, not instead of.

Sure.  I understand here that you mean the number of FPIs counted when
a record is inserted, different from the path where we decide to write
and/or flush WAL.  The proposed patch seems to be a bit inconsistent
regarding wal_sync_time, by the way.

By the way, if the write/sync quantities and times begin to be tracked
by pg_stat_io, I'd see a pretty good argument in removing the
equivalent columns in pg_stat_wal.  It looks like this would reduce
the confusion related to the handling of PendingWalStats added in
pgstat_io.c, for one.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Andres Freund

Date:

08 November 2023, 01:19:28

Hi,

On 2023-11-08 09:52:16 +0900, Michael Paquier wrote:
> By the way, if the write/sync quantities and times begin to be tracked
> by pg_stat_io, I'd see a pretty good argument in removing the
> equivalent columns in pg_stat_wal.  It looks like this would reduce
> the confusion related to the handling of PendingWalStats added in
> pgstat_io.c, for one.

Another approach would be to fetch the relevant columns from pg_stat_io in the
pg_stat_wal view. That'd avoid double accounting and breaking existing
monitoring.

Greetings,

Andres Freund

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

08 November 2023, 01:27:44

On Tue, Nov 07, 2023 at 05:19:28PM -0800, Andres Freund wrote:
> Another approach would be to fetch the relevant columns from pg_stat_io in the
> pg_stat_wal view. That'd avoid double accounting and breaking existing
> monitoring.

Yep, I'd be OK with that as well to maintain compatibility.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

08 November 2023, 05:59:07

On Wed, Nov 08, 2023 at 10:27:44AM +0900, Michael Paquier wrote:
> Yep, I'd be OK with that as well to maintain compatibility.

By the way, note that the patch is failing to apply, and that I've
switched it as waiting on author on 10/26.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Bharath Rupireddy

Date:

08 November 2023, 07:34:37

On Wed, Sep 20, 2023 at 1:28 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thanks for the review!
>
> Current status of the patch is:
> - IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
> - IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
> - IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
> - pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
> - Working on which 'BackendType / IOContext / IOOp' should be banned in pg_stat_io.
> - PendingWalStats.wal_sync and PendingWalStats.wal_write_time / PendingWalStats.wal_sync_time are moved to
pgstat_count_io_op_n()/ pgstat_count_io_op_time() respectively. 
>
> TODOs:
> - Documentation.
> - Try to set op_bytes for BackendType / IOContext.
> - Decide which 'BackendType / IOContext / IOOp' should not be tracked.
> - Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
> - Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

This patchset currently covers:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync.
- IOOBJECT_WAL / IOCONTEXT_INIT write and fsync.

doesn't cover:
- Streaming replication WAL IO.

Is there any plan to account for WAL read stats in the WALRead()
function which will cover walsenders i.e. WAL read by logical and
streaming replication, WAL read by pg_walinspect and so on? I see the
patch already covers WAL read stats by recovery in XLogPageRead(), but
not other page_read callbacks which will end up in WALRead()
eventually. If added, the feature at
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
can then extend it to cover WAL read from WAL buffer stats.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

09 November 2023, 09:35:46

Hi,

Thanks for all the feedback!

On Wed, 8 Nov 2023 at 08:59, Michael Paquier <michael@paquier.xyz> wrote:
>
> By the way, note that the patch is failing to apply, and that I've
> switched it as waiting on author on 10/26.

Here is an updated patchset in attachment. Rebased on the latest HEAD
and changed 'pgstat_report_wal(false)' to 'pgstat_flush_io(false)' in
xlogrecovery.c. I will share the new version of the patchset once I
address the feedback.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

09 November 2023, 11:39:26

Hi,

On Wed, 8 Nov 2023 at 04:19, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-11-08 09:52:16 +0900, Michael Paquier wrote:
> > By the way, if the write/sync quantities and times begin to be tracked
> > by pg_stat_io, I'd see a pretty good argument in removing the
> > equivalent columns in pg_stat_wal.  It looks like this would reduce
> > the confusion related to the handling of PendingWalStats added in
> > pgstat_io.c, for one.
>
> Another approach would be to fetch the relevant columns from pg_stat_io in the
> pg_stat_wal view. That'd avoid double accounting and breaking existing
> monitoring.

There are some differences between pg_stat_wal and pg_stat_io while
collecting WAL stats. For example in the XLogWrite() function in the
xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
is not something we want for pg_stat_io since pg_stat_io counts the
number of blocks rather than the system calls, so instead incremented
pg_stat_io by npages.

Could that cause a problem since pg_stat_wal's behaviour will be
changed? Of course, as an alternative we could change pg_stat_io's
behaviour but in the end either pg_stat_wal's or pg_stat_io's
behaviour will be changed.

Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

20 November 2023, 07:47:19

On Thu, Nov 09, 2023 at 02:39:26PM +0300, Nazir Bilal Yavuz wrote:
> There are some differences between pg_stat_wal and pg_stat_io while
> collecting WAL stats. For example in the XLogWrite() function in the
> xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
> is not something we want for pg_stat_io since pg_stat_io counts the
> number of blocks rather than the system calls, so instead incremented
> pg_stat_io by npages.
>
> Could that cause a problem since pg_stat_wal's behaviour will be
> changed? Of course, as an alternative we could change pg_stat_io's
> behaviour but in the end either pg_stat_wal's or pg_stat_io's
> behaviour will be changed.

Yep, that could be confusing for existing applications that track the
information of pg_stat_wal.  The number of writes is not something
that can be correctly shared between both.  The timings for the writes
and the syncs could be shared at least, right?

This slightly relates to pgstat_count_io_op_n() in your latest patch,
where it feels a bit weird to see an update of
PendingWalStats.wal_sync sit in the middle of a routine dedicated to
pg_stat_io..  I am not completely sure what's the right balance here,
but I would try to implement things so as pg_stat_io paths does not
need to know about PendingWalStats.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

20 November 2023, 14:43:17

Hi,

Thanks for the feedback.

On Mon, 20 Nov 2023 at 10:47, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Nov 09, 2023 at 02:39:26PM +0300, Nazir Bilal Yavuz wrote:
> > There are some differences between pg_stat_wal and pg_stat_io while
> > collecting WAL stats. For example in the XLogWrite() function in the
> > xlog.c file, pg_stat_wal counts wal_writes as write system calls. This
> > is not something we want for pg_stat_io since pg_stat_io counts the
> > number of blocks rather than the system calls, so instead incremented
> > pg_stat_io by npages.
> >
> > Could that cause a problem since pg_stat_wal's behaviour will be
> > changed? Of course, as an alternative we could change pg_stat_io's
> > behaviour but in the end either pg_stat_wal's or pg_stat_io's
> > behaviour will be changed.
>
> Yep, that could be confusing for existing applications that track the
> information of pg_stat_wal.  The number of writes is not something
> that can be correctly shared between both.  The timings for the writes
> and the syncs could be shared at least, right?

Yes, the timings for the writes and the syncs should work. Another
question I have in mind is the pg_stat_reset_shared() function. When
we call it with 'io' it will reset pg_stat_wal's timings and when we
call it with 'wal' it won't reset them, right?

>
> This slightly relates to pgstat_count_io_op_n() in your latest patch,
> where it feels a bit weird to see an update of
> PendingWalStats.wal_sync sit in the middle of a routine dedicated to
> pg_stat_io..  I am not completely sure what's the right balance here,
> but I would try to implement things so as pg_stat_io paths does not
> need to know about PendingWalStats.

Write has block vs system calls differentiation but it is the same for
sync. Because of that I put PendingWalStats.wal_sync to pg_stat_io but
I agree that it looks a bit weird.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

21 November 2023, 00:26:00

On Mon, Nov 20, 2023 at 05:43:17PM +0300, Nazir Bilal Yavuz wrote:
> Yes, the timings for the writes and the syncs should work. Another
> question I have in mind is the pg_stat_reset_shared() function. When
> we call it with 'io' it will reset pg_stat_wal's timings and when we
> call it with 'wal' it won't reset them, right?

pg_stat_reset_shared() with a target is IMO a very edge case, so I'm
OK with the approach of resetting timings in pg_stat_wal even if 'io'
was implied because pg_stat_wal would feed partially from pg_stat_io.
I'd take that as a side-cost in favor of compatibility while making
the stats gathering cheaper overall.  I'm OK as well if people
counter-argue on this point, though that would mean to keep entirely
separate views with duplicated fields that serve the same purpose,
impacting all deployments because it would make the stats gathering
heavier for all.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

01 December 2023, 08:30:08

Hi,

On Wed, 8 Nov 2023 at 10:34, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Is there any plan to account for WAL read stats in the WALRead()
> function which will cover walsenders i.e. WAL read by logical and
> streaming replication, WAL read by pg_walinspect and so on? I see the
> patch already covers WAL read stats by recovery in XLogPageRead(), but
> not other page_read callbacks which will end up in WALRead()
> eventually. If added, the feature at
> https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
> can then extend it to cover WAL read from WAL buffer stats.

Yes, I am planning to create a patch for that after this patch is
done. Thanks for informing!

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

01 December 2023, 09:02:05

Hi,

Thanks for all the feedback. I am sharing the new version of the patchset.

Current status of the patchset is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write, fsync stats and their
tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats and their tests are added.
- Documentation is updated.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time /
PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() /
pgstat_count_io_op_time() respectively.

Updates & Discussion items:
- Try to set op_bytes for BackendType / IOContext: I think we don't
need this now, we will need this when we add streaming replication WAL
IOs.

- Decide which 'BackendType / IOContext / IOOp' should not be tracked:
-- IOOBJECT_WAL / IOCONTEXT_INIT + IOCONTEXT_NORMAL / write and fsync
IOs can be done on every backend that tracks IO statistics. Because of
that and since we have a pgstat_tracks_io_bktype(bktype) check, I
didn't add another check for this.
-- I found that only the standalone backend and startup backend do
IOOBJECT_WAL / IOCONTEXT_NORMAL / read IOs. So, I added a check for
that but I am not sure if there are more backends that do WAL reads on
WAL recovery.

- For the IOOBJECT_WAL / IOCONTEXT_INIT and IOOBJECT_WAL /
IOCONTEXT_NORMAL / read tests, I used initial WAL IOs to check these
stats. I am not sure if that is the correct way or enough to test
these stats.

- To not calculate WAL timings on pg_stat_wal and pg_stat_io view,
pg_stat_wal view's WAL timings are fetched from pg_stat_io. Since
these timings are fetched from pg_stat_io, pg_stat_reset_shared('io')
will reset pg_stat_wal's timings too.

- I didn't move 'PendingWalStats.wal_sync' out from the
'pgstat_count_io_op_n' function because they count the same thing
(block vs system calls) but I agree that this doesn't look good.

Any kind of feedback would be appreciated.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

05 December 2023, 06:16:02

On Fri, Dec 01, 2023 at 12:02:05PM +0300, Nazir Bilal Yavuz wrote:
> Thanks for all the feedback. I am sharing the new version of the patchset.
>
> - I didn't move 'PendingWalStats.wal_sync' out from the
> 'pgstat_count_io_op_n' function because they count the same thing
> (block vs system calls) but I agree that this doesn't look good.

-       if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+       if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE)

Unrelated diff.

+   if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+       io_op == IOOP_FSYNC)
+       PendingWalStats.wal_sync += cnt;

Nah, I really don't think that adding this dependency within
pg_stat_io is a good idea.

-   PendingWalStats.wal_sync++;
+   pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+                           io_start, 1);

This is the only caller where this matters, and the count is always 1.

+    no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER ||
+        bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND ||
+        bktype == B_BG_WORKER || bktype == B_BG_WRITER ||
+        bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER ||
+        bktype == B_WAL_SENDER || bktype == B_WAL_WRITER;
+
+    if (no_wal_normal_read &&
+        (io_object == IOOBJECT_WAL &&
+         io_op == IOOP_READ))
+        return false;

This may be more readable if an enum is applied, without a default
clause so as it would not be forgotten if a new type is added, perhaps
in its own little routine.

-   if (track_io_timing)
+   if (track_io_timing || track_wal_io_timing)
        INSTR_TIME_SET_CURRENT(io_start);
    else

This interface from pgstat_prepare_io_time() is not really good,
because we could finish by setting io_start in the existing code paths
calling this routine even if track_io_timing is false when
track_wal_io_timing is true.  Why not changing this interface a bit
and pass down a GUC (track_io_timing or track_wal_io_timing) as an
argument of the function depending on what we expect to trigger the
timings?

-    /* Convert counters from microsec to millisec for display */
-    values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-    values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
+    /*
+     * There is no need to calculate timings for both pg_stat_wal and
+     * pg_stat_io. So, fetch timings from pg_stat_io to make stats gathering
+     * cheaper. Note that, since timings are fetched from pg_stat_io;
+     * pg_stat_reset_shared('io') will reset pg_stat_wal's timings too.
+     *
+     * Convert counters from microsec to millisec for display
+     */
+    values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+                                                   IOCONTEXT_NORMAL,
+                                                   IOOP_WRITE));
+    values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
+                                                   IOCONTEXT_NORMAL,
+                                                   IOOP_FSYNC));

Perhaps it is simpler to remove these columns from pg_stat_get_wal()
and plug an SQL upgrade to the view definition of pg_stat_wal?

+int
+pgstat_get_io_op_bytes(IOObject io_object, IOContext io_context)

This interface looks like a good idea even if there is only one
caller.

Finding a good balance between the subroutines, the two GUCs, the
contexts, the I/O operation type and the objects is the tricky part of
this patch.  If the dependency to PendingWalStats is removed and if
the interface of pgstat_prepare_io_time is improved, things are a bit
cleaner, but it feels like we could do more..  Nya.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

12 December 2023, 11:29:03

Hi,

Thanks for the feedback! The new version of the patch is attached.

On Tue, 5 Dec 2023 at 09:16, Michael Paquier <michael@paquier.xyz> wrote:
>
> -       if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
> +       if (io_op == IOOP_EXTEND || io_op == IOOP_WRITE)
>
> Unrelated diff.

Done.

>
> +   if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> +       io_op == IOOP_FSYNC)
> +       PendingWalStats.wal_sync += cnt;
>
> Nah, I really don't think that adding this dependency within
> pg_stat_io is a good idea.
>
> -   PendingWalStats.wal_sync++;
> +   pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
> +                           io_start, 1);
>
> This is the only caller where this matters, and the count is always 1.

I reverted that, pgstat_count_io_op_n doesn't count
PendingWalStats.wal_sync now.

>
> +       no_wal_normal_read = bktype == B_AUTOVAC_LAUNCHER ||
> +               bktype == B_AUTOVAC_WORKER || bktype == B_BACKEND ||
> +               bktype == B_BG_WORKER || bktype == B_BG_WRITER ||
> +               bktype == B_CHECKPOINTER || bktype == B_WAL_RECEIVER ||
> +               bktype == B_WAL_SENDER || bktype == B_WAL_WRITER;
> +
> +       if (no_wal_normal_read &&
> +               (io_object == IOOBJECT_WAL &&
> +                io_op == IOOP_READ))
> +               return false;
>
> This may be more readable if an enum is applied, without a default
> clause so as it would not be forgotten if a new type is added, perhaps
> in its own little routine.

Done.

>
> -   if (track_io_timing)
> +   if (track_io_timing || track_wal_io_timing)
>         INSTR_TIME_SET_CURRENT(io_start);
>     else
>
> This interface from pgstat_prepare_io_time() is not really good,
> because we could finish by setting io_start in the existing code paths
> calling this routine even if track_io_timing is false when
> track_wal_io_timing is true.  Why not changing this interface a bit
> and pass down a GUC (track_io_timing or track_wal_io_timing) as an
> argument of the function depending on what we expect to trigger the
> timings?

Done in 0001.

>
> -       /* Convert counters from microsec to millisec for display */
> -       values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
> -       values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
> +       /*
> +        * There is no need to calculate timings for both pg_stat_wal and
> +        * pg_stat_io. So, fetch timings from pg_stat_io to make stats gathering
> +        * cheaper. Note that, since timings are fetched from pg_stat_io;
> +        * pg_stat_reset_shared('io') will reset pg_stat_wal's timings too.
> +        *
> +        * Convert counters from microsec to millisec for display
> +        */
> +       values[6] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
> +                                                                                                  IOCONTEXT_NORMAL,
> +                                                                                                  IOOP_WRITE));
> +       values[7] = Float8GetDatum(pg_stat_get_io_time(IOOBJECT_WAL,
> +                                                                                                  IOCONTEXT_NORMAL,
> +                                                                                                  IOOP_FSYNC));
>
> Perhaps it is simpler to remove these columns from pg_stat_get_wal()
> and plug an SQL upgrade to the view definition of pg_stat_wal?

Done in 0003 but I am not sure if that is what you expected.

> Finding a good balance between the subroutines, the two GUCs, the
> contexts, the I/O operation type and the objects is the tricky part of
> this patch.  If the dependency to PendingWalStats is removed and if
> the interface of pgstat_prepare_io_time is improved, things are a bit
> cleaner, but it feels like we could do more..  Nya.

I agree. The patch is not logically complicated but it is hard to
select the best way.

Any kind of feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

16 December 2023, 19:20:57

On Tue, Dec 12, 2023 at 02:29:03PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 5 Dec 2023 at 09:16, Michael Paquier <michael@paquier.xyz> wrote:
>> This interface from pgstat_prepare_io_time() is not really good,
>> because we could finish by setting io_start in the existing code paths
>> calling this routine even if track_io_timing is false when
>> track_wal_io_timing is true.  Why not changing this interface a bit
>> and pass down a GUC (track_io_timing or track_wal_io_timing) as an
>> argument of the function depending on what we expect to trigger the
>> timings?
>
> Done in 0001.

One thing that 0001 missed is an update of the header where the
function is declared.  I've edited a few things, and applied it to
start on this stuff.  The rest will have to wait a bit more..
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

25 December 2023, 06:20:58

On Sat, Dec 16, 2023 at 08:20:57PM +0100, Michael Paquier wrote:
> One thing that 0001 missed is an update of the header where the
> function is declared.  I've edited a few things, and applied it to
> start on this stuff.  The rest will have to wait a bit more..

I have been reviewing the whole, and spotted a couple of issues.

+     * At the end of the if case, accumulate time for the pg_stat_io.
+     */
+    if (pgstat_should_track_io_time(io_object, io_context))

There was a bug here.  WAL operations can do IOOP_WRITE or IOOP_READ,
and this would cause pgstat_count_buffer_read_time() and
pgstat_count_buffer_write_time() to be called, incrementing
pgStatBlock{Read,Write}Time, which would be incorrect when it comes to
a WAL page or a WAL segment.  I was wondering what to do here first,
but we could just avoid calling these routines when working on an
IOOBJECT_WAL as that's the only object not doing a buffer operation.

A comment at the top of pgstat_tracks_io_bktype() is incorrect,
because this patch adds the WAL writer sender in the I/O tracking.

+           case B_WAL_RECEIVER:
+           case B_WAL_SENDER:
+           case B_WAL_WRITER:
+               return false;

pgstat_tracks_io_op() now needs B_WAL_SUMMARIZER.

pgstat_should_track_io_time() is used only in pgstat_io.c, so it can
be static rather than published in pgstat.h.

pgstat_tracks_io_bktype() does not look correct to me.  Why is the WAL
receiver considered as something correct in the list of backend types,
while the intention is to *not* add it to pg_stat_io?  I have tried to
switche to the correct behavior of returning false for a
B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
freezes on its shutdown sequence.  Something weird is going on here.
Could you look at it?  See the XXX comment in the attached, which is
the same behavior as v6-0002.  It looks to me that the patch has
introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
incorrect way to avoid the root issue.

I have also spent more time polishing the rest, touching a few things
while reviewing.  Not sure that I see a point in splitting the tests
from the main patch.
--
Michael

Attachment

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

25 December 2023, 06:40:17

On Mon, Dec 25, 2023 at 03:20:58PM +0900, Michael Paquier wrote:
> pgstat_tracks_io_bktype() does not look correct to me.  Why is the WAL
> receiver considered as something correct in the list of backend types,
> while the intention is to *not* add it to pg_stat_io?  I have tried to
> switche to the correct behavior of returning false for a
> B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
> freezes on its shutdown sequence.  Something weird is going on here.
> Could you look at it?  See the XXX comment in the attached, which is
> the same behavior as v6-0002.  It looks to me that the patch has
> introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
> incorrect way to avoid the root issue.

Ah, that's because it would trigger an assertion failure:
TRAP: failed Assert("pgstat_tracks_io_op(MyBackendType, io_object,
 io_context, io_op)"), File: "pgstat_io.c", Line: 89, PID: 6824
postgres: standby_local: walreceiver
(ExceptionalCondition+0xa8)[0x560d1b4dd38a]

And the backtrace just tells that this is the WAL receiver
initializing a WAL segment:
#5  0x0000560d1b3322c8 in pgstat_count_io_op_n
(io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
cnt=1) at pgstat_io.c:89
#6  0x0000560d1b33254a in pgstat_count_io_op_time
(io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
start_time=..., cnt=1) at pgstat_io.c:181
#7  0x0000560d1ae7f932 in XLogFileInitInternal (logsegno=3, logtli=1,
added=0x7ffd2733c6eb, path=0x7ffd2733c2e0 "pg_wal/00000001", '0'
<repeats 15 times>, "3")     at xlog.c:3115
#8  0x0000560d1ae7fc4e in XLogFileInit (logsegno=3, logtli=1) at
xlog.c:3215

Wouldn't it be simpler to just bite the bullet in this case and handle
WAL receivers in the IO tracking?
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

25 December 2023, 13:09:34

Hi,

Thanks for the review and feedback on your previous reply!

On Mon, 25 Dec 2023 at 09:40, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Dec 25, 2023 at 03:20:58PM +0900, Michael Paquier wrote:
> > pgstat_tracks_io_bktype() does not look correct to me.  Why is the WAL
> > receiver considered as something correct in the list of backend types,
> > while the intention is to *not* add it to pg_stat_io?  I have tried to
> > switche to the correct behavior of returning false for a
> > B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
> > freezes on its shutdown sequence.  Something weird is going on here.
> > Could you look at it?  See the XXX comment in the attached, which is
> > the same behavior as v6-0002.  It looks to me that the patch has
> > introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
> > incorrect way to avoid the root issue.
>
> Ah, that's because it would trigger an assertion failure:
> TRAP: failed Assert("pgstat_tracks_io_op(MyBackendType, io_object,
>  io_context, io_op)"), File: "pgstat_io.c", Line: 89, PID: 6824
> postgres: standby_local: walreceiver
> (ExceptionalCondition+0xa8)[0x560d1b4dd38a]
>
> And the backtrace just tells that this is the WAL receiver
> initializing a WAL segment:
> #5  0x0000560d1b3322c8 in pgstat_count_io_op_n
> (io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
> cnt=1) at pgstat_io.c:89
> #6  0x0000560d1b33254a in pgstat_count_io_op_time
> (io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
> start_time=..., cnt=1) at pgstat_io.c:181
> #7  0x0000560d1ae7f932 in XLogFileInitInternal (logsegno=3, logtli=1,
> added=0x7ffd2733c6eb, path=0x7ffd2733c2e0 "pg_wal/00000001", '0'
> <repeats 15 times>, "3")     at xlog.c:3115
> #8  0x0000560d1ae7fc4e in XLogFileInit (logsegno=3, logtli=1) at
> xlog.c:3215

Correct.

>
> Wouldn't it be simpler to just bite the bullet in this case and handle
> WAL receivers in the IO tracking?

There is one problem and I couldn't decide how to solve it. We need to
handle read IO in WALRead() in xlogreader.c. How many bytes the
WALRead() function will read is controlled by a variable and it can be
different from XLOG_BLCKSZ. This is a problem because pg_stat_io's
op_bytes column is a constant.

Here are all WALRead() function calls:

1- read_local_xlog_page_guts() in xlogutils.c => WALRead(XLOG_BLCKSZ)
=> always reads XLOG_BLCKSZ.

2- summarizer_read_local_xlog_page() in walsummarizer.c =>
WALRead(XLOG_BLCKSZ) => always reads XLOG_BLCKSZ.

3- logical_read_xlog_page() in walsender.c => WALRead(XLOG_BLCKSZ) =>
always reads XLOG_BLCKSZ.

4- XLogSendPhysical() in walsender.c => WALRead(nbytes) =>  nbytes can
be different from XLOG_BLCKSZ.

5- WALDumpReadPage() in pg_waldump.c => WALRead(count) => count can be
different from XLOG_BLCKSZ.

4 and 5 are the problematic calls.

Melanie's answer to this problem on previous discussions:

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

Using op_bytes as 1 solves this problem but since it will be different
from the rest of the pg_stat_io view it could be hard to understand.
There is no clear separation by backends as it can be seen from the walsender.

>
> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
> treat op_bytes * number of reads as an approximation of the number of
> bytes read. I don't actually know what makes more sense. I don't think I
> would like having a number for bytes that is not accurate.

Also, we have a similar problem in XLogPageRead() in xlogrecovery.c.
pg_pread() call tries to read XLOG_BLCKSZ but it is not certain and we
don't count IO if it couldn't read XLOG_BLCKSZ. IMO, this is not as
important as the previous problem but it still is a problem.

I would be glad to hear opinions on these problems.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

26 December 2023, 00:06:04

On Mon, Dec 25, 2023 at 04:09:34PM +0300, Nazir Bilal Yavuz wrote:
> On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman@gmail.com> wrote:
>> If there is any combination of BackendType and IOContext which will
>> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
>> op_bytes. For other cases, we may have to consider using op_bytes 1 and
>> tracking reads and write IOOps in number of bytes (instead of number of
>> pages). I don't actually know if there is a clear separation by
>> BackendType for these different cases.
>
> Using op_bytes as 1 solves this problem but since it will be different
> from the rest of the pg_stat_io view it could be hard to understand.
> There is no clear separation by backends as it can be seen from the walsender.

I find the use of 1 in this context a bit confusing, because when
referring to a counter at N, then it can be understood as doing N
times a operation, but it would be much less than that.  Another
solution would be to use NULL (as a synonym of "I don't know") and
then document that in this case all the bigint counters of pg_stat_io
track the number of bytes rather than the number of operations?

>> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
>> treat op_bytes * number of reads as an approximation of the number of
>> bytes read. I don't actually know what makes more sense. I don't think I
>> would like having a number for bytes that is not accurate.
>
> Also, we have a similar problem in XLogPageRead() in xlogrecovery.c.
> pg_pread() call tries to read XLOG_BLCKSZ but it is not certain and we
> don't count IO if it couldn't read XLOG_BLCKSZ. IMO, this is not as
> important as the previous problem but it still is a problem.
>
> I would be glad to hear opinions on these problems.

Correctness matters a lot for monitoring, IMO.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

26 December 2023, 08:27:16

Hi,

On Tue, 26 Dec 2023 at 03:06, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Dec 25, 2023 at 04:09:34PM +0300, Nazir Bilal Yavuz wrote:
> > On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >> If there is any combination of BackendType and IOContext which will
> >> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> >> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> >> tracking reads and write IOOps in number of bytes (instead of number of
> >> pages). I don't actually know if there is a clear separation by
> >> BackendType for these different cases.
> >
> > Using op_bytes as 1 solves this problem but since it will be different
> > from the rest of the pg_stat_io view it could be hard to understand.
> > There is no clear separation by backends as it can be seen from the walsender.
>
> I find the use of 1 in this context a bit confusing, because when
> referring to a counter at N, then it can be understood as doing N
> times a operation, but it would be much less than that.  Another
> solution would be to use NULL (as a synonym of "I don't know") and
> then document that in this case all the bigint counters of pg_stat_io
> track the number of bytes rather than the number of operations?

Yes, that makes sense.

Maybe it is better to create a pg_stat_io_wal view like you said
before. We could remove unused columns and add op_bytes for each
writes and reads. Also, we can track both the number of bytes and the
number of the operations. This doesn't fully solve the problem but it
will be easier to modify it to meet our needs.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

26 December 2023, 10:10:41

On Tue, Dec 26, 2023 at 11:27:16AM +0300, Nazir Bilal Yavuz wrote:
> Maybe it is better to create a pg_stat_io_wal view like you said
> before. We could remove unused columns and add op_bytes for each
> writes and reads. Also, we can track both the number of bytes and the
> number of the operations. This doesn't fully solve the problem but it
> will be easier to modify it to meet our needs.

I am not sure while the whole point of the exercise is to have all the
I/O related data in a single view.  Something that I've also found a
bit disturbing yesterday while looking at your patch is the fact that
the operation size is guessed from the context and object type when
querying the view because now everything is tied to BLCKSZ.  This
patch extends it with two more operation sizes, and there are even
cases where it may be a variable.  Could it be a better option to
extend pgstat_count_io_op_time() so as callers can themselves give the
size of the operation?

The whole patch is kind of itself complicated enough, so I'd be OK to
discard the case of the WAL receiver for now.  Now, if we do so, the
code stack of pgstat_io.c should handle WAL receivers as something
entirely disabled until all the known issues are solved.  There is
still a lot of value in tracking WAL data associated to the WAL
writer, normal backends and WAL senders.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

26 December 2023, 12:35:52

Hi,

On Tue, 26 Dec 2023 at 13:10, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Dec 26, 2023 at 11:27:16AM +0300, Nazir Bilal Yavuz wrote:
> > Maybe it is better to create a pg_stat_io_wal view like you said
> > before. We could remove unused columns and add op_bytes for each
> > writes and reads. Also, we can track both the number of bytes and the
> > number of the operations. This doesn't fully solve the problem but it
> > will be easier to modify it to meet our needs.
>
> I am not sure while the whole point of the exercise is to have all the
> I/O related data in a single view.  Something that I've also found a
> bit disturbing yesterday while looking at your patch is the fact that
> the operation size is guessed from the context and object type when
> querying the view because now everything is tied to BLCKSZ.  This
> patch extends it with two more operation sizes, and there are even
> cases where it may be a variable.  Could it be a better option to
> extend pgstat_count_io_op_time() so as callers can themselves give the
> size of the operation?

Do you mean removing the op_bytes column and tracking the number of
bytes in reads, writes, and extends? If so, that makes sense to me but
I don't want to remove the number of operations; I believe that has a
value too. We can extend the pgstat_count_io_op_time() so it can both
track the number of bytes and the number of operations.
Also, it is not directly related to this patch but vectored IO [1] is
coming soon; so the number of operations could be wrong since vectored
IO could merge a couple of operations.

>
> The whole patch is kind of itself complicated enough, so I'd be OK to
> discard the case of the WAL receiver for now.  Now, if we do so, the
> code stack of pgstat_io.c should handle WAL receivers as something
> entirely disabled until all the known issues are solved.  There is
> still a lot of value in tracking WAL data associated to the WAL
> writer, normal backends and WAL senders.

Why can't we add comments and leave it as it is? Is it because this
could cause misunderstandings?

If we want to entirely disable it, we can add

if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
    return;

to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
calls are done by this function, then we can disable it at
pgstat_tracks_io_bktype().

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

31 December 2023, 00:58:33

On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 26 Dec 2023 at 13:10, Michael Paquier <michael@paquier.xyz> wrote:
>> I am not sure while the whole point of the exercise is to have all the
>> I/O related data in a single view.  Something that I've also found a
>> bit disturbing yesterday while looking at your patch is the fact that
>> the operation size is guessed from the context and object type when
>> querying the view because now everything is tied to BLCKSZ.  This
>> patch extends it with two more operation sizes, and there are even
>> cases where it may be a variable.  Could it be a better option to
>> extend pgstat_count_io_op_time() so as callers can themselves give the
>> size of the operation?
>
> Do you mean removing the op_bytes column and tracking the number of
> bytes in reads, writes, and extends? If so, that makes sense to me but
> I don't want to remove the number of operations; I believe that has a
> value too. We can extend the pgstat_count_io_op_time() so it can both
> track the number of bytes and the number of operations.

Apologies if my previous wording sounded confusing.  The idea I had in
mind was to keep op_bytes in pg_stat_io, and extend it so as a value
of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
as a number of bytes.

> Also, it is not directly related to this patch but vectored IO [1] is
> coming soon; so the number of operations could be wrong since vectored
> IO could merge a couple of operations.

Hmm.  I have not checked this patch series so I cannot say for sure,
but we'd likely just want to track the number of bytes if a single
operation has a non-equal size rather than registering in pg_stat_io N
rows with different op_bytes, no?  I am looping in Thomas Munro in CC
for comments.

>> The whole patch is kind of itself complicated enough, so I'd be OK to
>> discard the case of the WAL receiver for now.  Now, if we do so, the
>> code stack of pgstat_io.c should handle WAL receivers as something
>> entirely disabled until all the known issues are solved.  There is
>> still a lot of value in tracking WAL data associated to the WAL
>> writer, normal backends and WAL senders.
>
> Why can't we add comments and leave it as it is? Is it because this
> could cause misunderstandings?
>
> If we want to entirely disable it, we can add
>
> if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
>     return;
>
> to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
> calls are done by this function, then we can disable it at
> pgstat_tracks_io_bktype().

Yeah, a limitation like that may be acceptable for now.  Tracking the
WAL writer and WAL sender activities can be relevant in a lot of cases
even if we don't have the full picture for the WAL receiver yet.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

03 January 2024, 13:10:58

Hi,

On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:
> > On Tue, 26 Dec 2023 at 13:10, Michael Paquier <michael@paquier.xyz> wrote:
> >> I am not sure while the whole point of the exercise is to have all the
> >> I/O related data in a single view.  Something that I've also found a
> >> bit disturbing yesterday while looking at your patch is the fact that
> >> the operation size is guessed from the context and object type when
> >> querying the view because now everything is tied to BLCKSZ.  This
> >> patch extends it with two more operation sizes, and there are even
> >> cases where it may be a variable.  Could it be a better option to
> >> extend pgstat_count_io_op_time() so as callers can themselves give the
> >> size of the operation?
> >
> > Do you mean removing the op_bytes column and tracking the number of
> > bytes in reads, writes, and extends? If so, that makes sense to me but
> > I don't want to remove the number of operations; I believe that has a
> > value too. We can extend the pgstat_count_io_op_time() so it can both
> > track the number of bytes and the number of operations.
>
> Apologies if my previous wording sounded confusing.  The idea I had in
> mind was to keep op_bytes in pg_stat_io, and extend it so as a value
> of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
> as a number of bytes.

Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

> > Also, it is not directly related to this patch but vectored IO [1] is
> > coming soon; so the number of operations could be wrong since vectored
> > IO could merge a couple of operations.
>
> Hmm.  I have not checked this patch series so I cannot say for sure,
> but we'd likely just want to track the number of bytes if a single
> operation has a non-equal size rather than registering in pg_stat_io N
> rows with different op_bytes, no?

Yes, that is correct.

> I am looping in Thomas Munro in CC for comments.

Thanks for doing that.

> > If we want to entirely disable it, we can add
> >
> > if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> >     return;
> >
> > to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
> > calls are done by this function, then we can disable it at
> > pgstat_tracks_io_bktype().
>
> Yeah, a limitation like that may be acceptable for now.  Tracking the
> WAL writer and WAL sender activities can be relevant in a lot of cases
> even if we don't have the full picture for the WAL receiver yet.

I added that and disabled B_WAL_RECEIVER backend with comments
explaining why. v8 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

v8-0001-Show-WAL-stats-on-pg_stat_io-except-streaming-rep.patch

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

10 January 2024, 05:24:59

On Wed, Jan 03, 2024 at 04:10:58PM +0300, Nazir Bilal Yavuz wrote:
> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
>> Apologies if my previous wording sounded confusing.  The idea I had in
>> mind was to keep op_bytes in pg_stat_io, and extend it so as a value
>> of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
>> as a number of bytes.
>
> Oh, I understand it now. Yes, that makes sense.
> I thought removing op_bytes completely ( as you said "This patch
> extends it with two more operation sizes, and there are even cases
> where it may be a variable" ) from pg_stat_io view then adding
> something like {read | write | extend}_bytes and {read | write |
> extend}_calls could be better, so that we don't lose any information.

But then you'd lose the possibility to analyze correlations between
the size and the number of the operations, which is something that
matters for more complex I/O scenarios.  This does not need to be
tackled in this patch, which is useful on its own, though I am really
wondering if this is required for the recent work done by Thomas.
Perhaps Andres, Thomas or Melanie could comment on that?

>> Yeah, a limitation like that may be acceptable for now.  Tracking the
>> WAL writer and WAL sender activities can be relevant in a lot of cases
>> even if we don't have the full picture for the WAL receiver yet.
>
> I added that and disabled B_WAL_RECEIVER backend with comments
> explaining why. v8 is attached.

I can see that's what you have been adding here, which should be OK:

> -    if (track_io_timing)
> +    /*
> +     * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp IOs
> +     * but these IOs are not countable for now because IOOP_READ IOs' op_bytes
> +     * (number of bytes per unit of I/O) might not be the same all the time.
> +     * The current implementation requires that the op_bytes must be the same
> +     * for the same IOObject, IOContext and IOOp. To avoid confusion, the
> +     * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
> +     * now.
> +     */
> +    if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> +        return;

This could be worded better, but that's one of these nits from me I
usually tweak when committing stuff.

> +/*
> + * Decide if IO timings need to be tracked.  Timings associated to
> + * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
> + * else rely on track_io_timing.
> + */
> +static bool
> +pgstat_should_track_io_time(IOObject io_object)
> +{
> +    if (io_object == IOOBJECT_WAL)
> +        return track_wal_io_timing;
> +
> +    return track_io_timing;
> +}

One thing I was also considering is if eliminating this routine would
make pgstat_count_io_op_time() more readable the result, but I cannot
get to that.

>          if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
>          {
> -            pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> +            if (io_object != IOOBJECT_WAL)
> +                pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> +
>              if (io_object == IOOBJECT_RELATION)
>                  INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
>              else if (io_object == IOOBJECT_TEMP_RELATION)
> @@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
>          }
>          else if (io_op == IOOP_READ)
>          {
> -            pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> +            if (io_object != IOOBJECT_WAL)
> +                pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> +
>              if (io_object == IOOBJECT_RELATION)
>                  INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
>              else if (io_object == IOOBJECT_TEMP_RELATION)

A second thing is if this would be better with more switch/cases, say:
switch (io_op):
{
    case IOOP_EXTEND:
    case IOOP_WRITE:
        switch (io_object):
    {
        case WAL:
                /* do nothing */
                break;
        case RELATION:
        case TEMP:
            .. blah ..
    }
        break;
    case IOOP_READ:
        switch (io_object):
    {
        .. blah ..
    }
        break;
}

Or just this one to make it clear that nothing happens for WAL
objects:
switch (io_object):
{
   case WAL:
       /* do nothing */
       break;
   case RELATION:
       switch (io_op):
       {
           case IOOP_EXTEND:
       case IOOP_WRITE:
           .. blah ..
       case IOOP_READ:
           .. blah ..
       }
       break;
   case TEMP:
       /* same switch as RELATION */
       break;
}

This duplicates a bit things, but at least in the second case it's
clear which counters are updated when I/O timings are tracked.  It's
OK by me if people don't like this suggestion, but that would avoid
bugs like the one I found upthread.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

10 January 2024, 12:59:24

Hi,

On Wed, 10 Jan 2024 at 08:25, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Jan 03, 2024 at 04:10:58PM +0300, Nazir Bilal Yavuz wrote:
> >
> > I thought removing op_bytes completely ( as you said "This patch
> > extends it with two more operation sizes, and there are even cases
> > where it may be a variable" ) from pg_stat_io view then adding
> > something like {read | write | extend}_bytes and {read | write |
> > extend}_calls could be better, so that we don't lose any information.
>
> But then you'd lose the possibility to analyze correlations between
> the size and the number of the operations, which is something that
> matters for more complex I/O scenarios.  This does not need to be
> tackled in this patch, which is useful on its own, though I am really
> wondering if this is required for the recent work done by Thomas.
> Perhaps Andres, Thomas or Melanie could comment on that?

Yes, you are right.

> >> Yeah, a limitation like that may be acceptable for now.  Tracking the
> >> WAL writer and WAL sender activities can be relevant in a lot of cases
> >> even if we don't have the full picture for the WAL receiver yet.
> >
> > I added that and disabled B_WAL_RECEIVER backend with comments
> > explaining why. v8 is attached.
>
> I can see that's what you have been adding here, which should be OK:
>
> > -    if (track_io_timing)
> > +    /*
> > +     * B_WAL_RECEIVER backend does IOOBJECT_WAL IOObject & IOOP_READ IOOp IOs
> > +     * but these IOs are not countable for now because IOOP_READ IOs' op_bytes
> > +     * (number of bytes per unit of I/O) might not be the same all the time.
> > +     * The current implementation requires that the op_bytes must be the same
> > +     * for the same IOObject, IOContext and IOOp. To avoid confusion, the
> > +     * B_WAL_RECEIVER backend & IOOBJECT_WAL IOObject IOs are disabled for
> > +     * now.
> > +     */
> > +    if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> > +        return;
>
> This could be worded better, but that's one of these nits from me I
> usually tweak when committing stuff.

Thanks for doing that! Do you have any specific comments that can help
improve it?

> > +/*
> > + * Decide if IO timings need to be tracked.  Timings associated to
> > + * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
> > + * else rely on track_io_timing.
> > + */
> > +static bool
> > +pgstat_should_track_io_time(IOObject io_object)
> > +{
> > +    if (io_object == IOOBJECT_WAL)
> > +        return track_wal_io_timing;
> > +
> > +    return track_io_timing;
> > +}
>
> One thing I was also considering is if eliminating this routine would
> make pgstat_count_io_op_time() more readable the result, but I cannot
> get to that.

I could not think of a way to eliminate pgstat_should_track_io_time()
route without causing performance regressions. What do you think about
moving inside of 'pgstat_should_track_io_time(io_object) if check' to
another function and call this function from
pgstat_count_io_op_time()? This does not change anything but IMO it
increases the readability.

> >          if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
> >          {
> > -            pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +            if (io_object != IOOBJECT_WAL)
> > +                pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +
> >              if (io_object == IOOBJECT_RELATION)
> >                  INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
> >              else if (io_object == IOOBJECT_TEMP_RELATION)
> > @@ -139,7 +177,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
> >          }
> >          else if (io_op == IOOP_READ)
> >          {
> > -            pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +            if (io_object != IOOBJECT_WAL)
> > +                pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
> > +
> >              if (io_object == IOOBJECT_RELATION)
> >                  INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
> >              else if (io_object == IOOBJECT_TEMP_RELATION)
>
> A second thing is if this would be better with more switch/cases, say:
> switch (io_op):
> {
>     case IOOP_EXTEND:
>     case IOOP_WRITE:
>         switch (io_object):
>         {
>             case WAL:
>                 /* do nothing */
>                 break;
>             case RELATION:
>             case TEMP:
>                 .. blah ..
>         }
>         break;
>     case IOOP_READ:
>         switch (io_object):
>         {
>             .. blah ..
>         }
>         break;
> }
>
> Or just this one to make it clear that nothing happens for WAL
> objects:
> switch (io_object):
> {
>    case WAL:
>        /* do nothing */
>        break;
>    case RELATION:
>        switch (io_op):
>        {
>            case IOOP_EXTEND:
>            case IOOP_WRITE:
>                .. blah ..
>            case IOOP_READ:
>                .. blah ..
>        }
>        break;
>    case TEMP:
>        /* same switch as RELATION */
>        break;
> }
>
> This duplicates a bit things, but at least in the second case it's
> clear which counters are updated when I/O timings are tracked.  It's
> OK by me if people don't like this suggestion, but that would avoid
> bugs like the one I found upthread.

I am more inclined towards the second one because it is more likely
that a new io_object will be introduced rather than a new io_op. So, I
think the second one is a bit more future proof.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Melanie Plageman

Date:

11 January 2024, 00:24:50

I have code review feedback as well, but I've saved that for my next email.

On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:
> > > On Tue, 26 Dec 2023 at 13:10, Michael Paquier <michael@paquier.xyz> wrote:
> > >> I am not sure while the whole point of the exercise is to have all the
> > >> I/O related data in a single view.  Something that I've also found a
> > >> bit disturbing yesterday while looking at your patch is the fact that
> > >> the operation size is guessed from the context and object type when
> > >> querying the view because now everything is tied to BLCKSZ.  This
> > >> patch extends it with two more operation sizes, and there are even
> > >> cases where it may be a variable.  Could it be a better option to
> > >> extend pgstat_count_io_op_time() so as callers can themselves give the
> > >> size of the operation?
> > >
> > > Do you mean removing the op_bytes column and tracking the number of
> > > bytes in reads, writes, and extends? If so, that makes sense to me but
> > > I don't want to remove the number of operations; I believe that has a
> > > value too. We can extend the pgstat_count_io_op_time() so it can both
> > > track the number of bytes and the number of operations.
> >
> > Apologies if my previous wording sounded confusing.  The idea I had in
> > mind was to keep op_bytes in pg_stat_io, and extend it so as a value
> > of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
> > as a number of bytes.
>
> Oh, I understand it now. Yes, that makes sense.
> I thought removing op_bytes completely ( as you said "This patch
> extends it with two more operation sizes, and there are even cases
> where it may be a variable" ) from pg_stat_io view then adding
> something like {read | write | extend}_bytes and {read | write |
> extend}_calls could be better, so that we don't lose any information.

Forgive me as I catch up on this thread.

Upthread, Michael says:

> I find the use of 1 in this context a bit confusing, because when
> referring to a counter at N, then it can be understood as doing N
> times a operation,

I didn't understand this argument, so I'm not sure if I agree or
disagree with it.

I think these are the three proposals for handling WAL reads:

1) setting op_bytes to 1 and the number of reads is the number of bytes
2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
number of calls to pg_pread() or similar
3) setting op_bytes to NULL and the number of reads is the number of
calls to pg_pread() or similar

Looking at the patch, I think it is still doing 2.

It would be good to list all our options with pros and cons (if only
because they are a bit spread throughout the thread now).

For an unpopular idea: we could add separate [IOOp]_bytes columns for
all those IOOps for which it would be relevant. It kind of stinks but
it would give us the freedom to document exactly what a single IOOp
means for each combination of BackendType, IOContext, IOObject, and
IOOp (as relevant) and still have an accurate number in the *bytes
columns. Everyone will probably hate us if we do that, though.
Especially because having bytes for the existing IOObjects is an
existing feature.

A separate question: suppose [1] goes in (to read WAL from WAL buffers
directly). Now, WAL reads are not from permanent storage anymore. Are
we only tracking permanent storage I/O in pg_stat_io? I also had this
question for some of the WAL receiver functions. Should we track any
I/O other than permanent storage I/O? Or did I miss this being
addressed upthread?

> > > Also, it is not directly related to this patch but vectored IO [1] is
> > > coming soon; so the number of operations could be wrong since vectored
> > > IO could merge a couple of operations.
> >
> > Hmm.  I have not checked this patch series so I cannot say for sure,
> > but we'd likely just want to track the number of bytes if a single
> > operation has a non-equal size rather than registering in pg_stat_io N
> > rows with different op_bytes, no?
>
> Yes, that is correct.

I do not like the idea of having basically GROUP BY op_bytes in the
view (if that is the suggestion).

In terms of what I/O we should track in a streaming/asynchronous
world, the options would be:

1) track read/write syscalls
2) track blocks of BLCKSZ submitted to the kernel
3) track bytes submitted to the kernel
4) track merged I/Os (after doing any merging in the application)

I think the debate was largely between 2 and 4. There was some
disagreement, but I think we landed on 2 because there is merging that
can happen at many levels in the storage stack (even the storage
controller). Distinguishing between whether or not Postgres submitted
2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
but I think it might be confusing for the Postgres user trying to
determine why their query is slow. It probably makes the most sense to
still track in block size.

No matter what solution we pick, you should get a correct number if
you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
rather, there should be some way of getting an accurate number in
bytes of the amount of a particular kind of I/O that has been done.

- Melanie

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

11 January 2024, 05:00:54

On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> I have code review feedback as well, but I've saved that for my next email.

Ah, cool.

> On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
>> Oh, I understand it now. Yes, that makes sense.
>> I thought removing op_bytes completely ( as you said "This patch
>> extends it with two more operation sizes, and there are even cases
>> where it may be a variable" ) from pg_stat_io view then adding
>> something like {read | write | extend}_bytes and {read | write |
>> extend}_calls could be better, so that we don't lose any information.
>
> Upthread, Michael says:
>
>> I find the use of 1 in this context a bit confusing, because when
>> referring to a counter at N, then it can be understood as doing N
>> times a operation,
>
> I didn't understand this argument, so I'm not sure if I agree or
> disagree with it.

Nazir has mentioned upthread one thing: what should we do for the case
where a combination of (io_object,io_context) does I/O with a
*variable* op_bytes, because that may be the case for the WAL
receiver?  For this case, he has mentioned that we should set op_bytes
to 1, but that's something I find confusing because it would mean that
we are doing read, writes or extends 1 byte at a time.  My suggestion
would be to use op_bytes = -1 or NULL for the variable case instead,
with reads, writes and extends referring to a number of bytes rather
than a number of operations.

> I think these are the three proposals for handling WAL reads:
>
> 1) setting op_bytes to 1 and the number of reads is the number of bytes
> 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> number of calls to pg_pread() or similar
> 3) setting op_bytes to NULL and the number of reads is the number of
> calls to pg_pread() or similar

3) could be a number of bytes, actually.

> Looking at the patch, I think it is still doing 2.

The patch disables stats for the WAL receiver, while the startup
process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
discard the variable case.

> For an unpopular idea: we could add separate [IOOp]_bytes columns for
> all those IOOps for which it would be relevant. It kind of stinks but
> it would give us the freedom to document exactly what a single IOOp
> means for each combination of BackendType, IOContext, IOObject, and
> IOOp (as relevant) and still have an accurate number in the *bytes
> columns. Everyone will probably hate us if we do that, though.
> Especially because having bytes for the existing IOObjects is an
> existing feature.

An issue I have with this one is that having multiple tuples for
each (object,context) if they have multiple op_bytes leads to
potentially a lot of bloat in the view.  That would be up to 8k extra
tuples just for the sake of op_byte's variability.

> A separate question: suppose [1] goes in (to read WAL from WAL buffers
> directly). Now, WAL reads are not from permanent storage anymore. Are
> we only tracking permanent storage I/O in pg_stat_io? I also had this
> question for some of the WAL receiver functions. Should we track any
> I/O other than permanent storage I/O? Or did I miss this being
> addressed upthread?

That's a good point.  I guess that this should just be a different
IOOp?  That's not a IOOP_READ.  A IOOP_HIT is also different.

> In terms of what I/O we should track in a streaming/asynchronous
> world, the options would be:
>
> 1) track read/write syscalls
> 2) track blocks of BLCKSZ submitted to the kernel
> 3) track bytes submitted to the kernel
> 4) track merged I/Os (after doing any merging in the application)
>
> I think the debate was largely between 2 and 4. There was some
> disagreement, but I think we landed on 2 because there is merging that
> can happen at many levels in the storage stack (even the storage
> controller). Distinguishing between whether or not Postgres submitted
> 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> but I think it might be confusing for the Postgres user trying to
> determine why their query is slow. It probably makes the most sense to
> still track in block size.
>
> No matter what solution we pick, you should get a correct number if
> you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
> rather, there should be some way of getting an accurate number in
> bytes of the amount of a particular kind of I/O that has been done.

Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
writes or extends are counted as bytes, because we don't have a fixed
operation size for some (object,context) cases.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

11 January 2024, 11:18:54

Hi,

On Thu, 11 Jan 2024 at 08:01, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > I have code review feedback as well, but I've saved that for my next email.
>
> Ah, cool.
>
> > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
> >> Oh, I understand it now. Yes, that makes sense.
> >> I thought removing op_bytes completely ( as you said "This patch
> >> extends it with two more operation sizes, and there are even cases
> >> where it may be a variable" ) from pg_stat_io view then adding
> >> something like {read | write | extend}_bytes and {read | write |
> >> extend}_calls could be better, so that we don't lose any information.
> >
> > Upthread, Michael says:
> >
> >> I find the use of 1 in this context a bit confusing, because when
> >> referring to a counter at N, then it can be understood as doing N
> >> times a operation,
> >
> > I didn't understand this argument, so I'm not sure if I agree or
> > disagree with it.
>
> Nazir has mentioned upthread one thing: what should we do for the case
> where a combination of (io_object,io_context) does I/O with a
> *variable* op_bytes, because that may be the case for the WAL
> receiver?  For this case, he has mentioned that we should set op_bytes
> to 1, but that's something I find confusing because it would mean that
> we are doing read, writes or extends 1 byte at a time.  My suggestion
> would be to use op_bytes = -1 or NULL for the variable case instead,
> with reads, writes and extends referring to a number of bytes rather
> than a number of operations.

I agree but we can't do this only for the *variable* cases since
B_WAL_RECEIVER and other backends use the same
pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
is, if two backends use the same pgstat_count_io_op_time() function
call in the code; they must count the same thing (number of calls,
bytes, etc.). It could be better to count the number of bytes for all
the IOOBJECT_WAL IOs.

> > I think these are the three proposals for handling WAL reads:
> >
> > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > number of calls to pg_pread() or similar
> > 3) setting op_bytes to NULL and the number of reads is the number of
> > calls to pg_pread() or similar
>
> 3) could be a number of bytes, actually.

One important point is that we can't change only reads, if we decide
to count the number of bytes for the reads; writes and extends should
be counted as a number of bytes as well.

> > Looking at the patch, I think it is still doing 2.
>
> The patch disables stats for the WAL receiver, while the startup
> process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> discard the variable case.
>
> > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > all those IOOps for which it would be relevant. It kind of stinks but
> > it would give us the freedom to document exactly what a single IOOp
> > means for each combination of BackendType, IOContext, IOObject, and
> > IOOp (as relevant) and still have an accurate number in the *bytes
> > columns. Everyone will probably hate us if we do that, though.
> > Especially because having bytes for the existing IOObjects is an
> > existing feature.
>
> An issue I have with this one is that having multiple tuples for
> each (object,context) if they have multiple op_bytes leads to
> potentially a lot of bloat in the view.  That would be up to 8k extra
> tuples just for the sake of op_byte's variability.

Yes, that doesn't seem applicable to me.

> > A separate question: suppose [1] goes in (to read WAL from WAL buffers
> > directly). Now, WAL reads are not from permanent storage anymore. Are
> > we only tracking permanent storage I/O in pg_stat_io? I also had this
> > question for some of the WAL receiver functions. Should we track any
> > I/O other than permanent storage I/O? Or did I miss this being
> > addressed upthread?
>
> That's a good point.  I guess that this should just be a different
> IOOp?  That's not a IOOP_READ.  A IOOP_HIT is also different.

I think different IOContext rather than IOOp suits better for this.

> > In terms of what I/O we should track in a streaming/asynchronous
> > world, the options would be:
> >
> > 1) track read/write syscalls
> > 2) track blocks of BLCKSZ submitted to the kernel
> > 3) track bytes submitted to the kernel
> > 4) track merged I/Os (after doing any merging in the application)
> >
> > I think the debate was largely between 2 and 4. There was some
> > disagreement, but I think we landed on 2 because there is merging that
> > can happen at many levels in the storage stack (even the storage
> > controller). Distinguishing between whether or not Postgres submitted
> > 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> > but I think it might be confusing for the Postgres user trying to
> > determine why their query is slow. It probably makes the most sense to
> > still track in block size.
> >
> > No matter what solution we pick, you should get a correct number if
> > you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
> > rather, there should be some way of getting an accurate number in
> > bytes of the amount of a particular kind of I/O that has been done.
>
> Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
> writes or extends are counted as bytes, because we don't have a fixed
> operation size for some (object,context) cases.

Can't we use 2 and 3 together? For example, use 3 for the IOOBJECT_WAL
IOs and 2 for the other IOs.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Melanie Plageman

Date:

11 January 2024, 14:27:53

On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> On Thu, 11 Jan 2024 at 08:01, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
> > >> Oh, I understand it now. Yes, that makes sense.
> > >> I thought removing op_bytes completely ( as you said "This patch
> > >> extends it with two more operation sizes, and there are even cases
> > >> where it may be a variable" ) from pg_stat_io view then adding
> > >> something like {read | write | extend}_bytes and {read | write |
> > >> extend}_calls could be better, so that we don't lose any information.
> > >
> > > Upthread, Michael says:
> > >
> > >> I find the use of 1 in this context a bit confusing, because when
> > >> referring to a counter at N, then it can be understood as doing N
> > >> times a operation,
> > >
> > > I didn't understand this argument, so I'm not sure if I agree or
> > > disagree with it.
> >
> > Nazir has mentioned upthread one thing: what should we do for the case
> > where a combination of (io_object,io_context) does I/O with a
> > *variable* op_bytes, because that may be the case for the WAL
> > receiver?  For this case, he has mentioned that we should set op_bytes
> > to 1, but that's something I find confusing because it would mean that
> > we are doing read, writes or extends 1 byte at a time.  My suggestion
> > would be to use op_bytes = -1 or NULL for the variable case instead,
> > with reads, writes and extends referring to a number of bytes rather
> > than a number of operations.
>
> I agree but we can't do this only for the *variable* cases since
> B_WAL_RECEIVER and other backends use the same
> pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
> is, if two backends use the same pgstat_count_io_op_time() function
> call in the code; they must count the same thing (number of calls,
> bytes, etc.). It could be better to count the number of bytes for all
> the IOOBJECT_WAL IOs.

I'm a bit confused by this. pgstat_count_io_op_time() can check
MyBackendType. In fact, you do this to ban the wal receiver already.
It is true that you would need to count all wal receiver normal
context wal object IOOps in the variable way, but I don't see how
pgstat_count_io_op_time() is the limiting factor as long as the
callsite is always doing either the number of bytes or the number of
calls.

> > > I think these are the three proposals for handling WAL reads:
> > >
> > > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > > number of calls to pg_pread() or similar
> > > 3) setting op_bytes to NULL and the number of reads is the number of
> > > calls to pg_pread() or similar
> >
> > 3) could be a number of bytes, actually.
>
> One important point is that we can't change only reads, if we decide
> to count the number of bytes for the reads; writes and extends should
> be counted as a number of bytes as well.

Yes, that is true.

> > > Looking at the patch, I think it is still doing 2.
> >
> > The patch disables stats for the WAL receiver, while the startup
> > process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> > discard the variable case.
> >
> > > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > > all those IOOps for which it would be relevant. It kind of stinks but
> > > it would give us the freedom to document exactly what a single IOOp
> > > means for each combination of BackendType, IOContext, IOObject, and
> > > IOOp (as relevant) and still have an accurate number in the *bytes
> > > columns. Everyone will probably hate us if we do that, though.
> > > Especially because having bytes for the existing IOObjects is an
> > > existing feature.
> >
> > An issue I have with this one is that having multiple tuples for
> > each (object,context) if they have multiple op_bytes leads to
> > potentially a lot of bloat in the view.  That would be up to 8k extra
> > tuples just for the sake of op_byte's variability.
>
> Yes, that doesn't seem applicable to me.

My suggestion (again not sure it is a good idea) was actually that we
remove op_bytes and add "write_bytes", "read_bytes", and
"extend_bytes". AFAICT, this would add columns not rows. In this
schema, read bytes for wal receiver could be counted in one way and
writes in another. We could document that, for wal receiver, the reads
are not always done in units of the same size, so the read_bytes /
reads could be thought of as an average size of read.

Even if we made a separate view for WAL I/O stats, we would still have
this issue of variable sized I/O vs block sized I/O and would probably
end up solving it with separate columns for the number of bytes and
number of operations.

> > > A separate question: suppose [1] goes in (to read WAL from WAL buffers
> > > directly). Now, WAL reads are not from permanent storage anymore. Are
> > > we only tracking permanent storage I/O in pg_stat_io? I also had this
> > > question for some of the WAL receiver functions. Should we track any
> > > I/O other than permanent storage I/O? Or did I miss this being
> > > addressed upthread?
> >
> > That's a good point.  I guess that this should just be a different
> > IOOp?  That's not a IOOP_READ.  A IOOP_HIT is also different.
>
> I think different IOContext rather than IOOp suits better for this.

That makes sense to me.

> > > In terms of what I/O we should track in a streaming/asynchronous
> > > world, the options would be:
> > >
> > > 1) track read/write syscalls
> > > 2) track blocks of BLCKSZ submitted to the kernel
> > > 3) track bytes submitted to the kernel
> > > 4) track merged I/Os (after doing any merging in the application)
> > >
> > > I think the debate was largely between 2 and 4. There was some
> > > disagreement, but I think we landed on 2 because there is merging that
> > > can happen at many levels in the storage stack (even the storage
> > > controller). Distinguishing between whether or not Postgres submitted
> > > 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> > > but I think it might be confusing for the Postgres user trying to
> > > determine why their query is slow. It probably makes the most sense to
> > > still track in block size.
> > >
> > > No matter what solution we pick, you should get a correct number if
> > > you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
> > > rather, there should be some way of getting an accurate number in
> > > bytes of the amount of a particular kind of I/O that has been done.
> >
> > Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
> > writes or extends are counted as bytes, because we don't have a fixed
> > operation size for some (object,context) cases.
>
> Can't we use 2 and 3 together? For example, use 3 for the IOOBJECT_WAL
> IOs and 2 for the other IOs.

We can do this. One concern I have is that much of WAL I/O is done in
XLOG_BLCKSZ, so it feels kind of odd for all WAL I/O to appear as if
it is being done in random chunks of bytes. We anticipated other
uniformly non-block-based I/O types where having 1 in op_bytes would
be meaningful. I didn't realize at the time that there would be
variable-sized and block-sized I/O mixed together for the same backend
type, io object, and io context.

- Melanie

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

12 January 2024, 13:23:26

Hi,

On Thu, 11 Jan 2024 at 17:28, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > On Thu, 11 Jan 2024 at 08:01, Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael@paquier.xyz> wrote:
> > > >> Oh, I understand it now. Yes, that makes sense.
> > > >> I thought removing op_bytes completely ( as you said "This patch
> > > >> extends it with two more operation sizes, and there are even cases
> > > >> where it may be a variable" ) from pg_stat_io view then adding
> > > >> something like {read | write | extend}_bytes and {read | write |
> > > >> extend}_calls could be better, so that we don't lose any information.
> > > >
> > > > Upthread, Michael says:
> > > >
> > > >> I find the use of 1 in this context a bit confusing, because when
> > > >> referring to a counter at N, then it can be understood as doing N
> > > >> times a operation,
> > > >
> > > > I didn't understand this argument, so I'm not sure if I agree or
> > > > disagree with it.
> > >
> > > Nazir has mentioned upthread one thing: what should we do for the case
> > > where a combination of (io_object,io_context) does I/O with a
> > > *variable* op_bytes, because that may be the case for the WAL
> > > receiver?  For this case, he has mentioned that we should set op_bytes
> > > to 1, but that's something I find confusing because it would mean that
> > > we are doing read, writes or extends 1 byte at a time.  My suggestion
> > > would be to use op_bytes = -1 or NULL for the variable case instead,
> > > with reads, writes and extends referring to a number of bytes rather
> > > than a number of operations.
> >
> > I agree but we can't do this only for the *variable* cases since
> > B_WAL_RECEIVER and other backends use the same
> > pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
> > is, if two backends use the same pgstat_count_io_op_time() function
> > call in the code; they must count the same thing (number of calls,
> > bytes, etc.). It could be better to count the number of bytes for all
> > the IOOBJECT_WAL IOs.
>
> I'm a bit confused by this. pgstat_count_io_op_time() can check
> MyBackendType. In fact, you do this to ban the wal receiver already.
> It is true that you would need to count all wal receiver normal
> context wal object IOOps in the variable way, but I don't see how
> pgstat_count_io_op_time() is the limiting factor as long as the
> callsite is always doing either the number of bytes or the number of
> calls.

Apologies for not being clear. Let me try to explain this by giving examples:

Let's assume that there are 3 different pgstat_count_io_op_time()
calls in the code base and they are labeled as 1, 2 and 3.

And let's' assume that: WAL receiver uses 1st and 2nd
pgstat_count_io_op_time(), autovacuum uses 2nd and 3rd
pgstat_count_io_op_time() and checkpointer uses 3rd
pgstat_count_io_op_time() to count IOs.

The 1st one is the only pgstat_count_io_op_time() call that must count
the number of bytes because of the variable cases and the others count
the number of calls or blocks.

a) WAL receiver uses both 1st and 2nd => 1st and 2nd
pgstat_count_io_op_time() must count the same thing => 2nd
pgstat_count_io_op_time() must count the number of bytes as well.

b) 2nd pgstat_count_io_op_time() started to count the number of bytes
=> Autovacuum will start to count the number of bytes => 2nd and 3rd
both are used by autocavuum => 3rd pgstat_count_io_op_time() must
count the number of bytes as well.

c) 3rd pgstat_count_io_op_time() started to count the number of bytes
=> Checkpointer will start to count the number of bytes.

The list goes on like this and if we don't have [write | read |
extend]_bytes, this effect will be multiplied.

> > > > I think these are the three proposals for handling WAL reads:
> > > >
> > > > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > > > number of calls to pg_pread() or similar
> > > > 3) setting op_bytes to NULL and the number of reads is the number of
> > > > calls to pg_pread() or similar
> > >
> > > 3) could be a number of bytes, actually.
> >
> > One important point is that we can't change only reads, if we decide
> > to count the number of bytes for the reads; writes and extends should
> > be counted as a number of bytes as well.
>
> Yes, that is true.
>
> > > > Looking at the patch, I think it is still doing 2.
> > >
> > > The patch disables stats for the WAL receiver, while the startup
> > > process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> > > discard the variable case.
> > >
> > > > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > > > all those IOOps for which it would be relevant. It kind of stinks but
> > > > it would give us the freedom to document exactly what a single IOOp
> > > > means for each combination of BackendType, IOContext, IOObject, and
> > > > IOOp (as relevant) and still have an accurate number in the *bytes
> > > > columns. Everyone will probably hate us if we do that, though.
> > > > Especially because having bytes for the existing IOObjects is an
> > > > existing feature.
> > >
> > > An issue I have with this one is that having multiple tuples for
> > > each (object,context) if they have multiple op_bytes leads to
> > > potentially a lot of bloat in the view.  That would be up to 8k extra
> > > tuples just for the sake of op_byte's variability.
> >
> > Yes, that doesn't seem applicable to me.
>
> My suggestion (again not sure it is a good idea) was actually that we
> remove op_bytes and add "write_bytes", "read_bytes", and
> "extend_bytes". AFAICT, this would add columns not rows. In this
> schema, read bytes for wal receiver could be counted in one way and
> writes in another. We could document that, for wal receiver, the reads
> are not always done in units of the same size, so the read_bytes /
> reads could be thought of as an average size of read.

That looks like one of the best options to me. I suggested something
similar upthread and Michael's answer was:

> But then you'd lose the possibility to analyze correlations between
> the size and the number of the operations, which is something that
> matters for more complex I/O scenarios.  This does not need to be
> tackled in this patch, which is useful on its own, though I am really
> wondering if this is required for the recent work done by Thomas.
> Perhaps Andres, Thomas or Melanie could comment on that?


> Even if we made a separate view for WAL I/O stats, we would still have
> this issue of variable sized I/O vs block sized I/O and would probably
> end up solving it with separate columns for the number of bytes and
> number of operations.

Yes, I think it is more about flexibility and not changing the already
published pg_stat_io view.

> > > > A separate question: suppose [1] goes in (to read WAL from WAL buffers
> > > > directly). Now, WAL reads are not from permanent storage anymore. Are
> > > > we only tracking permanent storage I/O in pg_stat_io? I also had this
> > > > question for some of the WAL receiver functions. Should we track any
> > > > I/O other than permanent storage I/O? Or did I miss this being
> > > > addressed upthread?
> > >
> > > That's a good point.  I guess that this should just be a different
> > > IOOp?  That's not a IOOP_READ.  A IOOP_HIT is also different.
> >
> > I think different IOContext rather than IOOp suits better for this.
>
> That makes sense to me.
>
> > > > In terms of what I/O we should track in a streaming/asynchronous
> > > > world, the options would be:
> > > >
> > > > 1) track read/write syscalls
> > > > 2) track blocks of BLCKSZ submitted to the kernel
> > > > 3) track bytes submitted to the kernel
> > > > 4) track merged I/Os (after doing any merging in the application)
> > > >
> > > > I think the debate was largely between 2 and 4. There was some
> > > > disagreement, but I think we landed on 2 because there is merging that
> > > > can happen at many levels in the storage stack (even the storage
> > > > controller). Distinguishing between whether or not Postgres submitted
> > > > 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> > > > but I think it might be confusing for the Postgres user trying to
> > > > determine why their query is slow. It probably makes the most sense to
> > > > still track in block size.
> > > >
> > > > No matter what solution we pick, you should get a correct number if
> > > > you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
> > > > rather, there should be some way of getting an accurate number in
> > > > bytes of the amount of a particular kind of I/O that has been done.
> > >
> > > Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
> > > writes or extends are counted as bytes, because we don't have a fixed
> > > operation size for some (object,context) cases.
> >
> > Can't we use 2 and 3 together? For example, use 3 for the IOOBJECT_WAL
> > IOs and 2 for the other IOs.
>
> We can do this. One concern I have is that much of WAL I/O is done in
> XLOG_BLCKSZ, so it feels kind of odd for all WAL I/O to appear as if
> it is being done in random chunks of bytes. We anticipated other
> uniformly non-block-based I/O types where having 1 in op_bytes would
> be meaningful. I didn't realize at the time that there would be
> variable-sized and block-sized I/O mixed together for the same backend
> type, io object, and io context.

Correct. What is the lowest level that can use two different options?
I mean, could we use 3 for the WAL receiver, IOOP_READ, IOOBJECT_WAL,
IOCONTEXT_NORMAL and the 2 for the rest?

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

15 January 2024, 06:27:20

On Fri, Jan 12, 2024 at 04:23:26PM +0300, Nazir Bilal Yavuz wrote:
> On Thu, 11 Jan 2024 at 17:28, Melanie Plageman <melanieplageman@gmail.com> wrote:
>> Even if we made a separate view for WAL I/O stats, we would still have
>> this issue of variable sized I/O vs block sized I/O and would probably
>> end up solving it with separate columns for the number of bytes and
>> number of operations.
>
> Yes, I think it is more about flexibility and not changing the already
> published pg_stat_io view.

I don't know.  Adding more columns or changing op_bytes with an extra
mode that reflects on what the other columns mean is kind of the same
thing to me: we want pg_stat_io to report more modes so as all I/O can
be evaluated from a single view, but the complication is now that
everything is tied to BLCKSZ.

IMHO, perhaps we'd better put this patch aside until we are absolutely
*sure* of what we want to achieve when it comes to WAL, and I am
afraid that this cannot happen until we're happy with the way we
handle WAL reads *and* writes, including WAL receiver or anything that
has the idea of pulling its own page callback with
XLogReaderAllocate() in the backend.  Well, writes should be
relatively "easy" as things happen with XLOG_BLCKSZ, mainly, but
reads are the unknown part.

That also seems furiously related to the work happening with async I/O
or the fact that we may want to have in the view a separate meaning
for cached pages or pages read directly from disk.  The worst thing
that we would do is rush something into the tree and then have to deal
with the aftermath of what we'd need to deal with in terms of
compatibility depending on the state of the other I/O related work
when the new view is released.  That would not be fun for the users
and any hackers who would have to deal with that (aka mainly me if I
were to commit something), because pg_stat_io could mean something in
version N, still mean something entirely different in version N+1.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

17 January 2024, 12:20:39

Hi,

On Mon, 15 Jan 2024 at 09:27, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Jan 12, 2024 at 04:23:26PM +0300, Nazir Bilal Yavuz wrote:
> > On Thu, 11 Jan 2024 at 17:28, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >> Even if we made a separate view for WAL I/O stats, we would still have
> >> this issue of variable sized I/O vs block sized I/O and would probably
> >> end up solving it with separate columns for the number of bytes and
> >> number of operations.
> >
> > Yes, I think it is more about flexibility and not changing the already
> > published pg_stat_io view.
>
> I don't know.  Adding more columns or changing op_bytes with an extra
> mode that reflects on what the other columns mean is kind of the same
> thing to me: we want pg_stat_io to report more modes so as all I/O can
> be evaluated from a single view, but the complication is now that
> everything is tied to BLCKSZ.
>
> IMHO, perhaps we'd better put this patch aside until we are absolutely
> *sure* of what we want to achieve when it comes to WAL, and I am
> afraid that this cannot happen until we're happy with the way we
> handle WAL reads *and* writes, including WAL receiver or anything that
> has the idea of pulling its own page callback with
> XLogReaderAllocate() in the backend.  Well, writes should be
> relatively "easy" as things happen with XLOG_BLCKSZ, mainly, but
> reads are the unknown part.
>
> That also seems furiously related to the work happening with async I/O
> or the fact that we may want to have in the view a separate meaning
> for cached pages or pages read directly from disk.  The worst thing
> that we would do is rush something into the tree and then have to deal
> with the aftermath of what we'd need to deal with in terms of
> compatibility depending on the state of the other I/O related work
> when the new view is released.  That would not be fun for the users
> and any hackers who would have to deal with that (aka mainly me if I
> were to commit something), because pg_stat_io could mean something in
> version N, still mean something entirely different in version N+1.

I agree with your points. While the other I/O related work is
happening we can discuss what we should do in the variable op_byte
cases. Also, this is happening only for WAL right now but if we try to
extend pg_stat_io in the future, that problem possibly will rise
again. So, it could be good to come up with a general solution, not
only for WAL.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

18 January 2024, 01:22:26

On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> I agree with your points. While the other I/O related work is
> happening we can discuss what we should do in the variable op_byte
> cases. Also, this is happening only for WAL right now but if we try to
> extend pg_stat_io in the future, that problem possibly will rise
> again. So, it could be good to come up with a general solution, not
> only for WAL.

Okay, I've marked the patch as RwF for this CF.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

19 February 2024, 07:28:05

Hi,

On Thu, 18 Jan 2024 at 04:22, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > I agree with your points. While the other I/O related work is
> > happening we can discuss what we should do in the variable op_byte
> > cases. Also, this is happening only for WAL right now but if we try to
> > extend pg_stat_io in the future, that problem possibly will rise
> > again. So, it could be good to come up with a general solution, not
> > only for WAL.
>
> Okay, I've marked the patch as RwF for this CF.

I wanted to inform you that the 73f0a13266 commit changed all WALRead
calls to read variable bytes, only the WAL receiver was reading
variable bytes before.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

19 April 2024, 08:01:54

Hi,

On Mon, 19 Feb 2024 at 10:28, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Thu, 18 Jan 2024 at 04:22, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > > I agree with your points. While the other I/O related work is
> > > happening we can discuss what we should do in the variable op_byte
> > > cases. Also, this is happening only for WAL right now but if we try to
> > > extend pg_stat_io in the future, that problem possibly will rise
> > > again. So, it could be good to come up with a general solution, not
> > > only for WAL.
> >
> > Okay, I've marked the patch as RwF for this CF.
>
> I wanted to inform you that the 73f0a13266 commit changed all WALRead
> calls to read variable bytes, only the WAL receiver was reading
> variable bytes before.

I want to start working on this again if possible. I will try to
summarize the current status:

* With the 73f0a13266 commit, the WALRead() function started to read
variable bytes in every case. Before, only the WAL receiver was
reading variable bytes.

* With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
discussing what we have to do when this is merged. It is decided that
WALReadFromBuffers() does not call pgstat_report_wait_start() because
this function does not perform any IO [1]. We may follow the same
logic by not including these to pg_stat_io?

* With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
does not block anything related to putting WAL stats in pg_stat_io.

If I am not missing any new changes, the only problem is reading
variable bytes now. We have discussed a couple of solutions:

1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
that this means some variable byte I/O is happening.

I kind of dislike this solution because if the *only* read I/O is
happening in variable bytes, it will look like write and extend I/Os
are happening in variable bytes as well. As a solution, it could be
documented that only read I/Os could happen in variable bytes for now.

2- Use op_bytes_[read | write | extend] columns instead of one
op_bytes column, also use the first solution.

This can solve the first solution's weakness but it introduces two
more columns. This is more future proof compared to the first solution
if there is a chance that some variable I/O could happen in write and
extend calls as well in the future.

3- Create a new pg_stat_io_wal view to put WAL I/Os here instead of pg_stat_io.

pg_stat_io could remain untouchable and we will have flexibility to
edit this new view as much as we want. But the original aim of the
pg_stat_io is evaluating all I/O from a single view and adding a new
view breaks this aim.

I hope that I did not miss anything and my explanations are clear.

Any kind of feedback would be appreciated.

[1] https://www.postgresql.org/message-id/CAFiTN-sE7CJn-ZFj%2B-0Wv6TNytv_fp4n%2BeCszspxJ3mt77t5ig%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

24 April 2024, 08:37:21

Hi,

On Fri, 19 Apr 2024 at 11:01, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > On Thu, 18 Jan 2024 at 04:22, Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> > > > I agree with your points. While the other I/O related work is
> > > > happening we can discuss what we should do in the variable op_byte
> > > > cases. Also, this is happening only for WAL right now but if we try to
> > > > extend pg_stat_io in the future, that problem possibly will rise
> > > > again. So, it could be good to come up with a general solution, not
> > > > only for WAL.
> > >
> > > Okay, I've marked the patch as RwF for this CF.

Since the last commitfest entry was returned with feedback, I created
a new commitfest entry: https://commitfest.postgresql.org/48/4950/

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Bharath Rupireddy

Date:

13 May 2024, 14:12:11

On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> > I wanted to inform you that the 73f0a13266 commit changed all WALRead
> > calls to read variable bytes, only the WAL receiver was reading
> > variable bytes before.
>
> I want to start working on this again if possible. I will try to
> summarize the current status:

Thanks for working on this.

> * With the 73f0a13266 commit, the WALRead() function started to read
> variable bytes in every case. Before, only the WAL receiver was
> reading variable bytes.
>
> * With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
> discussing what we have to do when this is merged. It is decided that
> WALReadFromBuffers() does not call pgstat_report_wait_start() because
> this function does not perform any IO [1]. We may follow the same
> logic by not including these to pg_stat_io?

Right. WALReadFromBuffers doesn't do any I/O.

Whoever reads WAL from disk (backends, walsenders, recovery process)
using pg_pread (XLogPageRead, WALRead) needs to be tracked in
pg_stat_io or some other view. If it were to be in pg_stat_io,
although we may not be able to distinguish WAL read stats at a backend
level (like how many times/bytes a walsender or recovery process or a
backend read WAL from disk), but it can help understand overall impact
of WAL read I/O at a cluster level. With this approach, the WAL I/O
stats are divided up - WAL read I/O and write I/O stats are in
pg_stat_io and pg_stat_wal respectively.

This makes me think if we need to add WAL read I/O stats also to
pg_stat_wal. Then, we can also add WALReadFromBuffers stats
hits/misses there. With this approach, pg_stat_wal can be a one-stop
view for all the WAL related stats. If needed, we can join info from
pg_stat_wal to pg_stat_io in system_views.sql so that the I/O stats
are emitted to the end-user via pg_stat_io.

> * With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
> does not block anything related to putting WAL stats in pg_stat_io.
>
> If I am not missing any new changes, the only problem is reading
> variable bytes now. We have discussed a couple of solutions:
>
> 1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
> that this means some variable byte I/O is happening.
>
> I kind of dislike this solution because if the *only* read I/O is
> happening in variable bytes, it will look like write and extend I/Os
> are happening in variable bytes as well. As a solution, it could be
> documented that only read I/Os could happen in variable bytes for now.

Yes, read I/O for relation and WAL can happen in variable bytes. I
think this idea seems reasonable and simple yet useful to know the
cluster-wide read I/O.

However, another point here is how the total number of bytes read is
represented with existing pg_stat_io columns 'reads' and 'op_bytes'.
It is known now with 'reads' * 'op_bytes', but with variable bytes,
how is read bytes calculated? Maybe add new columns
read_bytes/write_bytes?

> 2- Use op_bytes_[read | write | extend] columns instead of one
> op_bytes column, also use the first solution.
>
> This can solve the first solution's weakness but it introduces two
> more columns. This is more future proof compared to the first solution
> if there is a chance that some variable I/O could happen in write and
> extend calls as well in the future.

-1 as more columns impact the readability and usability.

> 3- Create a new pg_stat_io_wal view to put WAL I/Os here instead of pg_stat_io.
>
> pg_stat_io could remain untouchable and we will have flexibility to
> edit this new view as much as we want. But the original aim of the
> pg_stat_io is evaluating all I/O from a single view and adding a new
> view breaks this aim.

-1 as it defeats the very purpose of one-stop view pg_stat_io for all
kinds of I/O. PS: see my response above about adding both WAL write
I/O and read I/O stats to pg_stat_wal.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Show WAL write and fsync stats in pg_stat_io

From

Amit Kapila

Date:

28 May 2024, 00:48:40

On Mon, May 13, 2024 at 7:42 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > > I wanted to inform you that the 73f0a13266 commit changed all WALRead
> > > calls to read variable bytes, only the WAL receiver was reading
> > > variable bytes before.
> >
> > I want to start working on this again if possible. I will try to
> > summarize the current status:
>
> Thanks for working on this.
>
> > * With the 73f0a13266 commit, the WALRead() function started to read
> > variable bytes in every case. Before, only the WAL receiver was
> > reading variable bytes.
> >
> > * With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
> > discussing what we have to do when this is merged. It is decided that
> > WALReadFromBuffers() does not call pgstat_report_wait_start() because
> > this function does not perform any IO [1]. We may follow the same
> > logic by not including these to pg_stat_io?
>
> Right. WALReadFromBuffers doesn't do any I/O.
>
> Whoever reads WAL from disk (backends, walsenders, recovery process)
> using pg_pread (XLogPageRead, WALRead) needs to be tracked in
> pg_stat_io or some other view. If it were to be in pg_stat_io,
> although we may not be able to distinguish WAL read stats at a backend
> level (like how many times/bytes a walsender or recovery process or a
> backend read WAL from disk), but it can help understand overall impact
> of WAL read I/O at a cluster level. With this approach, the WAL I/O
> stats are divided up - WAL read I/O and write I/O stats are in
> pg_stat_io and pg_stat_wal respectively.
>
> This makes me think if we need to add WAL read I/O stats also to
> pg_stat_wal. Then, we can also add WALReadFromBuffers stats
> hits/misses there. With this approach, pg_stat_wal can be a one-stop
> view for all the WAL related stats.
>

If possible, let's have all the I/O stats (even for WAL) in
pg_stat_io. Can't we show the WAL data we get from buffers in the hits
column and then have read_bytes or something like that to know the
amount of data read?

--
With Regards,
Amit Kapila.

Re: Show WAL write and fsync stats in pg_stat_io

From

Nitin Jadhav

Date:

09 June 2024, 15:05:08

> If possible, let's have all the I/O stats (even for WAL) in
> pg_stat_io. Can't we show the WAL data we get from buffers in the hits
> column and then have read_bytes or something like that to know the
> amount of data read?

The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a
database. It signifies the count of cache hits, or in other words, the
instances where data was located in the ‘shared_buffers’. As a result,
keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful
knowledge about the buffer cache’s efficiency and assist users in
making educated choices when fine-tuning their database. However, if
we include the hit count of WAL buffers in this, it may lead to
misleading interpretations for database tuning. If there’s something
I’ve overlooked that’s already been discussed, please feel free to
correct me.


Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Tue, May 28, 2024 at 6:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 13, 2024 at 7:42 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > >
> > > > I wanted to inform you that the 73f0a13266 commit changed all WALRead
> > > > calls to read variable bytes, only the WAL receiver was reading
> > > > variable bytes before.
> > >
> > > I want to start working on this again if possible. I will try to
> > > summarize the current status:
> >
> > Thanks for working on this.
> >
> > > * With the 73f0a13266 commit, the WALRead() function started to read
> > > variable bytes in every case. Before, only the WAL receiver was
> > > reading variable bytes.
> > >
> > > * With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
> > > discussing what we have to do when this is merged. It is decided that
> > > WALReadFromBuffers() does not call pgstat_report_wait_start() because
> > > this function does not perform any IO [1]. We may follow the same
> > > logic by not including these to pg_stat_io?
> >
> > Right. WALReadFromBuffers doesn't do any I/O.
> >
> > Whoever reads WAL from disk (backends, walsenders, recovery process)
> > using pg_pread (XLogPageRead, WALRead) needs to be tracked in
> > pg_stat_io or some other view. If it were to be in pg_stat_io,
> > although we may not be able to distinguish WAL read stats at a backend
> > level (like how many times/bytes a walsender or recovery process or a
> > backend read WAL from disk), but it can help understand overall impact
> > of WAL read I/O at a cluster level. With this approach, the WAL I/O
> > stats are divided up - WAL read I/O and write I/O stats are in
> > pg_stat_io and pg_stat_wal respectively.
> >
> > This makes me think if we need to add WAL read I/O stats also to
> > pg_stat_wal. Then, we can also add WALReadFromBuffers stats
> > hits/misses there. With this approach, pg_stat_wal can be a one-stop
> > view for all the WAL related stats.
> >
>
> If possible, let's have all the I/O stats (even for WAL) in
> pg_stat_io. Can't we show the WAL data we get from buffers in the hits
> column and then have read_bytes or something like that to know the
> amount of data read?
>
> --
> With Regards,
> Amit Kapila.
>
>

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

13 June 2024, 08:51:59

Hi,

Thank you for looking into this! And, sorry for the late answer.

On Mon, 13 May 2024 at 17:12, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Apr 19, 2024 at 1:32 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > > I wanted to inform you that the 73f0a13266 commit changed all WALRead
> > > calls to read variable bytes, only the WAL receiver was reading
> > > variable bytes before.
> >
> > I want to start working on this again if possible. I will try to
> > summarize the current status:
>
> Thanks for working on this.
>
> > * With the 73f0a13266 commit, the WALRead() function started to read
> > variable bytes in every case. Before, only the WAL receiver was
> > reading variable bytes.
> >
> > * With the 91f2cae7a4 commit, WALReadFromBuffers() is merged. We were
> > discussing what we have to do when this is merged. It is decided that
> > WALReadFromBuffers() does not call pgstat_report_wait_start() because
> > this function does not perform any IO [1]. We may follow the same
> > logic by not including these to pg_stat_io?
>
> Right. WALReadFromBuffers doesn't do any I/O.
>
> Whoever reads WAL from disk (backends, walsenders, recovery process)
> using pg_pread (XLogPageRead, WALRead) needs to be tracked in
> pg_stat_io or some other view. If it were to be in pg_stat_io,
> although we may not be able to distinguish WAL read stats at a backend
> level (like how many times/bytes a walsender or recovery process or a
> backend read WAL from disk), but it can help understand overall impact
> of WAL read I/O at a cluster level. With this approach, the WAL I/O
> stats are divided up - WAL read I/O and write I/O stats are in
> pg_stat_io and pg_stat_wal respectively.
>
> This makes me think if we need to add WAL read I/O stats also to
> pg_stat_wal. Then, we can also add WALReadFromBuffers stats
> hits/misses there. With this approach, pg_stat_wal can be a one-stop
> view for all the WAL related stats. If needed, we can join info from
> pg_stat_wal to pg_stat_io in system_views.sql so that the I/O stats
> are emitted to the end-user via pg_stat_io.

I agree that the ultimate goal is seeing WAL I/O stats from one place.
There is a reply to this from Amit:

On Tue, 28 May 2024 at 03:48, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> If possible, let's have all the I/O stats (even for WAL) in
> pg_stat_io. Can't we show the WAL data we get from buffers in the hits
> column and then have read_bytes or something like that to know the
> amount of data read?

I think it is better to have all the I/O stats in pg_stat_io like Amit
said. And, it makes sense to me to show 'WAL data we get from buffers'
in the hits column. Since, basically instead of doing I/O from disk;
we get data directly from WAL buffers. I think that fits the
explanation of the hits column in pg_stat_io, which is 'The number of
times a desired block was found in a shared buffer.' [1].

> > * With the b5a9b18cd0 commit, streaming I/O is merged but AFAIK this
> > does not block anything related to putting WAL stats in pg_stat_io.
> >
> > If I am not missing any new changes, the only problem is reading
> > variable bytes now. We have discussed a couple of solutions:
> >
> > 1- Change op_bytes to something like -1, 0, 1, NULL etc. and document
> > that this means some variable byte I/O is happening.
> >
> > I kind of dislike this solution because if the *only* read I/O is
> > happening in variable bytes, it will look like write and extend I/Os
> > are happening in variable bytes as well. As a solution, it could be
> > documented that only read I/Os could happen in variable bytes for now.
>
> Yes, read I/O for relation and WAL can happen in variable bytes. I
> think this idea seems reasonable and simple yet useful to know the
> cluster-wide read I/O.

I agree.

> However, another point here is how the total number of bytes read is
> represented with existing pg_stat_io columns 'reads' and 'op_bytes'.
> It is known now with 'reads' * 'op_bytes', but with variable bytes,
> how is read bytes calculated? Maybe add new columns
> read_bytes/write_bytes?
>
> > 2- Use op_bytes_[read | write | extend] columns instead of one
> > op_bytes column, also use the first solution.
> >
> > This can solve the first solution's weakness but it introduces two
> > more columns. This is more future proof compared to the first solution
> > if there is a chance that some variable I/O could happen in write and
> > extend calls as well in the future.
>
> -1 as more columns impact the readability and usability.

I did not understand the overall difference between what you suggested
(adding read_bytes/write_bytes columns) and my suggestion (adding
op_bytes_[read | write | extend] columns). They both introduce new
columns. Could you please explain what you suggested in more detail?

> > 3- Create a new pg_stat_io_wal view to put WAL I/Os here instead of pg_stat_io.
> >
> > pg_stat_io could remain untouchable and we will have flexibility to
> > edit this new view as much as we want. But the original aim of the
> > pg_stat_io is evaluating all I/O from a single view and adding a new
> > view breaks this aim.
>
> -1 as it defeats the very purpose of one-stop view pg_stat_io for all
> kinds of I/O. PS: see my response above about adding both WAL write
> I/O and read I/O stats to pg_stat_wal.

I agree.

[1] https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-IO-VIEW

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

13 June 2024, 09:24:36

Hi,

Thank you for looking into this!

On Sun, 9 Jun 2024 at 18:05, Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote:
>
> > If possible, let's have all the I/O stats (even for WAL) in
> > pg_stat_io. Can't we show the WAL data we get from buffers in the hits
> > column and then have read_bytes or something like that to know the
> > amount of data read?
>
> The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a
> database. It signifies the count of cache hits, or in other words, the
> instances where data was located in the ‘shared_buffers’. As a result,
> keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful
> knowledge about the buffer cache’s efficiency and assist users in
> making educated choices when fine-tuning their database. However, if
> we include the hit count of WAL buffers in this, it may lead to
> misleading interpretations for database tuning. If there’s something
> I’ve overlooked that’s already been discussed, please feel free to
> correct me.

I think counting them as a hit makes sense. We read data from WAL
buffers instead of reading them from disk. And, WAL buffers are stored
in shared memory so I believe they can be counted as hits in the
shared buffers. Could you please explain how this change can 'lead to
misleading interpretations for database tuning' a bit more?

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Melanie Plageman

Date:

17 June 2024, 14:53:27

On Thu, Jun 13, 2024 at 5:24 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> On Sun, 9 Jun 2024 at 18:05, Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote:
> >
> > > If possible, let's have all the I/O stats (even for WAL) in
> > > pg_stat_io. Can't we show the WAL data we get from buffers in the hits
> > > column and then have read_bytes or something like that to know the
> > > amount of data read?
> >
> > The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a
> > database. It signifies the count of cache hits, or in other words, the
> > instances where data was located in the ‘shared_buffers’. As a result,
> > keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful
> > knowledge about the buffer cache’s efficiency and assist users in
> > making educated choices when fine-tuning their database. However, if
> > we include the hit count of WAL buffers in this, it may lead to
> > misleading interpretations for database tuning. If there’s something
> > I’ve overlooked that’s already been discussed, please feel free to
> > correct me.
>
> I think counting them as a hit makes sense. We read data from WAL
> buffers instead of reading them from disk. And, WAL buffers are stored
> in shared memory so I believe they can be counted as hits in the
> shared buffers. Could you please explain how this change can 'lead to
> misleading interpretations for database tuning' a bit more?

Perhaps Nitin was thinking of a scenario in which WAL hits are counted
as hits on the same IOObject as shared buffer hits. Since this thread
has been going on for awhile and we haven't recently had a schema
overview, I could understand if there was some confusion. For clarity,
I will restate that the current proposal is to count WAL buffer hits
for IOObject WAL, which means they will not be mixed in with shared
buffer hits.

And I think it makes sense to count WAL IOObject hits since increasing
wal_buffers can lead to more hits, right?

- Melanie

Re: Show WAL write and fsync stats in pg_stat_io

From

Nitin Jadhav

Date:

06 July 2024, 07:28:56

> Perhaps Nitin was thinking of a scenario in which WAL hits are counted
> as hits on the same IOObject as shared buffer hits. Since this thread
> has been going on for awhile and we haven't recently had a schema
> overview, I could understand if there was some confusion

Yes. I was considering a scenario where WAL hits are counted as hits
on the same IOObject as shared buffer hits.

> For clarity,
> I will restate that the current proposal is to count WAL buffer hits
> for IOObject WAL, which means they will not be mixed in with shared
> buffer hits.
>
> And I think it makes sense to count WAL IOObject hits since increasing
> wal_buffers can lead to more hits, right?

Thank you for the clarification. I agree with the proposal to count
WAL buffer hits for IOObject WAL separately from shared buffer hits.
This distinction will provide a more accurate representation.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Mon, Jun 17, 2024 at 8:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Jun 13, 2024 at 5:24 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > On Sun, 9 Jun 2024 at 18:05, Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote:
> > >
> > > > If possible, let's have all the I/O stats (even for WAL) in
> > > > pg_stat_io. Can't we show the WAL data we get from buffers in the hits
> > > > column and then have read_bytes or something like that to know the
> > > > amount of data read?
> > >
> > > The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a
> > > database. It signifies the count of cache hits, or in other words, the
> > > instances where data was located in the ‘shared_buffers’. As a result,
> > > keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful
> > > knowledge about the buffer cache’s efficiency and assist users in
> > > making educated choices when fine-tuning their database. However, if
> > > we include the hit count of WAL buffers in this, it may lead to
> > > misleading interpretations for database tuning. If there’s something
> > > I’ve overlooked that’s already been discussed, please feel free to
> > > correct me.
> >
> > I think counting them as a hit makes sense. We read data from WAL
> > buffers instead of reading them from disk. And, WAL buffers are stored
> > in shared memory so I believe they can be counted as hits in the
> > shared buffers. Could you please explain how this change can 'lead to
> > misleading interpretations for database tuning' a bit more?
>
> Perhaps Nitin was thinking of a scenario in which WAL hits are counted
> as hits on the same IOObject as shared buffer hits. Since this thread
> has been going on for awhile and we haven't recently had a schema
> overview, I could understand if there was some confusion. For clarity,
> I will restate that the current proposal is to count WAL buffer hits
> for IOObject WAL, which means they will not be mixed in with shared
> buffer hits.
>
> And I think it makes sense to count WAL IOObject hits since increasing
> wal_buffers can lead to more hits, right?
>
> - Melanie

Re: Show WAL write and fsync stats in pg_stat_io

From

Bertrand Drouvot

Date:

27 January, 10:45:25

Hi,

On Fri, Jan 24, 2025 at 06:29:46PM +0300, Nazir Bilal Yavuz wrote:
> Hi,
> 
> Thanks for looking into this!
> 
> On Fri, 24 Jan 2025 at 17:20, Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > I did not look at the code yet but did a few tests.
> > I can see diff between pg_stat_wal and pg_stat_io, for example:
> >
> > "
> > postgres=# select pg_stat_reset_shared();
> >  pg_stat_reset_shared
> > ----------------------
> >
> > (1 row)
> >
> > postgres=# insert into bdt select a from generate_series(1,200000) a ;
> > INSERT 0 200000
> >
> > postgres=# select wal_bytes,stats_reset from pg_stat_wal;
> >  wal_bytes |          stats_reset
> > -----------+-------------------------------
> >   11800088 | 2025-01-24 14:17:28.507994+00
> > (1 row)
> >
> > postgres=# select sum(write_bytes),stats_reset from pg_stat_io where object = 'wal' group by stats_reset;
> >    sum    |          stats_reset
> > ----------+-------------------------------
> >  12853248 | 2025-01-24 14:17:28.507988+00
> > (1 row)
> >
> > Is that expected?
> 
> I am not sure, I think they need to be the same.

Yeah I think so (or at least we should document the reason(s) why (and how)
they could differ).

> I could not
> understand the cause of the difference at first glance. I will look
> into this and will come back to you.

Thanks.

> 
> Quick note: WAL stats in pg_stat_io view includes WAL initialization
> stats under object = 'wal' and context = 'init', your query may count
> these initialization stats too. So the correct query is:
> 
> 'select sum(write_bytes),stats_reset from pg_stat_io where object =
> 'wal' and context = 'normal' group by stats_reset;'.
>

Thanks for mentioning this filtering! 
 
> By saying that, this does not solve the problem; there is still a
> difference although you omit WAL initialization stats from the
> pg_stat_io.

Yeah. The "init" data was also empty on my side after the pg_stat_reset_shared()
and after the insertion in the table.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

27 January, 17:13:39

Hi,

On Mon, 27 Jan 2025 at 16:59, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> On Mon, 27 Jan 2025 at 03:52, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > In 0002, you are removing PendingWalStats.wal_write_time, which does
> > not seem OK to me because we should still aggregate this data for
> > track_wal_io_timing, no?
>
> We use PendingWalStats.wal_[ write | sync ]_time only to show timings
> on the pg_stat_wal view, right? And now these pg_stat_wal.wal_[ write
> | sync ]_time datas are fetched from the pg_stat_io view when the
> track_wal_io_timing is on. So, I think it is correct to remove these.
>
> I made a mistake while splitting the patches. The places where
> 'PendingWalStats.wal_[ write | sync ]_time are incremented (the code
> piece you shared)' are removed in 0002 (0001 now), but they should be
> removed in 0003 (0002 now) instead. This is corrected in v11.

Oops, I forgot to add one thing to the previous mail.

If we agree with removing PendingWalStats.wal_[ write | sync ]_time
variables, then it would make sense to remove PgStat_PendingWalStats
struct completely. We have that struct because [1] it is cheap to
store PendingWalStats.wal_[ write | sync ]_time as instr_time instead
of PgStat_Counter.

[1] ca7b3c4c00

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

29 January, 13:49:20

Hi,

On Mon, 27 Jan 2025 at 10:45, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Fri, Jan 24, 2025 at 06:29:46PM +0300, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > Thanks for looking into this!
> >
> > On Fri, 24 Jan 2025 at 17:20, Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> > >
> > > I did not look at the code yet but did a few tests.
> > > I can see diff between pg_stat_wal and pg_stat_io, for example:
> > >
> > > "
> > > postgres=# select pg_stat_reset_shared();
> > >  pg_stat_reset_shared
> > > ----------------------
> > >
> > > (1 row)
> > >
> > > postgres=# insert into bdt select a from generate_series(1,200000) a ;
> > > INSERT 0 200000
> > >
> > > postgres=# select wal_bytes,stats_reset from pg_stat_wal;
> > >  wal_bytes |          stats_reset
> > > -----------+-------------------------------
> > >   11800088 | 2025-01-24 14:17:28.507994+00
> > > (1 row)
> > >
> > > postgres=# select sum(write_bytes),stats_reset from pg_stat_io where object = 'wal' group by stats_reset;
> > >    sum    |          stats_reset
> > > ----------+-------------------------------
> > >  12853248 | 2025-01-24 14:17:28.507988+00
> > > (1 row)
> > >
> > > Is that expected?
> >
> > I am not sure, I think they need to be the same.
>
> Yeah I think so (or at least we should document the reason(s) why (and how)
> they could differ).
>
> > I could not
> > understand the cause of the difference at first glance. I will look
> > into this and will come back to you.
>
> Thanks.

Sorry for replying late. It turns out that the difference is expected.
It is because pg_stat_wal.wal_bytes tracks the total amount of WAL
generated, while pg_stat_io tracks the I/Os that are written to disk.

From the docs of pg_stat_io: pg_stat_io view will contain one row for
each combination of backend type, target I/O object, and I/O context,
showing cluster-wide I/O statistics. Combinations which do not make
sense are omitted.

From the docs of pg_stat_wal.wal_bytes: Total amount of WAL generated in bytes.

It is explicitly said that pg_stat_io tracks I/O operations and
pg_stat_wal.wal_bytes tracks total amount of WAL generated in bytes. I
think this is clear enough. Do you think we still need additional
explanation?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Bertrand Drouvot

Date:

29 January, 18:16:13

Hi,

On Wed, Jan 29, 2025 at 02:57:21PM +0300, Nazir Bilal Yavuz wrote:
> I agree with you but it was discussed before in this thread [2]. It
> was decided to use both track_wal_io_timing and track_io_timing
> because of the overhead that track_wal_io_timing creates but we can
> still re-discuss it. Do you think that this discussion needs its own
> thread?
> 
> If we continue to discuss it in this thread, I am in favor of removing
> track_wal_io_timing and using track_io_timing for all types of I/Os.
> Like you said, this cross-dependency makes things more complex than
> they used to be. Downside of removing track_wal_io_timing is affecting
> people who:
> 
> 1- Want to track timings of only WAL I/Os.
> 2- Want to track timings of all IOs except WAL I/Os.
> 
> I think the first group is more important than the second because
> track_io_timing already creates overhead.

I'm -1 of removing track_wal_io_timing. I think that this code path is very
sensible to performance to not add extra overhead when not necessary asked for.

I think that's the main reason why ff99918c625 added this new GUC (looking at
the commit message). I'd feel more comfortable if we keep it.

That said, in:

+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+       if (io_object == IOOBJECT_WAL)
+               return track_wal_io_timing;
+
+       return track_io_timing;
+}

I think it would make sense to return "track_io_timing && track_wal_io_timing"
for the IOOBJECT_WAL case. That way it maintains track_io_timing as the master
switch for all I/O timing in pg_stat_io.

> One additional thing is that I think track_io_timing is a general
> word. When it exists, I do not expect there to be another GUC like
> track_wal_io_timing to track WAL I/Os' timings.

That's true but OTOH track_wal_io_timing is already there since years (it's not
like we are adding it).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

31 January, 11:29:31

Hi,

On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Wed, Jan 29, 2025 at 02:57:21PM +0300, Nazir Bilal Yavuz wrote:
> > I agree with you but it was discussed before in this thread [2]. It
> > was decided to use both track_wal_io_timing and track_io_timing
> > because of the overhead that track_wal_io_timing creates but we can
> > still re-discuss it. Do you think that this discussion needs its own
> > thread?
> >
> > If we continue to discuss it in this thread, I am in favor of removing
> > track_wal_io_timing and using track_io_timing for all types of I/Os.
> > Like you said, this cross-dependency makes things more complex than
> > they used to be. Downside of removing track_wal_io_timing is affecting
> > people who:
> >
> > 1- Want to track timings of only WAL I/Os.
> > 2- Want to track timings of all IOs except WAL I/Os.
> >
> > I think the first group is more important than the second because
> > track_io_timing already creates overhead.
>
> I'm -1 of removing track_wal_io_timing. I think that this code path is very
> sensible to performance to not add extra overhead when not necessary asked for.
>
> I think that's the main reason why ff99918c625 added this new GUC (looking at
> the commit message). I'd feel more comfortable if we keep it.

As Michael suggested, I will run a couple of benchmarks to see the
actual effect of this change. Then let's see if this affects anything.

>
> That said, in:
>
> +static bool
> +pgstat_should_track_io_time(IOObject io_object)
> +{
> +       if (io_object == IOOBJECT_WAL)
> +               return track_wal_io_timing;
> +
> +       return track_io_timing;
> +}
>
> I think it would make sense to return "track_io_timing && track_wal_io_timing"
> for the IOOBJECT_WAL case. That way it maintains track_io_timing as the master
> switch for all I/O timing in pg_stat_io.

I do not think that makes sense if we want to take performance into
account. This means if we want to track WAL IO timings, we must track
other IOs timings as well. Or perhaps did you mean that not fetching
pg_stat_wal's timings from pg_stat_io and instead: track_wal_io_timing
will track timings in the pg_stat_wal but it won't track the WAL IO
timings in the pg_stat_io unless track_io_timing is enabled?

> > One additional thing is that I think track_io_timing is a general
> > word. When it exists, I do not expect there to be another GUC like
> > track_wal_io_timing to track WAL I/Os' timings.
>
> That's true but OTOH track_wal_io_timing is already there since years (it's not
> like we are adding it).

Yes, this makes sense.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Bertrand Drouvot

Date:

03 February, 11:50:15

Hi,

On Mon, Feb 03, 2025 at 01:07:26PM +0900, Michael Paquier wrote:
> On Fri, Jan 31, 2025 at 11:29:31AM +0300, Nazir Bilal Yavuz wrote:
> > On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> >> I think that's the main reason why ff99918c625 added this new GUC (looking at
> >> the commit message). I'd feel more comfortable if we keep it.
> > 
> > As Michael suggested, I will run a couple of benchmarks to see the
> > actual effect of this change. Then let's see if this affects anything.
> 
> I've looked at bit at all that today, and something like the attached
> is what seems like the best streamlined version to me for the main
> feature.  I am also planning to run some short benchmarks with
> track_io_timing=on on HEAD and with the patch, then see the
> difference, without any relationship to track_wal_io_timing.

Thanks!

I've a few comments:

=== 1

+       pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+                                                       io_start, 1, wal_segment_size);

In case wal_init_zero is false, then we're only seeking to the end and write a
solitary byte. Then, is reporting "wal_segment_size" correct?

=== 2

+ /*
+  * Measure I/O timing to write WAL data, for pg_stat_wal
+  * and/or pg_stat_io.
+  */
+ start = pgstat_prepare_io_time(track_wal_io_timing || track_io_timing);

I think that makes sense done that way (as track_wal_io_timing does not have
any effect in pgstat_count_io_op_time()). Nit: maybe change the order in the
comment to reflect the code ordering? (I mean to say re-word to "for pg_stat_io
and/or pg_stat_wal). The order is ok in issue_xlog_fsync() though.

=== 3

What about adding a message in the doc as mentioned in [1]? (I'd not be surprised
if some people wonder why the "bytes" fields differ).

=== 4

pgstat_tracks_io_object() starts to be hard to read. I wonder if it could be
simplified with switch but that could be done after this one goes in.

=== 5

I think this patch will help simplify the per-backend WAL related patch, that's
nice. 

=== 6

I'll also do some benchmark on my side.

[1]: https://www.postgresql.org/message-id/Z5o9OQ0nwWD9tKTR%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Show WAL write and fsync stats in pg_stat_io

From

Bertrand Drouvot

Date:

03 February, 13:37:54

On Mon, Feb 03, 2025 at 08:50:15AM +0000, Bertrand Drouvot wrote:
> === 6
> 
> I'll also do some benchmark on my side.

So, I did some tests using:

c=1 && psql -c checkpoint -c 'select pg_switch_wal()' &&
pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t
20000

With 2 message size: 8192 and 10.

Here are the results (outliers removed and tsc clock source):

+------------+-------------+-------------+
| Version    | Msg Size 10 | Msg Size 8K |
+------------+-------------+-------------+
| With PATCH |    980 TPS  |    910 TPS  |
| On Master  |    980 TPS  |    910 TPS  |
+------------+-------------+-------------+

So the patch does not produce perf regression according to those tests.

Out of curiosity I also played a bit with the IO tracking (and hpet clock source)
and got:

+---------------------------------+-------------+-------------+-----------+
| Test Configuration              |    PATCH    |   MASTER    | % Change |
+---------------------------------+-------------+-------------+-----------+
| track_io_timing                 |                                      |
|   Message size 8192             |    805 TPS  |    810 TPS  |   -0.6%  |
|   Message size 10               |    860 TPS  |    860 TPS  |    0.0%  |
+---------------------------------+-------------+-------------+-----------+
| track_wal_io_timing             |                                      |
|   Message size 8192             |    810 TPS  |    810 TPS  |    0.0%  |
|   Message size 10               |    860 TPS  |    860 TPS  |    0.0%  |
+---------------------------------+-------------+-------------+-----------+
| track_wal_io + track_io         |                                      |
|   Message size 8192             |    800 TPS  |    800 TPS  |    0.0%  |
|   Message size 10               |    855 TPS  |    860 TPS  |   -0.6%  |
+---------------------------------+-------------+-------------+-----------+

Based on those results the patch does not show a noticable impact when IO timing
tracking is/are enabled.

FYI, It’s also worth noticing that if hpet is set then it also affect negatively
even if no timing tracking is set. It means that when track IO timing is/are
enabled the perf regression seen above are not fully related to having then
enabled but also (for a large part) to hpet vs tsc.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

03 February, 14:34:29

Hi,

On Mon, 3 Feb 2025 at 11:50, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Mon, Feb 03, 2025 at 01:07:26PM +0900, Michael Paquier wrote:
> > On Fri, Jan 31, 2025 at 11:29:31AM +0300, Nazir Bilal Yavuz wrote:
> > > On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot
> > > <bertranddrouvot.pg@gmail.com> wrote:
> > >> I think that's the main reason why ff99918c625 added this new GUC (looking at
> > >> the commit message). I'd feel more comfortable if we keep it.
> > >
> > > As Michael suggested, I will run a couple of benchmarks to see the
> > > actual effect of this change. Then let's see if this affects anything.
> >
> > I've looked at bit at all that today, and something like the attached
> > is what seems like the best streamlined version to me for the main
> > feature.  I am also planning to run some short benchmarks with
> > track_io_timing=on on HEAD and with the patch, then see the
> > difference, without any relationship to track_wal_io_timing.
>
> Thanks!
>
> I've a few comments:

Thank you both for the v13 and the review!

> === 1
>
> +       pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
> +                                                       io_start, 1, wal_segment_size);
>
> In case wal_init_zero is false, then we're only seeking to the end and write a
> solitary byte. Then, is reporting "wal_segment_size" correct?

I think you are right. It would make sense to have two
pgstat_count_io_op_time() calls here. One for wal_segment_size and one
for solitary byte.

> === 2
>
> + /*
> +  * Measure I/O timing to write WAL data, for pg_stat_wal
> +  * and/or pg_stat_io.
> +  */
> + start = pgstat_prepare_io_time(track_wal_io_timing || track_io_timing);
>
> I think that makes sense done that way (as track_wal_io_timing does not have
> any effect in pgstat_count_io_op_time()). Nit: maybe change the order in the
> comment to reflect the code ordering? (I mean to say re-word to "for pg_stat_io
> and/or pg_stat_wal). The order is ok in issue_xlog_fsync() though.
>
> === 3
>
> What about adding a message in the doc as mentioned in [1]? (I'd not be surprised
> if some people wonder why the "bytes" fields differ).
>
> === 4
>
> pgstat_tracks_io_object() starts to be hard to read. I wonder if it could be
> simplified with switch but that could be done after this one goes in.
>
> === 5
>
> I think this patch will help simplify the per-backend WAL related patch, that's
> nice.

And I agree with the other comments you mentioned.

> === 6
>
> I'll also do some benchmark on my side.

Thanks!

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

04 February, 17:29:55

Hi,

On Tue, 4 Feb 2025 at 10:55, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Feb 03, 2025 at 02:34:29PM +0300, Nazir Bilal Yavuz wrote:
> > On Mon, 3 Feb 2025 at 11:50, Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote:
> >> === 1
> >>
> >> + pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
> >> + io_start, 1, wal_segment_size);
> >>
> >> In case wal_init_zero is false, then we're only seeking to the end and write a
> >> solitary byte. Then, is reporting "wal_segment_size" correct?
> >
> > I think you are right. It would make sense to have two
> > pgstat_count_io_op_time() calls here. One for wal_segment_size and one
> > for solitary byte.
>
> Ah, right. We can just use one call with the size written set
> depending on wal_init_zero, because this is still a IOOP_WRITE for a
> IOCONTEXT_INIT in both cases. Only the size changes as we are in
> XLogFileInitInternal().

Yes, that is better. Thanks!

> At the end, we want this patch and this data, and my benchmarcking is
> not showing much differences even if going through a workload with
> many pages, so I've used the version relying entirely on
> track_io_timing and applied it.

Thanks! My benchmark results are the same [1]. I could not see the noticable difference between master and patched version. I run the benchmark a couple of times, sometimes the patched version, sometimes the master version is better. I think the differences are due to margin of error. Sharing the latest benchmark run at the bottom [1].

> If we split these timings across more GUCs, one thing to consider
> would be a third GUC which is neither track_wal_io_timing nor
> track_io_timing to keep things independent, but I am not really
> convinced that's necessary.

I agree.

[1]
pgbench -n -c8 -j8-T60 -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', ${emit_bytes}));";)

with emit bytes being 10 and 8192 for the 'acpi_pm', 'hpet' and 'tsc' clock sources.

(I hope that table gets rendered correctly in the mailing list.)

╔══════════════╦════════════╦═════════════════╦═════════════════════╦════════╦═════════╦═══════════╗
║ clock source ║ emit_bytes ║ track_io_timing ║ track_wal_io_timing ║ master ║ patched ║ Change % ║
║ ║ ║ ║ ║ TPS ║ TPS ║ (Positive ║
║ ║ ║ ║ ║ ║ ║ means ║
║ ║ ║ ║ ║ ║ ║ better) ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ off ║ 2164 ║ 2221 ║ 2.63 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2171 ║ 2246 ║ 3.45 ║
║ ║ 10 ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2243 ║ 2178 ║ -2.90 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2188 ║ 2148 ║ -1.83 ║
║ ╠════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ acpi_pm ║ ║ off ║ off ║ 2226 ║ 2172 ║ -2.43 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2219 ║ 2203 ║ -0.72 ║
║ ║ 8192 ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2168 ║ 2223 ║ 2.54 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2229 ║ 2204 ║ -1.12 ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ ║ ║ ║ ║ ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ ║ ║ ║ ║ ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ off ║ 2087 ║ 2156 ║ 3.31 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2141 ║ 2089 ║ -2.43 ║
║ ║ 10 ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2085 ║ 2071 ║ -0.67 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2024 ║ 2052 ║ 1.38 ║
║ ╠════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ hpet ║ ║ off ║ off ║ 2141 ║ 2074 ║ -3.13 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2056 ║ 2079 ║ 1.12 ║
║ ║ 8192 ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2096 ║ 2087 ║ -0.43 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2103 ║ 2086 ║ -0.81 ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ ║ ║ ║ ║ ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ ║ ║ ║ ║ ║
╠══════════════╬════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ off ║ 2276 ║ 2252 ║ -1.05 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2213 ║ 2124 ║ -4.02 ║
║ ║ 10 ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2183 ║ 2250 ║ 3.07 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2287 ║ 2250 ║ -1.62 ║
║ ╠════════════╬═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ tsc ║ ║ off ║ off ║ 2178 ║ 2179 ║ 0.05 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ off ║ 2203 ║ 2187 ║ -0.73 ║
║ ║ 8192 ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ off ║ on ║ 2220 ║ 2240 ║ 0.90 ║
║ ║ ╠═════════════════╬═════════════════════╬════════╬═════════╬═══════════╣
║ ║ ║ on ║ on ║ 2102 ║ 2182 ║ 3.81 ║
╚══════════════╩════════════╩═════════════════╩═════════════════════╩════════╩═════════╩═══════════╝

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Tom Lane

Date:

05 February, 21:32:02

Michael Paquier <michael@paquier.xyz> writes:
> At the end, we want this patch and this data, and my benchmarcking is
> not showing much differences even if going through a workload with
> many pages, so I've used the version relying entirely on
> track_io_timing and applied it.

Locally, the test added by this commit fails like so:

diff -U3 /home/postgres/pgsql/src/test/regress/expected/stats.out /home/postgres
/pgsql/src/test/regress/results/stats.out
--- /home/postgres/pgsql/src/test/regress/expected/stats.out    2025-02-04 12:33
:07.456393545 -0500
+++ /home/postgres/pgsql/src/test/regress/results/stats.out     2025-02-05 13:08
:30.605638432 -0500
@@ -886,7 +886,7 @@
   WHERE context = 'normal' AND object = 'wal';
  ?column?
 ----------
- t
+ f
 (1 row)

 -----

This is pretty repeatable (not perfectly so) in a build with
--enable-debug --enable-cassert --enable-tap-tests --with-llvm
but it usually passes without --with-llvm.  System is fairly
up-to-date RHEL8 on x86_64.  No idea why the buildfarm isn't
unhappy.  Any pointers where to look?

            regards, tom lane

Re: Show WAL write and fsync stats in pg_stat_io

From

Tom Lane

Date:

06 February, 05:06:24

Nazir Bilal Yavuz <byavuz81@gmail.com> writes:
> On Wed, 5 Feb 2025 at 21:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Locally, the test added by this commit fails like so:

> My thoughts when adding this test was that startup process must do the
> WAL read I/O while server is starting, i.e.:

Oh!  The problem is much simpler than I thought, then, and not
particularly related to configure options.  I was misled by
not trying repeated tests, but:

$ ./configure ...
$ make; make install
$ initdb ...; pg_ctl start
$ make installcheck
... all OK ...
$ make installcheck
... fails ...
$ make installcheck
... fails ...
$ make installcheck
... fails ...
$ make installcheck
... fails ...

Basically, this test is broken because it assumes that the server
is freshly started.  That's not a scenario the buildfarm runs
the core tests under, unfortunately, but it is one commonly
used in hand testing.

            regards, tom lane

Re: Show WAL write and fsync stats in pg_stat_io

From

Tom Lane

Date:

06 February, 05:52:14

Michael Paquier <michael@paquier.xyz> writes:
> The queries for the normal context are not going to have this problem
> even if we have a pg_stat_reset_shared('io'), but the init context
> gets unstable, unfortunately.  I don't see a way through here in the
> main regression test suite, so how about moving these into
> 027_stream_regress.pl.  It is possible to query the WAL read on the
> standby of this test, and the write part for the init context on the
> primary.  The syncs are not relevant as TAP usually runs with
> fsync=off, so better to remove this part entirely.

Yeah, if we want to assume we can see stats counts left over from
initdb, we have to put this in a TAP test, though I dunno if that is
the most appropriate one.

Now that I've looked at the tests a bit, I'm also distressed
by this test pattern:

SELECT stats_reset AS slru_commit_ts_reset_ts FROM pg_stat_slru WHERE name = 'commit_timestamp' \gset
SELECT pg_stat_reset_slru();
SELECT stats_reset > :'slru_commit_ts_reset_ts'::timestamptz FROM pg_stat_slru WHERE name = 'commit_timestamp';

This assumes that the execution time of pg_stat_reset_slru() is more
than the system clock resolution.  I won't be surprised to see that
fail in the future.  We did discover recently that gettimeofday is
good to the microsecond on most modern platforms [1], but it won't
get any better than that, while our machines keep getting faster.
Just for reference, on my hardly-bleeding-edge-anymore workstation:

regression=# select clock_timestamp(), pg_stat_reset_slru(), clock_timestamp();
        clock_timestamp        | pg_stat_reset_slru |        clock_timestamp
-------------------------------+--------------------+-------------------------------
 2025-02-05 21:47:54.929221-05 |                    | 2025-02-05 21:47:54.929223-05
(1 row)

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/be0339cc-1ae1-4892-9445-8e6d8995a44d%40eisentraut.org

Re: Show WAL write and fsync stats in pg_stat_io

From

Ranier Vilela

Date:

18 March, 15:46:54

Hi.

Em seg., 3 de fev. de 2025 às 01:07, Michael Paquier <michael@paquier.xyz> escreveu:

On Fri, Jan 31, 2025 at 11:29:31AM +0300, Nazir Bilal Yavuz wrote:
> On Wed, 29 Jan 2025 at 18:16, Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
>> I think that's the main reason why ff99918c625 added this new GUC (looking at
>> the commit message). I'd feel more comfortable if we keep it.
>
> As Michael suggested, I will run a couple of benchmarks to see the
> actual effect of this change. Then let's see if this affects anything.

I've looked at bit at all that today, and something like the attached
is what seems like the best streamlined version to me for the main
feature. I am also planning to run some short benchmarks with
track_io_timing=on on HEAD and with the patch, then see the
difference, without any relationship to track_wal_io_timing.

The comment additions in pgstat_count_io_op_time() were worth a patch
of their own. This part has been applied as b998fedab74c, after a few
tweaks of my own.

Sorry, I couldn't find the email in this thread, linked to the commit:

a051e71

I think it left an oversight.

Copy and past thinko?

Attached a trivial patch.

best regards,

Ranier Vilela

Attachment

fix-possible-copy-and-paste-thinko.patch

Re: Show WAL write and fsync stats in pg_stat_io

From

Nazir Bilal Yavuz

Date:

18 March, 16:13:15

Hi,

On Tue, 18 Mar 2025 at 15:47, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Sorry, I couldn't find the email in this thread, linked to the commit:
> a051e71
>
> I think it left an oversight.
> Copy and past thinko?

Thanks for the report! Yes, it is an oversight on my part.

> Attached a trivial patch.

LGTM.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Show WAL write and fsync stats in pg_stat_io

From

Michael Paquier

Date:

19 March, 02:33:37

On Tue, Mar 18, 2025 at 04:13:15PM +0300, Nazir Bilal Yavuz wrote:
> Thanks for the report! Yes, it is an oversight on my part.

No worries.  It's most likely me while reviewing the whole as I've
reordered these blocks a bit while going through the patch.

This has little consequences for the end user, as the read, read_bytes
and read_time should show up as 0 instead of NULL in the pg_stat_io
view for the two WAL rows.  Still it is confusing to show zeros, and
it makes queries of pg_stat_io a bit cheaper.  Will fix in a bit.
--
Michael

Attachment

signature.asc

Re: Show WAL write and fsync stats in pg_stat_io

From

Ranier Vilela

Date:

19 March, 15:10:00

Em ter., 18 de mar. de 2025 às 20:34, Michael Paquier <michael@paquier.xyz> escreveu:

On Tue, Mar 18, 2025 at 04:13:15PM +0300, Nazir Bilal Yavuz wrote:
> Thanks for the report! Yes, it is an oversight on my part.

No worries. It's most likely me while reviewing the whole as I've
reordered these blocks a bit while going through the patch.

This has little consequences for the end user, as the read, read_bytes
and read_time should show up as 0 instead of NULL in the pg_stat_io
view for the two WAL rows. Still it is confusing to show zeros, and
it makes queries of pg_stat_io a bit cheaper. Will fix in a bit.

Thank you Michael.

best regards,

Ranier Vilela