Re: Show WAL write and fsync stats in pg_stat_io - Mailing list pgsql-hackers

From Nazir Bilal Yavuz
Subject Re: Show WAL write and fsync stats in pg_stat_io
Date
Msg-id CAN55FZ2obn1N09Hgf9mBWg5jmHuoJQ_2gOrtXTJ4Bi37di-_7Q@mail.gmail.com
Whole thread Raw
In response to Re: Show WAL write and fsync stats in pg_stat_io  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Show WAL write and fsync stats in pg_stat_io
List pgsql-hackers
Hi,

Thanks for the review and feedback on your previous reply!

On Mon, 25 Dec 2023 at 09:40, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Dec 25, 2023 at 03:20:58PM +0900, Michael Paquier wrote:
> > pgstat_tracks_io_bktype() does not look correct to me.  Why is the WAL
> > receiver considered as something correct in the list of backend types,
> > while the intention is to *not* add it to pg_stat_io?  I have tried to
> > switche to the correct behavior of returning false for a
> > B_WAL_RECEIVER, to notice that pg_rewind's test 002_databases.pl
> > freezes on its shutdown sequence.  Something weird is going on here.
> > Could you look at it?  See the XXX comment in the attached, which is
> > the same behavior as v6-0002.  It looks to me that the patch has
> > introduced an infinite loop tweaking pgstat_tracks_io_bktype() in an
> > incorrect way to avoid the root issue.
>
> Ah, that's because it would trigger an assertion failure:
> TRAP: failed Assert("pgstat_tracks_io_op(MyBackendType, io_object,
>  io_context, io_op)"), File: "pgstat_io.c", Line: 89, PID: 6824
> postgres: standby_local: walreceiver
> (ExceptionalCondition+0xa8)[0x560d1b4dd38a]
>
> And the backtrace just tells that this is the WAL receiver
> initializing a WAL segment:
> #5  0x0000560d1b3322c8 in pgstat_count_io_op_n
> (io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
> cnt=1) at pgstat_io.c:89
> #6  0x0000560d1b33254a in pgstat_count_io_op_time
> (io_object=IOOBJECT_WAL, io_context=IOCONTEXT_INIT, io_op=IOOP_WRITE,
> start_time=..., cnt=1) at pgstat_io.c:181
> #7  0x0000560d1ae7f932 in XLogFileInitInternal (logsegno=3, logtli=1,
> added=0x7ffd2733c6eb, path=0x7ffd2733c2e0 "pg_wal/00000001", '0'
> <repeats 15 times>, "3")     at xlog.c:3115
> #8  0x0000560d1ae7fc4e in XLogFileInit (logsegno=3, logtli=1) at
> xlog.c:3215

Correct.

>
> Wouldn't it be simpler to just bite the bullet in this case and handle
> WAL receivers in the IO tracking?

There is one problem and I couldn't decide how to solve it. We need to
handle read IO in WALRead() in xlogreader.c. How many bytes the
WALRead() function will read is controlled by a variable and it can be
different from XLOG_BLCKSZ. This is a problem because pg_stat_io's
op_bytes column is a constant.

Here are all WALRead() function calls:

1- read_local_xlog_page_guts() in xlogutils.c => WALRead(XLOG_BLCKSZ)
=> always reads XLOG_BLCKSZ.

2- summarizer_read_local_xlog_page() in walsummarizer.c =>
WALRead(XLOG_BLCKSZ) => always reads XLOG_BLCKSZ.

3- logical_read_xlog_page() in walsender.c => WALRead(XLOG_BLCKSZ) =>
always reads XLOG_BLCKSZ.

4- XLogSendPhysical() in walsender.c => WALRead(nbytes) =>  nbytes can
be different from XLOG_BLCKSZ.

5- WALDumpReadPage() in pg_waldump.c => WALRead(count) => count can be
different from XLOG_BLCKSZ.

4 and 5 are the problematic calls.

Melanie's answer to this problem on previous discussions:

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

Using op_bytes as 1 solves this problem but since it will be different
from the rest of the pg_stat_io view it could be hard to understand.
There is no clear separation by backends as it can be seen from the walsender.

>
> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
> treat op_bytes * number of reads as an approximation of the number of
> bytes read. I don't actually know what makes more sense. I don't think I
> would like having a number for bytes that is not accurate.

Also, we have a similar problem in XLogPageRead() in xlogrecovery.c.
pg_pread() call tries to read XLOG_BLCKSZ but it is not certain and we
don't count IO if it couldn't read XLOG_BLCKSZ. IMO, this is not as
important as the previous problem but it still is a problem.

I would be glad to hear opinions on these problems.

--
Regards,
Nazir Bilal Yavuz
Microsoft



pgsql-hackers by date:

Previous
From: Amul Sul
Date:
Subject: Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression
Next
From: Michail Nikolaev
Date:
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements