Thread: Re: pgsql: Add contrib/pg_walinspect.

Re: pgsql: Add contrib/pg_walinspect.

From
Michael Paquier
Date:
Hi Jeff,

On Fri, Apr 08, 2022 at 07:27:44AM +0000, Jeff Davis wrote:
> Add contrib/pg_walinspect.
>
> Provides similar functionality to pg_waldump, but from a SQL interface
> rather than a separate utility.

The tests of pg_walinspect look unstable:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2022-04-25%2001%3A48%3A47

 SELECT COUNT(*) >= 0 AS ok FROM
 pg_get_wal_records_info_till_end_of_wal(:'wal_lsn1');
- ok
-----
- t
-(1 row)
-
+ERROR:  could not read WAL at 0/1903E40

This points out at ReadNextXLogRecord(), though I would not blame this
test suite as you create a physical replication slot beforehand.
Could this be an issue related to the addition of the circular WAL
decoding buffer, aka 3f1ce97?

I am adding an open item about that.

Thanks,
--
Michael

Attachment

Re: pgsql: Add contrib/pg_walinspect.

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> Could this be an issue related to the addition of the circular WAL
> decoding buffer, aka 3f1ce97?

I already whined about this at [1].

I've been wondering if the issue could be traced to topminnow's unusual
hardware properties, specifically that it has MAXALIGN 8 even though
it's only a 32-bit machine per sizeof(void *).  I think the only
other active buildfarm animal like that is my gaur ... but I've
failed to reproduce it on gaur.  Best guess at the moment is that
it's a timing issue that topminnow manages to reproduce often.

Anyway, as I said in the other thread, I can reproduce it on
topminnow's host.  Let me know if you have ideas about how to
home in on the cause.

            regards, tom lane

[1] https://www.postgresql.org/message-id/111657.1650910309%40sss.pgh.pa.us



Re: pgsql: Add contrib/pg_walinspect.

From
Michael Paquier
Date:
On Tue, Apr 26, 2022 at 01:25:14AM -0400, Tom Lane wrote:
> I've been wondering if the issue could be traced to topminnow's unusual
> hardware properties, specifically that it has MAXALIGN 8 even though
> it's only a 32-bit machine per sizeof(void *).  I think the only
> other active buildfarm animal like that is my gaur ... but I've
> failed to reproduce it on gaur.  Best guess at the moment is that
> it's a timing issue that topminnow manages to reproduce often.

I have managed to miss your message.  Let's continue the discussion
there, then.

> Anyway, as I said in the other thread, I can reproduce it on
> topminnow's host.  Let me know if you have ideas about how to
> home in on the cause.

Nice.  I have not been able to do so, but based on the lack of
reports from the buildfarm, that's not surprising.
--
Michael

Attachment

Re: pgsql: Add contrib/pg_walinspect.

From
Thomas Munro
Date:
On Tue, Apr 26, 2022 at 5:36 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Apr 26, 2022 at 01:25:14AM -0400, Tom Lane wrote:
> > I've been wondering if the issue could be traced to topminnow's unusual
> > hardware properties, specifically that it has MAXALIGN 8 even though
> > it's only a 32-bit machine per sizeof(void *).  I think the only
> > other active buildfarm animal like that is my gaur ... but I've
> > failed to reproduce it on gaur.  Best guess at the moment is that
> > it's a timing issue that topminnow manages to reproduce often.
>
> I have managed to miss your message.  Let's continue the discussion
> there, then.

I think it's a bug in pg_walinspect, so I'll move the discussion back
here.  Here's one rather simple way to fix it, that has survived
running the test a thousand times (using a recipe that failed for me
quite soon, after 20-100 attempts or so; I never figured out how to
get the 50% failure rate reported by Tom).  Explanation in commit
message.  You can see that the comments near the first hunk already
contemplated this possibility, but just didn't try to handle it.

Another idea that I slept on, but rejected, is that the new WOULDBLOCK
return value introduced to support WAL prefetching could be used here
(it's a way of reporting a lack of data, different from errors).
Unfortunately it's not exposed to the XLogReadRecord() interface, as I
only intended it for use by XLogReadAhead().  I don't really think
it's a good idea to redesign that API at this juncture.

Maybe there is some other way I haven't considered -- is there a way
to get the LSN past the latest whole flushed record from shmem?

Attachment

Re: pgsql: Add contrib/pg_walinspect.

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> I think it's a bug in pg_walinspect, so I'll move the discussion back
> here.  Here's one rather simple way to fix it, that has survived
> running the test a thousand times (using a recipe that failed for me
> quite soon, after 20-100 attempts or so; I never figured out how to
> get the 50% failure rate reported by Tom).

Not sure what we're doing differently, but plain "make check" in
contrib/pg_walinspect fails pretty consistently for me on gcc23.
I tried it again just now and got five failures in five attempts.

I then installed your patch and got the same failure, three times
out of three, so I don't think we're there yet.

Again, since I do have this problem in captivity, I'm happy
to spend some time poking at it.  But I could use a little
guidance where to poke, because I've not looked at any of
the WAL prefetch stuff.

            regards, tom lane



Re: pgsql: Add contrib/pg_walinspect.

From
Thomas Munro
Date:
On Wed, Apr 27, 2022 at 12:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > I think it's a bug in pg_walinspect, so I'll move the discussion back
> > here.  Here's one rather simple way to fix it, that has survived
> > running the test a thousand times (using a recipe that failed for me
> > quite soon, after 20-100 attempts or so; I never figured out how to
> > get the 50% failure rate reported by Tom).
>
> Not sure what we're doing differently, but plain "make check" in
> contrib/pg_walinspect fails pretty consistently for me on gcc23.
> I tried it again just now and got five failures in five attempts.

I tried on the /home filesystem (a slow NFS mount) and then inside a
directory on /tmp to get ext4 (I saw that Noah had somehow got onto a
local filesystem, based on the present of "ext4" in the pathname and I
was trying everything I could think of).  I used what I thought might
be some relevant starter configure options copied from the animal:

./configure --prefix=$HOME/install --enable-cassert --enable-debug
--enable-tap-tests CC="ccache gcc -mips32r2" CFLAGS="-O2
-funwind-tables" LDFLAGS="-rdynamic"

For me, make check always succeeds in contrib/pg_walinspect.  For me,
make installcheck fails if I do it enough times in a loop, somewhere
around the 20th loop or so, which I imagine has to do with WAL page
boundaries moving around.

for i in `seq 1 1000` ; do
  make -s installcheck || exit 1
done

> I then installed your patch and got the same failure, three times
> out of three, so I don't think we're there yet.

Hrmph...  Are you sure you rebuilt the contrib module?   Assuming so,
maybe it's failing in a different way for you and me.  For me, it
always fails after this break is reached in xlogutil.c:

            /* If asked, let's not wait for future WAL. */
            if (!wait_for_wal)
                break;

If you add a log message there, do you see that?  For me, the patch
fixes it, because it teaches pg_walinspect that messageless errors are
a way of detecting end-of-data (due to the code above, introduced by
the pg_walinspect commit).



Re: pgsql: Add contrib/pg_walinspect.

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Hrmph...  Are you sure you rebuilt the contrib module?   Assuming so,
> maybe it's failing in a different way for you and me.  For me, it
> always fails after this break is reached in xlogutil.c:

>             /* If asked, let's not wait for future WAL. */
>             if (!wait_for_wal)
>                 break;

Hmm.  For me, that statement is not reached at all in successful
(make installcheck) runs.  In a failing run, it's reached with
wait_for_wal = false, after which we get the "could not read WAL"
failure.  Usually that happens twice, as per attached.

            regards, tom lane

diff --git a/contrib/pg_walinspect/Makefile b/contrib/pg_walinspect/Makefile
index 960530e..85dcb3a 100644
--- a/contrib/pg_walinspect/Makefile
+++ b/contrib/pg_walinspect/Makefile
@@ -15,7 +15,6 @@ REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_walinspect/walinspect.conf

 # Disabled because these tests require "wal_level=replica", which
 # some installcheck users do not have (e.g. buildfarm clients).
-NO_INSTALLCHECK = 1

 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cf5db23..33d2c73 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -373,6 +373,9 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
          */
         Assert(!XLogRecPtrIsInvalid(state->EndRecPtr));

+        if(!*errormsg)
+          *errormsg = "decode_queue_head is null";
+
         return NULL;
     }

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4257026..4089efd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -955,6 +955,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
             if (loc <= read_upto)
                 break;

+            elog(LOG, "let's wait for wal: %d", wait_for_wal);
+
             /* If asked, let's not wait for future WAL. */
             if (!wait_for_wal)
                 break;
2022-04-27 03:40:09.253 CEST postmaster[32569] LOG:  starting PostgreSQL 15devel on mipsel-unknown-linux-gnu, compiled
bygcc (Debian 4.9.2-10+deb8u1) 4.9.2, 32-bit 
2022-04-27 03:40:09.254 CEST postmaster[32569] LOG:  listening on Unix socket "/tmp/pg_regress-g0HWzi/.s.PGSQL.51696"
2022-04-27 03:40:09.284 CEST startup[32574] LOG:  database system was shut down at 2022-04-27 03:40:09 CEST
2022-04-27 03:40:09.314 CEST postmaster[32569] LOG:  database system is ready to accept connections
2022-04-27 03:40:12.073 CEST client backend[32598] pg_regress/pg_walinspect ERROR:  WAL start LSN must be less than end
LSN
2022-04-27 03:40:12.073 CEST client backend[32598] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info('0/1903EE8','0/1903E40'); 
2022-04-27 03:40:12.075 CEST client backend[32598] pg_regress/pg_walinspect ERROR:  WAL start LSN must be less than end
LSN
2022-04-27 03:40:12.075 CEST client backend[32598] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats('0/1903EE8','0/1903E40'); 
2022-04-27 03:40:12.085 CEST client backend[32598] pg_regress/pg_walinspect LOG:  let's wait for wal: 0
2022-04-27 03:40:12.085 CEST client backend[32598] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info_till_end_of_wal('0/1903E40');
2022-04-27 03:40:12.085 CEST client backend[32598] pg_regress/pg_walinspect ERROR:  could not read WAL at 0/1903E40:
decode_queue_headis null 
2022-04-27 03:40:12.085 CEST client backend[32598] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info_till_end_of_wal('0/1903E40');
2022-04-27 03:40:12.091 CEST client backend[32598] pg_regress/pg_walinspect LOG:  let's wait for wal: 0
2022-04-27 03:40:12.091 CEST client backend[32598] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal('0/1903E40');
2022-04-27 03:40:12.091 CEST client backend[32598] pg_regress/pg_walinspect ERROR:  could not read WAL at 0/1903E40:
decode_queue_headis null 
2022-04-27 03:40:12.091 CEST client backend[32598] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal('0/1903E40');
2022-04-27 03:40:12.198 CEST postmaster[32569] LOG:  received fast shutdown request
2022-04-27 03:40:12.299 CEST postmaster[32569] LOG:  aborting any active transactions
2022-04-27 03:40:12.304 CEST postmaster[32569] LOG:  background worker "logical replication launcher" (PID 32577)
exitedwith exit code 1 
2022-04-27 03:40:12.306 CEST checkpointer[32572] LOG:  shutting down
2022-04-27 03:40:12.310 CEST checkpointer[32572] LOG:  checkpoint starting: shutdown immediate
2022-04-27 03:40:12.524 CEST checkpointer[32572] LOG:  checkpoint complete: wrote 784 buffers (4.8%); 0 WAL file(s)
added,0 removed, 0 recycled; write=0.199 s, sync=0.001 s, total=0.218 s; sync files=0, longest=0.000 s, average=0.000
s;distance=4197 kB, estimate=4197 kB 
2022-04-27 03:40:12.838 CEST postmaster[32569] LOG:  database system is shut down

Re: pgsql: Add contrib/pg_walinspect.

From
Thomas Munro
Date:
On Wed, Apr 27, 2022 at 1:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Hrmph...  Are you sure you rebuilt the contrib module?   Assuming so,
> > maybe it's failing in a different way for you and me.  For me, it
> > always fails after this break is reached in xlogutil.c:
>
> >             /* If asked, let's not wait for future WAL. */
> >             if (!wait_for_wal)
> >                 break;
>
> Hmm.  For me, that statement is not reached at all in successful
> (make installcheck) runs.  In a failing run, it's reached with
> wait_for_wal = false, after which we get the "could not read WAL"
> failure.  Usually that happens twice, as per attached.

Ok, that's the same for me.  Next question: why does the patch I
posted not help?  For me, the error "could not read WAL at %X/%X",
seen on the BF log, is raised by ReadNextXLogRecord() in
pg_walinspect.c.  The patch removes that ereport() entirely (and
handles NULL in a couple of places).



Re: pgsql: Add contrib/pg_walinspect.

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Apr 27, 2022 at 12:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Not sure what we're doing differently, but plain "make check" in
>> contrib/pg_walinspect fails pretty consistently for me on gcc23.
>> I tried it again just now and got five failures in five attempts.

> I tried on the /home filesystem (a slow NFS mount) and then inside a
> directory on /tmp to get ext4 (I saw that Noah had somehow got onto a
> local filesystem, based on the present of "ext4" in the pathname and I
> was trying everything I could think of).  I used what I thought might
> be some relevant starter configure options copied from the animal:

> ./configure --prefix=$HOME/install --enable-cassert --enable-debug
> --enable-tap-tests CC="ccache gcc -mips32r2" CFLAGS="-O2
> -funwind-tables" LDFLAGS="-rdynamic"

Hmph.  I'm also running it on the /home filesystem, and I used
these settings:

export CC="ccache gcc -mips32r2"
export CFLAGS="-O2 -funwind-tables"
export LDFLAGS="-rdynamic"

./configure --enable-debug --enable-cassert --with-systemd --enable-nls --with-icu --enable-tap-tests
--with-system-tzdata=/usr/share/zoneinfo

plus uninteresting stuff like --prefix.  Now maybe some of these
other options affect this, but I'd be pretty surprised if so.
So I'm at a loss why it behaves differently for you.

            regards, tom lane



Re: pgsql: Add contrib/pg_walinspect.

From
Thomas Munro
Date:
On Wed, Apr 27, 2022 at 1:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Apr 27, 2022 at 1:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > Hrmph...  Are you sure you rebuilt the contrib module?   Assuming so,
> > > maybe it's failing in a different way for you and me.  For me, it
> > > always fails after this break is reached in xlogutil.c:
> >
> > >             /* If asked, let's not wait for future WAL. */
> > >             if (!wait_for_wal)
> > >                 break;
> >
> > Hmm.  For me, that statement is not reached at all in successful
> > (make installcheck) runs.  In a failing run, it's reached with
> > wait_for_wal = false, after which we get the "could not read WAL"
> > failure.  Usually that happens twice, as per attached.
>
> Ok, that's the same for me.  Next question: why does the patch I
> posted not help?  For me, the error "could not read WAL at %X/%X",
> seen on the BF log, is raised by ReadNextXLogRecord() in
> pg_walinspect.c.  The patch removes that ereport() entirely (and
> handles NULL in a couple of places).

BTW If you had your local change from debug.patch (upthread), that'd
defeat the patch.  I mean this:

+        if(!*errormsg)
+          *errormsg = "decode_queue_head is null";



Re: pgsql: Add contrib/pg_walinspect.

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Ok, that's the same for me.  Next question: why does the patch I
> posted not help?

I improved the instrumentation a bit, and it looks like what is
happening is that loc > read_upto, causing that code to "break"
independently of wait_for_wal.  success.log is from "make installcheck"
immediately after initdb; fail.log is from "make check".

            regards, tom lane

diff --git a/contrib/pg_walinspect/Makefile b/contrib/pg_walinspect/Makefile
index 960530e..85dcb3a 100644
--- a/contrib/pg_walinspect/Makefile
+++ b/contrib/pg_walinspect/Makefile
@@ -15,7 +15,6 @@ REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_walinspect/walinspect.conf

 # Disabled because these tests require "wal_level=replica", which
 # some installcheck users do not have (e.g. buildfarm clients).
-NO_INSTALLCHECK = 1

 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cf5db23..33d2c73 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -373,6 +373,9 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
          */
         Assert(!XLogRecPtrIsInvalid(state->EndRecPtr));

+        if(!*errormsg)
+          *errormsg = "decode_queue_head is null";
+
         return NULL;
     }

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4257026..5ea59a3 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -951,6 +951,9 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,

         if (state->currTLI == currTLI)
         {
+          elog(LOG, "considering wait for wal: loc %X/%X read_upto %X/%X wait_for_wal %d",
+               LSN_FORMAT_ARGS(loc), LSN_FORMAT_ARGS(read_upto),
+               wait_for_wal);

             if (loc <= read_upto)
                 break;
2022-04-27 04:06:21.995 CEST [4503] LOG:  starting PostgreSQL 15devel on mipsel-unknown-linux-gnu, compiled by gcc
(Debian4.9.2-10+deb8u1) 4.9.2, 32-bit 
2022-04-27 04:06:21.997 CEST [4503] LOG:  listening on IPv6 address "::1", port 5440
2022-04-27 04:06:21.997 CEST [4503] LOG:  listening on IPv4 address "127.0.0.1", port 5440
2022-04-27 04:06:22.000 CEST [4503] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5440"
2022-04-27 04:06:22.021 CEST [4523] LOG:  database system was shut down at 2022-04-27 04:06:11 CEST
2022-04-27 04:06:22.048 CEST [4503] LOG:  database system is ready to accept connections
2022-04-27 04:06:30.619 CEST [4738] ERROR:  WAL start LSN must be less than end LSN
2022-04-27 04:06:30.619 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_records_info('0/1903EB0',
'0/1903E08');
2022-04-27 04:06:30.621 CEST [4738] ERROR:  WAL start LSN must be less than end LSN
2022-04-27 04:06:30.621 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats('0/1903EB0',
'0/1903E08');
2022-04-27 04:06:30.623 CEST [4738] LOG:  considering wait for wal: loc 0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.623 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_record_info('0/1903E08');
2022-04-27 04:06:30.623 CEST [4738] LOG:  considering wait for wal: loc 0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.623 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_record_info('0/1903E08');
2022-04-27 04:06:30.626 CEST [4738] LOG:  considering wait for wal: loc 0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.626 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_records_info('0/1903E08',
'0/1903EB0');
2022-04-27 04:06:30.626 CEST [4738] LOG:  considering wait for wal: loc 0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.626 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_records_info('0/1903E08',
'0/1903EB0');
2022-04-27 04:06:30.630 CEST [4738] LOG:  considering wait for wal: loc 0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.630 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info_till_end_of_wal('0/1903E08');
2022-04-27 04:06:30.630 CEST [4738] LOG:  considering wait for wal: loc 0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.630 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info_till_end_of_wal('0/1903E08');
2022-04-27 04:06:30.633 CEST [4738] LOG:  considering wait for wal: loc 0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.633 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats('0/1903E08',
'0/1903EB0');
2022-04-27 04:06:30.634 CEST [4738] LOG:  considering wait for wal: loc 0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.634 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats('0/1903E08',
'0/1903EB0');
2022-04-27 04:06:30.638 CEST [4738] LOG:  considering wait for wal: loc 0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.638 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal('0/1903E08');
2022-04-27 04:06:30.639 CEST [4738] LOG:  considering wait for wal: loc 0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.639 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal('0/1903E08');
2022-04-27 04:06:30.653 CEST [4738] LOG:  considering wait for wal: loc 0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.653 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_records_info('0/1903E08',
'0/1903EB0')
                WHERE block_ref LIKE concat('%', '16391', '%') AND resource_manager = 'Heap';
2022-04-27 04:06:30.654 CEST [4738] LOG:  considering wait for wal: loc 0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.654 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_records_info('0/1903E08',
'0/1903EB0')
                WHERE block_ref LIKE concat('%', '16391', '%') AND resource_manager = 'Heap';
2022-04-27 04:06:30.657 CEST [4738] LOG:  considering wait for wal: loc 0/1002000 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.657 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_records_info('0/1903E08',
'0/1903EB0')
                WHERE resource_manager = 'Heap' AND record_type = 'INSERT';
2022-04-27 04:06:30.658 CEST [4738] LOG:  considering wait for wal: loc 0/1903E08 read_upto 0/1903F58 wait_for_wal 0
2022-04-27 04:06:30.658 CEST [4738] STATEMENT:  SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_records_info('0/1903E08',
'0/1903EB0')
                WHERE resource_manager = 'Heap' AND record_type = 'INSERT';
2022-04-27 04:06:37.735 CEST [4503] LOG:  received fast shutdown request
2022-04-27 04:06:37.829 CEST [4503] LOG:  aborting any active transactions
2022-04-27 04:06:37.835 CEST [4503] LOG:  background worker "logical replication launcher" (PID 4526) exited with exit
code1 
2022-04-27 04:06:40.402 CEST [4521] LOG:  shutting down
2022-04-27 04:06:40.406 CEST [4521] LOG:  checkpoint starting: shutdown immediate
2022-04-27 04:06:41.063 CEST [4521] LOG:  checkpoint complete: wrote 784 buffers (4.8%); 0 WAL file(s) added, 0
removed,0 recycled; write=0.218 s, sync=0.001 s, total=0.661 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=4197kB, estimate=4197 kB 
2022-04-27 04:06:43.573 CEST [4503] LOG:  database system is shut down
2022-04-27 04:04:08.677 CEST postmaster[3167] LOG:  starting PostgreSQL 15devel on mipsel-unknown-linux-gnu, compiled
bygcc (Debian 4.9.2-10+deb8u1) 4.9.2, 32-bit 
2022-04-27 04:04:08.677 CEST postmaster[3167] LOG:  listening on Unix socket "/tmp/pg_regress-S5YZoI/.s.PGSQL.51696"
2022-04-27 04:04:08.708 CEST startup[3172] LOG:  database system was shut down at 2022-04-27 04:04:08 CEST
2022-04-27 04:04:08.737 CEST postmaster[3167] LOG:  database system is ready to accept connections
2022-04-27 04:04:12.135 CEST client backend[3185] pg_regress/pg_walinspect ERROR:  WAL start LSN must be less than end
LSN
2022-04-27 04:04:12.135 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info('0/1903EE8','0/1903E40'); 
2022-04-27 04:04:12.137 CEST client backend[3185] pg_regress/pg_walinspect ERROR:  WAL start LSN must be less than end
LSN
2022-04-27 04:04:12.137 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats('0/1903EE8','0/1903E40'); 
2022-04-27 04:04:12.138 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1002000read_upto 0/1903F90 wait_for_wal 0 
2022-04-27 04:04:12.138 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_record_info('0/1903E40');
2022-04-27 04:04:12.139 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1903E40read_upto 0/1903F90 wait_for_wal 0 
2022-04-27 04:04:12.139 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_record_info('0/1903E40');
2022-04-27 04:04:12.142 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1002000read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.142 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info('0/1903E40','0/1903EE8'); 
2022-04-27 04:04:12.143 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1903E40read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.143 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info('0/1903E40','0/1903EE8'); 
2022-04-27 04:04:12.146 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1002000read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.146 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info_till_end_of_wal('0/1903E40');
2022-04-27 04:04:12.147 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1903E40read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.147 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info_till_end_of_wal('0/1903E40');
2022-04-27 04:04:12.147 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1904034read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.147 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info_till_end_of_wal('0/1903E40');
2022-04-27 04:04:12.147 CEST client backend[3185] pg_regress/pg_walinspect ERROR:  could not read WAL at 0/1903E40:
decode_queue_headis null 
2022-04-27 04:04:12.147 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_records_info_till_end_of_wal('0/1903E40');
2022-04-27 04:04:12.149 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1002000read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.149 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats('0/1903E40','0/1903EE8'); 
2022-04-27 04:04:12.149 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1903E40read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.149 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats('0/1903E40','0/1903EE8'); 
2022-04-27 04:04:12.153 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1002000read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.153 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal('0/1903E40');
2022-04-27 04:04:12.153 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1903E40read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.153 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal('0/1903E40');
2022-04-27 04:04:12.153 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1904034read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.153 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal('0/1903E40');
2022-04-27 04:04:12.153 CEST client backend[3185] pg_regress/pg_walinspect ERROR:  could not read WAL at 0/1903E40:
decode_queue_headis null 
2022-04-27 04:04:12.153 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_get_wal_stats_till_end_of_wal('0/1903E40');
2022-04-27 04:04:12.165 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1002000read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.165 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 1 AS ok FROM
pg_get_wal_records_info('0/1903E40','0/1903EE8') 
                WHERE block_ref LIKE concat('%', '16391', '%') AND resource_manager = 'Heap';
2022-04-27 04:04:12.165 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1903E40read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.165 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 1 AS ok FROM
pg_get_wal_records_info('0/1903E40','0/1903EE8') 
                WHERE block_ref LIKE concat('%', '16391', '%') AND resource_manager = 'Heap';
2022-04-27 04:04:12.168 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1002000read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.168 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 1 AS ok FROM
pg_get_wal_records_info('0/1903E40','0/1903EE8') 
                WHERE resource_manager = 'Heap' AND record_type = 'INSERT';
2022-04-27 04:04:12.168 CEST client backend[3185] pg_regress/pg_walinspect LOG:  considering wait for wal: loc
0/1903E40read_upto 0/1904000 wait_for_wal 0 
2022-04-27 04:04:12.168 CEST client backend[3185] pg_regress/pg_walinspect STATEMENT:  SELECT COUNT(*) >= 1 AS ok FROM
pg_get_wal_records_info('0/1903E40','0/1903EE8') 
                WHERE resource_manager = 'Heap' AND record_type = 'INSERT';
2022-04-27 04:04:12.297 CEST postmaster[3167] LOG:  received fast shutdown request
2022-04-27 04:04:12.299 CEST postmaster[3167] LOG:  aborting any active transactions
2022-04-27 04:04:12.306 CEST postmaster[3167] LOG:  background worker "logical replication launcher" (PID 3175) exited
withexit code 1 
2022-04-27 04:04:12.307 CEST checkpointer[3170] LOG:  shutting down
2022-04-27 04:04:12.310 CEST checkpointer[3170] LOG:  checkpoint starting: shutdown immediate
2022-04-27 04:04:12.536 CEST checkpointer[3170] LOG:  checkpoint complete: wrote 784 buffers (4.8%); 0 WAL file(s)
added,0 removed, 0 recycled; write=0.212 s, sync=0.001 s, total=0.230 s; sync files=0, longest=0.000 s, average=0.000
s;distance=4197 kB, estimate=4197 kB 
2022-04-27 04:04:12.866 CEST postmaster[3167] LOG:  database system is shut down

Re: pgsql: Add contrib/pg_walinspect.

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> BTW If you had your local change from debug.patch (upthread), that'd
> defeat the patch.  I mean this:

> +        if(!*errormsg)
> +          *errormsg = "decode_queue_head is null";

Oh!  Okay, I'll retry without that.

            regards, tom lane



Re: pgsql: Add contrib/pg_walinspect.

From
Tom Lane
Date:
I wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> BTW If you had your local change from debug.patch (upthread), that'd
>> defeat the patch.  I mean this:

>> +        if(!*errormsg)
>> +          *errormsg = "decode_queue_head is null";

> Oh!  Okay, I'll retry without that.

I've now done several runs with your patch and not seen the test failure.
However, I think we ought to rethink this API a bit rather than just
apply the patch as-is.  Even if it were documented, relying on
errormsg = NULL to mean something doesn't seem like a great plan.

            regards, tom lane



Re: pgsql: Add contrib/pg_walinspect.

From
Bharath Rupireddy
Date:
On Wed, Apr 27, 2022 at 8:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> >> BTW If you had your local change from debug.patch (upthread), that'd
> >> defeat the patch.  I mean this:
>
> >> +        if(!*errormsg)
> >> +          *errormsg = "decode_queue_head is null";
>
> > Oh!  Okay, I'll retry without that.
>
> I've now done several runs with your patch and not seen the test failure.
> However, I think we ought to rethink this API a bit rather than just
> apply the patch as-is.  Even if it were documented, relying on
> errormsg = NULL to mean something doesn't seem like a great plan.

Sorry for being late in the game, occupied with other stuff.

How about using private_data of XLogReaderState for
read_local_xlog_page_no_wait, something like this?

typedef struct ReadLocalXLogPageNoWaitPrivate
{
    bool end_of_wal;
} ReadLocalXLogPageNoWaitPrivate;

In read_local_xlog_page_no_wait:

            /* If asked, let's not wait for future WAL. */
            if (!wait_for_wal)
           {
                private_data->end_of_wal = true;
                break;
           }

    /*
     * Opaque data for callbacks to use.  Not used by XLogReader.
     */
    void       *private_data;

Regards,
Bharath Rupireddy.



Re: pgsql: Add contrib/pg_walinspect.

From
Bharath Rupireddy
Date:
On Wed, Apr 27, 2022 at 8:57 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Apr 27, 2022 at 8:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I wrote:
> > > Thomas Munro <thomas.munro@gmail.com> writes:
> > >> BTW If you had your local change from debug.patch (upthread), that'd
> > >> defeat the patch.  I mean this:
> >
> > >> +        if(!*errormsg)
> > >> +          *errormsg = "decode_queue_head is null";
> >
> > > Oh!  Okay, I'll retry without that.
> >
> > I've now done several runs with your patch and not seen the test failure.
> > However, I think we ought to rethink this API a bit rather than just
> > apply the patch as-is.  Even if it were documented, relying on
> > errormsg = NULL to mean something doesn't seem like a great plan.
>
> Sorry for being late in the game, occupied with other stuff.
>
> How about using private_data of XLogReaderState for
> read_local_xlog_page_no_wait, something like this?
>
> typedef struct ReadLocalXLogPageNoWaitPrivate
> {
>     bool end_of_wal;
> } ReadLocalXLogPageNoWaitPrivate;
>
> In read_local_xlog_page_no_wait:
>
>             /* If asked, let's not wait for future WAL. */
>             if (!wait_for_wal)
>            {
>                 private_data->end_of_wal = true;
>                 break;
>            }
>
>     /*
>      * Opaque data for callbacks to use.  Not used by XLogReader.
>      */
>     void       *private_data;

I found an easy way to reproduce this consistently (I think on any server):

I basically generated huge WAL record (I used a fun extension that I
wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can
use pg_logical_emit_message as well)
session 1:
select * from pg_synthesize_wal_record(1*1024*1024); --> generate 1 MB
of WAL record first and make a note of the output lsn (lsn1)

session 2:
select * from pg_get_wal_records_info_till_end_of_wal(lsn1);
\watch 1

session 1:
select * from pg_synthesize_wal_record(1000*1024*1024);  --> generate
~1 GB of WAL record and we see ERROR:  could not read WAL at XXXXX in
session 2.

Delay the checkpoint (set checkpoint_timeout to 1hr) just not recycle
the wal files while we run pg_walinspect functions, no other changes
required from the default initdb settings on the server.

And, Thomas's patch fixes the issue.

Regards,
Bharath Rupireddy.



Re: pgsql: Add contrib/pg_walinspect.

From
Bharath Rupireddy
Date:
On Wed, Apr 27, 2022 at 1:47 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > > I've now done several runs with your patch and not seen the test failure.
> > > However, I think we ought to rethink this API a bit rather than just
> > > apply the patch as-is.  Even if it were documented, relying on
> > > errormsg = NULL to mean something doesn't seem like a great plan.
> >
> > Sorry for being late in the game, occupied with other stuff.
> >
> > How about using private_data of XLogReaderState for
> > read_local_xlog_page_no_wait, something like this?
> >
> > typedef struct ReadLocalXLogPageNoWaitPrivate
> > {
> >     bool end_of_wal;
> > } ReadLocalXLogPageNoWaitPrivate;
> >
> > In read_local_xlog_page_no_wait:
> >
> >             /* If asked, let's not wait for future WAL. */
> >             if (!wait_for_wal)
> >            {
> >                 private_data->end_of_wal = true;
> >                 break;
> >            }
> >
> >     /*
> >      * Opaque data for callbacks to use.  Not used by XLogReader.
> >      */
> >     void       *private_data;
>
> I found an easy way to reproduce this consistently (I think on any server):
>
> I basically generated huge WAL record (I used a fun extension that I
> wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can
> use pg_logical_emit_message as well)
> session 1:
> select * from pg_synthesize_wal_record(1*1024*1024); --> generate 1 MB
> of WAL record first and make a note of the output lsn (lsn1)
>
> session 2:
> select * from pg_get_wal_records_info_till_end_of_wal(lsn1);
> \watch 1
>
> session 1:
> select * from pg_synthesize_wal_record(1000*1024*1024);  --> generate
> ~1 GB of WAL record and we see ERROR:  could not read WAL at XXXXX in
> session 2.
>
> Delay the checkpoint (set checkpoint_timeout to 1hr) just not recycle
> the wal files while we run pg_walinspect functions, no other changes
> required from the default initdb settings on the server.
>
> And, Thomas's patch fixes the issue.

Here's v2 patch (up on Thomas's v1 at [1]) using private_data to set
the end of the WAL flag. Please have a look at it.

[1] https://www.postgresql.org/message-id/CA%2BhUKGLtswFk9ZO3WMOqnDkGs6dK5kCdQK9gxJm0N8gip5cpiA%40mail.gmail.com

Regards,
Bharath Rupireddy.

Attachment

Re: pgsql: Add contrib/pg_walinspect.

From
Jeff Davis
Date:
On Wed, 2022-04-27 at 13:47 +0530, Bharath Rupireddy wrote:
> I found an easy way to reproduce this consistently (I think on any
> server):
> 
> I basically generated huge WAL record (I used a fun extension that I
> wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can
> use pg_logical_emit_message as well)

Thank you Bharath for creating the extension and the simple test case.

Thomas's patch solves the issue for me as well.

Tom, the debug patch you posted[0] seems to be setting the error
message if it's not already set. Thomas's patch uses the lack of a
message as a signal that we've reached the end of WAL. That explains
why you are still seeing the problem.

Obviously, that's a sign that Thomas's patch is not the cleanest
solution. But other approaches would be more invasive. I guess the
question is whether that's a good enough solution for now, and
hopefully we could improve the API later; or whether we need to come up
with something better.

When reviewing, I considered the inability to read old WAL and the
inability to read flushed-in-the-middle-of-a-record WAL as similar
kinds of errors that the user would need to deal with. But they are
different: the former can be avoided by creating a slot; the latter
can't be easily avoided, only retried.

Depending on the intended use cases, forcing the user to retry might be
reasonable, in which case we could consider this a test problem rather
than a real problem, and we might be able to do something simpler to
just stabilize the test.

Regards,
    Jeff Davis

[0] https://postgr.es/m/295868.1651024073@sss.pgh.pa.us





Re: pgsql: Add contrib/pg_walinspect.

From
Thomas Munro
Date:
On Wed, Apr 27, 2022 at 10:22 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> Here's v2 patch (up on Thomas's v1 at [1]) using private_data to set
> the end of the WAL flag. Please have a look at it.

I don't have a strong view on whether it's better to use a NULL error
for this private communication between pg_walinspect.c and the
read_page callback it installs, or install a custom private_data to
signal this condition, or to give up on all this stuff completely and
just wait (see below for thoughts).  I'd feel better about both
no-wait options if the read_page callback in question were actually in
the contrib module, and not in the core code.  On the other hand, I'm
not too hung up about it because I'm really hoping to see the
get-rid-of-all-these-callbacks-and-make-client-code-do-the-waiting
scheme proposed by Horiguchi-san and Heikki developed further, to rip
much of this stuff out in a future release.

If you go with the private_data approach, a couple of small comments:

     if (record == NULL)
     {
+        ReadLocalXLogPageNoWaitPrivate *private_data;
+
+        /* return NULL, if end of WAL is reached */
+        private_data = (ReadLocalXLogPageNoWaitPrivate *)
+                            xlogreader->private_data;
+
+        if (private_data->end_of_wal)
+            return NULL;
+
         if (errormsg)
             ereport(ERROR,
                     (errcode_for_file_access(),
                      errmsg("could not read WAL at %X/%X: %s",
                             LSN_FORMAT_ARGS(first_record), errormsg)));
-        else
-            ereport(ERROR,
-                    (errcode_for_file_access(),
-                     errmsg("could not read WAL at %X/%X",
-                            LSN_FORMAT_ARGS(first_record))));
     }

I think you should leave that second ereport() call in, in this
variant of the patch, no?  I don't know if anything else raises errors
with no message, but if we're still going to treat them as silent
end-of-data conditions then you might as well go with the v1 patch.

Another option might be to abandon this whole no-wait concept and
revert 2258e76f's changes to xlogutils.c.  pg_walinspect already does
preliminary checks that LSNs are in range, so you can't enter a value
that will wait indefinitely, and it's interruptible (it's a 1ms
sleep/check loop, not my favourite programming pattern but that's
pre-existing code).  If you're unlucky enough to hit the case where
the LSN is judged to be valid but is in the middle of a record that
hasn't been totally flushed yet, it'll just be a bit slower to return
as we wait for the inevitable later flush(es) to happen.  The rest of
your record will *surely* be flushed pretty soon (or the flushing
backend panics the whole system and time ends).  I don't imagine this
is performance critical work, so maybe that'd be acceptable?



Re: pgsql: Add contrib/pg_walinspect.

From
Jeff Davis
Date:
On Thu, 2022-04-28 at 12:11 +1200, Thomas Munro wrote:
> 
> Another option might be to abandon this whole no-wait concept and
> revert 2258e76f's changes to xlogutils.c.  pg_walinspect already does
> preliminary checks that LSNs are in range, so you can't enter a value
> that will wait indefinitely, and it's interruptible (it's a 1ms
> sleep/check loop, not my favourite programming pattern but that's
> pre-existing code).  If you're unlucky enough to hit the case where
> the LSN is judged to be valid but is in the middle of a record that
> hasn't been totally flushed yet, it'll just be a bit slower to return
> as we wait for the inevitable later flush(es) to happen.  The rest of
> your record will *surely* be flushed pretty soon (or the flushing
> backend panics the whole system and time ends).  I don't imagine this
> is performance critical work, so maybe that'd be acceptable?

I'm inclined toward this option. I was a bit iffy on those xlogutils.c
changes to begin with, and they are causing a problem now, so I'd like
to avoid layering on more workarounds.

The time when we need to wait is very narrow, only in this case where
it's earlier than the flush pointer, and the flush pointer is in the
middle of a record that's not fully flushed. And as you say, we won't
be waiting very long in that case, because once we start to write a WAL
record it better finish soon.

Bharath, thoughts? When you originally introduced the nowait behavior,
I believe that was to solve the case where someone specifies an LSN
range well in the future, but we can still catch that and throw an
error if we see that it's beyond the flush pointer. Do you see a
problem with just doing that and getting rid of the nowait changes?

Regards,
    Jeff Davis





Re: pgsql: Add contrib/pg_walinspect.

From
Bharath Rupireddy
Date:
On Thu, Apr 28, 2022 at 8:41 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Thu, 2022-04-28 at 12:11 +1200, Thomas Munro wrote:
> >
> > Another option might be to abandon this whole no-wait concept and
> > revert 2258e76f's changes to xlogutils.c.  pg_walinspect already does
> > preliminary checks that LSNs are in range, so you can't enter a value
> > that will wait indefinitely, and it's interruptible (it's a 1ms
> > sleep/check loop, not my favourite programming pattern but that's
> > pre-existing code).  If you're unlucky enough to hit the case where
> > the LSN is judged to be valid but is in the middle of a record that
> > hasn't been totally flushed yet, it'll just be a bit slower to return
> > as we wait for the inevitable later flush(es) to happen.  The rest of
> > your record will *surely* be flushed pretty soon (or the flushing
> > backend panics the whole system and time ends).  I don't imagine this
> > is performance critical work, so maybe that'd be acceptable?
>
> I'm inclined toward this option. I was a bit iffy on those xlogutils.c
> changes to begin with, and they are causing a problem now, so I'd like
> to avoid layering on more workarounds.
>
> The time when we need to wait is very narrow, only in this case where
> it's earlier than the flush pointer, and the flush pointer is in the
> middle of a record that's not fully flushed. And as you say, we won't
> be waiting very long in that case, because once we start to write a WAL
> record it better finish soon.
>
> Bharath, thoughts? When you originally introduced the nowait behavior,
> I believe that was to solve the case where someone specifies an LSN
> range well in the future, but we can still catch that and throw an
> error if we see that it's beyond the flush pointer. Do you see a
> problem with just doing that and getting rid of the nowait changes?

It's not just the flush ptr, without no wait mode, the functions would
wait if start/input lsn is, say current flush lsn - 1 or 2 or more
(before the previous record) bytes. If the functions were to wait, by
how much time should they wait? a timeout? forever? This is what the
complexity we wanted to avoid. I would still vote for the private_data
approach and if the end of WAL reaches when flush lsn falls in the
middle of the record, let's just exit the loop and report the results
back to the client.

I addressed Thomas's review comment and attached v3 patch. Please have a look.

Regards,
Bharath Rupireddy.

Attachment

Re: pgsql: Add contrib/pg_walinspect.

From
Jeff Davis
Date:
On Fri, 2022-04-29 at 10:46 +0530, Bharath Rupireddy wrote:
> It's not just the flush ptr, without no wait mode, the functions
> would
> wait if start/input lsn is, say current flush lsn - 1 or 2 or more
> (before the previous record) bytes. If the functions were to wait, by
> how much time should they wait? a timeout? forever?

I see, you're talking about the case of XLogFindNextRecord(), not
XLogReadRecord().

XLogFindNextRecord() is the only way to align the user-provided start
LSN on a valid record, but that calls XLogReadRecord(), which may wait
indefinitely. If there were a different entry point that just did the
alignment and skipped past continuation records, we could prevent it
from trying to read the next record if we are already at the flush
pointer. But without some tweak to that API, nowait is still needed.

Committed your v3 patch with minor modifications.

Regards,
    Jeff Davis