Thread: pg15b3: recovery fails with wal prefetch enabled

pg15b3: recovery fails with wal prefetch enabled

From
Justin Pryzby
Date:
An internal VM crashed last night due to OOM.

When I tried to start postgres, it failed like:

< 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of generated WAL; request 1201/1CAF84F0, current
position1201/1CADB730
 
< 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to
1201/1CADB730
< 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
< 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed

I was able to start it with -c recovery_prefetch=no, so it seems like
prefetch tried to do too much.  The VM runs centos7 under qemu.
I'm making a copy of the data dir in cases it's needed.

-- 
Justin



Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> < 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
> < 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of generated WAL; request 1201/1CAF84F0, current
position1201/1CADB730
 
> < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
> < 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to
1201/1CADB730
> < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
> < 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed
>
> I was able to start it with -c recovery_prefetch=no, so it seems like
> prefetch tried to do too much.  The VM runs centos7 under qemu.
> I'm making a copy of the data dir in cases it's needed.

Hmm, a page with an LSN set 118208 bytes past the end of WAL.  It's a
vm fork page (which recovery prefetch should ignore completely).  Did
you happen to get a copy before the successful recovery?  After the
successful recovery, what LSN does that page have, and can you find
the references to it in the WAL with eg pg_waldump -R 1663/16681/2840
-F vm?  Have you turned FPW off (perhaps this is on ZFS?)?



Re: pg15b3: recovery fails with wal prefetch enabled

From
Justin Pryzby
Date:
On Thu, Sep 01, 2022 at 12:05:36PM +1200, Thomas Munro wrote:
> On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > < 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
> > < 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of generated WAL; request 1201/1CAF84F0, current
position1201/1CADB730
 
> > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
> > < 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to
1201/1CADB730
> > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
> > < 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed
> >
> > I was able to start it with -c recovery_prefetch=no, so it seems like
> > prefetch tried to do too much.  The VM runs centos7 under qemu.
> > I'm making a copy of the data dir in cases it's needed.
> 
> Hmm, a page with an LSN set 118208 bytes past the end of WAL.  It's a
> vm fork page (which recovery prefetch should ignore completely).  Did
> you happen to get a copy before the successful recovery?  After the
> successful recovery, what LSN does that page have, and can you find
> the references to it in the WAL with eg pg_waldump -R 1663/16681/2840
> -F vm?  Have you turned FPW off (perhaps this is on ZFS?)?

Yes, I have a copy that reproduces the issue:

#1  0x00000000009a0df4 in errfinish (filename=<optimized out>, filename@entry=0xa15535 "xlog.c",
lineno=lineno@entry=2671,funcname=funcname@entry=0xa1da27 <__func__.22763> "XLogFlush") at elog.c:588
 
#2  0x000000000055f1cf in XLogFlush (record=19795985532144) at xlog.c:2668
#3  0x0000000000813b24 in FlushBuffer (buf=0x7fffdf1f8900, reln=<optimized out>) at bufmgr.c:2889
#4  0x0000000000817a5b in SyncOneBuffer (buf_id=buf_id@entry=7796, skip_recently_used=skip_recently_used@entry=false,
wb_context=wb_context@entry=0x7fffffffcdf0)at bufmgr.c:2576
 
#5  0x00000000008181c2 in BufferSync (flags=flags@entry=358) at bufmgr.c:2164
#6  0x00000000008182f5 in CheckPointBuffers (flags=flags@entry=358) at bufmgr.c:2743
#7  0x00000000005587b2 in CheckPointGuts (checkPointRedo=19795985413936, flags=flags@entry=358) at xlog.c:6855
#8  0x000000000055feb3 in CreateCheckPoint (flags=flags@entry=358) at xlog.c:6534
#9  0x00000000007aceaa in CheckpointerMain () at checkpointer.c:455
#10 0x00000000007aac52 in AuxiliaryProcessMain (auxtype=auxtype@entry=CheckpointerProcess) at auxprocess.c:153
#11 0x00000000007b0bd8 in StartChildProcess (type=<optimized out>) at postmaster.c:5430
#12 0x00000000007b388f in PostmasterMain (argc=argc@entry=7, argv=argv@entry=0xf139e0) at postmaster.c:1463
#13 0x00000000004986a6 in main (argc=7, argv=0xf139e0) at main.c:202

It's not on zfs, and FPW have never been turned off.

I should add that this instance has been pg_upgraded since v10.

BTW, base/16881 is the postgres DB )which has 43GB of logfiles imported from
CSV, plus 2GB of snapshots of pg_control_checkpoint, pg_settings,
pg_stat_bgwriter, pg_stat_database, pg_stat_wal).

postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 'main', 0));
      lsn      | checksum | flags | lower | upper | special | pagesize | version | prune_xid  
---------------+----------+-------+-------+-------+---------+----------+---------+------------
 1201/1CDD1F98 |    -6200 |     1 |    44 |   424 |    8192 |     8192 |       4 | 3681043287
(1 fila)

postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 'vm', 0));
      lsn      | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
---------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1201/1CAF84F0 |    -6010 |     0 |    24 |  8192 |    8192 |     8192 |       4 |         0

I found this in waldump (note that you had a typoe - it's 16881).

[pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal
00000001000012010000001C
rmgr: Heap2       len (rec/tot):     64/   122, tx:          0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE
cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk
54
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF3AF8, prev 1201/1CAF2788, desc: VISIBLE
cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 0
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF3B70, prev 1201/1CAF3B38, desc: VISIBLE
cutoffxid 3671427998 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 2
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF4DC8, prev 1201/1CAF3BB0, desc: VISIBLE
cutoffxid 3672889900 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 4
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF5FB0, prev 1201/1CAF4E08, desc: VISIBLE
cutoffxid 3679743844 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 5
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7320, prev 1201/1CAF5FF0, desc: VISIBLE
cutoffxid 3679743844 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 6
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7398, prev 1201/1CAF7360, desc: VISIBLE
cutoffxid 3679751919 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 11
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7410, prev 1201/1CAF73D8, desc: VISIBLE
cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 17
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7488, prev 1201/1CAF7450, desc: VISIBLE
cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 19
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7500, prev 1201/1CAF74C8, desc: VISIBLE
cutoffxid 3645406844 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 23
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7578, prev 1201/1CAF7540, desc: VISIBLE
cutoffxid 3669978567 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 24
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF75F0, prev 1201/1CAF75B8, desc: VISIBLE
cutoffxid 0 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 25
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7668, prev 1201/1CAF7630, desc: VISIBLE
cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 26
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF76E0, prev 1201/1CAF76A8, desc: VISIBLE
cutoffxid 3671911724 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 27
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7758, prev 1201/1CAF7720, desc: VISIBLE
cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 34
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF77D0, prev 1201/1CAF7798, desc: VISIBLE
cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 35
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7EF8, prev 1201/1CAF7810, desc: VISIBLE
cutoffxid 3672408544 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 37
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7F70, prev 1201/1CAF7F38, desc: VISIBLE
cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 38 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF7FE8, prev 1201/1CAF7FB0, desc: VISIBLE
cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 39
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF8078, prev 1201/1CAF8040, desc: VISIBLE
cutoffxid 3678237783 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 41
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF80F0, prev 1201/1CAF80B8, desc: VISIBLE
cutoffxid 3672408544 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 42
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF8168, prev 1201/1CAF8130, desc: VISIBLE
cutoffxid 3680789266 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 43
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF81E0, prev 1201/1CAF81A8, desc: VISIBLE
cutoffxid 3667994218 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 44
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF8258, prev 1201/1CAF8220, desc: VISIBLE
cutoffxid 3680789266 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 45
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF82D0, prev 1201/1CAF8298, desc: VISIBLE
cutoffxid 3673830395 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 48
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF8348, prev 1201/1CAF8310, desc: VISIBLE
cutoffxid 0 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 50
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF83C0, prev 1201/1CAF8388, desc: VISIBLE
cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 51
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF8438, prev 1201/1CAF8400, desc: VISIBLE
cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 52
 
rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF84B0, prev 1201/1CAF8478, desc: VISIBLE
cutoffxid 3678741092 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 53
 
pg_waldump: error: error en registro de WAL en 1201/1CD90E48: invalid record length at 1201/1CD91010: wanted 24, got 0

I could send our WAL to you if that's desirable ..

-- 
Justin



Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Thu, Sep 1, 2022 at 12:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Yes, I have a copy that reproduces the issue:

That's good news.

So the last record touching that page was:

> rmgr: Heap2       len (rec/tot):     59/    59, tx:          0, lsn: 1201/1CAF84B0, prev 1201/1CAF8478, desc: VISIBLE
cutoffxid 3678741092 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 53
 

I think the expected LSN for that page is past the end of that record,
so 0x1CAF84B0 + 59 = 0x1caf84eb which rounds up to 0x1CAF84F0, and
indeed we see that in the restored page when recovery succeeds.

Next question: why do we think the WAL finishes at 1201/1CADB730 while
running that checkpoint?  Looking...



Re: pg15b3: recovery fails with wal prefetch enabled

From
Justin Pryzby
Date:
Some more details, in case they're important:

First: the server has wal_compression=zstd (I wonder if something
doesn't allow/accomodate compressed FPI?)

I thought to mention that after compiling pg15 locally and forgetting to
use --with-zstd.

I compiled it to enable your debug logging, which wrote these during
recovery:

< 2022-08-31 21:17:01.807 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958212 from block 156 until
1201/1C3965A0is replayed, which truncates the relation
 
< 2022-08-31 21:17:01.903 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958523 from block 23 until
1201/1C39CC98is replayed, which truncates the relation
 
< 2022-08-31 21:17:02.029 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958523 from block 23 until
1201/1C8643C8is replayed, because the relation is too small
 

Also, pg_waldump seems to fail early with -w:
[pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -w -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal
00000001000012010000001C
 
rmgr: Heap2       len (rec/tot):     64/   122, tx:          0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE
cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk
54
pg_waldump: error: error in WAL record at 1201/1CD90E48: invalid record length at 1201/1CD91010: wanted 24, got 0

Also, the VM has crashed with OOM before, while runnning pg15, with no issue in
recovery.  I haven't been able to track down the cause..

The VM is running: kernel-3.10.0-1160.66.1.el7.x86_64

pgsql is an ext4 FS (no tablespaces), which is a qemu block device
exposed like:

      <driver name='qemu' type='raw' cache='none' io='native'/>
      <target dev='vdg' bus='virtio'/>

It's nowhere near full:

/dev/vdc          96G    51G   46G  53% /var/lib/pgsql



Re: pg15b3: recovery fails with wal prefetch enabled

From
Kyotaro Horiguchi
Date:
At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > < 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
> > < 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of generated WAL; request 1201/1CAF84F0, current
position1201/1CADB730
 
> > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
> > < 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to
1201/1CADB730
> > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
> > < 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed
> >
> > I was able to start it with -c recovery_prefetch=no, so it seems like
> > prefetch tried to do too much.  The VM runs centos7 under qemu.
> > I'm making a copy of the data dir in cases it's needed.

Just for information, there was a fixed bug about
overwrite-aborted-contrecord feature, which causes this kind of
failure (xlog flush request exceeds insertion bleeding edge). If it is
that, it has been fixed by 6672d79139 two-days ago.

http://postgr.es/m/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com
http://postgr.es/m/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
> > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > < 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
> > > < 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of generated WAL; request 1201/1CAF84F0, current
position1201/1CADB730
 
> > > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
> > > < 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to
1201/1CADB730
> > > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
> > > < 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed
> > >
> > > I was able to start it with -c recovery_prefetch=no, so it seems like
> > > prefetch tried to do too much.  The VM runs centos7 under qemu.
> > > I'm making a copy of the data dir in cases it's needed.
>
> Just for information, there was a fixed bug about
> overwrite-aborted-contrecord feature, which causes this kind of
> failure (xlog flush request exceeds insertion bleeding edge). If it is
> that, it has been fixed by 6672d79139 two-days ago.

Hmm.  Justin, when you built from source, which commit were you at?
If it's REL_15_BETA3, any chance you could cherry pick that change and
check what happens?  And without that, could you show what this logs
for good and bad recovery settings?

Attachment

Re: pg15b3: recovery fails with wal prefetch enabled

From
Justin Pryzby
Date:
On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote:
> On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
> > > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > > < 2022-08-31 08:44:10.495 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
> > > > < 2022-08-31 08:44:10.609 CDT  >LOG:  request to flush past end of generated WAL; request 1201/1CAF84F0,
currentposition 1201/1CADB730
 
> > > > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
> > > > < 2022-08-31 08:44:10.609 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to
1201/1CADB730
> > > > < 2022-08-31 08:44:10.609 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
> > > > < 2022-08-31 08:44:10.609 CDT  >FATAL:  checkpoint request failed
> > > >
> > > > I was able to start it with -c recovery_prefetch=no, so it seems like
> > > > prefetch tried to do too much.  The VM runs centos7 under qemu.
> > > > I'm making a copy of the data dir in cases it's needed.
> >
> > Just for information, there was a fixed bug about
> > overwrite-aborted-contrecord feature, which causes this kind of
> > failure (xlog flush request exceeds insertion bleeding edge). If it is
> > that, it has been fixed by 6672d79139 two-days ago.
> 
> Hmm.  Justin, when you built from source, which commit were you at?
> If it's REL_15_BETA3,

No - it's:
commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE)
  
> If it's REL_15_BETA3, any chance you could cherry pick that change and
> check what happens?  And without that, could you show what this logs
> And without that, could you show what this logs
> for good and bad recovery settings?

I wasn't sure what mean by "without that" , so here's a bunch of logs to
sift through:

At a203, with #define XLOGPREFETCHER_DEBUG_LEVEL NOTICE:

[pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c
logging_collector=no-c port=5678
 
...
< 2022-08-31 23:31:38.690 CDT  >LOG:  redo starts at 1201/1B931F50
< 2022-08-31 23:31:40.204 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958212 from block 156 until
1201/1C3965A0is replayed, which truncates the relation
 
< 2022-08-31 23:31:40.307 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958523 from block 23 until
1201/1C39CC98is replayed, which truncates the relation
 
< 2022-08-31 23:31:40.493 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958523 from block 23 until
1201/1C8643C8is replayed, because the relation is too small
 
< 2022-08-31 23:31:40.721 CDT  >LOG:  redo done at 1201/1CADB300 system usage: CPU: user: 0.41 s, system: 0.23 s,
elapsed:2.03 s
 
< 2022-08-31 23:31:41.452 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 23:31:41.698 CDT  >LOG:  request to flush past end of generated WAL; request 1201/1CAF84F0, current
position1201/1CADB730
 
< 2022-08-31 23:31:41.698 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:31:41.698 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to
1201/1CADB730
< 2022-08-31 23:31:41.698 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:31:41.699 CDT  >FATAL:  checkpoint request failed
< 2022-08-31 23:31:41.699 CDT  >HINT:  Consult recent messages in the server log for details.
< 2022-08-31 23:31:41.704 CDT  >LOG:  startup process (PID 25046) exited with exit code 1
< 2022-08-31 23:31:41.704 CDT  >LOG:  terminating any other active server processes
< 2022-08-31 23:31:41.705 CDT  >LOG:  shutting down due to startup process failure
< 2022-08-31 23:31:41.731 CDT  >LOG:  database system is shut down

With your patch:

[pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c
logging_collector=no-c port=5678
 
...
< 2022-08-31 23:34:22.897 CDT  >LOG:  redo starts at 1201/1B931F50
< 2022-08-31 23:34:23.146 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958212 from block 156 until
1201/1C3965A0is replayed, which truncates the relation
 
< 2022-08-31 23:34:23.147 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958523 from block 23 until
1201/1C39CC98is replayed, which truncates the relation
 
< 2022-08-31 23:34:23.268 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958523 from block 23 until
1201/1C8643C8is replayed, because the relation is too small
 
< 2022-08-31 23:34:23.323 CDT  >LOG:  redo done at 1201/1CADB300 system usage: CPU: user: 0.29 s, system: 0.12 s,
elapsed:0.42 s
 
< 2022-08-31 23:34:23.323 CDT  >LOG:  point 0: lastRec = 12011cadb300
< 2022-08-31 23:34:23.323 CDT  >LOG:  point 0: endOfLog = 12011cadb730
< 2022-08-31 23:34:23.323 CDT  >LOG:  XXX point 1: EndOfLog = 12011cadb730
< 2022-08-31 23:34:23.386 CDT  >LOG:  XXX point 2: EndOfLog = 12011cadb730
< 2022-08-31 23:34:23.387 CDT  >LOG:  XXX point 3: Insert->CurrBytePos = 11f39ab82310
< 2022-08-31 23:34:23.565 CDT  >LOG:  XXX point 4: Insert->CurrBytePos = 11f39ab82310
< 2022-08-31 23:34:23.606 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 23:34:23.767 CDT  >LOG:  request to flush past end of generated WAL; request 1201/1CAF84F0, current
position1201/1CADB730
 
< 2022-08-31 23:34:23.767 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:34:23.767 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to
1201/1CADB730
< 2022-08-31 23:34:23.767 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:34:23.768 CDT  >FATAL:  checkpoint request failed

And without prefetch:

[pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c
logging_collector=no-c port=5678 -c recovery_prefetch=no
 
...
< 2022-08-31 23:37:08.792 CDT  >LOG:  redo starts at 1201/1B931F50
< 2022-08-31 23:37:09.269 CDT  >LOG:  invalid record length at 1201/1CD91010: wanted 24, got 0
< 2022-08-31 23:37:09.269 CDT  >LOG:  redo done at 1201/1CD90E48 system usage: CPU: user: 0.35 s, system: 0.11 s,
elapsed:0.47 s
 
< 2022-08-31 23:37:09.269 CDT  >LOG:  point 0: lastRec = 12011cd90e48
< 2022-08-31 23:37:09.269 CDT  >LOG:  point 0: endOfLog = 12011cd91010
< 2022-08-31 23:37:09.269 CDT  >LOG:  XXX point 1: EndOfLog = 12011cd91010
< 2022-08-31 23:37:09.350 CDT  >LOG:  XXX point 2: EndOfLog = 12011cd91010
< 2022-08-31 23:37:09.350 CDT  >LOG:  XXX point 3: Insert->CurrBytePos = 11f39ae35b68
< 2022-08-31 23:37:09.420 CDT  >LOG:  XXX point 4: Insert->CurrBytePos = 11f39ae35b68
< 2022-08-31 23:37:09.552 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 23:37:12.987 CDT  >LOG:  checkpoint complete: wrote 8030 buffers (49.0%); 0 WAL file(s) added, 0 removed,
1recycled; write=0.403 s, sync=2.841 s, total=3.566 s; sync files=102, longest=2.808 s, average=0.028 s; distance=20860
kB,estimate=20860 kB
 
< 2022-08-31 23:37:13.077 CDT  >LOG:  database system is ready to accept connections

If I revert 6672d79139 (and roll back to the unrecovered state):

[pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c
logging_collector=no-c port=5678 # -c recovery_prefetch=no
 
...
< 2022-08-31 23:42:40.592 CDT  >LOG:  redo starts at 1201/1B931F50
< 2022-08-31 23:42:42.168 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958212 from block 156 until
1201/1C3965A0is replayed, which truncates the relation
 
< 2022-08-31 23:42:42.238 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958523 from block 23 until
1201/1C39CC98is replayed, which truncates the relation
 
< 2022-08-31 23:42:42.405 CDT  >NOTICE:  suppressing prefetch in relation 1663/16888/165958523 from block 23 until
1201/1C8643C8is replayed, because the relation is too small
 
< 2022-08-31 23:42:42.602 CDT  >LOG:  redo done at 1201/1CADB300 system usage: CPU: user: 0.41 s, system: 0.25 s,
elapsed:2.01 s 
< 2022-08-31 23:42:42.602 CDT  >LOG:  point 0: lastRec = 12011cadb300
< 2022-08-31 23:42:42.602 CDT  >LOG:  point 0: endOfLog = 12011cadb730
< 2022-08-31 23:42:42.602 CDT  >LOG:  XXX point 1: EndOfLog = 12011cadb730
< 2022-08-31 23:42:42.830 CDT  >LOG:  XXX point 2: EndOfLog = 12011cadb730
< 2022-08-31 23:42:42.830 CDT  >LOG:  XXX point 3: Insert->CurrBytePos = 11f39ab82310
< 2022-08-31 23:42:43.194 CDT  >LOG:  XXX point 4: Insert->CurrBytePos = 11f39ab82310
< 2022-08-31 23:42:43.266 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 23:42:43.425 CDT  >LOG:  request to flush past end of generated WAL; request 1201/1CAF84F0, current
position1201/1CADB730
 
< 2022-08-31 23:42:43.425 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:42:43.425 CDT  >ERROR:  xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to
1201/1CADB730
< 2022-08-31 23:42:43.425 CDT  >CONTEXT:  writing block 0 of relation base/16881/2840_vm
< 2022-08-31 23:42:43.425 CDT  >FATAL:  checkpoint request failed
< 2022-08-31 23:42:43.425 CDT  >HINT:  Consult recent messages in the server log for details.
< 2022-08-31 23:42:43.431 CDT  >LOG:  startup process (PID 2415) exited with exit code 1
< 2022-08-31 23:42:43.431 CDT  >LOG:  terminating any other active server processes
< 2022-08-31 23:42:43.432 CDT  >LOG:  shutting down due to startup process failure
< 2022-08-31 23:42:43.451 CDT  >LOG:  database system is shut down

If I revert 6672d79139 and disable prefetch:

[pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c
logging_collector=no-c port=5678 -c recovery_prefetch=no
 
...
< 2022-08-31 23:43:25.711 CDT  >LOG:  redo starts at 1201/1B931F50
< 2022-08-31 23:43:26.178 CDT  >LOG:  invalid record length at 1201/1CD91010: wanted 24, got 0
< 2022-08-31 23:43:26.178 CDT  >LOG:  redo done at 1201/1CD90E48 system usage: CPU: user: 0.33 s, system: 0.11 s,
elapsed:0.46 s
 
< 2022-08-31 23:43:26.178 CDT  >LOG:  point 0: lastRec = 12011cd90e48
< 2022-08-31 23:43:26.178 CDT  >LOG:  point 0: endOfLog = 12011cd91010
< 2022-08-31 23:43:26.178 CDT  >LOG:  XXX point 1: EndOfLog = 12011cd91010
< 2022-08-31 23:43:26.369 CDT  >LOG:  XXX point 2: EndOfLog = 12011cd91010
< 2022-08-31 23:43:26.369 CDT  >LOG:  XXX point 3: Insert->CurrBytePos = 11f39ae35b68
< 2022-08-31 23:43:26.433 CDT  >LOG:  XXX point 4: Insert->CurrBytePos = 11f39ae35b68
< 2022-08-31 23:43:26.490 CDT  >LOG:  checkpoint starting: end-of-recovery immediate wait
< 2022-08-31 23:43:29.519 CDT  >LOG:  checkpoint complete: wrote 8030 buffers (49.0%); 0 WAL file(s) added, 0 removed,
1recycled; write=0.380 s, sync=2.492 s, total=3.086 s; sync files=102, longest=2.438 s, average=0.025 s; distance=20860
kB,estimate=20860 kB
 
< 2022-08-31 23:43:29.567 CDT  >LOG:  database system is ready to accept connections




Re: pg15b3: recovery fails with wal prefetch enabled

From
Kyotaro Horiguchi
Date:
At Wed, 31 Aug 2022 23:47:53 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in 
> On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote:
> > On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > Just for information, there was a fixed bug about
> > > overwrite-aborted-contrecord feature, which causes this kind of
> > > failure (xlog flush request exceeds insertion bleeding edge). If it is
> > > that, it has been fixed by 6672d79139 two-days ago.
> > 
> > Hmm.  Justin, when you built from source, which commit were you at?
> > If it's REL_15_BETA3,
> 
> No - it's:
> commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE)

It's newer than eb29fa3889 (6672d79139 on master) so it is fixed at
that commit.

> > If it's REL_15_BETA3, any chance you could cherry pick that change and
> > check what happens?  And without that, could you show what this logs
> > And without that, could you show what this logs
> > for good and bad recovery settings?
> 
> I wasn't sure what mean by "without that" , so here's a bunch of logs to
> sift through:

There's no need to cherry picking..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Thu, Sep 1, 2022 at 5:18 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Wed, 31 Aug 2022 23:47:53 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in
> > On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote:
> > > Hmm.  Justin, when you built from source, which commit were you at?
> > > If it's REL_15_BETA3,
> >
> > No - it's:
> > commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE)
>
> It's newer than eb29fa3889 (6672d79139 on master) so it is fixed at
> that commit.

Yeah.

> > I wasn't sure what mean by "without that" , so here's a bunch of logs to
> > sift through:

So it *looks* like it finished early (and without the expected
error?).  But it also looks like it replayed that record, according to
the page LSN.  So which is it?  Could you recompile with WAL_DEBUG
defined in pg_config_manual.h, and run recovery with wal_debug = on,
and see if it replays 1CAF84B0?



Re: pg15b3: recovery fails with wal prefetch enabled

From
Justin Pryzby
Date:
On Thu, Sep 01, 2022 at 05:35:23PM +1200, Thomas Munro wrote:
> So it *looks* like it finished early (and without the expected
> error?).  But it also looks like it replayed that record, according to
> the page LSN.  So which is it?  Could you recompile with WAL_DEBUG
> defined in pg_config_manual.h, and run recovery with wal_debug = on,
> and see if it replays 1CAF84B0?

This is with 6672d79139 un-reverted.

$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678
-cwal_debug=on 2>&1 |grep 1CAF84B0 || echo not found
 
not found

$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678
-cwal_debug=on -c recovery_prefetch=no 2>&1 |grep 1CAF84B0 || echo not found
 
< 2022-09-01 00:44:55.878 CDT  >LOG:  REDO @ 1201/1CAF8478; LSN 1201/1CAF84B0: prev 1201/1CAF8438; xid 0; len 2; blkref
#0:rel 1663/16881/2840, blk 53 - Heap2/VACUUM: nunused 4
 
< 2022-09-01 00:44:55.878 CDT  >LOG:  REDO @ 1201/1CAF84B0; LSN 1201/1CAF84F0: prev 1201/1CAF8478; xid 0; len 5; blkref
#0:rel 1663/16881/2840, fork 2, blk 0; blkref #1: rel 1663/16881/2840, blk 53 - Heap2/VISIBLE: cutoff xid 3678741092
flags0x01
 
< 2022-09-01 00:44:55.878 CDT  >LOG:  REDO @ 1201/1CAF84F0; LSN 1201/1CAF8AC0: prev 1201/1CAF84B0; xid 0; len 2; blkref
#0:rel 1663/16881/1259, blk 1 FPW, compression method: zstd - Heap/INPLACE: off 33
 

(Note that "compression method: zstd" is a local change to
xlog_block_info() which I just extracted from my original patch for
wal_compression, after forgetting to compile --with-zstd.  I'll mail
about that at a later time...).

-- 
Justin



Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Thu, Sep 1, 2022 at 5:52 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> compression method: zstd

Ahh, problem repro'd here with WAL compression.  More soon.



Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Thu, Sep 1, 2022 at 11:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Ahh, problem repro'd here with WAL compression.  More soon.

I followed some false pistes for a while there, but I finally figured
it out what's happening here after Justin kindly shared some files
with me.  The active ingredient here is a setting of
maintenance_io_concurency=0, which runs into a dumb accounting problem
of the fencepost variety and incorrectly concludes it's reached the
end early.  Setting it to 3 or higher allows his system to complete
recovery.  I'm working on a fix ASAP.



Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Fri, Sep 2, 2022 at 6:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... The active ingredient here is a setting of
> maintenance_io_concurency=0, which runs into a dumb accounting problem
> of the fencepost variety and incorrectly concludes it's reached the
> end early.  Setting it to 3 or higher allows his system to complete
> recovery.  I'm working on a fix ASAP.

The short version is that when tracking the number of IOs in progress,
I had two steps in the wrong order in the algorithm for figuring out
whether IO is saturated.  Internally, the effect of
maintenance_io_concurrency is clamped to 2 or more, and that mostly
hides the bug until you try to replay a particular sequence like
Justin's with such a low setting.  Without that clamp, and if you set
it to 1, then several of our recovery tests fail.

That clamp was a bad idea.  What I think we really want is for
maintenance_io_concurrency=0 to disable recovery prefetching exactly
as if you'd set recovery_prefetch=off, and any other setting including
1 to work without clamping.

Here's the patch I'm currently testing.  It also fixes a related
dangling reference problem with very small maintenance_io_concurrency.

I had this more or less figured out on Friday when I wrote last, but I
got stuck on a weird problem with 026_overwrite_contrecord.pl.  I
think that failure case should report an error, no?  I find it strange
that we end recovery in silence.  That was a problem for the new
coding in this patch, because it is confused by XLREAD_FAIL without
queuing an error, and then retries, which clobbers the aborted recptr
state.  I'm still looking into that.

Attachment

Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Mon, Sep 5, 2022 at 1:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I had this more or less figured out on Friday when I wrote last, but I
> got stuck on a weird problem with 026_overwrite_contrecord.pl.  I
> think that failure case should report an error, no?  I find it strange
> that we end recovery in silence.  That was a problem for the new
> coding in this patch, because it is confused by XLREAD_FAIL without
> queuing an error, and then retries, which clobbers the aborted recptr
> state.  I'm still looking into that.

On reflection, it'd be better not to clobber any pre-existing error
there, but report one only if there isn't one already queued.  I've
done that in this version, which I'm planning to do a bit more testing
on and commit soonish if there are no comments/objections, especially
for that part.

I'll have to check whether a doc change is necessary somewhere to
advertise that maintenance_io_concurrency=0 turns off prefetching, but
IIRC that's kinda already implied.

I've tested quite a lot of scenarios including make check-world with
maintenance_io_concurrency = 0, 1, 10, 1000, and ALTER SYSTEM for all
relevant GUCs on a standby running large pgbench to check expected
effect on pg_stat_recovery_prefetch view and generate system calls.

Attachment

Re: pg15b3: recovery fails with wal prefetch enabled

From
Kyotaro Horiguchi
Date:
At Mon, 5 Sep 2022 13:28:12 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> I had this more or less figured out on Friday when I wrote last, but I
> got stuck on a weird problem with 026_overwrite_contrecord.pl.  I
> think that failure case should report an error, no?  I find it strange
> that we end recovery in silence.  That was a problem for the new
> coding in this patch, because it is confused by XLREAD_FAIL without
> queuing an error, and then retries, which clobbers the aborted recptr
> state.  I'm still looking into that.

+1 for showing any message for the failure, but I think we shouldn't
hide an existing message if any. And the error messages around are
just telling that "<some error happened> at RecPtr". So I think
"missing contrecord at RecPtr" is sufficient here.

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cdcacc7803..bfe332c014 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -907,6 +907,11 @@ err:
                 */
                state->abortedRecPtr = RecPtr;
                state->missingContrecPtr = targetPagePtr;
+
+               /* Put a generic error message if no particular cause is recorded. */
+               if (!state->errormsg_buf[0])
+                       report_invalid_record(state, "missing contrecord at %X/%X",
+                                                                 LSN_FORMAT_ARGS(RecPtr));
        }
 
        if (decoded && decoded->oversized)

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg15b3: recovery fails with wal prefetch enabled

From
Kyotaro Horiguchi
Date:
(the previous mail was crossing with yours..)

At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
me> +1 for showing any message for the failure, but I think we shouldn't
me> hide an existing message if any.

At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On reflection, it'd be better not to clobber any pre-existing error
> there, but report one only if there isn't one already queued.  I've
> done that in this version, which I'm planning to do a bit more testing
> on and commit soonish if there are no comments/objections, especially
> for that part.

It looks fine in this regard.  I still think that the message looks
somewhat internal.

me> And the error messages around are
me> just telling that "<some error happened> at RecPtr". So I think
me> "missing contrecord at RecPtr" is sufficient here.


> I'll have to check whether a doc change is necessary somewhere to
> advertise that maintenance_io_concurrency=0 turns off prefetching, but
> IIRC that's kinda already implied.
> 
> I've tested quite a lot of scenarios including make check-world with
> maintenance_io_concurrency = 0, 1, 10, 1000, and ALTER SYSTEM for all
> relevant GUCs on a standby running large pgbench to check expected
> effect on pg_stat_recovery_prefetch view and generate system calls.

+    if (likely(record = prefetcher->reader->record))

Isn't this confusing a bit?


+    if (likely(record = prefetcher->reader->record))
+    {
+        XLogRecPtr    replayed_up_to = record->next_lsn;
+
+        XLogReleasePreviousRecord(prefetcher->reader);
+

The likely condition is the prerequisite for
XLogReleasePreviousRecord.  But is is a little hard to read the
condition as "in case no previous record exists".  Since there is one
in most cases, can't call XLogReleasePreviousRecord() unconditionally
then the function returns true when the previous record exists and
false if not?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Mon, Sep 5, 2022 at 5:34 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> me> +1 for showing any message for the failure, but I think we shouldn't
> me> hide an existing message if any.
>
> At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
> > On reflection, it'd be better not to clobber any pre-existing error
> > there, but report one only if there isn't one already queued.  I've
> > done that in this version, which I'm planning to do a bit more testing
> > on and commit soonish if there are no comments/objections, especially
> > for that part.
>
> It looks fine in this regard.  I still think that the message looks
> somewhat internal.

Thanks for looking!

> me> And the error messages around are
> me> just telling that "<some error happened> at RecPtr". So I think
> me> "missing contrecord at RecPtr" is sufficient here.

Ok, I've updated it like that.

> > I'll have to check whether a doc change is necessary somewhere to
> > advertise that maintenance_io_concurrency=0 turns off prefetching, but
> > IIRC that's kinda already implied.
> >
> > I've tested quite a lot of scenarios including make check-world with
> > maintenance_io_concurrency = 0, 1, 10, 1000, and ALTER SYSTEM for all
> > relevant GUCs on a standby running large pgbench to check expected
> > effect on pg_stat_recovery_prefetch view and generate system calls.
>
> +       if (likely(record = prefetcher->reader->record))
>
> Isn't this confusing a bit?
>
>
> +       if (likely(record = prefetcher->reader->record))
> +       {
> +               XLogRecPtr      replayed_up_to = record->next_lsn;
> +
> +               XLogReleasePreviousRecord(prefetcher->reader);
> +
>
> The likely condition is the prerequisite for
> XLogReleasePreviousRecord.  But is is a little hard to read the
> condition as "in case no previous record exists".  Since there is one
> in most cases, can't call XLogReleasePreviousRecord() unconditionally
> then the function returns true when the previous record exists and
> false if not?

We also need the LSN that is past that record.
XLogReleasePreviousRecord() could return it (or we could use
reader->EndRecPtr I suppose).  Thoughts on this version?

Attachment

Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Mon, Sep 5, 2022 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
> > > On reflection, it'd be better not to clobber any pre-existing error
> > > there, but report one only if there isn't one already queued.  I've
> > > done that in this version, which I'm planning to do a bit more testing
> > > on and commit soonish if there are no comments/objections, especially
> > > for that part.

Well I was about to commit this, but beta4 just got stamped (but not
yet tagged).  I see now that Jonathan (with RMT hat on, CC'd) meant
commits should be in by the *start* of the 5th AoE, not the end.  So
the procedural/RMT question is whether it's still possible to close
this item in beta4.



Re: pg15b3: recovery fails with wal prefetch enabled

From
Kyotaro Horiguchi
Date:
At Mon, 5 Sep 2022 21:08:16 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in 
> We also need the LSN that is past that record.
> XLogReleasePreviousRecord() could return it (or we could use
> reader->EndRecPtr I suppose).  Thoughts on this version?

(Catching the gap...)

It is easier to read. Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: pg15b3: recovery fails with wal prefetch enabled

From
"Jonathan S. Katz"
Date:
On 9/5/22 7:18 PM, Thomas Munro wrote:
> On Mon, Sep 5, 2022 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>> At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
>>> At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
>>>> On reflection, it'd be better not to clobber any pre-existing error
>>>> there, but report one only if there isn't one already queued.  I've
>>>> done that in this version, which I'm planning to do a bit more testing
>>>> on and commit soonish if there are no comments/objections, especially
>>>> for that part.
> 
> Well I was about to commit this, but beta4 just got stamped (but not
> yet tagged).  I see now that Jonathan (with RMT hat on, CC'd) meant
> commits should be in by the *start* of the 5th AoE, not the end.  So
> the procedural/RMT question is whether it's still possible to close
> this item in beta4.

Presumably because Tom stamped it, the released is wrapped so it 
wouldn't make Beta 4, but I defer to him to see if it can be included 
with the tag.

That said, if it doesn't make it for Beta 4, it would be in the next 
release (which is hopefully RC1).

Thanks,

Jonathan

Attachment

Re: pg15b3: recovery fails with wal prefetch enabled

From
Tom Lane
Date:
"Jonathan S. Katz" <jkatz@postgresql.org> writes:
> On 9/5/22 7:18 PM, Thomas Munro wrote:
>> Well I was about to commit this, but beta4 just got stamped (but not
>> yet tagged).  I see now that Jonathan (with RMT hat on, CC'd) meant
>> commits should be in by the *start* of the 5th AoE, not the end.  So
>> the procedural/RMT question is whether it's still possible to close
>> this item in beta4.

> Presumably because Tom stamped it, the released is wrapped so it 
> wouldn't make Beta 4, but I defer to him to see if it can be included 
> with the tag.

I already made the tarballs available to packagers, so adding this
would involve a re-wrap and great confusion.  In any case, I'm not
a fan of pushing fixes within a day or two of the wrap deadline,
let alone after it; you get inadequate buildfarm coverage when you
cut corners that way.  I think this one missed the boat.

            regards, tom lane



Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Tue, Sep 6, 2022 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Jonathan S. Katz" <jkatz@postgresql.org> writes:
> > On 9/5/22 7:18 PM, Thomas Munro wrote:
> >> Well I was about to commit this, but beta4 just got stamped (but not
> >> yet tagged).  I see now that Jonathan (with RMT hat on, CC'd) meant
> >> commits should be in by the *start* of the 5th AoE, not the end.  So
> >> the procedural/RMT question is whether it's still possible to close
> >> this item in beta4.
>
> > Presumably because Tom stamped it, the released is wrapped so it
> > wouldn't make Beta 4, but I defer to him to see if it can be included
> > with the tag.
>
> I already made the tarballs available to packagers, so adding this
> would involve a re-wrap and great confusion.  In any case, I'm not
> a fan of pushing fixes within a day or two of the wrap deadline,
> let alone after it; you get inadequate buildfarm coverage when you
> cut corners that way.  I think this one missed the boat.

Got it.  Yeah I knew it was going to be a close thing with a problem
diagnosed on Thursday/Friday before a Monday wrap, even before I
managed to confuse myself about dates and times.  Thanks both.



Re: pg15b3: recovery fails with wal prefetch enabled

From
"Jonathan S. Katz"
Date:
On 9/5/22 10:03 PM, Thomas Munro wrote:
> On Tue, Sep 6, 2022 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> "Jonathan S. Katz" <jkatz@postgresql.org> writes:
>>> On 9/5/22 7:18 PM, Thomas Munro wrote:
>>>> Well I was about to commit this, but beta4 just got stamped (but not
>>>> yet tagged).  I see now that Jonathan (with RMT hat on, CC'd) meant
>>>> commits should be in by the *start* of the 5th AoE, not the end.  So
>>>> the procedural/RMT question is whether it's still possible to close
>>>> this item in beta4.
>>
>>> Presumably because Tom stamped it, the released is wrapped so it
>>> wouldn't make Beta 4, but I defer to him to see if it can be included
>>> with the tag.
>>
>> I already made the tarballs available to packagers, so adding this
>> would involve a re-wrap and great confusion.  In any case, I'm not
>> a fan of pushing fixes within a day or two of the wrap deadline,
>> let alone after it; you get inadequate buildfarm coverage when you
>> cut corners that way.  I think this one missed the boat.
> 
> Got it.  Yeah I knew it was going to be a close thing with a problem
> diagnosed on Thursday/Friday before a Monday wrap, even before I
> managed to confuse myself about dates and times.  Thanks both.

To close this loop, I added a section for "fixed before RC1" to Open 
Items since this is presumably the next release. We can include it there 
once committed.

Jonathan

Attachment

Re: pg15b3: recovery fails with wal prefetch enabled

From
Thomas Munro
Date:
On Wed, Sep 7, 2022 at 1:56 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:
> To close this loop, I added a section for "fixed before RC1" to Open
> Items since this is presumably the next release. We can include it there
> once committed.

Done yesterday.

To tie up a couple of loose ends from this thread:

On Thu, Sep 1, 2022 at 2:48 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Also, pg_waldump seems to fail early with -w:
> [pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -w -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal
00000001000012010000001C
> rmgr: Heap2       len (rec/tot):     64/   122, tx:          0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE
cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk
54
> pg_waldump: error: error in WAL record at 1201/1CD90E48: invalid record length at 1201/1CD91010: wanted 24, got 0

That looks OK to me.  With or without -w, we get as far as
1201/1CD91010 and then hit zeroes.

On Thu, Sep 1, 2022 at 5:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> So it *looks* like it finished early (and without the expected
> error?).  But it also looks like it replayed that record, according to
> the page LSN.  So which is it?

The reason 1201/1CAF84B0 appeared on a page despite not having been
replayed (due to the bug) is just that vismap pages don't follow the
usual logging rules, and can be read in by heap records that don't
mention the vm page (and therefore no FPW).  So we can finish up
reading a page from disk with a future LSN on it.