Thread: pg15b3: recovery fails with wal prefetch enabled
An internal VM crashed last night due to OOM. When I tried to start postgres, it failed like: < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed I was able to start it with -c recovery_prefetch=no, so it seems like prefetch tried to do too much. The VM runs centos7 under qemu. I'm making a copy of the data dir in cases it's needed. -- Justin
On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > I was able to start it with -c recovery_prefetch=no, so it seems like > prefetch tried to do too much. The VM runs centos7 under qemu. > I'm making a copy of the data dir in cases it's needed. Hmm, a page with an LSN set 118208 bytes past the end of WAL. It's a vm fork page (which recovery prefetch should ignore completely). Did you happen to get a copy before the successful recovery? After the successful recovery, what LSN does that page have, and can you find the references to it in the WAL with eg pg_waldump -R 1663/16681/2840 -F vm? Have you turned FPW off (perhaps this is on ZFS?)?
On Thu, Sep 01, 2022 at 12:05:36PM +1200, Thomas Munro wrote: > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait > > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > > > I was able to start it with -c recovery_prefetch=no, so it seems like > > prefetch tried to do too much. The VM runs centos7 under qemu. > > I'm making a copy of the data dir in cases it's needed. > > Hmm, a page with an LSN set 118208 bytes past the end of WAL. It's a > vm fork page (which recovery prefetch should ignore completely). Did > you happen to get a copy before the successful recovery? After the > successful recovery, what LSN does that page have, and can you find > the references to it in the WAL with eg pg_waldump -R 1663/16681/2840 > -F vm? Have you turned FPW off (perhaps this is on ZFS?)? Yes, I have a copy that reproduces the issue: #1 0x00000000009a0df4 in errfinish (filename=<optimized out>, filename@entry=0xa15535 "xlog.c", lineno=lineno@entry=2671,funcname=funcname@entry=0xa1da27 <__func__.22763> "XLogFlush") at elog.c:588 #2 0x000000000055f1cf in XLogFlush (record=19795985532144) at xlog.c:2668 #3 0x0000000000813b24 in FlushBuffer (buf=0x7fffdf1f8900, reln=<optimized out>) at bufmgr.c:2889 #4 0x0000000000817a5b in SyncOneBuffer (buf_id=buf_id@entry=7796, skip_recently_used=skip_recently_used@entry=false, wb_context=wb_context@entry=0x7fffffffcdf0)at bufmgr.c:2576 #5 0x00000000008181c2 in BufferSync (flags=flags@entry=358) at bufmgr.c:2164 #6 0x00000000008182f5 in CheckPointBuffers (flags=flags@entry=358) at bufmgr.c:2743 #7 0x00000000005587b2 in CheckPointGuts (checkPointRedo=19795985413936, flags=flags@entry=358) at xlog.c:6855 #8 0x000000000055feb3 in CreateCheckPoint (flags=flags@entry=358) at xlog.c:6534 #9 0x00000000007aceaa in CheckpointerMain () at checkpointer.c:455 #10 0x00000000007aac52 in AuxiliaryProcessMain (auxtype=auxtype@entry=CheckpointerProcess) at auxprocess.c:153 #11 0x00000000007b0bd8 in StartChildProcess (type=<optimized out>) at postmaster.c:5430 #12 0x00000000007b388f in PostmasterMain (argc=argc@entry=7, argv=argv@entry=0xf139e0) at postmaster.c:1463 #13 0x00000000004986a6 in main (argc=7, argv=0xf139e0) at main.c:202 It's not on zfs, and FPW have never been turned off. I should add that this instance has been pg_upgraded since v10. BTW, base/16881 is the postgres DB )which has 43GB of logfiles imported from CSV, plus 2GB of snapshots of pg_control_checkpoint, pg_settings, pg_stat_bgwriter, pg_stat_database, pg_stat_wal). postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 'main', 0)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid ---------------+----------+-------+-------+-------+---------+----------+---------+------------ 1201/1CDD1F98 | -6200 | 1 | 44 | 424 | 8192 | 8192 | 4 | 3681043287 (1 fila) postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 'vm', 0)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid ---------------+----------+-------+-------+-------+---------+----------+---------+----------- 1201/1CAF84F0 | -6010 | 0 | 24 | 8192 | 8192 | 8192 | 4 | 0 I found this in waldump (note that you had a typoe - it's 16881). [pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal 00000001000012010000001C rmgr: Heap2 len (rec/tot): 64/ 122, tx: 0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk 54 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF3AF8, prev 1201/1CAF2788, desc: VISIBLE cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 0 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF3B70, prev 1201/1CAF3B38, desc: VISIBLE cutoffxid 3671427998 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 2 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF4DC8, prev 1201/1CAF3BB0, desc: VISIBLE cutoffxid 3672889900 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 4 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF5FB0, prev 1201/1CAF4E08, desc: VISIBLE cutoffxid 3679743844 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 5 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7320, prev 1201/1CAF5FF0, desc: VISIBLE cutoffxid 3679743844 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 6 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7398, prev 1201/1CAF7360, desc: VISIBLE cutoffxid 3679751919 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 11 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7410, prev 1201/1CAF73D8, desc: VISIBLE cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 17 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7488, prev 1201/1CAF7450, desc: VISIBLE cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 19 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7500, prev 1201/1CAF74C8, desc: VISIBLE cutoffxid 3645406844 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 23 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7578, prev 1201/1CAF7540, desc: VISIBLE cutoffxid 3669978567 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 24 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF75F0, prev 1201/1CAF75B8, desc: VISIBLE cutoffxid 0 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 25 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7668, prev 1201/1CAF7630, desc: VISIBLE cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 26 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF76E0, prev 1201/1CAF76A8, desc: VISIBLE cutoffxid 3671911724 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 27 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7758, prev 1201/1CAF7720, desc: VISIBLE cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 34 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF77D0, prev 1201/1CAF7798, desc: VISIBLE cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 35 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7EF8, prev 1201/1CAF7810, desc: VISIBLE cutoffxid 3672408544 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 37 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7F70, prev 1201/1CAF7F38, desc: VISIBLE cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 38 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF7FE8, prev 1201/1CAF7FB0, desc: VISIBLE cutoffxid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 39 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF8078, prev 1201/1CAF8040, desc: VISIBLE cutoffxid 3678237783 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 41 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF80F0, prev 1201/1CAF80B8, desc: VISIBLE cutoffxid 3672408544 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 42 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF8168, prev 1201/1CAF8130, desc: VISIBLE cutoffxid 3680789266 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 43 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF81E0, prev 1201/1CAF81A8, desc: VISIBLE cutoffxid 3667994218 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 44 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF8258, prev 1201/1CAF8220, desc: VISIBLE cutoffxid 3680789266 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 45 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF82D0, prev 1201/1CAF8298, desc: VISIBLE cutoffxid 3673830395 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 48 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF8348, prev 1201/1CAF8310, desc: VISIBLE cutoffxid 0 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 50 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF83C0, prev 1201/1CAF8388, desc: VISIBLE cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 51 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF8438, prev 1201/1CAF8400, desc: VISIBLE cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 52 rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF84B0, prev 1201/1CAF8478, desc: VISIBLE cutoffxid 3678741092 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 53 pg_waldump: error: error en registro de WAL en 1201/1CD90E48: invalid record length at 1201/1CD91010: wanted 24, got 0 I could send our WAL to you if that's desirable .. -- Justin
On Thu, Sep 1, 2022 at 12:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > Yes, I have a copy that reproduces the issue: That's good news. So the last record touching that page was: > rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 1201/1CAF84B0, prev 1201/1CAF8478, desc: VISIBLE cutoffxid 3678741092 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 53 I think the expected LSN for that page is past the end of that record, so 0x1CAF84B0 + 59 = 0x1caf84eb which rounds up to 0x1CAF84F0, and indeed we see that in the restored page when recovery succeeds. Next question: why do we think the WAL finishes at 1201/1CADB730 while running that checkpoint? Looking...
Some more details, in case they're important: First: the server has wal_compression=zstd (I wonder if something doesn't allow/accomodate compressed FPI?) I thought to mention that after compiling pg15 locally and forgetting to use --with-zstd. I compiled it to enable your debug logging, which wrote these during recovery: < 2022-08-31 21:17:01.807 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0is replayed, which truncates the relation < 2022-08-31 21:17:01.903 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98is replayed, which truncates the relation < 2022-08-31 21:17:02.029 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8is replayed, because the relation is too small Also, pg_waldump seems to fail early with -w: [pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -w -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal 00000001000012010000001C rmgr: Heap2 len (rec/tot): 64/ 122, tx: 0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk 54 pg_waldump: error: error in WAL record at 1201/1CD90E48: invalid record length at 1201/1CD91010: wanted 24, got 0 Also, the VM has crashed with OOM before, while runnning pg15, with no issue in recovery. I haven't been able to track down the cause.. The VM is running: kernel-3.10.0-1160.66.1.el7.x86_64 pgsql is an ext4 FS (no tablespaces), which is a qemu block device exposed like: <driver name='qemu' type='raw' cache='none' io='native'/> <target dev='vdg' bus='virtio'/> It's nowhere near full: /dev/vdc 96G 51G 46G 53% /var/lib/pgsql
At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait > > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > > > I was able to start it with -c recovery_prefetch=no, so it seems like > > prefetch tried to do too much. The VM runs centos7 under qemu. > > I'm making a copy of the data dir in cases it's needed. Just for information, there was a fixed bug about overwrite-aborted-contrecord feature, which causes this kind of failure (xlog flush request exceeds insertion bleeding edge). If it is that, it has been fixed by 6672d79139 two-days ago. http://postgr.es/m/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com http://postgr.es/m/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait > > > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 > > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > > > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 > > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > > > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > > > > > I was able to start it with -c recovery_prefetch=no, so it seems like > > > prefetch tried to do too much. The VM runs centos7 under qemu. > > > I'm making a copy of the data dir in cases it's needed. > > Just for information, there was a fixed bug about > overwrite-aborted-contrecord feature, which causes this kind of > failure (xlog flush request exceeds insertion bleeding edge). If it is > that, it has been fixed by 6672d79139 two-days ago. Hmm. Justin, when you built from source, which commit were you at? If it's REL_15_BETA3, any chance you could cherry pick that change and check what happens? And without that, could you show what this logs for good and bad recovery settings?
Attachment
On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote: > On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > > > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait > > > > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, currentposition 1201/1CADB730 > > > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > > > > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 > > > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > > > > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > > > > > > > I was able to start it with -c recovery_prefetch=no, so it seems like > > > > prefetch tried to do too much. The VM runs centos7 under qemu. > > > > I'm making a copy of the data dir in cases it's needed. > > > > Just for information, there was a fixed bug about > > overwrite-aborted-contrecord feature, which causes this kind of > > failure (xlog flush request exceeds insertion bleeding edge). If it is > > that, it has been fixed by 6672d79139 two-days ago. > > Hmm. Justin, when you built from source, which commit were you at? > If it's REL_15_BETA3, No - it's: commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE) > If it's REL_15_BETA3, any chance you could cherry pick that change and > check what happens? And without that, could you show what this logs > And without that, could you show what this logs > for good and bad recovery settings? I wasn't sure what mean by "without that" , so here's a bunch of logs to sift through: At a203, with #define XLOGPREFETCHER_DEBUG_LEVEL NOTICE: [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no-c port=5678 ... < 2022-08-31 23:31:38.690 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:31:40.204 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0is replayed, which truncates the relation < 2022-08-31 23:31:40.307 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98is replayed, which truncates the relation < 2022-08-31 23:31:40.493 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8is replayed, because the relation is too small < 2022-08-31 23:31:40.721 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.41 s, system: 0.23 s, elapsed:2.03 s < 2022-08-31 23:31:41.452 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:31:41.698 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 < 2022-08-31 23:31:41.698 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:31:41.698 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 < 2022-08-31 23:31:41.698 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:31:41.699 CDT >FATAL: checkpoint request failed < 2022-08-31 23:31:41.699 CDT >HINT: Consult recent messages in the server log for details. < 2022-08-31 23:31:41.704 CDT >LOG: startup process (PID 25046) exited with exit code 1 < 2022-08-31 23:31:41.704 CDT >LOG: terminating any other active server processes < 2022-08-31 23:31:41.705 CDT >LOG: shutting down due to startup process failure < 2022-08-31 23:31:41.731 CDT >LOG: database system is shut down With your patch: [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no-c port=5678 ... < 2022-08-31 23:34:22.897 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:34:23.146 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0is replayed, which truncates the relation < 2022-08-31 23:34:23.147 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98is replayed, which truncates the relation < 2022-08-31 23:34:23.268 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8is replayed, because the relation is too small < 2022-08-31 23:34:23.323 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.29 s, system: 0.12 s, elapsed:0.42 s < 2022-08-31 23:34:23.323 CDT >LOG: point 0: lastRec = 12011cadb300 < 2022-08-31 23:34:23.323 CDT >LOG: point 0: endOfLog = 12011cadb730 < 2022-08-31 23:34:23.323 CDT >LOG: XXX point 1: EndOfLog = 12011cadb730 < 2022-08-31 23:34:23.386 CDT >LOG: XXX point 2: EndOfLog = 12011cadb730 < 2022-08-31 23:34:23.387 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ab82310 < 2022-08-31 23:34:23.565 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ab82310 < 2022-08-31 23:34:23.606 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:34:23.767 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 < 2022-08-31 23:34:23.767 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:34:23.767 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 < 2022-08-31 23:34:23.767 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:34:23.768 CDT >FATAL: checkpoint request failed And without prefetch: [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no-c port=5678 -c recovery_prefetch=no ... < 2022-08-31 23:37:08.792 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:37:09.269 CDT >LOG: invalid record length at 1201/1CD91010: wanted 24, got 0 < 2022-08-31 23:37:09.269 CDT >LOG: redo done at 1201/1CD90E48 system usage: CPU: user: 0.35 s, system: 0.11 s, elapsed:0.47 s < 2022-08-31 23:37:09.269 CDT >LOG: point 0: lastRec = 12011cd90e48 < 2022-08-31 23:37:09.269 CDT >LOG: point 0: endOfLog = 12011cd91010 < 2022-08-31 23:37:09.269 CDT >LOG: XXX point 1: EndOfLog = 12011cd91010 < 2022-08-31 23:37:09.350 CDT >LOG: XXX point 2: EndOfLog = 12011cd91010 < 2022-08-31 23:37:09.350 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ae35b68 < 2022-08-31 23:37:09.420 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ae35b68 < 2022-08-31 23:37:09.552 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:37:12.987 CDT >LOG: checkpoint complete: wrote 8030 buffers (49.0%); 0 WAL file(s) added, 0 removed, 1recycled; write=0.403 s, sync=2.841 s, total=3.566 s; sync files=102, longest=2.808 s, average=0.028 s; distance=20860 kB,estimate=20860 kB < 2022-08-31 23:37:13.077 CDT >LOG: database system is ready to accept connections If I revert 6672d79139 (and roll back to the unrecovered state): [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no-c port=5678 # -c recovery_prefetch=no ... < 2022-08-31 23:42:40.592 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:42:42.168 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0is replayed, which truncates the relation < 2022-08-31 23:42:42.238 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98is replayed, which truncates the relation < 2022-08-31 23:42:42.405 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8is replayed, because the relation is too small < 2022-08-31 23:42:42.602 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.41 s, system: 0.25 s, elapsed:2.01 s < 2022-08-31 23:42:42.602 CDT >LOG: point 0: lastRec = 12011cadb300 < 2022-08-31 23:42:42.602 CDT >LOG: point 0: endOfLog = 12011cadb730 < 2022-08-31 23:42:42.602 CDT >LOG: XXX point 1: EndOfLog = 12011cadb730 < 2022-08-31 23:42:42.830 CDT >LOG: XXX point 2: EndOfLog = 12011cadb730 < 2022-08-31 23:42:42.830 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ab82310 < 2022-08-31 23:42:43.194 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ab82310 < 2022-08-31 23:42:43.266 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:42:43.425 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 < 2022-08-31 23:42:43.425 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:42:43.425 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 < 2022-08-31 23:42:43.425 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:42:43.425 CDT >FATAL: checkpoint request failed < 2022-08-31 23:42:43.425 CDT >HINT: Consult recent messages in the server log for details. < 2022-08-31 23:42:43.431 CDT >LOG: startup process (PID 2415) exited with exit code 1 < 2022-08-31 23:42:43.431 CDT >LOG: terminating any other active server processes < 2022-08-31 23:42:43.432 CDT >LOG: shutting down due to startup process failure < 2022-08-31 23:42:43.451 CDT >LOG: database system is shut down If I revert 6672d79139 and disable prefetch: [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no-c port=5678 -c recovery_prefetch=no ... < 2022-08-31 23:43:25.711 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:43:26.178 CDT >LOG: invalid record length at 1201/1CD91010: wanted 24, got 0 < 2022-08-31 23:43:26.178 CDT >LOG: redo done at 1201/1CD90E48 system usage: CPU: user: 0.33 s, system: 0.11 s, elapsed:0.46 s < 2022-08-31 23:43:26.178 CDT >LOG: point 0: lastRec = 12011cd90e48 < 2022-08-31 23:43:26.178 CDT >LOG: point 0: endOfLog = 12011cd91010 < 2022-08-31 23:43:26.178 CDT >LOG: XXX point 1: EndOfLog = 12011cd91010 < 2022-08-31 23:43:26.369 CDT >LOG: XXX point 2: EndOfLog = 12011cd91010 < 2022-08-31 23:43:26.369 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ae35b68 < 2022-08-31 23:43:26.433 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ae35b68 < 2022-08-31 23:43:26.490 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:43:29.519 CDT >LOG: checkpoint complete: wrote 8030 buffers (49.0%); 0 WAL file(s) added, 0 removed, 1recycled; write=0.380 s, sync=2.492 s, total=3.086 s; sync files=102, longest=2.438 s, average=0.025 s; distance=20860 kB,estimate=20860 kB < 2022-08-31 23:43:29.567 CDT >LOG: database system is ready to accept connections
At Wed, 31 Aug 2022 23:47:53 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in > On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote: > > On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > Just for information, there was a fixed bug about > > > overwrite-aborted-contrecord feature, which causes this kind of > > > failure (xlog flush request exceeds insertion bleeding edge). If it is > > > that, it has been fixed by 6672d79139 two-days ago. > > > > Hmm. Justin, when you built from source, which commit were you at? > > If it's REL_15_BETA3, > > No - it's: > commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE) It's newer than eb29fa3889 (6672d79139 on master) so it is fixed at that commit. > > If it's REL_15_BETA3, any chance you could cherry pick that change and > > check what happens? And without that, could you show what this logs > > And without that, could you show what this logs > > for good and bad recovery settings? > > I wasn't sure what mean by "without that" , so here's a bunch of logs to > sift through: There's no need to cherry picking.. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Sep 1, 2022 at 5:18 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Wed, 31 Aug 2022 23:47:53 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in > > On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote: > > > Hmm. Justin, when you built from source, which commit were you at? > > > If it's REL_15_BETA3, > > > > No - it's: > > commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE) > > It's newer than eb29fa3889 (6672d79139 on master) so it is fixed at > that commit. Yeah. > > I wasn't sure what mean by "without that" , so here's a bunch of logs to > > sift through: So it *looks* like it finished early (and without the expected error?). But it also looks like it replayed that record, according to the page LSN. So which is it? Could you recompile with WAL_DEBUG defined in pg_config_manual.h, and run recovery with wal_debug = on, and see if it replays 1CAF84B0?
On Thu, Sep 01, 2022 at 05:35:23PM +1200, Thomas Munro wrote: > So it *looks* like it finished early (and without the expected > error?). But it also looks like it replayed that record, according to > the page LSN. So which is it? Could you recompile with WAL_DEBUG > defined in pg_config_manual.h, and run recovery with wal_debug = on, > and see if it replays 1CAF84B0? This is with 6672d79139 un-reverted. $ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 -cwal_debug=on 2>&1 |grep 1CAF84B0 || echo not found not found $ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no -c port=5678 -cwal_debug=on -c recovery_prefetch=no 2>&1 |grep 1CAF84B0 || echo not found < 2022-09-01 00:44:55.878 CDT >LOG: REDO @ 1201/1CAF8478; LSN 1201/1CAF84B0: prev 1201/1CAF8438; xid 0; len 2; blkref #0:rel 1663/16881/2840, blk 53 - Heap2/VACUUM: nunused 4 < 2022-09-01 00:44:55.878 CDT >LOG: REDO @ 1201/1CAF84B0; LSN 1201/1CAF84F0: prev 1201/1CAF8478; xid 0; len 5; blkref #0:rel 1663/16881/2840, fork 2, blk 0; blkref #1: rel 1663/16881/2840, blk 53 - Heap2/VISIBLE: cutoff xid 3678741092 flags0x01 < 2022-09-01 00:44:55.878 CDT >LOG: REDO @ 1201/1CAF84F0; LSN 1201/1CAF8AC0: prev 1201/1CAF84B0; xid 0; len 2; blkref #0:rel 1663/16881/1259, blk 1 FPW, compression method: zstd - Heap/INPLACE: off 33 (Note that "compression method: zstd" is a local change to xlog_block_info() which I just extracted from my original patch for wal_compression, after forgetting to compile --with-zstd. I'll mail about that at a later time...). -- Justin
On Thu, Sep 1, 2022 at 5:52 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > compression method: zstd Ahh, problem repro'd here with WAL compression. More soon.
On Thu, Sep 1, 2022 at 11:18 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Ahh, problem repro'd here with WAL compression. More soon. I followed some false pistes for a while there, but I finally figured it out what's happening here after Justin kindly shared some files with me. The active ingredient here is a setting of maintenance_io_concurency=0, which runs into a dumb accounting problem of the fencepost variety and incorrectly concludes it's reached the end early. Setting it to 3 or higher allows his system to complete recovery. I'm working on a fix ASAP.
On Fri, Sep 2, 2022 at 6:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > ... The active ingredient here is a setting of > maintenance_io_concurency=0, which runs into a dumb accounting problem > of the fencepost variety and incorrectly concludes it's reached the > end early. Setting it to 3 or higher allows his system to complete > recovery. I'm working on a fix ASAP. The short version is that when tracking the number of IOs in progress, I had two steps in the wrong order in the algorithm for figuring out whether IO is saturated. Internally, the effect of maintenance_io_concurrency is clamped to 2 or more, and that mostly hides the bug until you try to replay a particular sequence like Justin's with such a low setting. Without that clamp, and if you set it to 1, then several of our recovery tests fail. That clamp was a bad idea. What I think we really want is for maintenance_io_concurrency=0 to disable recovery prefetching exactly as if you'd set recovery_prefetch=off, and any other setting including 1 to work without clamping. Here's the patch I'm currently testing. It also fixes a related dangling reference problem with very small maintenance_io_concurrency. I had this more or less figured out on Friday when I wrote last, but I got stuck on a weird problem with 026_overwrite_contrecord.pl. I think that failure case should report an error, no? I find it strange that we end recovery in silence. That was a problem for the new coding in this patch, because it is confused by XLREAD_FAIL without queuing an error, and then retries, which clobbers the aborted recptr state. I'm still looking into that.
Attachment
On Mon, Sep 5, 2022 at 1:28 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I had this more or less figured out on Friday when I wrote last, but I > got stuck on a weird problem with 026_overwrite_contrecord.pl. I > think that failure case should report an error, no? I find it strange > that we end recovery in silence. That was a problem for the new > coding in this patch, because it is confused by XLREAD_FAIL without > queuing an error, and then retries, which clobbers the aborted recptr > state. I'm still looking into that. On reflection, it'd be better not to clobber any pre-existing error there, but report one only if there isn't one already queued. I've done that in this version, which I'm planning to do a bit more testing on and commit soonish if there are no comments/objections, especially for that part. I'll have to check whether a doc change is necessary somewhere to advertise that maintenance_io_concurrency=0 turns off prefetching, but IIRC that's kinda already implied. I've tested quite a lot of scenarios including make check-world with maintenance_io_concurrency = 0, 1, 10, 1000, and ALTER SYSTEM for all relevant GUCs on a standby running large pgbench to check expected effect on pg_stat_recovery_prefetch view and generate system calls.
Attachment
At Mon, 5 Sep 2022 13:28:12 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > I had this more or less figured out on Friday when I wrote last, but I > got stuck on a weird problem with 026_overwrite_contrecord.pl. I > think that failure case should report an error, no? I find it strange > that we end recovery in silence. That was a problem for the new > coding in this patch, because it is confused by XLREAD_FAIL without > queuing an error, and then retries, which clobbers the aborted recptr > state. I'm still looking into that. +1 for showing any message for the failure, but I think we shouldn't hide an existing message if any. And the error messages around are just telling that "<some error happened> at RecPtr". So I think "missing contrecord at RecPtr" is sufficient here. diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index cdcacc7803..bfe332c014 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -907,6 +907,11 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* Put a generic error message if no particular cause is recorded. */ + if (!state->errormsg_buf[0]) + report_invalid_record(state, "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } if (decoded && decoded->oversized) -- Kyotaro Horiguchi NTT Open Source Software Center
(the previous mail was crossing with yours..) At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in me> +1 for showing any message for the failure, but I think we shouldn't me> hide an existing message if any. At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > On reflection, it'd be better not to clobber any pre-existing error > there, but report one only if there isn't one already queued. I've > done that in this version, which I'm planning to do a bit more testing > on and commit soonish if there are no comments/objections, especially > for that part. It looks fine in this regard. I still think that the message looks somewhat internal. me> And the error messages around are me> just telling that "<some error happened> at RecPtr". So I think me> "missing contrecord at RecPtr" is sufficient here. > I'll have to check whether a doc change is necessary somewhere to > advertise that maintenance_io_concurrency=0 turns off prefetching, but > IIRC that's kinda already implied. > > I've tested quite a lot of scenarios including make check-world with > maintenance_io_concurrency = 0, 1, 10, 1000, and ALTER SYSTEM for all > relevant GUCs on a standby running large pgbench to check expected > effect on pg_stat_recovery_prefetch view and generate system calls. + if (likely(record = prefetcher->reader->record)) Isn't this confusing a bit? + if (likely(record = prefetcher->reader->record)) + { + XLogRecPtr replayed_up_to = record->next_lsn; + + XLogReleasePreviousRecord(prefetcher->reader); + The likely condition is the prerequisite for XLogReleasePreviousRecord. But is is a little hard to read the condition as "in case no previous record exists". Since there is one in most cases, can't call XLogReleasePreviousRecord() unconditionally then the function returns true when the previous record exists and false if not? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Sep 5, 2022 at 5:34 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > me> +1 for showing any message for the failure, but I think we shouldn't > me> hide an existing message if any. > > At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > > On reflection, it'd be better not to clobber any pre-existing error > > there, but report one only if there isn't one already queued. I've > > done that in this version, which I'm planning to do a bit more testing > > on and commit soonish if there are no comments/objections, especially > > for that part. > > It looks fine in this regard. I still think that the message looks > somewhat internal. Thanks for looking! > me> And the error messages around are > me> just telling that "<some error happened> at RecPtr". So I think > me> "missing contrecord at RecPtr" is sufficient here. Ok, I've updated it like that. > > I'll have to check whether a doc change is necessary somewhere to > > advertise that maintenance_io_concurrency=0 turns off prefetching, but > > IIRC that's kinda already implied. > > > > I've tested quite a lot of scenarios including make check-world with > > maintenance_io_concurrency = 0, 1, 10, 1000, and ALTER SYSTEM for all > > relevant GUCs on a standby running large pgbench to check expected > > effect on pg_stat_recovery_prefetch view and generate system calls. > > + if (likely(record = prefetcher->reader->record)) > > Isn't this confusing a bit? > > > + if (likely(record = prefetcher->reader->record)) > + { > + XLogRecPtr replayed_up_to = record->next_lsn; > + > + XLogReleasePreviousRecord(prefetcher->reader); > + > > The likely condition is the prerequisite for > XLogReleasePreviousRecord. But is is a little hard to read the > condition as "in case no previous record exists". Since there is one > in most cases, can't call XLogReleasePreviousRecord() unconditionally > then the function returns true when the previous record exists and > false if not? We also need the LSN that is past that record. XLogReleasePreviousRecord() could return it (or we could use reader->EndRecPtr I suppose). Thoughts on this version?
Attachment
On Mon, Sep 5, 2022 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > > > On reflection, it'd be better not to clobber any pre-existing error > > > there, but report one only if there isn't one already queued. I've > > > done that in this version, which I'm planning to do a bit more testing > > > on and commit soonish if there are no comments/objections, especially > > > for that part. Well I was about to commit this, but beta4 just got stamped (but not yet tagged). I see now that Jonathan (with RMT hat on, CC'd) meant commits should be in by the *start* of the 5th AoE, not the end. So the procedural/RMT question is whether it's still possible to close this item in beta4.
At Mon, 5 Sep 2022 21:08:16 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > We also need the LSN that is past that record. > XLogReleasePreviousRecord() could return it (or we could use > reader->EndRecPtr I suppose). Thoughts on this version? (Catching the gap...) It is easier to read. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 9/5/22 7:18 PM, Thomas Munro wrote: > On Mon, Sep 5, 2022 at 9:08 PM Thomas Munro <thomas.munro@gmail.com> wrote: >>> At Mon, 05 Sep 2022 14:15:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in >>> At Mon, 5 Sep 2022 16:54:07 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in >>>> On reflection, it'd be better not to clobber any pre-existing error >>>> there, but report one only if there isn't one already queued. I've >>>> done that in this version, which I'm planning to do a bit more testing >>>> on and commit soonish if there are no comments/objections, especially >>>> for that part. > > Well I was about to commit this, but beta4 just got stamped (but not > yet tagged). I see now that Jonathan (with RMT hat on, CC'd) meant > commits should be in by the *start* of the 5th AoE, not the end. So > the procedural/RMT question is whether it's still possible to close > this item in beta4. Presumably because Tom stamped it, the released is wrapped so it wouldn't make Beta 4, but I defer to him to see if it can be included with the tag. That said, if it doesn't make it for Beta 4, it would be in the next release (which is hopefully RC1). Thanks, Jonathan
Attachment
"Jonathan S. Katz" <jkatz@postgresql.org> writes: > On 9/5/22 7:18 PM, Thomas Munro wrote: >> Well I was about to commit this, but beta4 just got stamped (but not >> yet tagged). I see now that Jonathan (with RMT hat on, CC'd) meant >> commits should be in by the *start* of the 5th AoE, not the end. So >> the procedural/RMT question is whether it's still possible to close >> this item in beta4. > Presumably because Tom stamped it, the released is wrapped so it > wouldn't make Beta 4, but I defer to him to see if it can be included > with the tag. I already made the tarballs available to packagers, so adding this would involve a re-wrap and great confusion. In any case, I'm not a fan of pushing fixes within a day or two of the wrap deadline, let alone after it; you get inadequate buildfarm coverage when you cut corners that way. I think this one missed the boat. regards, tom lane
On Tue, Sep 6, 2022 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jonathan S. Katz" <jkatz@postgresql.org> writes: > > On 9/5/22 7:18 PM, Thomas Munro wrote: > >> Well I was about to commit this, but beta4 just got stamped (but not > >> yet tagged). I see now that Jonathan (with RMT hat on, CC'd) meant > >> commits should be in by the *start* of the 5th AoE, not the end. So > >> the procedural/RMT question is whether it's still possible to close > >> this item in beta4. > > > Presumably because Tom stamped it, the released is wrapped so it > > wouldn't make Beta 4, but I defer to him to see if it can be included > > with the tag. > > I already made the tarballs available to packagers, so adding this > would involve a re-wrap and great confusion. In any case, I'm not > a fan of pushing fixes within a day or two of the wrap deadline, > let alone after it; you get inadequate buildfarm coverage when you > cut corners that way. I think this one missed the boat. Got it. Yeah I knew it was going to be a close thing with a problem diagnosed on Thursday/Friday before a Monday wrap, even before I managed to confuse myself about dates and times. Thanks both.
On 9/5/22 10:03 PM, Thomas Munro wrote: > On Tue, Sep 6, 2022 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> "Jonathan S. Katz" <jkatz@postgresql.org> writes: >>> On 9/5/22 7:18 PM, Thomas Munro wrote: >>>> Well I was about to commit this, but beta4 just got stamped (but not >>>> yet tagged). I see now that Jonathan (with RMT hat on, CC'd) meant >>>> commits should be in by the *start* of the 5th AoE, not the end. So >>>> the procedural/RMT question is whether it's still possible to close >>>> this item in beta4. >> >>> Presumably because Tom stamped it, the released is wrapped so it >>> wouldn't make Beta 4, but I defer to him to see if it can be included >>> with the tag. >> >> I already made the tarballs available to packagers, so adding this >> would involve a re-wrap and great confusion. In any case, I'm not >> a fan of pushing fixes within a day or two of the wrap deadline, >> let alone after it; you get inadequate buildfarm coverage when you >> cut corners that way. I think this one missed the boat. > > Got it. Yeah I knew it was going to be a close thing with a problem > diagnosed on Thursday/Friday before a Monday wrap, even before I > managed to confuse myself about dates and times. Thanks both. To close this loop, I added a section for "fixed before RC1" to Open Items since this is presumably the next release. We can include it there once committed. Jonathan
Attachment
On Wed, Sep 7, 2022 at 1:56 AM Jonathan S. Katz <jkatz@postgresql.org> wrote: > To close this loop, I added a section for "fixed before RC1" to Open > Items since this is presumably the next release. We can include it there > once committed. Done yesterday. To tie up a couple of loose ends from this thread: On Thu, Sep 1, 2022 at 2:48 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > Also, pg_waldump seems to fail early with -w: > [pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -w -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal 00000001000012010000001C > rmgr: Heap2 len (rec/tot): 64/ 122, tx: 0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoffxid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk 54 > pg_waldump: error: error in WAL record at 1201/1CD90E48: invalid record length at 1201/1CD91010: wanted 24, got 0 That looks OK to me. With or without -w, we get as far as 1201/1CD91010 and then hit zeroes. On Thu, Sep 1, 2022 at 5:35 PM Thomas Munro <thomas.munro@gmail.com> wrote: > So it *looks* like it finished early (and without the expected > error?). But it also looks like it replayed that record, according to > the page LSN. So which is it? The reason 1201/1CAF84B0 appeared on a page despite not having been replayed (due to the bug) is just that vismap pages don't follow the usual logging rules, and can be read in by heap records that don't mention the vm page (and therefore no FPW). So we can finish up reading a page from disk with a future LSN on it.