Re: pg15b3: recovery fails with wal prefetch enabled - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | Re: pg15b3: recovery fails with wal prefetch enabled |
Date | |
Msg-id | 20220901044753.GG31833@telsasoft.com Whole thread Raw |
In response to | Re: pg15b3: recovery fails with wal prefetch enabled (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: pg15b3: recovery fails with wal prefetch enabled
|
List | pgsql-hackers |
On Thu, Sep 01, 2022 at 04:22:20PM +1200, Thomas Munro wrote: > On Thu, Sep 1, 2022 at 3:08 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > At Thu, 1 Sep 2022 12:05:36 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in > > > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait > > > > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, currentposition 1201/1CADB730 > > > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > > > > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 > > > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm > > > > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > > > > > > > I was able to start it with -c recovery_prefetch=no, so it seems like > > > > prefetch tried to do too much. The VM runs centos7 under qemu. > > > > I'm making a copy of the data dir in cases it's needed. > > > > Just for information, there was a fixed bug about > > overwrite-aborted-contrecord feature, which causes this kind of > > failure (xlog flush request exceeds insertion bleeding edge). If it is > > that, it has been fixed by 6672d79139 two-days ago. > > Hmm. Justin, when you built from source, which commit were you at? > If it's REL_15_BETA3, No - it's: commit a2039b1f8e90d26a7e2a115ad5784476bd6deaa2 (HEAD -> REL_15_STABLE, origin/REL_15_STABLE) > If it's REL_15_BETA3, any chance you could cherry pick that change and > check what happens? And without that, could you show what this logs > And without that, could you show what this logs > for good and bad recovery settings? I wasn't sure what mean by "without that" , so here's a bunch of logs to sift through: At a203, with #define XLOGPREFETCHER_DEBUG_LEVEL NOTICE: [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no-c port=5678 ... < 2022-08-31 23:31:38.690 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:31:40.204 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0is replayed, which truncates the relation < 2022-08-31 23:31:40.307 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98is replayed, which truncates the relation < 2022-08-31 23:31:40.493 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8is replayed, because the relation is too small < 2022-08-31 23:31:40.721 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.41 s, system: 0.23 s, elapsed:2.03 s < 2022-08-31 23:31:41.452 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:31:41.698 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 < 2022-08-31 23:31:41.698 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:31:41.698 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 < 2022-08-31 23:31:41.698 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:31:41.699 CDT >FATAL: checkpoint request failed < 2022-08-31 23:31:41.699 CDT >HINT: Consult recent messages in the server log for details. < 2022-08-31 23:31:41.704 CDT >LOG: startup process (PID 25046) exited with exit code 1 < 2022-08-31 23:31:41.704 CDT >LOG: terminating any other active server processes < 2022-08-31 23:31:41.705 CDT >LOG: shutting down due to startup process failure < 2022-08-31 23:31:41.731 CDT >LOG: database system is shut down With your patch: [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no-c port=5678 ... < 2022-08-31 23:34:22.897 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:34:23.146 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0is replayed, which truncates the relation < 2022-08-31 23:34:23.147 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98is replayed, which truncates the relation < 2022-08-31 23:34:23.268 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8is replayed, because the relation is too small < 2022-08-31 23:34:23.323 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.29 s, system: 0.12 s, elapsed:0.42 s < 2022-08-31 23:34:23.323 CDT >LOG: point 0: lastRec = 12011cadb300 < 2022-08-31 23:34:23.323 CDT >LOG: point 0: endOfLog = 12011cadb730 < 2022-08-31 23:34:23.323 CDT >LOG: XXX point 1: EndOfLog = 12011cadb730 < 2022-08-31 23:34:23.386 CDT >LOG: XXX point 2: EndOfLog = 12011cadb730 < 2022-08-31 23:34:23.387 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ab82310 < 2022-08-31 23:34:23.565 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ab82310 < 2022-08-31 23:34:23.606 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:34:23.767 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 < 2022-08-31 23:34:23.767 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:34:23.767 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 < 2022-08-31 23:34:23.767 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:34:23.768 CDT >FATAL: checkpoint request failed And without prefetch: [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no-c port=5678 -c recovery_prefetch=no ... < 2022-08-31 23:37:08.792 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:37:09.269 CDT >LOG: invalid record length at 1201/1CD91010: wanted 24, got 0 < 2022-08-31 23:37:09.269 CDT >LOG: redo done at 1201/1CD90E48 system usage: CPU: user: 0.35 s, system: 0.11 s, elapsed:0.47 s < 2022-08-31 23:37:09.269 CDT >LOG: point 0: lastRec = 12011cd90e48 < 2022-08-31 23:37:09.269 CDT >LOG: point 0: endOfLog = 12011cd91010 < 2022-08-31 23:37:09.269 CDT >LOG: XXX point 1: EndOfLog = 12011cd91010 < 2022-08-31 23:37:09.350 CDT >LOG: XXX point 2: EndOfLog = 12011cd91010 < 2022-08-31 23:37:09.350 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ae35b68 < 2022-08-31 23:37:09.420 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ae35b68 < 2022-08-31 23:37:09.552 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:37:12.987 CDT >LOG: checkpoint complete: wrote 8030 buffers (49.0%); 0 WAL file(s) added, 0 removed, 1recycled; write=0.403 s, sync=2.841 s, total=3.566 s; sync files=102, longest=2.808 s, average=0.028 s; distance=20860 kB,estimate=20860 kB < 2022-08-31 23:37:13.077 CDT >LOG: database system is ready to accept connections If I revert 6672d79139 (and roll back to the unrecovered state): [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no-c port=5678 # -c recovery_prefetch=no ... < 2022-08-31 23:42:40.592 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:42:42.168 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958212 from block 156 until 1201/1C3965A0is replayed, which truncates the relation < 2022-08-31 23:42:42.238 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C39CC98is replayed, which truncates the relation < 2022-08-31 23:42:42.405 CDT >NOTICE: suppressing prefetch in relation 1663/16888/165958523 from block 23 until 1201/1C8643C8is replayed, because the relation is too small < 2022-08-31 23:42:42.602 CDT >LOG: redo done at 1201/1CADB300 system usage: CPU: user: 0.41 s, system: 0.25 s, elapsed:2.01 s < 2022-08-31 23:42:42.602 CDT >LOG: point 0: lastRec = 12011cadb300 < 2022-08-31 23:42:42.602 CDT >LOG: point 0: endOfLog = 12011cadb730 < 2022-08-31 23:42:42.602 CDT >LOG: XXX point 1: EndOfLog = 12011cadb730 < 2022-08-31 23:42:42.830 CDT >LOG: XXX point 2: EndOfLog = 12011cadb730 < 2022-08-31 23:42:42.830 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ab82310 < 2022-08-31 23:42:43.194 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ab82310 < 2022-08-31 23:42:43.266 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:42:43.425 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position1201/1CADB730 < 2022-08-31 23:42:43.425 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:42:43.425 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 < 2022-08-31 23:42:43.425 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 23:42:43.425 CDT >FATAL: checkpoint request failed < 2022-08-31 23:42:43.425 CDT >HINT: Consult recent messages in the server log for details. < 2022-08-31 23:42:43.431 CDT >LOG: startup process (PID 2415) exited with exit code 1 < 2022-08-31 23:42:43.431 CDT >LOG: terminating any other active server processes < 2022-08-31 23:42:43.432 CDT >LOG: shutting down due to startup process failure < 2022-08-31 23:42:43.451 CDT >LOG: database system is shut down If I revert 6672d79139 and disable prefetch: [pryzbyj@template0 postgresql]$ sudo -u postgres ./tmp_install/usr/local/pgsql/bin/postgres -D /mnt/tmp/15/data -c logging_collector=no-c port=5678 -c recovery_prefetch=no ... < 2022-08-31 23:43:25.711 CDT >LOG: redo starts at 1201/1B931F50 < 2022-08-31 23:43:26.178 CDT >LOG: invalid record length at 1201/1CD91010: wanted 24, got 0 < 2022-08-31 23:43:26.178 CDT >LOG: redo done at 1201/1CD90E48 system usage: CPU: user: 0.33 s, system: 0.11 s, elapsed:0.46 s < 2022-08-31 23:43:26.178 CDT >LOG: point 0: lastRec = 12011cd90e48 < 2022-08-31 23:43:26.178 CDT >LOG: point 0: endOfLog = 12011cd91010 < 2022-08-31 23:43:26.178 CDT >LOG: XXX point 1: EndOfLog = 12011cd91010 < 2022-08-31 23:43:26.369 CDT >LOG: XXX point 2: EndOfLog = 12011cd91010 < 2022-08-31 23:43:26.369 CDT >LOG: XXX point 3: Insert->CurrBytePos = 11f39ae35b68 < 2022-08-31 23:43:26.433 CDT >LOG: XXX point 4: Insert->CurrBytePos = 11f39ae35b68 < 2022-08-31 23:43:26.490 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 23:43:29.519 CDT >LOG: checkpoint complete: wrote 8030 buffers (49.0%); 0 WAL file(s) added, 0 removed, 1recycled; write=0.380 s, sync=2.492 s, total=3.086 s; sync files=102, longest=2.438 s, average=0.025 s; distance=20860 kB,estimate=20860 kB < 2022-08-31 23:43:29.567 CDT >LOG: database system is ready to accept connections
pgsql-hackers by date: