Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman - Mailing list pgsql-hackers

From Andres Freund
Subject Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
Date
Msg-id 20220409005910.alw46xqmmgny2sgr@alap3.anarazel.de
Whole thread Raw
In response to Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2022-04-08 17:55:51 -0400, Tom Lane wrote:
> I tried adjusting the patch so it does guarantee that (as attached),
> and in two out of two tries it got past the archive_cleanup_command
> failure but then hung up waiting for standby2 to promote.

Adding

$node_standby->safe_psql('postgres', "SELECT pg_switch_wal()");
just after
$node_standby2->start;

makes the tests pass here.


What is that second test really testing?

# Check the presence of temporary files specifically generated during
# archive recovery.  To ensure the presence of the temporary history
# file, switch to a timeline large enough to allow a standby to recover
# a history file from an archive.  As this requires at least two timeline
# switches, promote the existing standby first.  Then create a second
# standby based on the promoted one.  Finally, the second standby is
# promoted.

Note "Then create a second standby based on the promoted one." - but that's
not actually what's happening:

$node_standby2->init_from_backup($node_primary, $backup_name,
    has_restoring => 1);

It's created from the original primary, not the first standby, as the
description says...

Both nodes get promoted independently, in a run without valgrind:

standby:
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG:  INSERT @ 0/4000058:  - XLOG/END_OF_RECOVERY: tli 2; prev
tli1; time 2022-04-08 17:23:42.96686-07
 
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG:  xlog flush request 0/4000058; write 0/4000000; flush
0/4000000

standby2:
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG:  INSERT @ 0/4000058:  - XLOG/END_OF_RECOVERY: tli 3; prev
tli1; time 2022-04-08 17:23:43.307443->
 
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG:  xlog flush request 0/4000058; write 0/4000000; flush
0/4000000

except that standby2 can't choose tli 2 because it finds it used.

Sure looks like something is funky with that test.


But I think there's also something funky in the prefetching logic. I think it
may attempt restoring during prefetching somehow, even though there's code
that appears to try to prevent that?

on standby2 I can see replay progress like the following:
2022-04-08 17:02:12.310 PDT [2441453][startup][1/0:0][] LOG:  REDO @ 0/3024488; LSN 0/30244C8: prev 0/3024448; xid 725;
len3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 60 flags 0x00
 
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] DEBUG:  record known xact 725 latestObservedXid 725
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] CONTEXT:  WAL redo at 0/3024488 for Heap/INSERT: off 60 flags
0x00;blkref #0: rel 1663/5/16384, blk 4
 
2022-04-08 17:02:12.312 PDT [2441453][startup][1/0:0][] DEBUG:  executing restore command "cp
"/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004"
"pg_wal/RECOVERYXLOG""
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG:  could not restore file "000000010000000000000004" from
archive:child process exited with exit code 1
 
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG:  could not open file "pg_wal/000000010000000000000004":
Nosuch file or directory
 
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] LOG:  REDO @ 0/30244C8; LSN 0/3024508: prev 0/3024488; xid 725;
len3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 61 flags 0x00
 
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] DEBUG:  record known xact 725 latestObservedXid 725
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] CONTEXT:  WAL redo at 0/30244C8 for Heap/INSERT: off 61 flags
0x00;blkref #0: rel 1663/5/16384, blk 4
 
2022-04-08 17:02:13.857 PDT [2441453][startup][1/0:0][] DEBUG:  executing restore command "cp
"/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004"
"pg_wal/RECOVERYXLOG""
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG:  could not restore file "000000010000000000000004" from
archive:child process exited with exit code 1
 
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG:  could not open file "pg_wal/000000010000000000000004":
Nosuch file or directory
 
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] LOG:  REDO @ 0/3024508; LSN 0/3024548: prev 0/30244C8; xid 725;
len3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 62 flags 0x00
 
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] DEBUG:  record known xact 725 latestObservedXid 725
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] CONTEXT:  WAL redo at 0/3024508 for Heap/INSERT: off 62 flags
0x00;blkref #0: rel 1663/5/16384, blk 4
 
2022-04-08 17:02:15.415 PDT [2441453][startup][1/0:0][] DEBUG:  executing restore command "cp
"/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004"
"pg_wal/RECOVERYXLOG""

note that we appear to wait between replaying of records, even though we
apparently have WAL for the next record!

And interestingly I'm not seeing the
"switched WAL source from stream to archive after failure"
lines I'd expect.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Mingw task for Cirrus CI
Next
From: Robert Haas
Date:
Subject: Re: avoid multiple hard links to same WAL file after a crash