pg ignores wal files in pg_wal, and instead tries to load them from archive/primary - Mailing list pgsql-bugs

From hubert depesz lubaczewski
Subject pg ignores wal files in pg_wal, and instead tries to load them from archive/primary
Date
Msg-id YzW+5v/VwbguW+XU@depesz.com
Whole thread Raw
Responses Re: pg ignores wal files in pg_wal, and instead tries to load them from archive/primary
Re: pg ignores wal files in pg_wal, and instead tries to load them from archive/primary
List pgsql-bugs
Hi,
we have following situation:

1. primary on 14.5 that is *not* archiving (this is temporary situation
   related to ongoing upgrade from pg 12 proces) - all on ubuntu focal.
2. on new replica we run (via wrapper, but this doesn't seem to be
   related):
   pg_basebackup -D /var/lib/postgresql/14/main -c fast -v -P -U some-user -h sourcedb.hostname
3. after it is done, if the datadir was large enough, pg on replica
   doesn't replicate/catchup, because, from logs:
2022-09-29 14:59:26.587 UTC,,,2355588,,6335b2ce.23f184,1,,2022-09-29 14:59:26 UTC,,0,LOG,00000,"started streaming WAL
fromprimary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0
 
2022-09-29 14:59:26.587 UTC,,,2355588,,6335b2ce.23f184,2,,2022-09-29 14:59:26 UTC,,0,FATAL,08P01,"could not receive
datafrom WAL stream: ERROR:  requested WAL segment 00000001000007E800000067 has already been
removed",,,,,,,,,"","walreceiver",,0
4. if there is restore_command configured, it tries to read data from archive
   too, but archive is non-existant.
5. the "missing" file is there, in pg_wal (I would assume that
   pg_basebackup copied it there):
   root@host# /bin/ls -c1 0* | wc -l
   1068
   root@host# /bin/ls -c1 0* | sort -V | head -n 1
   00000001000007E4000000A0
   root@host# /bin/ls -c1 0* | sort -V | tail -n 1
   00000001000007E800000092
   root@host# /bin/ls -c1 0* | sort -V | grep -n 00000001000007E800000067
   1043:00000001000007E800000067
   root@host# /bin/ls -c1 0* | sort -V | grep -n -C5 00000001000007E800000067
   1038-00000001000007E800000062
   1039-00000001000007E800000063
   1040-00000001000007E800000064
   1041-00000001000007E800000065
   1042-00000001000007E800000066
   1043:00000001000007E800000067
   1044-00000001000007E800000068
   1045-00000001000007E800000069
   1046-00000001000007E800000070
   1047-00000001000007E800000071
   1048-00000001000007E800000072
6. What's more - I straced startup process, and it does:
   a. opens the wal file (the problematic one)
   b. read 8k form it
   c. closes it
   d. checks existence of finish.recovery trigger file (it doesn't exist)
   e. starts restore program (which fails).
   f. rinse and repeat

What am I missing? what is wrong? How can I fix it? The problem is not fixing
*this server*, because we are in process of upgrading LOTS and LOTS of servers,
and I need to know what is broken/how to work around it.


Currently our goto fix is:
1. increase wal_keep_size to ~ 200GB
2. standaup replica
3. once it catches up decrease wal_keep_size to standard (for us) 16GB

but it is not really nice solution.

Best regards,

depesz




pgsql-bugs by date:

Previous
From: Bertrand Mutangana
Date:
Subject: Re: BUG #17624: Creating database is non-ending execution.
Next
From: PG Bug reporting form
Date:
Subject: BUG #17625: In PG15 PQsslAttribute returns different values than PG14 when SSL is not in use for the connection