Thread: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18575 Logged by: Georgy Silkly Email address: g.shelkovy@arenadata.io PostgreSQL version: 16.3 Operating system: docker alpine Description: Sometimes pg_rewind mistakenly assumes that nothing needs to be done, which results in the replica having data that is not on the master. Steps to reproduce script ```bash #!/bin/sh -eux echo "Goto script's directory" cd "$(dirname "$0")" echo "Stop master if exists" test -f master/postmaster.pid && pg_ctl -w -D master stop -m fast || echo $? echo "Remove master directory" rm -rf master echo "Create master directory" mkdir -p master echo "Initialize master, data checksums are mandatory" pg_ctl initdb -D master -o "--data-checksums -N -A trust --wal-segsize 1" echo "Debug to see recycling WAL, keep only 5 WAL files (because segsize 1MB)" cat >>master/postgresql.conf <<EOF hot_standby = on logging_collector = on log_min_messages = debug wal_keep_size = 5MB EOF echo "Accept replication connections on master" cat >> master/pg_hba.conf <<EOF local replication all trust host replication all 127.0.0.1/32 trust host replication all ::1/128 trust EOF echo "Start master" pg_ctl -w -D master start -o "-p 5000" echo "Stop standby1 if exists" test -f standby1/postmaster.pid && pg_ctl -w -D standby1 stop -m fast || echo $? echo "Remove standby1 directory" rm -rf standby1 echo "Base backup is taken with xlog files included" pg_basebackup -D standby1 -p 5000 -X fetch --verbose --write-recovery-conf echo "Start standby1" pg_ctl -w -D standby1 start -o "-p 5001" echo "Stop standby2 if exists" test -f standby2/postmaster.pid && pg_ctl -w -D standby2 stop -m fast || echo $? echo "Remove standby2 directory" rm -rf standby2 echo "Base backup is taken with xlog files included" pg_basebackup -D standby2 -p 5000 -X fetch --verbose --write-recovery-conf echo "Start standby2" pg_ctl -w -D standby2 start -o "-p 5002" echo "Create table and insert lot of rows" psql -d postgres -a --no-psqlrc -p 5000 <<EOF create table a (a int, b int); insert into a select i, i from generate_series(1, 100000) i; EOF echo "Stop master to recycle WAL files" pg_ctl -w -D master stop -m fast echo "Stop standby1 to recycle WAL files" pg_ctl -w -D standby1 stop -m fast echo "Stop standby2 to recycle WAL files" pg_ctl -w -D standby2 stop -m fast echo "Start master" pg_ctl -w -D master start -o "-p 5000" echo "Start standby1" pg_ctl -w -D standby1 start -o "-p 5001" echo "Start standby2" pg_ctl -w -D standby2 start -o "-p 5002" echo "Now promote standby1" pg_ctl -w -D standby1 promote echo "Wait until standby1 is promoted" while ! pg_isready -p 5001; do sleep 1; done echo "Insert some rows" psql -d postgres -a --no-psqlrc -p 5000 <<EOF insert into a select i, i from generate_series(1, 340) i; EOF echo "Standby2 node need to be stopped as well." pg_ctl -w -D standby2 stop -m fast echo "Stop the master after standby promotion" pg_ctl -w -D master stop -m fast echo "Standby1 node need to be stopped as well." pg_ctl -w -D standby1 stop -m fast echo "Do rewind using standby1 pgdata as source" pg_rewind --progress --debug --source-pgdata=standby1 --target-pgdata=standby2 echo "Parameters for standby2 postgresql.conf" cat >> standby2/postgresql.conf <<EOF primary_conninfo = 'port=5001' EOF echo "Place standby2 signal file" touch standby2/standby.signal echo "Start standby1" pg_ctl -w -D standby1 start -o "-p 5001" echo "Start standby2" pg_ctl -w -D standby2 start -o "-p 5002" echo "Same query" psql -d postgres -a --no-psqlrc -p 5001 <<EOF select count(*) from a; EOF echo "Different results" psql -d postgres -a --no-psqlrc -p 5002 <<EOF select count(*) from a; EOF echo "Stop standby1" pg_ctl -w -D standby1 stop -m fast echo "Stop standby2" pg_ctl -w -D standby2 stop -m fast ``` output ```bash + echo 'Goto script'"'"'s directory' Goto script's directory + dirname /var/lib/postgresql/r.sh + cd /var/lib/postgresql + echo 'Stop master if exists' Stop master if exists + test -f master/postmaster.pid + echo 1 1 + echo 'Remove master directory' Remove master directory + rm -rf master + echo 'Create master directory' Create master directory + mkdir -p master + echo 'Initialize master, data checksums are mandatory' Initialize master, data checksums are mandatory + pg_ctl initdb -D master -o '--data-checksums -N -A trust --wal-segsize 1' The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locale "ru_RU.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "russian". Data page checksums are enabled. fixing permissions on existing directory master ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting default time zone ... Asia/Yekaterinburg creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok Sync to disk skipped. The data directory might become corrupt if the operating system crashes. Success. You can now start the database server using: /usr/local/bin/pg_ctl -D master -l logfile start + echo 'Debug to see recycling WAL, keep only 5 WAL files (because segsize 1MB)' Debug to see recycling WAL, keep only 5 WAL files (because segsize 1MB) + cat + echo 'Accept replication connections on master' Accept replication connections on master + cat + echo 'Start master' Start master + pg_ctl -w -D master start -o '-p 5000' waiting for server to start....2024-08-07 17:14:59.431 +05 [19618] DEBUG: registering background worker "logical replication launcher" 2024-08-07 17:14:59.431 +05 [19618] DEBUG: mmap(146800640) with MAP_HUGETLB failed, huge pages disabled: Out of memory 2024-08-07 17:14:59.434 +05 [19618] DEBUG: dynamic shared memory system will support 674 segments 2024-08-07 17:14:59.434 +05 [19618] DEBUG: created dynamic shared memory control segment 1581037720 (26976 bytes) 2024-08-07 17:14:59.434 +05 [19618] DEBUG: max_safe_fds = 986, usable_fds = 1000, already_open = 4 2024-08-07 17:14:59.434 +05 [19618] LOG: redirecting log output to logging collector process 2024-08-07 17:14:59.434 +05 [19618] HINT: Future log output will appear in directory "log". done server started + echo 'Stop standby1 if exists' Stop standby1 if exists + test -f standby1/postmaster.pid + echo 1 1 + echo 'Remove standby1 directory' Remove standby1 directory + rm -rf standby1 + echo 'Base backup is taken with xlog files included' Base backup is taken with xlog files included + pg_basebackup -D standby1 -p 5000 -X fetch --verbose --write-recovery-conf pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/800028 on timeline 1 pg_basebackup: write-ahead log end point: 0/800100 pg_basebackup: syncing data to disk ... pg_basebackup: renaming backup_manifest.tmp to backup_manifest pg_basebackup: base backup completed + echo 'Start standby1' Start standby1 + pg_ctl -w -D standby1 start -o '-p 5001' waiting for server to start....2024-08-07 17:15:00.282 +05 [19631] DEBUG: registering background worker "logical replication launcher" 2024-08-07 17:15:00.282 +05 [19631] DEBUG: mmap(146800640) with MAP_HUGETLB failed, huge pages disabled: Out of memory 2024-08-07 17:15:00.286 +05 [19631] DEBUG: dynamic shared memory system will support 674 segments 2024-08-07 17:15:00.286 +05 [19631] DEBUG: created dynamic shared memory control segment 1807324640 (26976 bytes) 2024-08-07 17:15:00.286 +05 [19631] DEBUG: max_safe_fds = 986, usable_fds = 1000, already_open = 4 2024-08-07 17:15:00.286 +05 [19631] LOG: redirecting log output to logging collector process 2024-08-07 17:15:00.286 +05 [19631] HINT: Future log output will appear in directory "log". done server started + echo 'Stop standby2 if exists' Stop standby2 if exists + test -f standby2/postmaster.pid + echo 1 1 + echo 'Remove standby2 directory' Remove standby2 directory + rm -rf standby2 + echo 'Base backup is taken with xlog files included' Base backup is taken with xlog files included + pg_basebackup -D standby2 -p 5000 -X fetch --verbose --write-recovery-conf pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/900028 on timeline 1 pg_basebackup: write-ahead log end point: 0/900100 pg_basebackup: syncing data to disk ... pg_basebackup: renaming backup_manifest.tmp to backup_manifest pg_basebackup: base backup completed + echo 'Start standby2' Start standby2 + pg_ctl -w -D standby2 start -o '-p 5002' waiting for server to start....2024-08-07 17:15:01.123 +05 [19643] DEBUG: registering background worker "logical replication launcher" 2024-08-07 17:15:01.123 +05 [19643] DEBUG: mmap(146800640) with MAP_HUGETLB failed, huge pages disabled: Out of memory 2024-08-07 17:15:01.126 +05 [19643] DEBUG: dynamic shared memory system will support 674 segments 2024-08-07 17:15:01.126 +05 [19643] DEBUG: created dynamic shared memory control segment 2259555868 (26976 bytes) 2024-08-07 17:15:01.126 +05 [19643] DEBUG: max_safe_fds = 986, usable_fds = 1000, already_open = 4 2024-08-07 17:15:01.126 +05 [19643] LOG: redirecting log output to logging collector process 2024-08-07 17:15:01.126 +05 [19643] HINT: Future log output will appear in directory "log". done server started + echo 'Create table and insert lot of rows' Create table and insert lot of rows + psql -d postgres -a --no-psqlrc -p 5000 create table a (a int, b int); CREATE TABLE insert into a select i, i from generate_series(1, 100000) i; INSERT 0 100000 + echo 'Stop master to recycle WAL files' Stop master to recycle WAL files + pg_ctl -w -D master stop -m fast waiting for server to shut down....2024-08-07 17:15:01.650 +05 [19619] DEBUG: logger shutting down done server stopped + echo 'Stop standby1 to recycle WAL files' Stop standby1 to recycle WAL files + pg_ctl -w -D standby1 stop -m fast waiting for server to shut down....2024-08-07 17:15:01.731 +05 [19632] DEBUG: logger shutting down done server stopped + echo 'Stop standby2 to recycle WAL files' Stop standby2 to recycle WAL files + pg_ctl -w -D standby2 stop -m fast waiting for server to shut down....2024-08-07 17:15:01.831 +05 [19644] DEBUG: logger shutting down done server stopped + echo 'Start master' Start master + pg_ctl -w -D master start -o '-p 5000' waiting for server to start....2024-08-07 17:15:01.872 +05 [19660] DEBUG: registering background worker "logical replication launcher" 2024-08-07 17:15:01.872 +05 [19660] DEBUG: mmap(146800640) with MAP_HUGETLB failed, huge pages disabled: Out of memory 2024-08-07 17:15:01.876 +05 [19660] DEBUG: dynamic shared memory system will support 674 segments 2024-08-07 17:15:01.876 +05 [19660] DEBUG: created dynamic shared memory control segment 2766949016 (26976 bytes) 2024-08-07 17:15:01.876 +05 [19660] DEBUG: max_safe_fds = 986, usable_fds = 1000, already_open = 4 2024-08-07 17:15:01.876 +05 [19660] LOG: redirecting log output to logging collector process 2024-08-07 17:15:01.876 +05 [19660] HINT: Future log output will appear in directory "log". done server started + echo 'Start standby1' Start standby1 + pg_ctl -w -D standby1 start -o '-p 5001' waiting for server to start....2024-08-07 17:15:01.979 +05 [19670] DEBUG: registering background worker "logical replication launcher" 2024-08-07 17:15:01.979 +05 [19670] DEBUG: mmap(146800640) with MAP_HUGETLB failed, huge pages disabled: Out of memory 2024-08-07 17:15:01.982 +05 [19670] DEBUG: dynamic shared memory system will support 674 segments 2024-08-07 17:15:01.982 +05 [19670] DEBUG: created dynamic shared memory control segment 1083876086 (26976 bytes) 2024-08-07 17:15:01.982 +05 [19670] DEBUG: max_safe_fds = 986, usable_fds = 1000, already_open = 4 2024-08-07 17:15:01.983 +05 [19670] LOG: redirecting log output to logging collector process 2024-08-07 17:15:01.983 +05 [19670] HINT: Future log output will appear in directory "log". done server started + echo 'Start standby2' Start standby2 + pg_ctl -w -D standby2 start -o '-p 5002' waiting for server to start....2024-08-07 17:15:02.085 +05 [19679] DEBUG: registering background worker "logical replication launcher" 2024-08-07 17:15:02.085 +05 [19679] DEBUG: mmap(146800640) with MAP_HUGETLB failed, huge pages disabled: Out of memory 2024-08-07 17:15:02.088 +05 [19679] DEBUG: dynamic shared memory system will support 674 segments 2024-08-07 17:15:02.088 +05 [19679] DEBUG: created dynamic shared memory control segment 4052049392 (26976 bytes) 2024-08-07 17:15:02.088 +05 [19679] DEBUG: max_safe_fds = 986, usable_fds = 1000, already_open = 4 2024-08-07 17:15:02.089 +05 [19679] LOG: redirecting log output to logging collector process 2024-08-07 17:15:02.089 +05 [19679] HINT: Future log output will appear in directory "log". done server started + echo 'Now promote standby1' Now promote standby1 + pg_ctl -w -D standby1 promote waiting for server to promote.... done server promoted + echo 'Wait until standby1 is promoted' Wait until standby1 is promoted + pg_isready -p 5001 /tmp:5001 - accepting connections + echo 'Insert some rows' Insert some rows + psql -d postgres -a --no-psqlrc -p 5000 insert into a select i, i from generate_series(1, 340) i; INSERT 0 340 + echo 'Standby2 node need to be stopped as well.' Standby2 node need to be stopped as well. + pg_ctl -w -D standby2 stop -m fast waiting for server to shut down....2024-08-07 17:15:02.411 +05 [19680] DEBUG: logger shutting down done server stopped + echo 'Stop the master after standby promotion' Stop the master after standby promotion + pg_ctl -w -D master stop -m fast waiting for server to shut down....2024-08-07 17:15:02.517 +05 [19661] DEBUG: logger shutting down done server stopped + echo 'Standby1 node need to be stopped as well.' Standby1 node need to be stopped as well. + pg_ctl -w -D standby1 stop -m fast waiting for server to shut down....2024-08-07 17:15:02.617 +05 [19671] DEBUG: logger shutting down done server stopped + echo 'Do rewind using standby1 pgdata as source' Do rewind using standby1 pgdata as source + pg_rewind --progress --debug '--source-pgdata=standby1' '--target-pgdata=standby2' pg_rewind: Source timeline history: pg_rewind: Target timeline history: pg_rewind: 1: 0/0 - 0/0 pg_rewind: servers diverged at WAL location 0/10392E8 on timeline 1 pg_rewind: no rewind required + echo 'Parameters for standby2 postgresql.conf' Parameters for standby2 postgresql.conf + cat + echo 'Place standby2 signal file' Place standby2 signal file + touch standby2/standby.signal + echo 'Start standby1' Start standby1 + pg_ctl -w -D standby1 start -o '-p 5001' waiting for server to start....2024-08-07 17:15:02.723 +05 [19702] DEBUG: registering background worker "logical replication launcher" 2024-08-07 17:15:02.723 +05 [19702] DEBUG: mmap(146800640) with MAP_HUGETLB failed, huge pages disabled: Out of memory 2024-08-07 17:15:02.726 +05 [19702] DEBUG: dynamic shared memory system will support 674 segments 2024-08-07 17:15:02.726 +05 [19702] DEBUG: created dynamic shared memory control segment 3627107606 (26976 bytes) 2024-08-07 17:15:02.726 +05 [19702] DEBUG: max_safe_fds = 986, usable_fds = 1000, already_open = 4 2024-08-07 17:15:02.726 +05 [19702] LOG: redirecting log output to logging collector process 2024-08-07 17:15:02.726 +05 [19702] HINT: Future log output will appear in directory "log". done server started + echo 'Start standby2' Start standby2 + pg_ctl -w -D standby2 start -o '-p 5002' waiting for server to start....2024-08-07 17:15:02.830 +05 [19712] DEBUG: registering background worker "logical replication launcher" 2024-08-07 17:15:02.830 +05 [19712] DEBUG: mmap(146800640) with MAP_HUGETLB failed, huge pages disabled: Out of memory 2024-08-07 17:15:02.833 +05 [19712] DEBUG: dynamic shared memory system will support 674 segments 2024-08-07 17:15:02.833 +05 [19712] DEBUG: created dynamic shared memory control segment 2291233354 (26976 bytes) 2024-08-07 17:15:02.833 +05 [19712] DEBUG: max_safe_fds = 986, usable_fds = 1000, already_open = 4 2024-08-07 17:15:02.833 +05 [19712] LOG: redirecting log output to logging collector process 2024-08-07 17:15:02.833 +05 [19712] HINT: Future log output will appear in directory "log". done server started + echo 'Same query' Same query + psql -d postgres -a --no-psqlrc -p 5001 select count(*) from a; count -------- 100000 (1 row) + echo 'Different results' Different results + psql -d postgres -a --no-psqlrc -p 5002 select count(*) from a; count -------- 100340 (1 row) + echo 'Stop standby1' Stop standby1 + pg_ctl -w -D standby1 stop -m fast waiting for server to shut down....2024-08-07 17:15:03.077 +05 [19703] DEBUG: logger shutting down done server stopped + echo 'Stop standby2' Stop standby2 + pg_ctl -w -D standby2 stop -m fast waiting for server to shut down....2024-08-07 17:15:03.117 +05 [19713] DEBUG: logger shutting down done server stopped ```
Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
From
Heikki Linnakangas
Date:
On 07/08/2024 15:19, PG Bug reporting form wrote: > Sometimes pg_rewind mistakenly assumes that nothing needs to be done, which > results in the replica having data that is not on the master. > > ... > + pg_rewind --progress --debug '--source-pgdata=standby1' > '--target-pgdata=standby2' > pg_rewind: Source timeline history: > pg_rewind: Target timeline history: > pg_rewind: 1: 0/0 - 0/0 > pg_rewind: servers diverged at WAL location 0/10392E8 on timeline 1 > pg_rewind: no rewind required > > ... > Same query > + psql -d postgres -a --no-psqlrc -p 5001 > select count(*) from a; > count > -------- > 100000 > (1 row) > > + echo 'Different results' > Different results > + psql -d postgres -a --no-psqlrc -p 5002 > select count(*) from a; > count > -------- > 100340 > (1 row) > > + echo 'Stop standby1' > Stop standby1 > + pg_ctl -w -D standby1 stop -m fast > waiting for server to shut down....2024-08-07 17:15:03.077 +05 [19703] > DEBUG: logger shutting down > done > server stopped > + echo 'Stop standby2' > Stop standby2 > + pg_ctl -w -D standby2 stop -m fast > waiting for server to shut down....2024-08-07 17:15:03.117 +05 [19713] > DEBUG: logger shutting down > done > server stopped > ``` I cannot reproduce this. On my laptop, the reported "servers diverged at WAL location" is always higher, and it performs rewind as expected: g_rewind: Source timeline history: pg_rewind: Target timeline history: pg_rewind: 1: 0/0 - 0/0 pg_rewind: servers diverged at WAL location 0/1138F00 on timeline 1 pg_rewind: rewinding from last common checkpoint at 0/1138E88 on timeline 1 However while looking at the code, I noticed that the debug-output of the Source timeline history is broken. See attached patch to fix it. With that fix, I get: pg_rewind: Source timeline history: pg_rewind: 1: 0/0 - 0/1138F00 pg_rewind: 2: 0/1138F00 - 0/0 pg_rewind: Target timeline history: pg_rewind: 1: 0/0 - 0/0 pg_rewind: 0 a: 1 0/0-0/1138F00 - b: 1 0/0-0/0 pg_rewind: servers diverged at WAL location 0/1138F00 on timeline 1 pg_rewind: rewinding from last common checkpoint at 0/1138E88 on timeline 1 That just fixes the debugging output, though, it doesn't change what it actually does. Can you try the attached fix show the output you get with that please? Or alternatively, show the contents of the standby1/pg_wal/00000002.history file. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
From
Heikki Linnakangas
Date:
On 08/08/2024 10:57, Georgy Shelkovy wrote: > Unfortunately, the playback is not very stable, but sometimes it shoots. > I added some commands to show last WAL rows Thanks. I still haven't been able to reproduce it, but here's a theory: When determining whether the target needs rewinding, pg_rewind looks at the target's last checkpoint record, or if it's a standby, its minRecoveryPoint. It's possible that standby2's minRecoveryPoint is indeed before the point of divergence. That means it has replayed the 340 insert records, but all the changes are still only sitting in the shared buffer cache. When you shut it down, those 340 inserts are gone on standby2. When you restart it, they will be applied again from the WAL. In that case, pg_rewind's conclusion that no rewind is needed is correct. standby2 is strictly behind standby1, and could catch up directly to it. However, when you restart standby2, it will first replay the WAL it had streamed from master. Can you show the full output of pg_controldata on all the servers, please? In your latest snippet, you showed just the checkpoint locations, but if just remove the "grep checkpoint | grep location" filters, it would print the whole thing. I'm particularly interested in the minRecoveryPoint on standby2, in the cases when it works and when it doesn't. I'm not sure what the right behavior would be if that's the issue. Perhaps pg_rewind should truncate the WAL in standby2/pg_wal/ in that case, so that when you start it up again, it would not replay the local WAL but would connect to standby2 directly. Also, perhaps a fast shutdown of a standby server should update minRecoveryPoint before exiting. -- Heikki Linnakangas Neon (https://neon.tech)
Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
From
Heikki Linnakangas
Date:
On 08/08/2024 12:38, Georgy Shelkovy wrote: > On second run I got bug Can you show the full pg_controldata output please? See attached script. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
From
Heikki Linnakangas
Date:
On 08/08/2024 14:58, Georgy Shelkovy wrote: > this is good log of previous run for comparison Thanks! That confirms my theory that the minRecoveryPoint in the "bad" case is at the LSN where the histories diverged, while in the "good" case, it's later. So pg_rewind is not wrong when it says that no rewind is required. It's still confusing though. Here's a visualization of the scenarios: Legend: TLI 1: this the WAL produced on the master TLI 2: WAL produced on standby1 *: point of divergence ^: minRecoveryPoint on standby2 Good: ----------- TLI 2 / ----------*------------ TLI ^ Bad: ----------- TLI 2 / ----------*------------ TLI 1 ^ There's a third possibility, which actually produces an assertion failure. I was able to reproduce this case by adding some sleeps in the script and in walreceiver startup code: ----------- TLI 2 / ----------*------------ TLI 1 ^ pg_rewind: Source timeline history: pg_rewind: 1: 0/0 - 0/1138F00 pg_rewind: 2: 0/1138F00 - 0/0 pg_rewind: Target timeline history: pg_rewind: 1: 0/0 - 0/0 pg_rewind: servers diverged at WAL location 0/1138F00 on timeline 1 pg_rewind: ../src/bin/pg_rewind/pg_rewind.c:443: int main(int, char **): Assertion `target_wal_endrec == divergerec' failed. Except for the assertion failure, I think that's essentially the same as the "Bad" case. On a non-assertion build, pg_rewind would report "no rewind required" which seems correct. So it seems true that rewind is not required in those cases. However, if the WAL is already written on the standby's disk, just not replayed yet, then when you restart the server, it will replay the WAL from timeline 1. That does seem surprising. Perhaps pg_rewind should just update the minRecoveryPoint and minRecoveryTLI in control file in that case, to force WAL recovery to follow the timeline switch to TLI 2. I will try to write a TAP test for the "Bad" and the assertion failure case, fix the assertion failure, and test how updating the minRecoveryPoint would behave. -- Heikki Linnakangas Neon (https://neon.tech)
Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
From
Heikki Linnakangas
Date:
On 08/08/2024 16:07, Heikki Linnakangas wrote: > So it seems true that rewind is not required in those cases. However, if > the WAL is already written on the standby's disk, just not replayed yet, > then when you restart the server, it will replay the WAL from timeline > 1. That does seem surprising. Perhaps pg_rewind should just update the > minRecoveryPoint and minRecoveryTLI in control file in that case, to > force WAL recovery to follow the timeline switch to TLI 2. > > I will try to write a TAP test for the "Bad" and the assertion failure > case, fix the assertion failure, and test how updating the > minRecoveryPoint would behave. I've divided this into three separate issues: 1. I think the assertion in pg_rewind is simply bogus, and we should remove it. Attached 0002-Remove-bogus-assertion-in-pg_rewind.patch does that, and adds a test case to cover it. 2. Independently of pg_rewind: When you start PostgreSQL, it will first try to recover all the WAL it has locally in pg_wal. That goes wrong if you have set a recovery target TLI. For example, imagine this situation: - Recovery target TLI is 2, set explicitly in postgresql.conf - The switchpoint from TLI 1 to 2 happened at WAL position 0/1510198 (the switchpoint is found in 00000002.history) - There is a WAL file 000000010000000000000001 under pg_wal, which contains valid WAL up to 0/1590000 When you start the server, it will first recover all the WAL from 000000010000000000000001, up to 0/1590000. Then it will connect to the primary to fetch mor WAL, but it will fail to make any progress because it already blew past the switch point. It's obviously wrong to replay the WAL from timeline 1 beyond the 1->2 switchpoint, when the recovery target is TLI 2. The attached 0003-Don-t-read-past-current-TLI-during-archive-recovery.patch fixes that. However, the logic to find the right WAL segment file and read the WAL is extremely complicated, and I don't feel comfortable that I got all the cases right. Review would be highly appreciated. The patch includes a test case to demonstrate the case, with no pg_rewind. It does include one "manual" step to copy a timeline history file into pg_wal, marked with XXX, however. So I'm not sure how possible this scenario is in production setups . 3. When pg_rewind has nothing to do, the target server is left unmodified, in a state such that when you restart it, it will replay all the WAL it has locally in pg_wal first, before connecting to the primary. Even though the target is a direct ancestor of the source and hence it *can* follow the WAL to the source's position without rewinding, it doesn't mean that it *will* actually do so. The attached changes it so that it updates the control file in that case, setting minRecoveryPoint and minRecoveryPointTLI to point to the source's current WAL position. That way, when you start it up, it will follow the timeline history to reach that point. (This requires fixing issue 2, because otherwise it still won't follow the history correctly to reach the minRecoveryPointTLI) To make the test pass, it actually would be sufficient to copy the timeline history file into pg_wal (which the patch also does). But updating the minRecoveryPoint seems good to ensure that it follows the right timeline. Otherwise it relies on the logic at startup to find the latest timeline, and that the latest timeline is the one you tried to rewind to. I think it would go wrong if there was another higher-numbered history file present in pg_wal for some reason. All in all, I don't feel very confident about all this. The assertion seems straightforward, so barring objections I'll commit and backpatch that. The timeline-following at startup (issue 2) seems pretty clearly wrong, but I'm not sure it's worth the risk to backpatch. Similarly issue 3 might not be worth the risk to backpatch, especially if we don't also backpatch 2. I would love to hear comments on those. Georgy, if you have the possibility to test this patches with your repro script, that would be highly appreciated. -- Heikki Linnakangas Neon (https://neon.tech)