Thread: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.

BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18575
Logged by:          Georgy Silkly
Email address:      g.shelkovy@arenadata.io
PostgreSQL version: 16.3
Operating system:   docker alpine
Description:

Sometimes pg_rewind mistakenly assumes that nothing needs to be done, which
results in the replica having data that is not on the master.

Steps to reproduce
script
```bash
#!/bin/sh -eux

echo "Goto script's directory"
cd "$(dirname "$0")"
echo "Stop master if exists"
test -f master/postmaster.pid && pg_ctl -w -D master stop -m fast || echo
$?
echo "Remove master directory"
rm -rf master
echo "Create master directory"
mkdir -p master
echo "Initialize master, data checksums are mandatory"
pg_ctl initdb -D master -o "--data-checksums -N -A trust --wal-segsize 1"
echo "Debug to see recycling WAL, keep only 5 WAL files (because segsize
1MB)"
cat >>master/postgresql.conf <<EOF
hot_standby = on
logging_collector = on
log_min_messages = debug
wal_keep_size = 5MB
EOF
echo "Accept replication connections on master"
cat >> master/pg_hba.conf <<EOF
local replication all trust
host replication all 127.0.0.1/32 trust
host replication all ::1/128 trust
EOF
echo "Start master"
pg_ctl -w -D master start -o "-p 5000"
echo "Stop standby1 if exists"
test -f standby1/postmaster.pid && pg_ctl -w -D standby1 stop -m fast ||
echo $?
echo "Remove standby1 directory"
rm -rf standby1
echo "Base backup is taken with xlog files included"
pg_basebackup -D standby1 -p 5000 -X fetch --verbose --write-recovery-conf
echo "Start standby1"
pg_ctl -w -D standby1 start -o "-p 5001"
echo "Stop standby2 if exists"
test -f standby2/postmaster.pid && pg_ctl -w -D standby2 stop -m fast ||
echo $?
echo "Remove standby2 directory"
rm -rf standby2
echo "Base backup is taken with xlog files included"
pg_basebackup -D standby2 -p 5000 -X fetch --verbose --write-recovery-conf
echo "Start standby2"
pg_ctl -w -D standby2 start -o "-p 5002"
echo "Create table and insert lot of rows"
psql -d postgres -a --no-psqlrc -p 5000 <<EOF
create table a (a int, b int);
insert into a select i, i from generate_series(1, 100000) i;
EOF
echo "Stop master to recycle WAL files"
pg_ctl -w -D master stop -m fast
echo "Stop standby1 to recycle WAL files"
pg_ctl -w -D standby1 stop -m fast
echo "Stop standby2 to recycle WAL files"
pg_ctl -w -D standby2 stop -m fast
echo "Start master"
pg_ctl -w -D master start -o "-p 5000"
echo "Start standby1"
pg_ctl -w -D standby1 start -o "-p 5001"
echo "Start standby2"
pg_ctl -w -D standby2 start -o "-p 5002"
echo "Now promote standby1"
pg_ctl -w -D standby1 promote
echo "Wait until standby1 is promoted"
while ! pg_isready -p 5001; do sleep 1; done
echo "Insert some rows"
psql -d postgres -a --no-psqlrc -p 5000 <<EOF
insert into a select i, i from generate_series(1, 340) i;
EOF
echo "Standby2 node need to be stopped as well."
pg_ctl -w -D standby2 stop -m fast
echo "Stop the master after standby promotion"
pg_ctl -w -D master stop -m fast
echo "Standby1 node need to be stopped as well."
pg_ctl -w -D standby1 stop -m fast
echo "Do rewind using standby1 pgdata as source"
pg_rewind --progress --debug --source-pgdata=standby1
--target-pgdata=standby2
echo "Parameters for standby2 postgresql.conf"
cat >> standby2/postgresql.conf <<EOF
primary_conninfo = 'port=5001'
EOF
echo "Place standby2 signal file"
touch standby2/standby.signal
echo "Start standby1"
pg_ctl -w -D standby1 start -o "-p 5001"
echo "Start standby2"
pg_ctl -w -D standby2 start -o "-p 5002"
echo "Same query"
psql -d postgres -a --no-psqlrc -p 5001 <<EOF
select count(*) from a;
EOF
echo "Different results"
psql -d postgres -a --no-psqlrc -p 5002 <<EOF
select count(*) from a;
EOF
echo "Stop standby1"
pg_ctl -w -D standby1 stop -m fast
echo "Stop standby2"
pg_ctl -w -D standby2 stop -m fast
```
output
```bash
+ echo 'Goto script'"'"'s directory'
Goto script's directory
+ dirname /var/lib/postgresql/r.sh
+ cd /var/lib/postgresql
+ echo 'Stop master if exists'
Stop master if exists
+ test -f master/postmaster.pid
+ echo 1
1
+ echo 'Remove master directory'
Remove master directory
+ rm -rf master
+ echo 'Create master directory'
Create master directory
+ mkdir -p master
+ echo 'Initialize master, data checksums are mandatory'
Initialize master, data checksums are mandatory
+ pg_ctl initdb -D master -o '--data-checksums -N -A trust --wal-segsize
1'
The files belonging to this database system will be owned by user
"postgres".
This user must also own the server process.

The database cluster will be initialized with locale "ru_RU.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "russian".

Data page checksums are enabled.

fixing permissions on existing directory master ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Asia/Yekaterinburg
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok

Sync to disk skipped.
The data directory might become corrupt if the operating system crashes.

Success. You can now start the database server using:

    /usr/local/bin/pg_ctl -D master -l logfile start

+ echo 'Debug to see recycling WAL, keep only 5 WAL files (because segsize
1MB)'
Debug to see recycling WAL, keep only 5 WAL files (because segsize 1MB)
+ cat
+ echo 'Accept replication connections on master'
Accept replication connections on master
+ cat
+ echo 'Start master'
Start master
+ pg_ctl -w -D master start -o '-p 5000'
waiting for server to start....2024-08-07 17:14:59.431 +05 [19618] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:14:59.431 +05 [19618] DEBUG:  mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:14:59.434 +05 [19618] DEBUG:  dynamic shared memory system
will support 674 segments
2024-08-07 17:14:59.434 +05 [19618] DEBUG:  created dynamic shared memory
control segment 1581037720 (26976 bytes)
2024-08-07 17:14:59.434 +05 [19618] DEBUG:  max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:14:59.434 +05 [19618] LOG:  redirecting log output to logging
collector process
2024-08-07 17:14:59.434 +05 [19618] HINT:  Future log output will appear in
directory "log".
 done
server started
+ echo 'Stop standby1 if exists'
Stop standby1 if exists
+ test -f standby1/postmaster.pid
+ echo 1
1
+ echo 'Remove standby1 directory'
Remove standby1 directory
+ rm -rf standby1
+ echo 'Base backup is taken with xlog files included'
Base backup is taken with xlog files included
+ pg_basebackup -D standby1 -p 5000 -X fetch --verbose
--write-recovery-conf
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/800028 on timeline 1
pg_basebackup: write-ahead log end point: 0/800100
pg_basebackup: syncing data to disk ...
pg_basebackup: renaming backup_manifest.tmp to backup_manifest
pg_basebackup: base backup completed
+ echo 'Start standby1'
Start standby1
+ pg_ctl -w -D standby1 start -o '-p 5001'
waiting for server to start....2024-08-07 17:15:00.282 +05 [19631] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:00.282 +05 [19631] DEBUG:  mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:00.286 +05 [19631] DEBUG:  dynamic shared memory system
will support 674 segments
2024-08-07 17:15:00.286 +05 [19631] DEBUG:  created dynamic shared memory
control segment 1807324640 (26976 bytes)
2024-08-07 17:15:00.286 +05 [19631] DEBUG:  max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:00.286 +05 [19631] LOG:  redirecting log output to logging
collector process
2024-08-07 17:15:00.286 +05 [19631] HINT:  Future log output will appear in
directory "log".
 done
server started
+ echo 'Stop standby2 if exists'
Stop standby2 if exists
+ test -f standby2/postmaster.pid
+ echo 1
1
+ echo 'Remove standby2 directory'
Remove standby2 directory
+ rm -rf standby2
+ echo 'Base backup is taken with xlog files included'
Base backup is taken with xlog files included
+ pg_basebackup -D standby2 -p 5000 -X fetch --verbose
--write-recovery-conf
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/900028 on timeline 1
pg_basebackup: write-ahead log end point: 0/900100
pg_basebackup: syncing data to disk ...
pg_basebackup: renaming backup_manifest.tmp to backup_manifest
pg_basebackup: base backup completed
+ echo 'Start standby2'
Start standby2
+ pg_ctl -w -D standby2 start -o '-p 5002'
waiting for server to start....2024-08-07 17:15:01.123 +05 [19643] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:01.123 +05 [19643] DEBUG:  mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:01.126 +05 [19643] DEBUG:  dynamic shared memory system
will support 674 segments
2024-08-07 17:15:01.126 +05 [19643] DEBUG:  created dynamic shared memory
control segment 2259555868 (26976 bytes)
2024-08-07 17:15:01.126 +05 [19643] DEBUG:  max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:01.126 +05 [19643] LOG:  redirecting log output to logging
collector process
2024-08-07 17:15:01.126 +05 [19643] HINT:  Future log output will appear in
directory "log".
 done
server started
+ echo 'Create table and insert lot of rows'
Create table and insert lot of rows
+ psql -d postgres -a --no-psqlrc -p 5000
create table a (a int, b int);
CREATE TABLE
insert into a select i, i from generate_series(1, 100000) i;
INSERT 0 100000
+ echo 'Stop master to recycle WAL files'
Stop master to recycle WAL files
+ pg_ctl -w -D master stop -m fast
waiting for server to shut down....2024-08-07 17:15:01.650 +05 [19619]
DEBUG:  logger shutting down
 done
server stopped
+ echo 'Stop standby1 to recycle WAL files'
Stop standby1 to recycle WAL files
+ pg_ctl -w -D standby1 stop -m fast
waiting for server to shut down....2024-08-07 17:15:01.731 +05 [19632]
DEBUG:  logger shutting down
 done
server stopped
+ echo 'Stop standby2 to recycle WAL files'
Stop standby2 to recycle WAL files
+ pg_ctl -w -D standby2 stop -m fast
waiting for server to shut down....2024-08-07 17:15:01.831 +05 [19644]
DEBUG:  logger shutting down
 done
server stopped
+ echo 'Start master'
Start master
+ pg_ctl -w -D master start -o '-p 5000'
waiting for server to start....2024-08-07 17:15:01.872 +05 [19660] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:01.872 +05 [19660] DEBUG:  mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:01.876 +05 [19660] DEBUG:  dynamic shared memory system
will support 674 segments
2024-08-07 17:15:01.876 +05 [19660] DEBUG:  created dynamic shared memory
control segment 2766949016 (26976 bytes)
2024-08-07 17:15:01.876 +05 [19660] DEBUG:  max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:01.876 +05 [19660] LOG:  redirecting log output to logging
collector process
2024-08-07 17:15:01.876 +05 [19660] HINT:  Future log output will appear in
directory "log".
 done
server started
+ echo 'Start standby1'
Start standby1
+ pg_ctl -w -D standby1 start -o '-p 5001'
waiting for server to start....2024-08-07 17:15:01.979 +05 [19670] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:01.979 +05 [19670] DEBUG:  mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:01.982 +05 [19670] DEBUG:  dynamic shared memory system
will support 674 segments
2024-08-07 17:15:01.982 +05 [19670] DEBUG:  created dynamic shared memory
control segment 1083876086 (26976 bytes)
2024-08-07 17:15:01.982 +05 [19670] DEBUG:  max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:01.983 +05 [19670] LOG:  redirecting log output to logging
collector process
2024-08-07 17:15:01.983 +05 [19670] HINT:  Future log output will appear in
directory "log".
 done
server started
+ echo 'Start standby2'
Start standby2
+ pg_ctl -w -D standby2 start -o '-p 5002'
waiting for server to start....2024-08-07 17:15:02.085 +05 [19679] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:02.085 +05 [19679] DEBUG:  mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:02.088 +05 [19679] DEBUG:  dynamic shared memory system
will support 674 segments
2024-08-07 17:15:02.088 +05 [19679] DEBUG:  created dynamic shared memory
control segment 4052049392 (26976 bytes)
2024-08-07 17:15:02.088 +05 [19679] DEBUG:  max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:02.089 +05 [19679] LOG:  redirecting log output to logging
collector process
2024-08-07 17:15:02.089 +05 [19679] HINT:  Future log output will appear in
directory "log".
 done
server started
+ echo 'Now promote standby1'
Now promote standby1
+ pg_ctl -w -D standby1 promote
waiting for server to promote.... done
server promoted
+ echo 'Wait until standby1 is promoted'
Wait until standby1 is promoted
+ pg_isready -p 5001
/tmp:5001 - accepting connections
+ echo 'Insert some rows'
Insert some rows
+ psql -d postgres -a --no-psqlrc -p 5000
insert into a select i, i from generate_series(1, 340) i;
INSERT 0 340
+ echo 'Standby2 node need to be stopped as well.'
Standby2 node need to be stopped as well.
+ pg_ctl -w -D standby2 stop -m fast
waiting for server to shut down....2024-08-07 17:15:02.411 +05 [19680]
DEBUG:  logger shutting down
 done
server stopped
+ echo 'Stop the master after standby promotion'
Stop the master after standby promotion
+ pg_ctl -w -D master stop -m fast
waiting for server to shut down....2024-08-07 17:15:02.517 +05 [19661]
DEBUG:  logger shutting down
 done
server stopped
+ echo 'Standby1 node need to be stopped as well.'
Standby1 node need to be stopped as well.
+ pg_ctl -w -D standby1 stop -m fast
waiting for server to shut down....2024-08-07 17:15:02.617 +05 [19671]
DEBUG:  logger shutting down
 done
server stopped
+ echo 'Do rewind using standby1 pgdata as source'
Do rewind using standby1 pgdata as source
+ pg_rewind --progress --debug '--source-pgdata=standby1'
'--target-pgdata=standby2'
pg_rewind: Source timeline history:
pg_rewind: Target timeline history:
pg_rewind: 1: 0/0 - 0/0
pg_rewind: servers diverged at WAL location 0/10392E8 on timeline 1
pg_rewind: no rewind required
+ echo 'Parameters for standby2 postgresql.conf'
Parameters for standby2 postgresql.conf
+ cat
+ echo 'Place standby2 signal file'
Place standby2 signal file
+ touch standby2/standby.signal
+ echo 'Start standby1'
Start standby1
+ pg_ctl -w -D standby1 start -o '-p 5001'
waiting for server to start....2024-08-07 17:15:02.723 +05 [19702] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:02.723 +05 [19702] DEBUG:  mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:02.726 +05 [19702] DEBUG:  dynamic shared memory system
will support 674 segments
2024-08-07 17:15:02.726 +05 [19702] DEBUG:  created dynamic shared memory
control segment 3627107606 (26976 bytes)
2024-08-07 17:15:02.726 +05 [19702] DEBUG:  max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:02.726 +05 [19702] LOG:  redirecting log output to logging
collector process
2024-08-07 17:15:02.726 +05 [19702] HINT:  Future log output will appear in
directory "log".
 done
server started
+ echo 'Start standby2'
Start standby2
+ pg_ctl -w -D standby2 start -o '-p 5002'
waiting for server to start....2024-08-07 17:15:02.830 +05 [19712] DEBUG:
registering background worker "logical replication launcher"
2024-08-07 17:15:02.830 +05 [19712] DEBUG:  mmap(146800640) with MAP_HUGETLB
failed, huge pages disabled: Out of memory
2024-08-07 17:15:02.833 +05 [19712] DEBUG:  dynamic shared memory system
will support 674 segments
2024-08-07 17:15:02.833 +05 [19712] DEBUG:  created dynamic shared memory
control segment 2291233354 (26976 bytes)
2024-08-07 17:15:02.833 +05 [19712] DEBUG:  max_safe_fds = 986, usable_fds =
1000, already_open = 4
2024-08-07 17:15:02.833 +05 [19712] LOG:  redirecting log output to logging
collector process
2024-08-07 17:15:02.833 +05 [19712] HINT:  Future log output will appear in
directory "log".
 done
server started
+ echo 'Same query'
Same query
+ psql -d postgres -a --no-psqlrc -p 5001
select count(*) from a;
 count  
--------
 100000
(1 row)

+ echo 'Different results'
Different results
+ psql -d postgres -a --no-psqlrc -p 5002
select count(*) from a;
 count  
--------
 100340
(1 row)

+ echo 'Stop standby1'
Stop standby1
+ pg_ctl -w -D standby1 stop -m fast
waiting for server to shut down....2024-08-07 17:15:03.077 +05 [19703]
DEBUG:  logger shutting down
 done
server stopped
+ echo 'Stop standby2'
Stop standby2
+ pg_ctl -w -D standby2 stop -m fast
waiting for server to shut down....2024-08-07 17:15:03.117 +05 [19713]
DEBUG:  logger shutting down
 done
server stopped
```


On 07/08/2024 15:19, PG Bug reporting form wrote:
> Sometimes pg_rewind mistakenly assumes that nothing needs to be done, which
> results in the replica having data that is not on the master.
> 
> ...
> + pg_rewind --progress --debug '--source-pgdata=standby1'
> '--target-pgdata=standby2'
> pg_rewind: Source timeline history:
> pg_rewind: Target timeline history:
> pg_rewind: 1: 0/0 - 0/0
> pg_rewind: servers diverged at WAL location 0/10392E8 on timeline 1
> pg_rewind: no rewind required
>
> ...
> Same query
> + psql -d postgres -a --no-psqlrc -p 5001
> select count(*) from a;
>   count
> --------
>   100000
> (1 row)
> 
> + echo 'Different results'
> Different results
> + psql -d postgres -a --no-psqlrc -p 5002
> select count(*) from a;
>   count
> --------
>   100340
> (1 row)
> 
> + echo 'Stop standby1'
> Stop standby1
> + pg_ctl -w -D standby1 stop -m fast
> waiting for server to shut down....2024-08-07 17:15:03.077 +05 [19703]
> DEBUG:  logger shutting down
>   done
> server stopped
> + echo 'Stop standby2'
> Stop standby2
> + pg_ctl -w -D standby2 stop -m fast
> waiting for server to shut down....2024-08-07 17:15:03.117 +05 [19713]
> DEBUG:  logger shutting down
>   done
> server stopped
> ```

I cannot reproduce this. On my laptop, the reported "servers diverged at 
WAL location" is always higher, and it performs rewind as expected:

g_rewind: Source timeline history:
pg_rewind: Target timeline history:
pg_rewind: 1: 0/0 - 0/0
pg_rewind: servers diverged at WAL location 0/1138F00 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/1138E88 on timeline 1

However while looking at the code, I noticed that the debug-output of 
the Source timeline history is broken. See attached patch to fix it. 
With that fix, I get:

pg_rewind: Source timeline history:
pg_rewind: 1: 0/0 - 0/1138F00
pg_rewind: 2: 0/1138F00 - 0/0
pg_rewind: Target timeline history:
pg_rewind: 1: 0/0 - 0/0
pg_rewind: 0 a: 1 0/0-0/1138F00 - b: 1 0/0-0/0
pg_rewind: servers diverged at WAL location 0/1138F00 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/1138E88 on timeline 1

That just fixes the debugging output, though, it doesn't change what it 
actually does.

Can you try the attached fix show the output you get with that please? 
Or alternatively, show the contents of the 
standby1/pg_wal/00000002.history file.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment
On 08/08/2024 10:57, Georgy Shelkovy wrote:
> Unfortunately, the playback is not very stable, but sometimes it shoots. 
> I added some commands to show last WAL rows

Thanks. I still haven't been able to reproduce it, but here's a theory:

When determining whether the target needs rewinding, pg_rewind looks at 
the target's last checkpoint record, or if it's a standby, its 
minRecoveryPoint. It's possible that standby2's minRecoveryPoint is 
indeed before the point of divergence. That means it has replayed the 
340 insert records, but all the changes are still only sitting in the 
shared buffer cache. When you shut it down, those 340 inserts are gone 
on standby2. When you restart it, they will be applied again from the WAL.

In that case, pg_rewind's conclusion that no rewind is needed is 
correct. standby2 is strictly behind standby1, and could catch up 
directly to it. However, when you restart standby2, it will first replay 
the WAL it had streamed from master.

Can you show the full output of pg_controldata on all the servers, 
please? In your latest snippet, you showed just the checkpoint 
locations, but if just remove the "grep checkpoint | grep location" 
filters, it would print the whole thing. I'm particularly interested in 
the minRecoveryPoint on standby2, in the cases when it works and when it 
doesn't.

I'm not sure what the right behavior would be if that's the issue. 
Perhaps pg_rewind should truncate the WAL in standby2/pg_wal/ in that 
case, so that when you start it up again, it would not replay the local 
WAL but would connect to standby2 directly. Also, perhaps a fast 
shutdown of a standby server should update minRecoveryPoint before exiting.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




On 08/08/2024 12:38, Georgy Shelkovy wrote:
> On second run I got bug

Can you show the full pg_controldata output please? See attached script.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment
On 08/08/2024 14:58, Georgy Shelkovy wrote:
> this is good log of previous run for comparison
Thanks! That confirms my theory that the minRecoveryPoint in the "bad" 
case is at the LSN where the histories diverged, while in the "good" 
case, it's later. So pg_rewind is not wrong when it says that no rewind 
is required. It's still confusing though. Here's a visualization of the 
scenarios:

Legend:

TLI 1: this the WAL produced on the master
TLI 2: WAL produced on standby1
*: point of divergence
^: minRecoveryPoint on standby2

Good:

             -----------  TLI 2
            /
----------*------------  TLI
                    ^
Bad:

             -----------  TLI 2
            /
----------*------------ TLI 1
           ^

There's a third possibility, which actually produces an assertion 
failure. I was able to reproduce this case by adding some sleeps in the 
script and in walreceiver startup code:


             -----------  TLI 2
            /
----------*------------ TLI 1
     ^

pg_rewind: Source timeline history:
pg_rewind: 1: 0/0 - 0/1138F00
pg_rewind: 2: 0/1138F00 - 0/0
pg_rewind: Target timeline history:
pg_rewind: 1: 0/0 - 0/0
pg_rewind: servers diverged at WAL location 0/1138F00 on timeline 1
pg_rewind: ../src/bin/pg_rewind/pg_rewind.c:443: int main(int, char **): 
Assertion `target_wal_endrec == divergerec' failed.

Except for the assertion failure, I think that's essentially the same as 
the "Bad" case. On a non-assertion build, pg_rewind would report "no 
rewind required" which seems correct.


So it seems true that rewind is not required in those cases. However, if 
the WAL is already written on the standby's disk, just not replayed yet, 
then when you restart the server, it will replay the WAL from timeline 
1. That does seem surprising. Perhaps pg_rewind should just update the 
minRecoveryPoint and minRecoveryTLI in control file in that case, to 
force WAL recovery to follow the timeline switch to TLI 2.

I will try to write a TAP test for the "Bad" and the assertion failure 
case, fix the assertion failure, and test how updating the 
minRecoveryPoint would behave.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




On 08/08/2024 16:07, Heikki Linnakangas wrote:
> So it seems true that rewind is not required in those cases. However, if
> the WAL is already written on the standby's disk, just not replayed yet,
> then when you restart the server, it will replay the WAL from timeline
> 1. That does seem surprising. Perhaps pg_rewind should just update the
> minRecoveryPoint and minRecoveryTLI in control file in that case, to
> force WAL recovery to follow the timeline switch to TLI 2.
> 
> I will try to write a TAP test for the "Bad" and the assertion failure
> case, fix the assertion failure, and test how updating the
> minRecoveryPoint would behave.

I've divided this into three separate issues:

1. I think the assertion in pg_rewind is simply bogus, and we should 
remove it. Attached 0002-Remove-bogus-assertion-in-pg_rewind.patch does 
that, and adds a test case to cover it.

2. Independently of pg_rewind: When you start PostgreSQL, it will first 
try to recover all the WAL it has locally in pg_wal. That goes wrong if 
you have set a recovery target TLI. For example, imagine this situation:

- Recovery target TLI is 2, set explicitly in postgresql.conf
- The switchpoint from TLI 1 to 2 happened at WAL position 0/1510198 
(the switchpoint is found in 00000002.history)
- There is a WAL file 000000010000000000000001 under pg_wal, which 
contains valid WAL up to 0/1590000

When you start the server, it will first recover all the WAL from 
000000010000000000000001, up to 0/1590000. Then it will connect to the 
primary to fetch mor WAL, but it will fail to make any progress because 
it already blew past the switch point.

It's obviously wrong to replay the WAL from timeline 1 beyond the 1->2 
switchpoint, when the recovery target is TLI 2. The attached 
0003-Don-t-read-past-current-TLI-during-archive-recovery.patch fixes 
that. However, the logic to find the right WAL segment file and read the 
WAL is extremely complicated, and I don't feel comfortable that I got 
all the cases right. Review would be highly appreciated.

The patch includes a test case to demonstrate the case, with no 
pg_rewind. It does include one "manual" step to copy a timeline history 
file into pg_wal, marked with XXX, however. So I'm not sure how possible 
this scenario is in production setups .

3. When pg_rewind has nothing to do, the target server is left 
unmodified, in a state such that when you restart it, it will replay all 
the WAL it has locally in pg_wal first, before connecting to the 
primary. Even though the target is a direct ancestor of the source and 
hence it *can* follow the WAL to the source's position without 
rewinding, it doesn't mean that it *will* actually do so.

The attached  changes it so that it updates the control file in that 
case, setting minRecoveryPoint and minRecoveryPointTLI to point to the 
source's current WAL position. That way, when you start it up, it will 
follow the timeline history to reach that point. (This requires fixing 
issue 2, because otherwise it still won't follow the history correctly 
to reach the minRecoveryPointTLI)

To make the test pass, it actually would be sufficient to copy the 
timeline history file into pg_wal (which the patch also does). But 
updating the minRecoveryPoint seems good to ensure that it follows the 
right timeline. Otherwise it relies on the logic at startup to find the 
latest timeline, and that the latest timeline is the one you tried to 
rewind to. I think it would go wrong if there was another 
higher-numbered history file present in pg_wal for some reason.


All in all, I don't feel very confident about all this. The assertion 
seems straightforward, so barring objections I'll commit and backpatch 
that. The timeline-following at startup (issue 2) seems pretty clearly 
wrong, but I'm not sure it's worth the risk to backpatch. Similarly 
issue 3 might not be worth the risk to backpatch, especially if we don't 
also backpatch 2. I would love to hear comments on those.

Georgy, if you have the possibility to test this patches with your repro 
script, that would be highly appreciated.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment