Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done. - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
Date
Msg-id 092dd515-b7b4-4fd0-8407-ceca2f02f6ec@iki.fi
Whole thread Raw
In response to BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
On 07/08/2024 15:19, PG Bug reporting form wrote:
> Sometimes pg_rewind mistakenly assumes that nothing needs to be done, which
> results in the replica having data that is not on the master.
> 
> ...
> + pg_rewind --progress --debug '--source-pgdata=standby1'
> '--target-pgdata=standby2'
> pg_rewind: Source timeline history:
> pg_rewind: Target timeline history:
> pg_rewind: 1: 0/0 - 0/0
> pg_rewind: servers diverged at WAL location 0/10392E8 on timeline 1
> pg_rewind: no rewind required
>
> ...
> Same query
> + psql -d postgres -a --no-psqlrc -p 5001
> select count(*) from a;
>   count
> --------
>   100000
> (1 row)
> 
> + echo 'Different results'
> Different results
> + psql -d postgres -a --no-psqlrc -p 5002
> select count(*) from a;
>   count
> --------
>   100340
> (1 row)
> 
> + echo 'Stop standby1'
> Stop standby1
> + pg_ctl -w -D standby1 stop -m fast
> waiting for server to shut down....2024-08-07 17:15:03.077 +05 [19703]
> DEBUG:  logger shutting down
>   done
> server stopped
> + echo 'Stop standby2'
> Stop standby2
> + pg_ctl -w -D standby2 stop -m fast
> waiting for server to shut down....2024-08-07 17:15:03.117 +05 [19713]
> DEBUG:  logger shutting down
>   done
> server stopped
> ```

I cannot reproduce this. On my laptop, the reported "servers diverged at 
WAL location" is always higher, and it performs rewind as expected:

g_rewind: Source timeline history:
pg_rewind: Target timeline history:
pg_rewind: 1: 0/0 - 0/0
pg_rewind: servers diverged at WAL location 0/1138F00 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/1138E88 on timeline 1

However while looking at the code, I noticed that the debug-output of 
the Source timeline history is broken. See attached patch to fix it. 
With that fix, I get:

pg_rewind: Source timeline history:
pg_rewind: 1: 0/0 - 0/1138F00
pg_rewind: 2: 0/1138F00 - 0/0
pg_rewind: Target timeline history:
pg_rewind: 1: 0/0 - 0/0
pg_rewind: 0 a: 1 0/0-0/1138F00 - b: 1 0/0-0/0
pg_rewind: servers diverged at WAL location 0/1138F00 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/1138E88 on timeline 1

That just fixes the debugging output, though, it doesn't change what it 
actually does.

Can you try the attached fix show the output you get with that please? 
Or alternatively, show the contents of the 
standby1/pg_wal/00000002.history file.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #18577: Miss out part invalItems in saved plansource of one CallStmt
Next
From: jian he
Date:
Subject: Re: BUG #18568: BUG: Result wrong when do group by on partition table!