Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done. - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
Date
Msg-id fe7af87d-b9bf-481a-902c-089c49aa911d@iki.fi
Whole thread Raw
In response to Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-bugs
On 08/08/2024 16:07, Heikki Linnakangas wrote:
> So it seems true that rewind is not required in those cases. However, if
> the WAL is already written on the standby's disk, just not replayed yet,
> then when you restart the server, it will replay the WAL from timeline
> 1. That does seem surprising. Perhaps pg_rewind should just update the
> minRecoveryPoint and minRecoveryTLI in control file in that case, to
> force WAL recovery to follow the timeline switch to TLI 2.
> 
> I will try to write a TAP test for the "Bad" and the assertion failure
> case, fix the assertion failure, and test how updating the
> minRecoveryPoint would behave.

I've divided this into three separate issues:

1. I think the assertion in pg_rewind is simply bogus, and we should 
remove it. Attached 0002-Remove-bogus-assertion-in-pg_rewind.patch does 
that, and adds a test case to cover it.

2. Independently of pg_rewind: When you start PostgreSQL, it will first 
try to recover all the WAL it has locally in pg_wal. That goes wrong if 
you have set a recovery target TLI. For example, imagine this situation:

- Recovery target TLI is 2, set explicitly in postgresql.conf
- The switchpoint from TLI 1 to 2 happened at WAL position 0/1510198 
(the switchpoint is found in 00000002.history)
- There is a WAL file 000000010000000000000001 under pg_wal, which 
contains valid WAL up to 0/1590000

When you start the server, it will first recover all the WAL from 
000000010000000000000001, up to 0/1590000. Then it will connect to the 
primary to fetch mor WAL, but it will fail to make any progress because 
it already blew past the switch point.

It's obviously wrong to replay the WAL from timeline 1 beyond the 1->2 
switchpoint, when the recovery target is TLI 2. The attached 
0003-Don-t-read-past-current-TLI-during-archive-recovery.patch fixes 
that. However, the logic to find the right WAL segment file and read the 
WAL is extremely complicated, and I don't feel comfortable that I got 
all the cases right. Review would be highly appreciated.

The patch includes a test case to demonstrate the case, with no 
pg_rewind. It does include one "manual" step to copy a timeline history 
file into pg_wal, marked with XXX, however. So I'm not sure how possible 
this scenario is in production setups .

3. When pg_rewind has nothing to do, the target server is left 
unmodified, in a state such that when you restart it, it will replay all 
the WAL it has locally in pg_wal first, before connecting to the 
primary. Even though the target is a direct ancestor of the source and 
hence it *can* follow the WAL to the source's position without 
rewinding, it doesn't mean that it *will* actually do so.

The attached  changes it so that it updates the control file in that 
case, setting minRecoveryPoint and minRecoveryPointTLI to point to the 
source's current WAL position. That way, when you start it up, it will 
follow the timeline history to reach that point. (This requires fixing 
issue 2, because otherwise it still won't follow the history correctly 
to reach the minRecoveryPointTLI)

To make the test pass, it actually would be sufficient to copy the 
timeline history file into pg_wal (which the patch also does). But 
updating the minRecoveryPoint seems good to ensure that it follows the 
right timeline. Otherwise it relies on the logic at startup to find the 
latest timeline, and that the latest timeline is the one you tried to 
rewind to. I think it would go wrong if there was another 
higher-numbered history file present in pg_wal for some reason.


All in all, I don't feel very confident about all this. The assertion 
seems straightforward, so barring objections I'll commit and backpatch 
that. The timeline-following at startup (issue 2) seems pretty clearly 
wrong, but I'm not sure it's worth the risk to backpatch. Similarly 
issue 3 might not be worth the risk to backpatch, especially if we don't 
also backpatch 2. I would love to hear comments on those.

Georgy, if you have the possibility to test this patches with your repro 
script, that would be highly appreciated.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18576: Using EXPLAIN (VERBOSE) in information_schema.element_types returns ERROR: failed to find plan for
Next
From: Aleš Zelený
Date:
Subject: Re: BUG #18573: Analyze command consumes several GB of memory - more than analyzed table size