On Thu, Mar 06, 2025 at 01:44:30PM -0600, Nathan Bossart wrote:
> On Thu, Mar 06, 2025 at 11:30:13AM -0800, Noah Misch wrote:
>> 1. Make v14 and v13 skip WAL recycling and preallocation during archive
>> recovery, like newer branches do. I think that means back-patching the six
>> commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef.
>>
>> 2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart.
>> Avoid multiple hard links by making durable_rename_excl() first rename
>> oldfile to a temporary name. (I haven't thought this through in detail.
>> It may not suffice.)
>>
>> I'm leaning toward (1) at least enough to see how messy the back-patch would
>> be, since I don't like risks of designing old-branch-specific solutions when
>> v15/v16/v17 have a proven solution. What else should we be thinking about
>> before deciding?
>
> I agree with first attempting a back-patch. All of this stuff is from
> 2021-2022, so it's had time to bake and is IMHO lower risk.
Sorry for the late reply here (life things and the timing of Noah's
report).
Discarding the option where we would design a new solution specific to
v14 and v13, I don't really see a different third option here.
Choice 2 means that we'd just leave v13 and v14 exposed to
corruptions, and that the message to our user base is that they should
upgrade to v15 to get that fixed, even if we still have roughly two
more 2 years of official community support, as they would still be
exposed to the risk of having multiple links pointing to a single WAL
segment, creating corruption. As reported upthread, that can be
reached.
Choice 1 would be the logical path, even if it may not be the
practical one in terms of backpatch risks vs benefits. It would be
much better than recommending people that they have to upgrade to 15
at least to make sure that their version of Postgres is safer to use.
Saying that, I've looked at the whole for a good part of the day. In
15~, one thing that may make a backpatch messier is related to
recovery and the fact that Robert has improved/refactored the area
quite a bit, tweaking the code so as we rely on slightly different
assumptions in 13/14 compared to 15~. There is a piece around
ThisTimeLineID, for example, which is something that
InstallXLogFileSegment() uses and also something that's touched by
1f95181b44c8. The impact of these pieces of refactorings specific to
v15 is surprisingly lower than I thought it would.
Some notes about the six commits you are mentioning, looking at them
one by one on the 13 and 14 stable branches.
First, for v14:
cc2c7d6~4: some bumps with the removal of use_lock with one flag.
Looks OK otherwise.
cc2c7d6~3: Looks OK. Based on what this addresses, this is low-risk.
cc2c7d6~2: XLogWalRcvWrite() has a conflict as an effect of
XLogWalRcvWrite() that can be given a "tli". It still seems correct
to me to use "recvFileTLI = ThisTimeLineID".
cc2c7d6~1: 8ad6c5dbbe5a needs to be reverted first. Commit
8ad6c5dbbe5a has been introduced as a workaround for what we would
backpatch here. One conflict in walreceiver.c.
cc2c7d6: Does not conflict, which was a bit surprising, TBH.
043_no_contrecord_switch.pl gets unhappy here, on an assertion failure
in WaitForWALToBecomeAvailable() about InstallXLogFileSegmentActive.
e36cbef: Previous failure fixed by this one, with the same symptoms
as reported in the original thread that led to this commit.
Then, for v13:
cc2c7d6~4: One conflict with InstallXLogFileSegment() in xlog.c,
nothing huge.
cc2c7d6~3: Looks OK.
cc2c7d6~2: Looks OK.
cc2c7d6~1: With 8ad6c5dbbe5a reverted, same as the v14 part.
cc2c7d6: Does not conflict. 043_no_contrecord_switch.pl gets unhappy
here, same as above.
e36cbef: Same as above.
Noah, all these improvements are something you have directly worked
on in 15~. I'm hoping not to have missed something, even if I've
looked at the paths manipulating the WAL segments changed here on v13
and v14. Could it be possible for you to double-check and also run
more tests if your enviroments help? Perhaps this needs specific
prerequisites for recovery? We need more confidence in the solution
here, even if it's proving to work for v15, we may still have gaps
specific to v14 and/or v13.
I am attaching a full set of patches for v14 and v13 that can be used
for these tests. WDYT?
--
Michael