On Wed, Jul 10, 2024 at 5:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Here is a draft patch that attempts to fix this problem. I'm not
> certain that it's completely correct, but it does seem to fix the
> reported issue.
I tried to write a test case for this and discovered that there are
actually two separate problems in this area. First, as shown by the
assertion failure reported by Fujii Masao, the WAL summarizer thinks
that it should never need to back up to an earlier LSN, and the test
case he provided shows that this is incorrect. Second, the WAL
summarizer can end up in a bad state after the startup process renames
the last WAL file on the old timeline to a .partial file. If this
happens before the file has been summarized, then the WAL summarizer
can't access it any more and errors out. Promotion also removes WAL
files from the old timeline completely, but only if they're after the
switch point, and summarization doesn't care about those anyway. So
the partial file seems to be the only problem case.
In theory, the problem with the partial file could be handled in a
variety of ways: we could teach summarization to read the partial
file, perhaps, or postpone adding the .partial suffix until after
summarization has happened. But in practice, given where we are in the
release cycle, the only reasonable approach that I can see is to have
promotion wait for summarization to catch up, so that's what I did in
0003.
0002 is the same as what I posted previously as 0001, and teaches the
summarizer about backing up when we switch timelines. 0001 adds a
missing call to ConditionVariableCancelSleep; AFAIK, that omission has
no important consequences, but still seems like it should be fixed.
--
Robert Haas
EDB: http://www.enterprisedb.com