Hi,
On 2024-12-18 10:38:19 -0600, Nathan Bossart wrote:
> On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote:
> > We recently observed a few cases where Postgres running on Linux
> > encountered an issue with WAL segment files. Specifically, two WAL
> > segments were linked to the same physical file after Postgres ran out
> > of memory and the OOM killer terminated one of its processes. This
> > resulted in the WAL segments overwriting each other and Postgres
> > failing a later recovery.
>
> Yikes!
Indeed. As chance would have it, I was asked for input on a corrupted server
*today*. Eventually we found that recovery stopped early, after encountering a
segment with a *newer* pageaddr than we expected. Which made me think of this
issue, and indeed, the file recovery stopped at had two links. Before that
the server had been crashing on a regular basis for unrelated reasons, which
presumably increased the chances sufficiently to eventually hit this problem.
It's a normal thing to discover the end of the WAL by finding a segment that
has an older pageaddr than its name suggests. But in this case we saw a newer
page address. I wonder if we should treat that differently...
> > We found this fix [1] that has been applied to Postgres 16, but the
> > cases we observed were running Postgres 15. Given that older major
> > versions will be supported for a good number of years, and the
> > potential for irrecoverability exists (even if rare), we would like to
> > discuss the possibility of back-patching this fix.
>
> IMHO this is a good time to reevaluate. It looks like we originally didn't
> back-patch out of an abundance of caution, but now that this one has had
> time to bake, I think it's worth seriously considering, especially now that
> we have a report from the field.
Strongly agreed.
I don't think the issue is actually quite as unlikely to be hit as reasoned in
the commit message. The crash has indeed to happen between the link() and
unlink() - but at the end of a checkpoint we do that operations hundreds of
times in a row on a busy server. And that's just after potentially doing lots
of write IO during a checkpoint, filling up drive write caches / eating up
IOPS/bandwidth disk quots.
Greetings,
Andres Freund