Re: Corruption during WAL replay - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Corruption during WAL replay
Date
Msg-id 3192026.1648185780@sss.pgh.pa.us
Whole thread Raw
In response to Re: Corruption during WAL replay  (Andres Freund <andres@anarazel.de>)
Responses Re: Corruption during WAL replay  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
> I do see that the LSN that ends up on the page is the same across a few runs
> of the test on serinus. Which presumably differs between different
> animals. Surprised that it's this predictable - but I guess the run is short
> enough that there's no variation due to autovacuum, checkpoints etc.

Uh-huh.  I'm not surprised that it's repeatable on a given animal.
What remains to be explained:

1. Why'd it start failing now?  I'm guessing that ce95c5437 *was* the
culprit after all, by slightly changing the amount of catalog data
written during initdb, and thus moving the initial LSN.

2. Why just these two animals?  If initial LSN is the critical thing,
then the results of "locale -a" would affect it, so platform
dependence is hardly surprising ... but I'd have thought that all
the animals on that host would use the same initial set of
collations.  OTOH, I see petalura and pogona just fell over too.
Do you have some of those animals --with-icu and others not?

> 16bit checksums for the win.

Yay :-(

As for a fix, would damaging more of the page help?  I guess
it'd just move around the one-in-64K chance of failure.
Maybe we have to intentionally corrupt (e.g. invert) the
checksum field specifically.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: shared-memory based stats collector - v66
Next
From: "wangw.fnst@fujitsu.com"
Date:
Subject: RE: Logical replication timeout problem