On Wed, Sep 27, 2023 at 11:06:37AM +1300, Thomas Munro wrote:
> I don't have an opinion yet on your other thread about making this
> stuff configurable for replicas, but for the simple crash recovery
> case shown here, hard failure makes sense to me.
Also, if we conclude that we're OK with just failing hard all the time
for crash recovery and archive recovery on OOM, the other patch is not
really required. That would be disruptive for standbys in some cases,
still perhaps OK in the long-term. I am wondering if people have lost
data because of this problem on production systems, actually.. It
would not be possible to know that it happened until you see a page on
disk that has a somewhat valid LSN, still an LSN older than the
position currently being inserted, and that could show up in various
forms. Even that could get hidden quickly if WAL is written at a fast
pace after a crash recovery. A standby promotion at an LSN older
would be unlikely as monitoring solutions discard standbys lagging
behind N bytes.
> *A more detailed analysis would talk about sectors (page header is
> atomic), and consider whether we're only trying to defend ourselves
> against recycled pages written by PostgreSQL (yes), arbitrary random
> data (no, but it's probably still pretty good) or someone trying to
> trick us (no, and we don't stand a chance).
WAL would not be the only part of the system that would get borked if
arbitrary bytes can be inserted into what's read from disk, random or
not.
--
Michael