Re: FSM Corruption (was: Could not read block at end of the relation) - Mailing list pgsql-bugs

From Noah Misch
Subject Re: FSM Corruption (was: Could not read block at end of the relation)
Date
Msg-id 20240304190312.b6.nmisch@google.com
Whole thread Raw
In response to Re: FSM Corruption (was: Could not read block at end of the relation)  (Ronan Dunklau <ronan.dunklau@aiven.io>)
Responses Re: FSM Corruption (was: Could not read block at end of the relation)  (Ronan Dunklau <ronan.dunklau@aiven.io>)
List pgsql-bugs
On Mon, Mar 04, 2024 at 02:10:39PM +0100, Ronan Dunklau wrote:
> Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit :
> > On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:
> > > - happens during heavy system load
> > > - lots of concurrent writes happening on a table
> > > - often (but haven't been able to confirm it is necessary), a vacuum is
> > > running on the table at the same time the error is triggered

> Looking at when the corruption was WAL-logged, this particular case is quite 
> easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table 
> (probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM 
> running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog 
> records for each of the relation main fork, followed by a lonely FPI for the 
> leaf page of it's FSM:

You're using data_checksums, right?  Thanks for the wal dump excerpts; I agree
with this summary thereof.

> There are no traces of relation truncation happening in the WAL.

That is notable.

> This case only shows a single invalid entry in the FSM, but I've noticed as 
> much as 62 blocks present in the FSM while they do not exist on disk, all 
> tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk 
> extension mechanism.

Is this happening after an OS crash, a replica promote, or a PITR restore?  If
so, I think I see the problem.  We have an undocumented rule that FSM shall
not contain references to pages past the end of the relation.  To facilitate
that, relation truncation WAL-logs FSM truncate.  However, there's no similar
protection for relation extension, which is not WAL-logged.  We break the rule
whenever we write FSM for block X before some WAL record initializes block X.
data_checksums makes the trouble easier to hit, since it creates FPI_FOR_HINT
records for FSM changes.  A replica promote or PITR ending just after the FSM
FPI_FOR_HINT would yield this broken state.  While v16 RelationAddBlocks()
made this easier to hit, I suspect it's reproducible in all supported
branches.  For example, lazy_scan_new_or_empty() and multiple index AMs break
the rule via RecordPageWithFreeSpace() on a PageIsNew() page.

I think the fix is one of:

- Revoke the undocumented rule.  Make FSM consumers resilient to the FSM
  returning a now-too-large block number.

- Enforce a new "main-fork WAL before FSM" rule for logged rels.  For example,
  in each PageIsNew() case, either don't update FSM or WAL-log an init (like
  lazy_scan_new_or_empty() does when PageIsEmpty()).



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18375: requested statistics kind "f" is not yet built for statistics object 16722
Next
From: Ronan Dunklau
Date:
Subject: Re: FSM Corruption (was: Could not read block at end of the relation)