Re: FSM Corruption (was: Could not read block at end of the relation) - Mailing list pgsql-bugs

From Noah Misch
Subject Re: FSM Corruption (was: Could not read block at end of the relation)
Date
Msg-id 20240304230503.a4.nmisch@google.com
Whole thread Raw
In response to Re: FSM Corruption (was: Could not read block at end of the relation)  (Ronan Dunklau <ronan.dunklau@aiven.io>)
Responses Re: FSM Corruption (was: Could not read block at end of the relation)  (Ronan Dunklau <ronan.dunklau@aiven.io>)
List pgsql-bugs
On Mon, Mar 04, 2024 at 11:21:07PM +0100, Ronan Dunklau wrote:
> Le lundi 4 mars 2024, 20:03:12 CET Noah Misch a écrit :
> > Is this happening after an OS crash, a replica promote, or a PITR restore? 
> 
> I need to double check all occurences, but I wouldn't be surprised if they all 
> went trough a replica promotion.
> 
> > If so, I think I see the problem.  We have an undocumented rule that FSM
> > shall not contain references to pages past the end of the relation.  To
> > facilitate that, relation truncation WAL-logs FSM truncate.  However,
> > there's no similar protection for relation extension, which is not
> > WAL-logged.  We break the rule whenever we write FSM for block X before
> > some WAL record initializes block X. data_checksums makes the trouble
> > easier to hit, since it creates FPI_FOR_HINT records for FSM changes.  A
> > replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield
> > this broken state.  While v16 RelationAddBlocks() made this easier to hit,
> > I suspect it's reproducible in all supported branches.  For example,
> > lazy_scan_new_or_empty() and multiple index AMs break the rule via
> > RecordPageWithFreeSpace() on a PageIsNew() page.
> 
> Very interesting. I understand the reasoning, and am now able to craft a very 
> crude test case, which I will refine into something more actionable: 
> 
>  - start a session 1, which will just vacuum our table continuously
>  - start a ession 2, which will issue checkpoints continuously
>  - in a session 3, COPY enough data so that we prealllocate too many pages. In 
> my example, I 'm copying 33150 single integer rows, which fit on a bit more 
> than 162 pages. The idea behind that is to make sure we don't fall on a round 
> number of pages, and we leave several iterations of the extend mechanism 
> running to issue a full page write of the fsm. 
> 
> So to trigger the bug, it seems to me one needs to run in parallel:
>  - COPY
>  - VACUUM
>  - CHECKPOINT
> 
> Which would explain why a surprisingly large number of occurences I've seen 
> seemed to involve a pg_restore invocation.

Does that procedure alone reproduce the ERROR, without any replica promote,
postmaster restart, etc.?  That, I do not have reason to expect.

> > I think the fix is one of:
> > 
> > - Revoke the undocumented rule.  Make FSM consumers resilient to the FSM
> >   returning a now-too-large block number.
> 
> Performance wise, that seems to be the better answer

I would guess this one is more risky from a performance perspective, since
we'd be adding to a hotter path under RelationGetBufferForTuple().  Still,
it's likely fine.

> but won't this kind of 
> built-in resilience encourage subtle bugs creeping in ?

Since the !wal_log_hints case avoids WAL for FSM, we've already accepted
subtle bugs.  I bet an unlucky torn FSM page could reach the ERROR you
reached, even if we introduced a "main-fork WAL before FSM" rule.  We'll have
plenty of need for resilience.  Hence, I'd lean toward this approach.

> > - Enforce a new "main-fork WAL before FSM" rule for logged rels.  For
> > example, in each PageIsNew() case, either don't update FSM or WAL-log an
> > init (like lazy_scan_new_or_empty() does when PageIsEmpty()).
> 
> The FSM is not updated by the caller: in the bulk insert case, we 
> intentionally don't add them to the FSM. So having VACUUM WAL-log the page in 
> lazy_scan_new_or_empty in the New case as you propose seems a very good idea.
> 
> Performance wise that would keep the WAL logging outside of the backend 
> performing the preventive work for itself or others, and it should not be that 
> many pages that it gives VACUUM too much work. 

A drawback is that this approach touches several access methods, so
out-of-tree access methods also likely need changes.  Still, this is a good
backup if the first option measurably slows INSERT throughput.

> I can try my hand at such a patch it if looks like a good idea. 

Please do.

> Thank you very much for your insights.

Thanks for the detailed report that made it possible.



pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #18376: pg_upgrade fails trying to restore privileges on retired function pg_start_backup and pg_stop_backup
Next
From: Ronan Dunklau
Date:
Subject: Re: FSM Corruption (was: Could not read block at end of the relation)