Re: FSM Corruption (was: Could not read block at end of the relation) - Mailing list pgsql-bugs

From Ronan Dunklau
Subject Re: FSM Corruption (was: Could not read block at end of the relation)
Date
Msg-id 3549385.iIbC2pHGDl@aivenlaptop
Whole thread Raw
In response to Re: FSM Corruption (was: Could not read block at end of the relation)  (Noah Misch <noah@leadboat.com>)
Responses Re: FSM Corruption (was: Could not read block at end of the relation)  (Noah Misch <noah@leadboat.com>)
List pgsql-bugs
Le lundi 4 mars 2024, 20:03:12 CET Noah Misch a écrit :
> On Mon, Mar 04, 2024 at 02:10:39PM +0100, Ronan Dunklau wrote:
> > Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit :
> > > On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:
> You're using data_checksums, right?  Thanks for the wal dump excerpts; I
> agree with this summary thereof.

Yes, data checksums so wal_log_hints is implied.

>
> > There are no traces of relation truncation happening in the WAL.
>
> That is notable.
>
> > This case only shows a single invalid entry in the FSM, but I've noticed
> > as
> > much as 62 blocks present in the FSM while they do not exist on disk, all
> > tagged with MaxFSMRequestSize so I suppose something is wrong with the
> > bulk
> > extension mechanism.
>
> Is this happening after an OS crash, a replica promote, or a PITR restore?

I need to double check all occurences, but I wouldn't be surprised if they all
went trough a replica promotion.

> If so, I think I see the problem.  We have an undocumented rule that FSM
> shall not contain references to pages past the end of the relation.  To
> facilitate that, relation truncation WAL-logs FSM truncate.  However,
> there's no similar protection for relation extension, which is not
> WAL-logged.  We break the rule whenever we write FSM for block X before
> some WAL record initializes block X. data_checksums makes the trouble
> easier to hit, since it creates FPI_FOR_HINT records for FSM changes.  A
> replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield
> this broken state.  While v16 RelationAddBlocks() made this easier to hit,
> I suspect it's reproducible in all supported branches.  For example,
> lazy_scan_new_or_empty() and multiple index AMs break the rule via
> RecordPageWithFreeSpace() on a PageIsNew() page.

Very interesting. I understand the reasoning, and am now able to craft a very
crude test case, which I will refine into something more actionable:

 - start a session 1, which will just vacuum our table continuously
 - start a ession 2, which will issue checkpoints continuously
 - in a session 3, COPY enough data so that we prealllocate too many pages. In
my example, I 'm copying 33150 single integer rows, which fit on a bit more
than 162 pages. The idea behind that is to make sure we don't fall on a round
number of pages, and we leave several iterations of the extend mechanism
running to issue a full page write of the fsm.

So to trigger the bug, it seems to me one needs to run in parallel:
 - COPY
 - VACUUM
 - CHECKPOINT

Which would explain why a surprisingly large number of occurences I've seen
seemed to involve a pg_restore invocation.

>
> I think the fix is one of:
>
> - Revoke the undocumented rule.  Make FSM consumers resilient to the FSM
>   returning a now-too-large block number.

Performance wise, that seems to be the better answer but won't this kind of
built-in resilience encourage subtle bugs creeping in ?

>
> - Enforce a new "main-fork WAL before FSM" rule for logged rels.  For
> example, in each PageIsNew() case, either don't update FSM or WAL-log an
> init (like lazy_scan_new_or_empty() does when PageIsEmpty()).

The FSM is not updated by the caller: in the bulk insert case, we
intentionally don't add them to the FSM. So having VACUUM WAL-log the page in
lazy_scan_new_or_empty in the New case as you propose seems a very good idea.

Performance wise that would keep the WAL logging outside of the backend
performing the preventive work for itself or others, and it should not be that
many pages that it gives VACUUM too much work.

I can try my hand at such a patch it if looks like a good idea.

Thank you very much for your insights.

Best regards,

--
Ronan Dunklau







pgsql-bugs by date:

Previous
From: Noah Misch
Date:
Subject: Re: FSM Corruption (was: Could not read block at end of the relation)
Next
From: Michael Paquier
Date:
Subject: Re: BUG #18376: pg_upgrade fails trying to restore privileges on retired function pg_start_backup and pg_stop_backup