Re: FSM Corruption (was: Could not read block at end of the relation) - Mailing list pgsql-bugs

From Noah Misch
Subject Re: FSM Corruption (was: Could not read block at end of the relation)
Date
Msg-id 20240306184228.a7.nmisch@google.com
Whole thread Raw
In response to Re: FSM Corruption (was: Could not read block at end of the relation)  (Ronan Dunklau <ronan.dunklau@aiven.io>)
Responses Re: FSM Corruption (was: Could not read block at end of the relation)
List pgsql-bugs
On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote:
> Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
> > I would guess this one is more risky from a performance perspective, since
> > we'd be adding to a hotter path under RelationGetBufferForTuple().  Still,
> > it's likely fine.
> 
> I ended up implementing this in the attached patch. The idea is that we detect 
> if the FSM returns a page past the end of the relation, and ignore it.
> In that case we will fallback through the extension mechanism.
> 
> For the corrupted-FSM case it is not great performance wise, as we will extend 
> the relation in small steps every time we find a non existing block in the FSM, 
> until the actual relation size matches what is recorded in the FSM. But since 
> those seldom happen, I figured it was better to keep the code really simple for 
> a bugfix.

That could be fine.  Before your message, I assumed we would treat this like
the case of a full page.  That is, we'd call RecordAndGetPageWithFreeSpace(),
which would remove the FSM entry and possibly return another.  That could be
problematic if another backend is extending the relation; we don't want to
erase the extender's new FSM entry.  I'm parking this topic for the moment.

> I wanted to test the impact in terms of performance, and I thought about the 
> worst possible case for this.
> 
[test details]
> 
> I noticed no difference running with or without the patch, but maybe someone 
> else can try to run that or find another adversarial case ? 

The patch currently adds an lseek() whenever the FSM finds a block.  I see
relevant-looking posts from mailing list searches with subsets of these terms:
RelationGetNumberOfBlocks lseek benchmark overhead.  I bet we should reduce
this cost add.  Some ideas:

- No need for a system call if the FSM-returned value is low enough with
  respect to any prior RelationGetNumberOfBlocks() call.

- We'd be happy and efficient with a ReadBufferMode such that
  ReadBufferExtended() returns NULL after a 0-byte read, with all other errors
  handled normally.  That sounds like the specification of RBM_NORMAL_NO_LOG.
  Today's implementation of RBM_NORMAL_NO_LOG isn't that, but perhaps one RBM
  could satisfy both sets of needs.

On Wed, Mar 06, 2024 at 12:08:38PM +0100, Ronan Dunklau wrote:
> Subject: [PATCH 2/2] Detect invalid FSM when finding a block.
> 
> Whenever the FSM points to a block past the end of the relation, detect
> it and fallback to the relation extension mechanism.
> 
> This may arise when a FPI for the FSM has been triggered, and we end up
> WAL-logging a page of the FSM pointing to newly extended blocks in the
> relation which have never been WAL-logged.
> ---
>  src/backend/access/heap/hio.c | 20 +++++++++++++++++++-

Each GetPageWithFreeSpace() caller would need this, not just heap.  Please
evaluate whether this logic belongs inside freespace.c vs. in each caller.
freespace.c comments and/or freespace/README may need updates.  Please
evaluate header comments of freespace.c functions whose callers you are
changing, and evaluate comments about truncation.



pgsql-bugs by date:

Previous
From: Stephen Frost
Date:
Subject: Re: BUG #18379: LDAP bind password exposed
Next
From: Tom Lane
Date:
Subject: Re: BUG #18379: LDAP bind password exposed