Re: FSM Corruption (was: Could not read block at end of the relation) - Mailing list pgsql-bugs

From Ronan Dunklau
Subject Re: FSM Corruption (was: Could not read block at end of the relation)
Date
Msg-id 3225659.5fSG56mABF@aivenlaptop
Whole thread Raw
In response to Re: FSM Corruption (was: Could not read block at end of the relation)  (Noah Misch <noah@leadboat.com>)
Responses Re: FSM Corruption (was: Could not read block at end of the relation)
List pgsql-bugs
Le mardi 12 mars 2024, 23:08:07 CET Noah Misch a écrit :
> > I would argue this is fine, as corruptions don't happen often, and FSM is
> > not an exact thing anyway. If we record a block as full, it will just be
> > unused until the next vacuum adds it back to the FSM with a correct
> > value.
> If this happened only under FSM corruption, it would be fine.  I was
> thinking about normal extension, with an event sequence like this:
>
> - RelationExtensionLockWaiterCount() returns 10.
> - Backend B1 extends by 10 pages.
> - B1 records 9 pages in the FSM.
> - B2, B3, ... B11 wake up and fetch from the FSM.  In each of those
> backends, fsm_does_block_exist() returns false, because the cached relation
> size is stale.  Those pages remain unused until VACUUM re-records them in
> the FSM.
>
> PostgreSQL added the multiple-block extension logic (commit 719c84c) from
> hard-won experience bottlenecking there.  If fixing $SUBJECT will lose 1% of
> that change's benefit, that's probably okay.  If it loses 20%, we should
> work to do better.  While we could rerun the 2016 benchmark to see how the
> patch regresses it, there might be a simple fix.  fsm_does_block_exist()
> could skip the cache when blknumber>=cached, like this:
>
>     return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM])
&&
>             blknumber < smgr-
>smgr_cached_nblocks[MAIN_FORKNUM]) ||
>            blknumber < RelationGetNumberOfBlocks(rel));
>
> How do you see it?  That still leaves us with an avoidable lseek in B2, B3,
> ... B11, but that feels tolerable.  I don't know how to do better without
> switching to the ReadBufferMode approach.

Hi Noah and thanks for your feedback.

That makes sense. I'm a bit worried about triggering additional lseeks when we
in fact have other free pages in the map, that would pass the test. I'm not
sure how relevant it is given the way we search the FSM with fp_next_slot
though...

To address that, I've given a bit of thought about enabling / disabling the
auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish the
cases where we know we have a somewhat up-to-date value compared to the case
where we don't (as in, for heap, try without repair, then get an uptodate
value to try the last block, and if we need to look at the FSM again then ask
for it to be repaired) but it brings way too much complexity and would need
careful thought for each AM.

So please find attached a patch with the change you propose.

Do you have a link to the benchmark you mention though to evaluate the patch
against it ?

>
> > > The patch currently adds an lseek() whenever the FSM finds a block.  I
> > > see
> > > relevant-looking posts from mailing list searches with subsets of these
> > > terms: RelationGetNumberOfBlocks lseek benchmark overhead.  I bet we
> > > should
> > > reduce this cost add.  Some ideas:
> > >
> > > - No need for a system call if the FSM-returned value is low enough with
> > >
> > >   >   respect to any prior RelationGetNumberOfBlocks() call.
> >
> > I'm not sure what you mean by "low enough".  What I chose to implement
> > instead
> I just meant "less than".  Specifically, if "fsm_returned_value < MAX(all
> RelationGetNumberOfBlocks() calls done on this relation, since last
> postmaster start)" then the FSM-returned value won't have the problem seen
> in $SUBJECT.
> > > - We'd be happy and efficient with a ReadBufferMode such that
> > >
> > >   ReadBufferExtended() returns NULL after a 0-byte read, with all other
> > >
> > > errors handled normally.  That sounds like the specification of
> > > RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't
> > > that,
> > > but perhaps one RBM could satisfy both sets of needs.
> >
> > I'm not sure about this one, this seems to have far more reaching
> > consequences. This would mean we don't have any cooperation from the FSM,
> > and would need to implement error recovery scenarios in each caller. But
> > I admit I haven't looked too much into it, and focused instead in seeing
> > if the current approach can get us to an acceptable state.
>
> I, too, am not sure.  I agree that RBM approach wouldn't let you isolate the
> changes like the last patch does.  Each GetFreeIndexPage() caller would
> need code for the return-NULL case.  Avoiding that is nice.  (If we ever do
> the RBM change in the future, the fsm_readbuf() and vm_readbuf() !extend
> cases should also use it.)
>
> > --- a/src/backend/storage/freespace/freespace.c
> > +++ b/src/backend/storage/freespace/freespace.c
> >
> > +            if (addr.level == FSM_BOTTOM_LEVEL) {
> > +                BlockNumber blkno =
fsm_get_heap_blk(addr, slot);
> > +                Page page;
> > +                /*
> > +                 * If the buffer is past the end
of the relation,
> > +                 * update the page and restarts
from the root
> > +                 */
> > +                if (fsm_does_block_exist(rel,
blkno))
> > +                {
> > +                    ReleaseBuffer(buf);
> > +                    return blkno;
> > +                }
> > +                page = BufferGetPage(buf);
> > +                LockBuffer(buf,
BUFFER_LOCK_EXCLUSIVE);
> > +                fsm_set_avail(page, slot, 0);
> > +                MarkBufferDirtyHint(buf, false);
>
> I wondered if we should zero all slots on the same FSM page that correspond
> to main fork blocks past the end of the main fork.  The usual "advancenext"
> behavior is poor for this case, since every later slot is invalid in same
> way. My current thinking: no, it's not worth the code bulk to do better.  A
> single extension caps at 64 blocks, so the worst case is roughly locking
> the buffer 64 times and restarting from FSM root 64 times.  For something
> this infrequent, that's not bad.

That was my reasoning as well. Paying a small performance penalty in the
unlikely case we hit a corruption and get done with it seems correct.


>
> Thanks,
> nm


Attachment

pgsql-bugs by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Potential data loss due to race condition during logical replication slot creation
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Potential data loss due to race condition during logical replication slot creation