Re: FSM Corruption (was: Could not read block at end of the relation) - Mailing list pgsql-bugs
From | Noah Misch |
---|---|
Subject | Re: FSM Corruption (was: Could not read block at end of the relation) |
Date | |
Msg-id | 20240306184228.a7.nmisch@google.com Whole thread Raw |
In response to | Re: FSM Corruption (was: Could not read block at end of the relation) (Ronan Dunklau <ronan.dunklau@aiven.io>) |
Responses |
Re: FSM Corruption (was: Could not read block at end of the relation)
|
List | pgsql-bugs |
On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote: > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit : > > I would guess this one is more risky from a performance perspective, since > > we'd be adding to a hotter path under RelationGetBufferForTuple(). Still, > > it's likely fine. > > I ended up implementing this in the attached patch. The idea is that we detect > if the FSM returns a page past the end of the relation, and ignore it. > In that case we will fallback through the extension mechanism. > > For the corrupted-FSM case it is not great performance wise, as we will extend > the relation in small steps every time we find a non existing block in the FSM, > until the actual relation size matches what is recorded in the FSM. But since > those seldom happen, I figured it was better to keep the code really simple for > a bugfix. That could be fine. Before your message, I assumed we would treat this like the case of a full page. That is, we'd call RecordAndGetPageWithFreeSpace(), which would remove the FSM entry and possibly return another. That could be problematic if another backend is extending the relation; we don't want to erase the extender's new FSM entry. I'm parking this topic for the moment. > I wanted to test the impact in terms of performance, and I thought about the > worst possible case for this. > [test details] > > I noticed no difference running with or without the patch, but maybe someone > else can try to run that or find another adversarial case ? The patch currently adds an lseek() whenever the FSM finds a block. I see relevant-looking posts from mailing list searches with subsets of these terms: RelationGetNumberOfBlocks lseek benchmark overhead. I bet we should reduce this cost add. Some ideas: - No need for a system call if the FSM-returned value is low enough with respect to any prior RelationGetNumberOfBlocks() call. - We'd be happy and efficient with a ReadBufferMode such that ReadBufferExtended() returns NULL after a 0-byte read, with all other errors handled normally. That sounds like the specification of RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't that, but perhaps one RBM could satisfy both sets of needs. On Wed, Mar 06, 2024 at 12:08:38PM +0100, Ronan Dunklau wrote: > Subject: [PATCH 2/2] Detect invalid FSM when finding a block. > > Whenever the FSM points to a block past the end of the relation, detect > it and fallback to the relation extension mechanism. > > This may arise when a FPI for the FSM has been triggered, and we end up > WAL-logging a page of the FSM pointing to newly extended blocks in the > relation which have never been WAL-logged. > --- > src/backend/access/heap/hio.c | 20 +++++++++++++++++++- Each GetPageWithFreeSpace() caller would need this, not just heap. Please evaluate whether this logic belongs inside freespace.c vs. in each caller. freespace.c comments and/or freespace/README may need updates. Please evaluate header comments of freespace.c functions whose callers you are changing, and evaluate comments about truncation.
pgsql-bugs by date: