Re: FSM Corruption (was: Could not read block at end of the relation) - Mailing list pgsql-bugs
From | Noah Misch |
---|---|
Subject | Re: FSM Corruption (was: Could not read block at end of the relation) |
Date | |
Msg-id | 20240316030339.03.nmisch@google.com Whole thread Raw |
In response to | Re: FSM Corruption (was: Could not read block at end of the relation) (Ronan Dunklau <ronan.dunklau@aiven.io>) |
Responses |
Re: FSM Corruption (was: Could not read block at end of the relation)
(Noah Misch <noah@leadboat.com>)
|
List | pgsql-bugs |
On Fri, Mar 15, 2024 at 07:23:46AM +0100, Ronan Dunklau wrote: > Le mercredi 13 mars 2024, 17:55:23 CET Noah Misch a écrit : > > On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote: > > > I'm a bit worried about triggering additional lseeks > > > when we in fact have other free pages in the map, that would pass the > > > test. I'm not sure how relevant it is given the way we search the FSM > > > with fp_next_slot though... > > > > That's a reasonable thing to worry about. We could do wrong by trying too > > hard to use an FSM slot, and we could do wrong by not trying hard enough. > > > > > To address that, I've given a bit of thought about enabling / disabling > > > the > > > auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish > > > the cases where we know we have a somewhat up-to-date value compared to > > > the case where we don't (as in, for heap, try without repair, then get an > > > uptodate value to try the last block, and if we need to look at the FSM > > > again then ask for it to be repaired) but it brings way too much > > > complexity and would need careful thought for each AM. > > > > > > So please find attached a patch with the change you propose. > > > > > > Do you have a link to the benchmark you mention though to evaluate the > > > patch against it ? > > > > Here is one from the thread that created commit 719c84c: > > https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEom > > gHfuVdzNJnwWVoE_c60g%40mail.gmail.com > > > > There may be other benchmarks earlier in that thread. > > Thank you. I tried to run that benchmark, with 4 clients all copying 10M > pgbench_accounts row concurrently. With and without the patch, I ran the > benchmark 5 times, and took the average. Does "that benchmark" refer to "unlogged tables, 4 parallel copies", "logged tables, 4 parallel copies", or something else? > master: 15.05s > patched: 15.24s (+1.25%) > > If I remove the best and worst run for each of those, the difference falls at > +0.7%. To get some additional perspective on the benchmark, how hard would it be to run one or both of the following? Feel free to decline if difficult. - Make GetPageWithFreeSpace() just "return InvalidBlockNumber", then rerun the benchmark. This should be a lot slower. If not, the bottleneck is somewhere unexpected, and we'd need a different benchmark. - Get profiles with both master and patched. (lseek or freespace.c functions rising by 0.1%-1% would fit what we know.) Distinguishing a 1% change from a 0% change would need different techniques still. If we're genuinely slowing bulk loads by ~1% to account for the possibly of a flawed post-recovery FSM, that's sad, but I'm inclined to accept the loss. A persistent error in an innocent INSERT is unacceptable, and the alternatives we discussed upthread have their own substantial trouble. Other opinions? Thanks, nm
pgsql-bugs by date: