Thread: Could not read block at end of the relation

Could not read block at end of the relation

From
Ronan Dunklau
Date:
Hello,

I'm sorry as this will be a very poor bug report. On PG16, I'm am experiencing 
random errors which share the same characteristics: 

- happens during heavy system load
- lots of concurrent writes happening on a table
- often (but haven't been able to confirm it is necessary), a vacuum is running 
on the table at the same time the error is triggered

Then, several backends get the same error at once "ERROR:  could not read 
block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with different 
block numbers. The relation is always a table (regular or toast). The blocks 
are past the end of the relation, and the different backends are all trying to 
read a different block. The offending queries are either an INSERT / UPDATE / 
COPY. 

I've seen that several bugs have been fixed in 16.1 and 16.2 regarding the new 
relation extension infrastructure, involving partitioned tables in one case 
and temp tables in the other one so I suspect maybe some other corner cases 
are uncovered in there. 

I suspected the FSM could be corrupted in some way but taking a look at it 
just after the errors have been triggered, the offending (non existing)blocks 
are just not present in the FSM either.

I'm desperately trying to reproduce the issue in a test environment, without 
any luck so far... I suspected a race condition with VACUUM trying to reclaim 
the space at the end of the relation, but running a custom build trying to 
reproduce that (by always trying to truncate the relation during VACUUM 
regardless of the amount of possibly-freeable-space) hasn't led me anywhere.

It seems that for some reason, a backend is extending the relation for the 
other waiting ones, and the newly allocated blocks don't end up being pinned 
in shared_buffers. They could then be evicted, and the waiting backend is now 
trying to read from a block which has to be read from disk but has never been 
marked dirty and never persisted. I don't have anything to back that 
hypothesis though...

Once again I'm sorry that this report is too vague, I'll update if I manage to 
reproduce the issue or gather some more information, but in the meantime has 
anybody witnessed something similar ?  And more importantly, do you have any 
pointers on how to investigate to try to trigger the issue manually ?

Best regards,

--
Ronan Dunklau





FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le mardi 27 février 2024, 11:34:14 CET Ronan Dunklau a écrit :
> I suspected the FSM could be corrupted in some way but taking a look at it
> just after the errors have been triggered, the offending (non
> existing)blocks are just not present in the FSM either.

I think I may have missed something on my first look. On other affected
clusters, the FSM is definitely corrupted.  So it looks like we have an FSM
corruption bug on our hands.

The occurence of this bug happening makes it hard to reproduce,  but it's
definitely frequent enough we witnessed it on a dozen PostgreSQL clusters.

In our case, we need to repair the FSM. The instructions on the wiki do work,
but maybe we should add something like the attached patch (modeled after the
same feature in pg_visibility) to make it possible to repair the FSM
corruption online. What do you think about it ?

The investigation of the corruption is still ongoing.


Best regards,

--
Ronan Dunklau


Attachment

Re: FSM Corruption (was: Could not read block at end of the relation)

From
Michael Paquier
Date:
On Fri, Mar 01, 2024 at 09:56:51AM +0100, Ronan Dunklau wrote:
> In our case, we need to repair the FSM. The instructions on the wiki do work,
> but maybe we should add something like the attached patch (modeled after the
> same feature in pg_visibility) to make it possible to repair the FSM
> corruption online. What do you think about it ?

I vaguely recall that something like that has been suggested around
917dc7d2393c in the past.

> The investigation of the corruption is still ongoing.

Thanks!
--
Michael

Attachment

Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:
> - happens during heavy system load
> - lots of concurrent writes happening on a table
> - often (but haven't been able to confirm it is necessary), a vacuum is running 
> on the table at the same time the error is triggered
> 
> Then, several backends get the same error at once "ERROR:  could not read 
> block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with different 

What are some of the specific block numbers reported?

> has anybody witnessed something similar ?

https://postgr.es/m/flat/CA%2BhUKGK%2B5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8D%2BZg%40mail.gmail.com
reminded me of this.  Did you upgrade your OS recently?

On Fri, Mar 01, 2024 at 09:56:51AM +0100, Ronan Dunklau wrote:
> I think I may have missed something on my first look. On other affected 
> clusters, the FSM is definitely corrupted.  So it looks like we have an FSM 
> corruption bug on our hands.

What corruption signs did you observe in the FSM?  Since FSM is intentionally
not WAL-logged, corruption is normal, but corruption causing errors is not
normal.  That said, if any crash leaves a state that the freespace/README
"self-correcting measures" don't detect, errors may happen.  Did the clusters
crash recently?

> The occurence of this bug happening makes it hard to reproduce,  but it's 
> definitely frequent enough we witnessed it on a dozen PostgreSQL clusters. 

You could do "ALTER TABLE x SET (vacuum_truncate = off);" and see if the
problem stops happening.  That would corroborate the VACUUM theory.

Can you use backtrace_functions to get a stack track?

> In our case, we need to repair the FSM. The instructions on the wiki do work, 
> but maybe we should add something like the attached patch (modeled after the 
> same feature in pg_visibility) to make it possible to repair the FSM 
> corruption online. What do you think about it ?

That's reasonable in concept.



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit :
> On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:
> > - happens during heavy system load
> > - lots of concurrent writes happening on a table
> > - often (but haven't been able to confirm it is necessary), a vacuum is
> > running on the table at the same time the error is triggered
> >
> > Then, several backends get the same error at once "ERROR:  could not read
> > block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with
> > different
> What are some of the specific block numbers reported?

Nothing specific, except they are past the end of the table, by an offset
starting at one. I've seen it happens on regular tables restored by pg_restore
quite a lot, but also on bigger tables, regular or TOAST. We've seen this on a
dozen different clusters, but nothing reproducible for now.

>
> > has anybody witnessed something similar ?
>
> https://postgr.es/m/flat/CA%2BhUKGK%2B5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8
> D%2BZg%40mail.gmail.com reminded me of this.  Did you upgrade your OS
> recently?

No recent upgrades I'm aware of. Also, it only touches PG16 clusters, not
earlier versions. But it looks like the same kind of errors.


> What corruption signs did you observe in the FSM?  Since FSM is
> intentionally not WAL-logged, corruption is normal, but corruption causing
> errors is not normal.  That said, if any crash leaves a state that the
> freespace/README "self-correcting measures" don't detect, errors may
> happen.  Did the clusters crash recently?

Well we still have full page images for the FSM.
I'll take as an example the cluster I'm currently investigating on.
The corruption I'm seeing is that I have an additional entry in the FSM for a
block that does not exist in the relation: IIUC, slot 4129 corresponds to
block 34, which is the last one I have in the relation. However, I have an
entry (slot 4130) for block number 35, with MaxFSMRequestSize

 fsm_page_contents
-------------------
 0: 255           +
 1: 255           +
 3: 255           +
...
 4129: 114        +
 4130: 255        +


This then causes attempts to insert into that block to fail.

Looking at when the corruption was WAL-logged, this particular case is quite
easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table
(probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM
running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog
records for each of the relation main fork, followed by a lonely FPI for the
leaf page of it's FSM:

rmgr: Heap2       len (rec/tot):   5698/  5698, tx:       1637, lsn:
1/810763B8, prev 1/81076388, desc: MULTI_INSERT+INIT ntuples: 124, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 0
rmgr: XLOG        len (rec/tot):     49/  8241, tx:       1637, lsn:
1/81077A00, prev 1/810763B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/17343/17959 fork fsm blk 2 FPW
rmgr: Heap2       len (rec/tot):   5762/  5762, tx:       1637, lsn:
1/81079A50, prev 1/81077A00, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 1
rmgr: Heap2       len (rec/tot):   5746/  5746, tx:       1637, lsn:
1/8107B0F0, prev 1/81079A50, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 2
rmgr: Heap2       len (rec/tot):   5778/  5778, tx:       1637, lsn:
1/8107C780, prev 1/8107B0F0, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 3

(up until blk 34 is  wal logged)

At some point pages 0 and 1 get a FPI, the leaf page (blk2) does not during
the transaction.

Later on, we have this for every main fork block:

rmgr: XLOG        len (rec/tot):     49/  8201, tx:          0, lsn:
2/0E1ECEC8, prev 2/0E1ECE10, desc: FPI_FOR_HINT , blkref #0: rel
1663/17343/17959 blk 0 FPW
rmgr: Heap2       len (rec/tot):    325/   325, tx:          0, lsn:
2/0E1EEEF0, prev 2/0E1ECEC8, desc: FREEZE_PAGE snapshotConflictHorizon: 1637,
nplans: 2, plans: [{ xmax: 0, infomask: 2816, infomask2: 5, ntuples: 86,
offsets: [1, 2, 3, 5, 7, 10, 14, 16, 21, 22, 28, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 45, 46, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60,
64, 65, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 85, 86,
87, 90, 92, 94, 95, 96, 98, 100, 102, 103, 104, 105, 106, 107, 108, 109, 110,
111, 112, 113, 114, 115, 117, 118, 119, 120, 121, 122, 123, 124] }, { xmax: 0,
infomask: 2817, infomask2: 5, ntuples: 38, offsets: [4, 6, 8, 9, 11, 12, 13,
15, 17, 18, 19, 20, 23, 24, 25, 26, 27, 29, 43, 44, 47, 48, 59, 61, 62, 63,
66, 70, 80, 84, 88, 89, 91, 93, 97, 99, 101, 116] }], blkref #0: rel
1663/17343/17959 blk 0
rmgr: Heap2       len (rec/tot):     64/  8256, tx:          0, lsn:
2/0E1EF038, prev 2/0E1EEEF0, desc: VISIBLE snapshotConflictHorizon: 0, flags:
0x03, blkref #0: rel 1663/17343/17959 fork vm blk 0 FPW, blkref #1: rel
1663/17343/17959 blk 0


And then the single FSM FPI, which contains the invalid record:

rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn:
2/0E28F640, prev 2/0E28F600, desc: FPI_FOR_HINT , blkref #0: rel
1663/17343/17959 fork fsm blk 2 FPW

I've dumped the associated FPIs, and took a look at them. First FPI for leaf
block (#2 in the fsm) happens in the beginning of the multi insert
transaction, and only contains one leaf node for block 0 (slot 4095).

Later FPI (during what I think is a VACUUM) persists the corruption in the
WAL.
So presumably something happened during the COPY operation which caused it to
add an invalid entry for inexisting block 35 in the FSM.

There are no traces of relation truncation happening in the WAL.

This case only shows a single invalid entry in the FSM, but I've noticed as
much as 62 blocks present in the FSM while they do not exist on disk, all
tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk
extension mechanism.


>
> > The occurence of this bug happening makes it hard to reproduce,  but it's
> > definitely frequent enough we witnessed it on a dozen PostgreSQL clusters.
>
> You could do "ALTER TABLE x SET (vacuum_truncate = off);" and see if the
> problem stops happening.  That would corroborate the VACUUM theory.
>
> Can you use backtrace_functions to get a stack track?

I got a stack trace from the error after the corruption has occured (even got
to attach gdb to it as I was able to restore a corrupted cluster), but I don't
think it's of any use (this is from a different occurence than the one I wrote
in details above, so the block numbers don't match but it's the exact same
thing).

#0  errstart (elevel=21, domain=0x0) at utils/error/elog.c:348
#1  0x000055c81ea83a0a in errstart_cold (domain=0x0, elevel=21) at utils/
error/elog.c:333
#2  mdread (reln=0x55c8210bbe00, forknum=MAIN_FORKNUM, blocknum=1203,
buffer=0x7f51b4c32000) at storage/smgr/md.c:796
#3  0x000055c81ee396cf in smgrread (buffer=0x7f51b4c32000, blocknum=1203,
forknum=MAIN_FORKNUM, reln=0x55c8210bbe00) at storage/smgr/smgr.c:565
#4  ReadBuffer_common (smgr=0x55c8210bbe00,
relpersistence=relpersistence@entry=112 'p',
forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=1203,
    mode=mode@entry=RBM_NORMAL, strategy=<optimized out>, hit=0x7fff6e22f607)
at storage/buffer/bufmgr.c:1124
#5  0x000055c81ee39d0b in ReadBufferExtended (reln=0x7f51b294f0d0,
forkNum=MAIN_FORKNUM, blockNum=1203, mode=RBM_NORMAL, strategy=<optimized
out>)
    at access/brin/../../../../src/include/utils/rel.h:576
#6  0x000055c81eb1525c in ReadBufferBI (mode=RBM_NORMAL, bistate=0x0,
targetBlock=1203, relation=<optimized out>) at access/heap/hio.c:96
#7  RelationGetBufferForTuple (relation=<optimized out>, len=<optimized out>,
otherBuffer=0, options=<optimized out>, bistate=0x0, vmbuffer=<optimized out>,
    vmbuffer_other=<optimized out>, num_pages=<optimized out>) at access/heap/
hio.c:620
#8  0x000055c81eafcace in heap_insert (relation=0x7f51b294f0d0,
tup=0x55c821042580, cid=<optimized out>, options=0, bistate=0x0) at access/
heap/heapam.c:1862
#9  0x000055c81eb08bab in heapam_tuple_insert (relation=0x7f51b294f0d0,
slot=0x55c821042478, cid=0, options=0, bistate=0x0) at access/heap/
heapam_handler.c:252
#10 0x000055c81ecce2f1 in table_tuple_insert (bistate=0x0, options=0,
cid=<optimized out>, slot=0x55c821042478, rel=<optimized out>)
    at executor/../../../src/include/access/tableam.h:1400
#11 ExecInsert (context=context@entry=0x7fff6e22f9b0,
resultRelInfo=resultRelInfo@entry=0x55c8210410c0,
slot=slot@entry=0x55c821042478, canSetTag=<optimized out>,
    inserted_tuple=inserted_tuple@entry=0x0,
insert_destrel=insert_destrel@entry=0x0) at executor/nodeModifyTable.c:1133
#12 0x000055c81eccf1eb in ExecModifyTable (pstate=<optimized out>) at
executor/nodeModifyTable.c:3810

Just a regular insertion failing to read the block it's trying to insert into.

> > In our case, we need to repair the FSM. The instructions on the wiki do
> > work, but maybe we should add something like the attached patch (modeled
> > after the same feature in pg_visibility) to make it possible to repair
> > the FSM corruption online. What do you think about it ?
>
> That's reasonable in concept.

Ok, I'll start a separate thread on -hackers for that.

Thank you !










Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Mon, Mar 04, 2024 at 02:10:39PM +0100, Ronan Dunklau wrote:
> Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit :
> > On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:
> > > - happens during heavy system load
> > > - lots of concurrent writes happening on a table
> > > - often (but haven't been able to confirm it is necessary), a vacuum is
> > > running on the table at the same time the error is triggered

> Looking at when the corruption was WAL-logged, this particular case is quite 
> easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table 
> (probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM 
> running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog 
> records for each of the relation main fork, followed by a lonely FPI for the 
> leaf page of it's FSM:

You're using data_checksums, right?  Thanks for the wal dump excerpts; I agree
with this summary thereof.

> There are no traces of relation truncation happening in the WAL.

That is notable.

> This case only shows a single invalid entry in the FSM, but I've noticed as 
> much as 62 blocks present in the FSM while they do not exist on disk, all 
> tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk 
> extension mechanism.

Is this happening after an OS crash, a replica promote, or a PITR restore?  If
so, I think I see the problem.  We have an undocumented rule that FSM shall
not contain references to pages past the end of the relation.  To facilitate
that, relation truncation WAL-logs FSM truncate.  However, there's no similar
protection for relation extension, which is not WAL-logged.  We break the rule
whenever we write FSM for block X before some WAL record initializes block X.
data_checksums makes the trouble easier to hit, since it creates FPI_FOR_HINT
records for FSM changes.  A replica promote or PITR ending just after the FSM
FPI_FOR_HINT would yield this broken state.  While v16 RelationAddBlocks()
made this easier to hit, I suspect it's reproducible in all supported
branches.  For example, lazy_scan_new_or_empty() and multiple index AMs break
the rule via RecordPageWithFreeSpace() on a PageIsNew() page.

I think the fix is one of:

- Revoke the undocumented rule.  Make FSM consumers resilient to the FSM
  returning a now-too-large block number.

- Enforce a new "main-fork WAL before FSM" rule for logged rels.  For example,
  in each PageIsNew() case, either don't update FSM or WAL-log an init (like
  lazy_scan_new_or_empty() does when PageIsEmpty()).



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le lundi 4 mars 2024, 20:03:12 CET Noah Misch a écrit :
> On Mon, Mar 04, 2024 at 02:10:39PM +0100, Ronan Dunklau wrote:
> > Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit :
> > > On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:
> You're using data_checksums, right?  Thanks for the wal dump excerpts; I
> agree with this summary thereof.

Yes, data checksums so wal_log_hints is implied.

>
> > There are no traces of relation truncation happening in the WAL.
>
> That is notable.
>
> > This case only shows a single invalid entry in the FSM, but I've noticed
> > as
> > much as 62 blocks present in the FSM while they do not exist on disk, all
> > tagged with MaxFSMRequestSize so I suppose something is wrong with the
> > bulk
> > extension mechanism.
>
> Is this happening after an OS crash, a replica promote, or a PITR restore?

I need to double check all occurences, but I wouldn't be surprised if they all
went trough a replica promotion.

> If so, I think I see the problem.  We have an undocumented rule that FSM
> shall not contain references to pages past the end of the relation.  To
> facilitate that, relation truncation WAL-logs FSM truncate.  However,
> there's no similar protection for relation extension, which is not
> WAL-logged.  We break the rule whenever we write FSM for block X before
> some WAL record initializes block X. data_checksums makes the trouble
> easier to hit, since it creates FPI_FOR_HINT records for FSM changes.  A
> replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield
> this broken state.  While v16 RelationAddBlocks() made this easier to hit,
> I suspect it's reproducible in all supported branches.  For example,
> lazy_scan_new_or_empty() and multiple index AMs break the rule via
> RecordPageWithFreeSpace() on a PageIsNew() page.

Very interesting. I understand the reasoning, and am now able to craft a very
crude test case, which I will refine into something more actionable:

 - start a session 1, which will just vacuum our table continuously
 - start a ession 2, which will issue checkpoints continuously
 - in a session 3, COPY enough data so that we prealllocate too many pages. In
my example, I 'm copying 33150 single integer rows, which fit on a bit more
than 162 pages. The idea behind that is to make sure we don't fall on a round
number of pages, and we leave several iterations of the extend mechanism
running to issue a full page write of the fsm.

So to trigger the bug, it seems to me one needs to run in parallel:
 - COPY
 - VACUUM
 - CHECKPOINT

Which would explain why a surprisingly large number of occurences I've seen
seemed to involve a pg_restore invocation.

>
> I think the fix is one of:
>
> - Revoke the undocumented rule.  Make FSM consumers resilient to the FSM
>   returning a now-too-large block number.

Performance wise, that seems to be the better answer but won't this kind of
built-in resilience encourage subtle bugs creeping in ?

>
> - Enforce a new "main-fork WAL before FSM" rule for logged rels.  For
> example, in each PageIsNew() case, either don't update FSM or WAL-log an
> init (like lazy_scan_new_or_empty() does when PageIsEmpty()).

The FSM is not updated by the caller: in the bulk insert case, we
intentionally don't add them to the FSM. So having VACUUM WAL-log the page in
lazy_scan_new_or_empty in the New case as you propose seems a very good idea.

Performance wise that would keep the WAL logging outside of the backend
performing the preventive work for itself or others, and it should not be that
many pages that it gives VACUUM too much work.

I can try my hand at such a patch it if looks like a good idea.

Thank you very much for your insights.

Best regards,

--
Ronan Dunklau







Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Mon, Mar 04, 2024 at 11:21:07PM +0100, Ronan Dunklau wrote:
> Le lundi 4 mars 2024, 20:03:12 CET Noah Misch a écrit :
> > Is this happening after an OS crash, a replica promote, or a PITR restore? 
> 
> I need to double check all occurences, but I wouldn't be surprised if they all 
> went trough a replica promotion.
> 
> > If so, I think I see the problem.  We have an undocumented rule that FSM
> > shall not contain references to pages past the end of the relation.  To
> > facilitate that, relation truncation WAL-logs FSM truncate.  However,
> > there's no similar protection for relation extension, which is not
> > WAL-logged.  We break the rule whenever we write FSM for block X before
> > some WAL record initializes block X. data_checksums makes the trouble
> > easier to hit, since it creates FPI_FOR_HINT records for FSM changes.  A
> > replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield
> > this broken state.  While v16 RelationAddBlocks() made this easier to hit,
> > I suspect it's reproducible in all supported branches.  For example,
> > lazy_scan_new_or_empty() and multiple index AMs break the rule via
> > RecordPageWithFreeSpace() on a PageIsNew() page.
> 
> Very interesting. I understand the reasoning, and am now able to craft a very 
> crude test case, which I will refine into something more actionable: 
> 
>  - start a session 1, which will just vacuum our table continuously
>  - start a ession 2, which will issue checkpoints continuously
>  - in a session 3, COPY enough data so that we prealllocate too many pages. In 
> my example, I 'm copying 33150 single integer rows, which fit on a bit more 
> than 162 pages. The idea behind that is to make sure we don't fall on a round 
> number of pages, and we leave several iterations of the extend mechanism 
> running to issue a full page write of the fsm. 
> 
> So to trigger the bug, it seems to me one needs to run in parallel:
>  - COPY
>  - VACUUM
>  - CHECKPOINT
> 
> Which would explain why a surprisingly large number of occurences I've seen 
> seemed to involve a pg_restore invocation.

Does that procedure alone reproduce the ERROR, without any replica promote,
postmaster restart, etc.?  That, I do not have reason to expect.

> > I think the fix is one of:
> > 
> > - Revoke the undocumented rule.  Make FSM consumers resilient to the FSM
> >   returning a now-too-large block number.
> 
> Performance wise, that seems to be the better answer

I would guess this one is more risky from a performance perspective, since
we'd be adding to a hotter path under RelationGetBufferForTuple().  Still,
it's likely fine.

> but won't this kind of 
> built-in resilience encourage subtle bugs creeping in ?

Since the !wal_log_hints case avoids WAL for FSM, we've already accepted
subtle bugs.  I bet an unlucky torn FSM page could reach the ERROR you
reached, even if we introduced a "main-fork WAL before FSM" rule.  We'll have
plenty of need for resilience.  Hence, I'd lean toward this approach.

> > - Enforce a new "main-fork WAL before FSM" rule for logged rels.  For
> > example, in each PageIsNew() case, either don't update FSM or WAL-log an
> > init (like lazy_scan_new_or_empty() does when PageIsEmpty()).
> 
> The FSM is not updated by the caller: in the bulk insert case, we 
> intentionally don't add them to the FSM. So having VACUUM WAL-log the page in 
> lazy_scan_new_or_empty in the New case as you propose seems a very good idea.
> 
> Performance wise that would keep the WAL logging outside of the backend 
> performing the preventive work for itself or others, and it should not be that 
> many pages that it gives VACUUM too much work. 

A drawback is that this approach touches several access methods, so
out-of-tree access methods also likely need changes.  Still, this is a good
backup if the first option measurably slows INSERT throughput.

> I can try my hand at such a patch it if looks like a good idea. 

Please do.

> Thank you very much for your insights.

Thanks for the detailed report that made it possible.



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
> Does that procedure alone reproduce the ERROR, without any replica promote,
> postmaster restart, etc.?  That, I do not have reason to expect.

Oh no sorrry for not being clear: it needs to go through recovery either
through a backup restoration, a replica or anything else. This is why I've
been struggling to reproduce the issue, it didn't occur to me that the fsm was
only sane on the primary, as the extensions were not wal logged.

>
> > > I think the fix is one of:
> > >
> > > - Revoke the undocumented rule.  Make FSM consumers resilient to the FSM
> > >
> > >   returning a now-too-large block number.
> >
> > Performance wise, that seems to be the better answer
>
> I would guess this one is more risky from a performance perspective, since
> we'd be adding to a hotter path under RelationGetBufferForTuple().  Still,
> it's likely fine.

Makes sense.

>
> > but won't this kind of
> > built-in resilience encourage subtle bugs creeping in ?
>
> Since the !wal_log_hints case avoids WAL for FSM, we've already accepted
> subtle bugs.  I bet an unlucky torn FSM page could reach the ERROR you
> reached, even if we introduced a "main-fork WAL before FSM" rule.  We'll
> have plenty of need for resilience.  Hence, I'd lean toward this approach.


> > > - Enforce a new "main-fork WAL before FSM" rule for logged rels.  For
> > > example, in each PageIsNew() case, either don't update FSM or WAL-log an
> > > init (like lazy_scan_new_or_empty() does when PageIsEmpty()).
> >
> > The FSM is not updated by the caller: in the bulk insert case, we
> > intentionally don't add them to the FSM. So having VACUUM WAL-log the page
> > in lazy_scan_new_or_empty in the New case as you propose seems a very
> > good idea.
> >
> > Performance wise that would keep the WAL logging outside of the backend
> > performing the preventive work for itself or others, and it should not be
> > that many pages that it gives VACUUM too much work.
>
> A drawback is that this approach touches several access methods, so
> out-of-tree access methods also likely need changes.  Still, this is a good
> backup if the first option measurably slows INSERT throughput.

Would it actually ? At least for the currently discussed VACUUM case, we will
fall back to a generic FPI for the new page, which should work on any AM. But
the requirement to WAL log the page before recording it in the FSM should be
publicized, and the common infrastructure should adhere to it. I'd have to
look into the other cases you mentioned where that can happen.

>
> > I can try my hand at such a patch it if looks like a good idea.
>
> Please do.
>
> > Thank you very much for your insights.
>
> Thanks for the detailed report that made it possible.







Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Tue, Mar 05, 2024 at 12:33:13AM +0100, Ronan Dunklau wrote:
> Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
> > > > - Enforce a new "main-fork WAL before FSM" rule for logged rels.  For
> > > > example, in each PageIsNew() case, either don't update FSM or WAL-log an
> > > > init (like lazy_scan_new_or_empty() does when PageIsEmpty()).
> > > 
> > > The FSM is not updated by the caller: in the bulk insert case, we
> > > intentionally don't add them to the FSM. So having VACUUM WAL-log the page
> > > in lazy_scan_new_or_empty in the New case as you propose seems a very
> > > good idea.
> > > 
> > > Performance wise that would keep the WAL logging outside of the backend
> > > performing the preventive work for itself or others, and it should not be
> > > that many pages that it gives VACUUM too much work.
> > 
> > A drawback is that this approach touches several access methods, so
> > out-of-tree access methods also likely need changes.  Still, this is a good
> > backup if the first option measurably slows INSERT throughput.
> 
> Would it actually ? At least for the currently discussed VACUUM case, we will 
> fall back to a generic FPI for the new page, which should work on any AM. But 
> the requirement to WAL log the page before recording it in the FSM should be 
> publicized, and the common infrastructure should adhere to it.

I agree each AM can use generic FPI code.  I'm looking at code like this as
needing change under this candidate design:

static void
spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
{
...
        if (PageIsNew(page) || PageIsEmpty(page))
        {
            RecordFreeIndexPage(index, blkno);

The change there would be fairly simple, but non-core access methods may
require similar simple changes.  That's fine if this approach has other
reasons to win, but it is a drawback.  A grep of pgxn code shows "rum" as the
only module containing a RecordFreeIndexPage() call.  No module contains a
RecordPageWithFreeSpace() call.  So this drawback is all but negligible.



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Michael Paquier
Date:
On Mon, Mar 04, 2024 at 04:13:46PM -0800, Noah Misch wrote:
> I agree each AM can use generic FPI code.  I'm looking at code like this as
> needing change under this candidate design:
>
> static void
> spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
> {
> ...
>         if (PageIsNew(page) || PageIsEmpty(page))
>         {
>             RecordFreeIndexPage(index, blkno);
>
> The change there would be fairly simple, but non-core access methods may
> require similar simple changes.  That's fine if this approach has other
> reasons to win, but it is a drawback.  A grep of pgxn code shows "rum" as the
> only module containing a RecordFreeIndexPage() call.  No module contains a
> RecordPageWithFreeSpace() call.  So this drawback is all but negligible.

This design does not sound that bad to be, FWIW, if what you are
discussing impacts data inserts enough so be noticeable in the default
cases.  There are not that many out-of-core index AMs out there, and
"rum" is as far as I know not supported but any of the major cloud
vendors.  So contacting the authors of such AMs and raising awareness
would be enough.  So I would worry much more about performance.  My
2c.

(You should be able to implement a cheap test using two injection
points, from what I can read, if you need coordination between three
actions.)
--
Michael

Attachment

Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le mardi 5 mars 2024, 01:56:12 CET Michael Paquier a écrit :
> (You should be able to implement a cheap test using two injection
> points, from what I can read, if you need coordination between three
> actions.)

I'm sorry, but I don't understand this sentence at all. Would you mind to
elaborate ?

Thanks !

> --
> Michael







Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
> I would guess this one is more risky from a performance perspective, since
> we'd be adding to a hotter path under RelationGetBufferForTuple().  Still,
> it's likely fine.

I ended up implementing this in the attached patch. The idea is that we detect
if the FSM returns a page past the end of the relation, and ignore it.
In that case we will fallback through the extension mechanism.

For the corrupted-FSM case it is not great performance wise, as we will extend
the relation in small steps every time we find a non existing block in the FSM,
until the actual relation size matches what is recorded in the FSM. But since
those seldom happen, I figured it was better to keep the code really simple for
a bugfix.

I wanted to test the impact in terms of performance, and I thought about the
worst possible case for this.

Then, run a pgbench doing insertions in the table. With the attached patch the
worst case I could come up with is:
  - remember which page we last inserted into
  - notice we don't have enough space
  - ask the FSM for a block
  - now have to compare that to the actual relation size

So I came up with the following initialization steps:

 - create a table with vacuum_truncate = off, with a tuple size big enough that
it's impossible to fit two tuples on the same page
 - insert lots of tuple in it until it reaches a decent size
 - delete them all
 - vacuum
 - all of this fitting in shared_buffers

As in:

CREATE TABLE test_perf (c1 char(5000));
ALTER TABLE test_perf ALTER c1 SET STORAGE PLAIN;
ALTER TABLE test_perf SET (VACUUM_TRUNCATE = off);
INSERT INTO test_perf (c1) SELECT 'c' FROM generate_series(1, 1000000);
DELETE FROM test_perf;
VACUUM test_perf;

Then I ran pgbench with a single client, with a script only inserting the same
value over and over again, for 1000000 transactions (initial table size).

I noticed no difference running with or without the patch, but maybe someone
else can try to run that or find another adversarial case ?

Best regards,

--
Ronan Dunklau
Attachment

Re: FSM Corruption (was: Could not read block at end of the relation)

From
Patrick Stählin
Date:
On 3/6/24 10:31, Ronan Dunklau wrote:
> Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
>> I would guess this one is more risky from a performance perspective, since
>> we'd be adding to a hotter path under RelationGetBufferForTuple().  Still,
>> it's likely fine.
> 
> I ended up implementing this in the attached patch. The idea is that we detect
> if the FSM returns a page past the end of the relation, and ignore it.
> In that case we will fallback through the extension mechanism.

@@ -582,7 +583,17 @@ RelationGetBufferForTuple(Relation relation, Size len,
           * We have no cached target page, so ask the FSM for an initial
           * target.
           */
+        nblocks = RelationGetNumberOfBlocks(relation);
          targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);

I'd move the fetching of nblocks to after getting the targetBlock. This 
avoids extending the relation in the unlikely event of a FSM/relation 
extension in-between those two calls.

Patrick



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le mercredi 6 mars 2024, 11:59:43 CET Patrick Stählin a écrit :
> On 3/6/24 10:31, Ronan Dunklau wrote:
> > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
> >> I would guess this one is more risky from a performance perspective,
> >> since
> >> we'd be adding to a hotter path under RelationGetBufferForTuple().
> >> Still,
> >> it's likely fine.
> >
> > I ended up implementing this in the attached patch. The idea is that we
> > detect if the FSM returns a page past the end of the relation, and ignore
> > it. In that case we will fallback through the extension mechanism.
>
> @@ -582,7 +583,17 @@ RelationGetBufferForTuple(Relation relation, Size len,
>            * We have no cached target page, so ask the FSM for an
initial
>            * target.
>            */
> +        nblocks = RelationGetNumberOfBlocks(relation);
>           targetBlock = GetPageWithFreeSpace(relation,
targetFreeSpace);
>
> I'd move the fetching of nblocks to after getting the targetBlock. This
> avoids extending the relation in the unlikely event of a FSM/relation
> extension in-between those two calls.

Good catch. Please find attached v2 of the patch.

--
Ronan Dunklau
Attachment

Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote:
> Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
> > I would guess this one is more risky from a performance perspective, since
> > we'd be adding to a hotter path under RelationGetBufferForTuple().  Still,
> > it's likely fine.
> 
> I ended up implementing this in the attached patch. The idea is that we detect 
> if the FSM returns a page past the end of the relation, and ignore it.
> In that case we will fallback through the extension mechanism.
> 
> For the corrupted-FSM case it is not great performance wise, as we will extend 
> the relation in small steps every time we find a non existing block in the FSM, 
> until the actual relation size matches what is recorded in the FSM. But since 
> those seldom happen, I figured it was better to keep the code really simple for 
> a bugfix.

That could be fine.  Before your message, I assumed we would treat this like
the case of a full page.  That is, we'd call RecordAndGetPageWithFreeSpace(),
which would remove the FSM entry and possibly return another.  That could be
problematic if another backend is extending the relation; we don't want to
erase the extender's new FSM entry.  I'm parking this topic for the moment.

> I wanted to test the impact in terms of performance, and I thought about the 
> worst possible case for this.
> 
[test details]
> 
> I noticed no difference running with or without the patch, but maybe someone 
> else can try to run that or find another adversarial case ? 

The patch currently adds an lseek() whenever the FSM finds a block.  I see
relevant-looking posts from mailing list searches with subsets of these terms:
RelationGetNumberOfBlocks lseek benchmark overhead.  I bet we should reduce
this cost add.  Some ideas:

- No need for a system call if the FSM-returned value is low enough with
  respect to any prior RelationGetNumberOfBlocks() call.

- We'd be happy and efficient with a ReadBufferMode such that
  ReadBufferExtended() returns NULL after a 0-byte read, with all other errors
  handled normally.  That sounds like the specification of RBM_NORMAL_NO_LOG.
  Today's implementation of RBM_NORMAL_NO_LOG isn't that, but perhaps one RBM
  could satisfy both sets of needs.

On Wed, Mar 06, 2024 at 12:08:38PM +0100, Ronan Dunklau wrote:
> Subject: [PATCH 2/2] Detect invalid FSM when finding a block.
> 
> Whenever the FSM points to a block past the end of the relation, detect
> it and fallback to the relation extension mechanism.
> 
> This may arise when a FPI for the FSM has been triggered, and we end up
> WAL-logging a page of the FSM pointing to newly extended blocks in the
> relation which have never been WAL-logged.
> ---
>  src/backend/access/heap/hio.c | 20 +++++++++++++++++++-

Each GetPageWithFreeSpace() caller would need this, not just heap.  Please
evaluate whether this logic belongs inside freespace.c vs. in each caller.
freespace.c comments and/or freespace/README may need updates.  Please
evaluate header comments of freespace.c functions whose callers you are
changing, and evaluate comments about truncation.



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le mercredi 6 mars 2024 19:42:28 CET, vous avez écrit :
> On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote:
> > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
> > > I would guess this one is more risky from a performance perspective,
> > > since
> > > we'd be adding to a hotter path under RelationGetBufferForTuple().
> > > Still,
> > > it's likely fine.
> >
> > I ended up implementing this in the attached patch. The idea is that we
> > detect if the FSM returns a page past the end of the relation, and ignore
> > it. In that case we will fallback through the extension mechanism.
> >
> > For the corrupted-FSM case it is not great performance wise, as we will
> > extend the relation in small steps every time we find a non existing
> > block in the FSM, until the actual relation size matches what is recorded
> > in the FSM. But since those seldom happen, I figured it was better to
> > keep the code really simple for a bugfix.
>
> That could be fine.  Before your message, I assumed we would treat this like
> the case of a full page.  That is, we'd call
> RecordAndGetPageWithFreeSpace(), which would remove the FSM entry and
> possibly return another.  That could be problematic if another backend is
> extending the relation; we don't want to erase the extender's new FSM
> entry.  I'm parking this topic for the moment.

I would argue this is fine, as corruptions don't happen often, and FSM is not
an exact thing anyway. If we record a block as full, it will just be unused
until the next vacuum adds it back to the FSM with a correct value.

> > I wanted to test the impact in terms of performance, and I thought about
> > the worst possible case for this.
>
> [test details]
>
> > I noticed no difference running with or without the patch, but maybe
> > someone else can try to run that or find another adversarial case ?
>
> The patch currently adds an lseek() whenever the FSM finds a block.  I see
> relevant-looking posts from mailing list searches with subsets of these
> terms: RelationGetNumberOfBlocks lseek benchmark overhead.  I bet we should
> reduce this cost add.  Some ideas:
>
> - No need for a system call if the FSM-returned value is low enough with
>   >   respect to any prior RelationGetNumberOfBlocks() call.

I'm not sure what you mean by "low enough".  What I chose to implement instead
is to rely on the cached value even when not in recovery for that particular
case.  The reasoning being, as above, that if someone extended the relation
since the last cached value, worst case scenario is that we do not insert into
that old block. Whenever we detect a case like this, we just erase the FSM
entry to mark the block as full, so that it's not picked up again next time.

>
> - We'd be happy and efficient with a ReadBufferMode such that
>   ReadBufferExtended() returns NULL after a 0-byte read, with all other
> errors handled normally.  That sounds like the specification of
> RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't that,
> but perhaps one RBM could satisfy both sets of needs.

I'm not sure about this one, this seems to have far more reaching
consequences. This would mean we don't have any cooperation from the FSM, and
would need to implement error recovery scenarios in each caller. But I admit I
haven't looked too much into it, and focused instead in seeing if the current
approach can get us to an acceptable state.

>
> On Wed, Mar 06, 2024 at 12:08:38PM +0100, Ronan Dunklau wrote:
> > Subject: [PATCH 2/2] Detect invalid FSM when finding a block.
> >
> > Whenever the FSM points to a block past the end of the relation, detect
> > it and fallback to the relation extension mechanism.
> >
> > This may arise when a FPI for the FSM has been triggered, and we end up
> > WAL-logging a page of the FSM pointing to newly extended blocks in the
> > relation which have never been WAL-logged.
> > ---
> >
> >  src/backend/access/heap/hio.c | 20 +++++++++++++++++++-
>
> Each GetPageWithFreeSpace() caller would need this, not just heap.  Please
> evaluate whether this logic belongs inside freespace.c vs. in each caller.
> freespace.c comments and/or freespace/README may need updates.  Please
> evaluate header comments of freespace.c functions whose callers you are
> changing, and evaluate comments about truncation.

I implemented the new behaviour in fsm_search and
RecordAndGetPageWithFreeSpace.

Regards,

--
Ronan
Attachment

Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Thu, Mar 07, 2024 at 12:21:24PM +0100, Ronan Dunklau wrote:
> Le mercredi 6 mars 2024 19:42:28 CET, vous avez écrit :
> > On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote:
> > > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
> > > > I would guess this one is more risky from a performance perspective,
> > > > since
> > > > we'd be adding to a hotter path under RelationGetBufferForTuple(). 
> > > > Still,
> > > > it's likely fine.
> > > 
> > > I ended up implementing this in the attached patch. The idea is that we
> > > detect if the FSM returns a page past the end of the relation, and ignore
> > > it. In that case we will fallback through the extension mechanism.
> > > 
> > > For the corrupted-FSM case it is not great performance wise, as we will
> > > extend the relation in small steps every time we find a non existing
> > > block in the FSM, until the actual relation size matches what is recorded
> > > in the FSM. But since those seldom happen, I figured it was better to
> > > keep the code really simple for a bugfix.
> > 
> > That could be fine.  Before your message, I assumed we would treat this like
> > the case of a full page.  That is, we'd call
> > RecordAndGetPageWithFreeSpace(), which would remove the FSM entry and
> > possibly return another.  That could be problematic if another backend is
> > extending the relation; we don't want to erase the extender's new FSM
> > entry.  I'm parking this topic for the moment.
> 
> I would argue this is fine, as corruptions don't happen often, and FSM is not 
> an exact thing anyway. If we record a block as full, it will just be unused 
> until the next vacuum adds it back to the FSM with a correct value. 

If this happened only under FSM corruption, it would be fine.  I was thinking
about normal extension, with an event sequence like this:

- RelationExtensionLockWaiterCount() returns 10.
- Backend B1 extends by 10 pages.
- B1 records 9 pages in the FSM.
- B2, B3, ... B11 wake up and fetch from the FSM.  In each of those backends,
  fsm_does_block_exist() returns false, because the cached relation size is
  stale.  Those pages remain unused until VACUUM re-records them in the FSM.

PostgreSQL added the multiple-block extension logic (commit 719c84c) from
hard-won experience bottlenecking there.  If fixing $SUBJECT will lose 1% of
that change's benefit, that's probably okay.  If it loses 20%, we should work
to do better.  While we could rerun the 2016 benchmark to see how the patch
regresses it, there might be a simple fix.  fsm_does_block_exist() could skip
the cache when blknumber>=cached, like this:

    return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM]) &&
            blknumber < smgr->smgr_cached_nblocks[MAIN_FORKNUM]) ||
           blknumber < RelationGetNumberOfBlocks(rel));

How do you see it?  That still leaves us with an avoidable lseek in B2, B3,
... B11, but that feels tolerable.  I don't know how to do better without
switching to the ReadBufferMode approach.

> > The patch currently adds an lseek() whenever the FSM finds a block.  I see
> > relevant-looking posts from mailing list searches with subsets of these
> > terms: RelationGetNumberOfBlocks lseek benchmark overhead.  I bet we should
> > reduce this cost add.  Some ideas:
> > 
> > - No need for a system call if the FSM-returned value is low enough with
> >   >   respect to any prior RelationGetNumberOfBlocks() call.
> 
> I'm not sure what you mean by "low enough".  What I chose to implement instead 

I just meant "less than".  Specifically, if "fsm_returned_value < MAX(all
RelationGetNumberOfBlocks() calls done on this relation, since last postmaster
start)" then the FSM-returned value won't have the problem seen in $SUBJECT.

> > - We'd be happy and efficient with a ReadBufferMode such that
> >   ReadBufferExtended() returns NULL after a 0-byte read, with all other
> > errors handled normally.  That sounds like the specification of
> > RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't that,
> > but perhaps one RBM could satisfy both sets of needs.
> 
> I'm not sure about this one, this seems to have far more reaching 
> consequences. This would mean we don't have any cooperation from the FSM, and 
> would need to implement error recovery scenarios in each caller. But I admit I 
> haven't looked too much into it, and focused instead in seeing if the current 
> approach can get us to an acceptable state. 

I, too, am not sure.  I agree that RBM approach wouldn't let you isolate the
changes like the last patch does.  Each GetFreeIndexPage() caller would need
code for the return-NULL case.  Avoiding that is nice.  (If we ever do the RBM
change in the future, the fsm_readbuf() and vm_readbuf() !extend cases should
also use it.)

> --- a/src/backend/storage/freespace/freespace.c
> +++ b/src/backend/storage/freespace/freespace.c

> +            if (addr.level == FSM_BOTTOM_LEVEL) {
> +                BlockNumber blkno = fsm_get_heap_blk(addr, slot);
> +                Page page;
> +                /*
> +                 * If the buffer is past the end of the relation,
> +                 * update the page and restarts from the root
> +                 */
> +                if (fsm_does_block_exist(rel, blkno))
> +                {
> +                    ReleaseBuffer(buf);
> +                    return blkno;
> +                }
> +                page = BufferGetPage(buf);
> +                LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
> +                fsm_set_avail(page, slot, 0);
> +                MarkBufferDirtyHint(buf, false);

I wondered if we should zero all slots on the same FSM page that correspond to
main fork blocks past the end of the main fork.  The usual "advancenext"
behavior is poor for this case, since every later slot is invalid in same way.
My current thinking: no, it's not worth the code bulk to do better.  A single
extension caps at 64 blocks, so the worst case is roughly locking the buffer
64 times and restarting from FSM root 64 times.  For something this
infrequent, that's not bad.

Thanks,
nm



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le mardi 12 mars 2024, 23:08:07 CET Noah Misch a écrit :
> > I would argue this is fine, as corruptions don't happen often, and FSM is
> > not an exact thing anyway. If we record a block as full, it will just be
> > unused until the next vacuum adds it back to the FSM with a correct
> > value.
> If this happened only under FSM corruption, it would be fine.  I was
> thinking about normal extension, with an event sequence like this:
>
> - RelationExtensionLockWaiterCount() returns 10.
> - Backend B1 extends by 10 pages.
> - B1 records 9 pages in the FSM.
> - B2, B3, ... B11 wake up and fetch from the FSM.  In each of those
> backends, fsm_does_block_exist() returns false, because the cached relation
> size is stale.  Those pages remain unused until VACUUM re-records them in
> the FSM.
>
> PostgreSQL added the multiple-block extension logic (commit 719c84c) from
> hard-won experience bottlenecking there.  If fixing $SUBJECT will lose 1% of
> that change's benefit, that's probably okay.  If it loses 20%, we should
> work to do better.  While we could rerun the 2016 benchmark to see how the
> patch regresses it, there might be a simple fix.  fsm_does_block_exist()
> could skip the cache when blknumber>=cached, like this:
>
>     return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM])
&&
>             blknumber < smgr-
>smgr_cached_nblocks[MAIN_FORKNUM]) ||
>            blknumber < RelationGetNumberOfBlocks(rel));
>
> How do you see it?  That still leaves us with an avoidable lseek in B2, B3,
> ... B11, but that feels tolerable.  I don't know how to do better without
> switching to the ReadBufferMode approach.

Hi Noah and thanks for your feedback.

That makes sense. I'm a bit worried about triggering additional lseeks when we
in fact have other free pages in the map, that would pass the test. I'm not
sure how relevant it is given the way we search the FSM with fp_next_slot
though...

To address that, I've given a bit of thought about enabling / disabling the
auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish the
cases where we know we have a somewhat up-to-date value compared to the case
where we don't (as in, for heap, try without repair, then get an uptodate
value to try the last block, and if we need to look at the FSM again then ask
for it to be repaired) but it brings way too much complexity and would need
careful thought for each AM.

So please find attached a patch with the change you propose.

Do you have a link to the benchmark you mention though to evaluate the patch
against it ?

>
> > > The patch currently adds an lseek() whenever the FSM finds a block.  I
> > > see
> > > relevant-looking posts from mailing list searches with subsets of these
> > > terms: RelationGetNumberOfBlocks lseek benchmark overhead.  I bet we
> > > should
> > > reduce this cost add.  Some ideas:
> > >
> > > - No need for a system call if the FSM-returned value is low enough with
> > >
> > >   >   respect to any prior RelationGetNumberOfBlocks() call.
> >
> > I'm not sure what you mean by "low enough".  What I chose to implement
> > instead
> I just meant "less than".  Specifically, if "fsm_returned_value < MAX(all
> RelationGetNumberOfBlocks() calls done on this relation, since last
> postmaster start)" then the FSM-returned value won't have the problem seen
> in $SUBJECT.
> > > - We'd be happy and efficient with a ReadBufferMode such that
> > >
> > >   ReadBufferExtended() returns NULL after a 0-byte read, with all other
> > >
> > > errors handled normally.  That sounds like the specification of
> > > RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't
> > > that,
> > > but perhaps one RBM could satisfy both sets of needs.
> >
> > I'm not sure about this one, this seems to have far more reaching
> > consequences. This would mean we don't have any cooperation from the FSM,
> > and would need to implement error recovery scenarios in each caller. But
> > I admit I haven't looked too much into it, and focused instead in seeing
> > if the current approach can get us to an acceptable state.
>
> I, too, am not sure.  I agree that RBM approach wouldn't let you isolate the
> changes like the last patch does.  Each GetFreeIndexPage() caller would
> need code for the return-NULL case.  Avoiding that is nice.  (If we ever do
> the RBM change in the future, the fsm_readbuf() and vm_readbuf() !extend
> cases should also use it.)
>
> > --- a/src/backend/storage/freespace/freespace.c
> > +++ b/src/backend/storage/freespace/freespace.c
> >
> > +            if (addr.level == FSM_BOTTOM_LEVEL) {
> > +                BlockNumber blkno =
fsm_get_heap_blk(addr, slot);
> > +                Page page;
> > +                /*
> > +                 * If the buffer is past the end
of the relation,
> > +                 * update the page and restarts
from the root
> > +                 */
> > +                if (fsm_does_block_exist(rel,
blkno))
> > +                {
> > +                    ReleaseBuffer(buf);
> > +                    return blkno;
> > +                }
> > +                page = BufferGetPage(buf);
> > +                LockBuffer(buf,
BUFFER_LOCK_EXCLUSIVE);
> > +                fsm_set_avail(page, slot, 0);
> > +                MarkBufferDirtyHint(buf, false);
>
> I wondered if we should zero all slots on the same FSM page that correspond
> to main fork blocks past the end of the main fork.  The usual "advancenext"
> behavior is poor for this case, since every later slot is invalid in same
> way. My current thinking: no, it's not worth the code bulk to do better.  A
> single extension caps at 64 blocks, so the worst case is roughly locking
> the buffer 64 times and restarting from FSM root 64 times.  For something
> this infrequent, that's not bad.

That was my reasoning as well. Paying a small performance penalty in the
unlikely case we hit a corruption and get done with it seems correct.


>
> Thanks,
> nm


Attachment

Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote:
> Le mardi 12 mars 2024, 23:08:07 CET Noah Misch a écrit :
> > > I would argue this is fine, as corruptions don't happen often, and FSM is
> > > not an exact thing anyway. If we record a block as full, it will just be
> > > unused until the next vacuum adds it back to the FSM with a correct
> > > value.
> > If this happened only under FSM corruption, it would be fine.  I was
> > thinking about normal extension, with an event sequence like this:
> > 
> > - RelationExtensionLockWaiterCount() returns 10.
> > - Backend B1 extends by 10 pages.
> > - B1 records 9 pages in the FSM.
> > - B2, B3, ... B11 wake up and fetch from the FSM.  In each of those
> > backends, fsm_does_block_exist() returns false, because the cached relation
> > size is stale.  Those pages remain unused until VACUUM re-records them in
> > the FSM.
> > 
> > PostgreSQL added the multiple-block extension logic (commit 719c84c) from
> > hard-won experience bottlenecking there.  If fixing $SUBJECT will lose 1% of
> > that change's benefit, that's probably okay.  If it loses 20%, we should
> > work to do better.  While we could rerun the 2016 benchmark to see how the
> > patch regresses it, there might be a simple fix.  fsm_does_block_exist()
> > could skip the cache when blknumber>=cached, like this:
> > 
> >     return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM]) 
> &&
> >             blknumber < smgr-
> >smgr_cached_nblocks[MAIN_FORKNUM]) ||
> >            blknumber < RelationGetNumberOfBlocks(rel));
> > 
> > How do you see it?  That still leaves us with an avoidable lseek in B2, B3,
> > ... B11, but that feels tolerable.  I don't know how to do better without
> > switching to the ReadBufferMode approach.

> That makes sense. I'm a bit worried about triggering additional lseeks when we 
> in fact have other free pages in the map, that would pass the test. I'm not 
> sure how relevant it is given the way we search the FSM with fp_next_slot 
> though...

That's a reasonable thing to worry about.  We could do wrong by trying too
hard to use an FSM slot, and we could do wrong by not trying hard enough.

> To address that, I've given a bit of thought about enabling / disabling the 
> auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish the 
> cases where we know we have a somewhat up-to-date value compared to the case 
> where we don't (as in, for heap, try without repair, then get an uptodate 
> value to try the last block, and if we need to look at the FSM again then ask 
> for it to be repaired) but it brings way too much complexity and would need 
> careful thought for each AM. 
> 
> So please find attached a patch with the change you propose.
> 
> Do you have a link to the benchmark you mention though to evaluate the patch 
> against it ?

Here is one from the thread that created commit 719c84c:
https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEomgHfuVdzNJnwWVoE_c60g%40mail.gmail.com

There may be other benchmarks earlier in that thread.



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le mercredi 13 mars 2024, 17:55:23 CET Noah Misch a écrit :
> On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote:
> > Le mardi 12 mars 2024, 23:08:07 CET Noah Misch a écrit :
> > > > I would argue this is fine, as corruptions don't happen often, and FSM
> > > > is
> > > > not an exact thing anyway. If we record a block as full, it will just
> > > > be
> > > > unused until the next vacuum adds it back to the FSM with a correct
> > > > value.
> > >
> > > If this happened only under FSM corruption, it would be fine.  I was
> > > thinking about normal extension, with an event sequence like this:
> > >
> > > - RelationExtensionLockWaiterCount() returns 10.
> > > - Backend B1 extends by 10 pages.
> > > - B1 records 9 pages in the FSM.
> > > - B2, B3, ... B11 wake up and fetch from the FSM.  In each of those
> > > backends, fsm_does_block_exist() returns false, because the cached
> > > relation
> > > size is stale.  Those pages remain unused until VACUUM re-records them
> > > in
> > > the FSM.
> > >
> > > PostgreSQL added the multiple-block extension logic (commit 719c84c)
> > > from
> > > hard-won experience bottlenecking there.  If fixing $SUBJECT will lose
> > > 1% of that change's benefit, that's probably okay.  If it loses 20%, we
> > > should work to do better.  While we could rerun the 2016 benchmark to
> > > see how the patch regresses it, there might be a simple fix.
> > > fsm_does_block_exist()> >
> > > could skip the cache when blknumber>=cached, like this:
> > >     return((BlockNumberIsValid(smgr-
>smgr_cached_nblocks[MAIN_FORKNUM])
> >
> > &&
> >
> > >             blknumber < smgr-
> > >
> > >smgr_cached_nblocks[MAIN_FORKNUM]) ||
> > >
> > >            blknumber < RelationGetNumberOfBlocks(rel));
> > >
> > > How do you see it?  That still leaves us with an avoidable lseek in B2,
> > > B3,
> > > ... B11, but that feels tolerable.  I don't know how to do better
> > > without
> > > switching to the ReadBufferMode approach.
> >
> > That makes sense. I'm a bit worried about triggering additional lseeks
> > when we in fact have other free pages in the map, that would pass the
> > test. I'm not sure how relevant it is given the way we search the FSM
> > with fp_next_slot though...
>
> That's a reasonable thing to worry about.  We could do wrong by trying too
> hard to use an FSM slot, and we could do wrong by not trying hard enough.
>
> > To address that, I've given a bit of thought about enabling / disabling
> > the
> > auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish
> > the cases where we know we have a somewhat up-to-date value compared to
> > the case where we don't (as in, for heap, try without repair, then get an
> > uptodate value to try the last block, and if we need to look at the FSM
> > again then ask for it to be repaired) but it brings way too much
> > complexity and would need careful thought for each AM.
> >
> > So please find attached a patch with the change you propose.
> >
> > Do you have a link to the benchmark you mention though to evaluate the
> > patch against it ?
>
> Here is one from the thread that created commit 719c84c:
> https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEom
> gHfuVdzNJnwWVoE_c60g%40mail.gmail.com
>
> There may be other benchmarks earlier in that thread.

Thank you. I tried to run that benchmark, with 4 clients all copying 10M
pgbench_accounts row concurrently. With and without the patch, I ran the
benchmark 5 times, and took the average.

master: 15.05s
patched: 15.24s (+1.25%)

If I remove the best and worst run for each of those, the difference falls at
+0.7%.





Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Fri, Mar 15, 2024 at 07:23:46AM +0100, Ronan Dunklau wrote:
> Le mercredi 13 mars 2024, 17:55:23 CET Noah Misch a écrit :
> > On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote:
> > > I'm a bit worried about triggering additional lseeks
> > > when we in fact have other free pages in the map, that would pass the
> > > test. I'm not sure how relevant it is given the way we search the FSM
> > > with fp_next_slot though...
> > 
> > That's a reasonable thing to worry about.  We could do wrong by trying too
> > hard to use an FSM slot, and we could do wrong by not trying hard enough.
> > 
> > > To address that, I've given a bit of thought about enabling / disabling
> > > the
> > > auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish
> > > the cases where we know we have a somewhat up-to-date value compared to
> > > the case where we don't (as in, for heap, try without repair, then get an
> > > uptodate value to try the last block, and if we need to look at the FSM
> > > again then ask for it to be repaired) but it brings way too much
> > > complexity and would need careful thought for each AM.
> > > 
> > > So please find attached a patch with the change you propose.
> > > 
> > > Do you have a link to the benchmark you mention though to evaluate the
> > > patch against it ?
> > 
> > Here is one from the thread that created commit 719c84c:
> > https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEom
> > gHfuVdzNJnwWVoE_c60g%40mail.gmail.com
> > 
> > There may be other benchmarks earlier in that thread.
> 
> Thank you. I tried to run that benchmark, with 4 clients all copying 10M 
> pgbench_accounts row concurrently. With and without the patch, I ran the 
> benchmark 5 times, and took the average. 

Does "that benchmark" refer to "unlogged tables, 4 parallel copies", "logged
tables, 4 parallel copies", or something else?

> master: 15.05s
> patched: 15.24s (+1.25%)
> 
> If I remove the best and worst run for each of those, the difference falls at 
> +0.7%.

To get some additional perspective on the benchmark, how hard would it be to
run one or both of the following?  Feel free to decline if difficult.

- Make GetPageWithFreeSpace() just "return InvalidBlockNumber", then rerun the
  benchmark.  This should be a lot slower.  If not, the bottleneck is
  somewhere unexpected, and we'd need a different benchmark.

- Get profiles with both master and patched.  (lseek or freespace.c functions
  rising by 0.1%-1% would fit what we know.)

Distinguishing a 1% change from a 0% change would need different techniques
still.  If we're genuinely slowing bulk loads by ~1% to account for the
possibly of a flawed post-recovery FSM, that's sad, but I'm inclined to accept
the loss.  A persistent error in an innocent INSERT is unacceptable, and the
alternatives we discussed upthread have their own substantial trouble.  Other
opinions?

Thanks,
nm



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Fri, Mar 15, 2024 at 08:03:39PM -0700, Noah Misch wrote:
> On Fri, Mar 15, 2024 at 07:23:46AM +0100, Ronan Dunklau wrote:
> > Le mercredi 13 mars 2024, 17:55:23 CET Noah Misch a écrit :
> > > On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote:
> > > > I'm a bit worried about triggering additional lseeks
> > > > when we in fact have other free pages in the map, that would pass the
> > > > test. I'm not sure how relevant it is given the way we search the FSM
> > > > with fp_next_slot though...
> > > 
> > > That's a reasonable thing to worry about.  We could do wrong by trying too
> > > hard to use an FSM slot, and we could do wrong by not trying hard enough.
> > > 
> > > > To address that, I've given a bit of thought about enabling / disabling
> > > > the
> > > > auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish
> > > > the cases where we know we have a somewhat up-to-date value compared to
> > > > the case where we don't (as in, for heap, try without repair, then get an
> > > > uptodate value to try the last block, and if we need to look at the FSM
> > > > again then ask for it to be repaired) but it brings way too much
> > > > complexity and would need careful thought for each AM.
> > > > 
> > > > So please find attached a patch with the change you propose.
> > > > 
> > > > Do you have a link to the benchmark you mention though to evaluate the
> > > > patch against it ?
> > > 
> > > Here is one from the thread that created commit 719c84c:
> > > https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEom
> > > gHfuVdzNJnwWVoE_c60g%40mail.gmail.com
> > > 
> > > There may be other benchmarks earlier in that thread.
> > 
> > Thank you. I tried to run that benchmark, with 4 clients all copying 10M 
> > pgbench_accounts row concurrently. With and without the patch, I ran the 
> > benchmark 5 times, and took the average. 
> 
> Does "that benchmark" refer to "unlogged tables, 4 parallel copies", "logged
> tables, 4 parallel copies", or something else?
> 
> > master: 15.05s
> > patched: 15.24s (+1.25%)
> > 
> > If I remove the best and worst run for each of those, the difference falls at 
> > +0.7%.
> 
> To get some additional perspective on the benchmark, how hard would it be to
> run one or both of the following?  Feel free to decline if difficult.
> 
> - Make GetPageWithFreeSpace() just "return InvalidBlockNumber", then rerun the
>   benchmark.  This should be a lot slower.  If not, the bottleneck is
>   somewhere unexpected, and we'd need a different benchmark.
> 
> - Get profiles with both master and patched.  (lseek or freespace.c functions
>   rising by 0.1%-1% would fit what we know.)

Forgot one more:

- Your earlier version of the patch, with fewer lseek() but more disuse of FSM
  entries.

> Distinguishing a 1% change from a 0% change would need different techniques
> still.  If we're genuinely slowing bulk loads by ~1% to account for the
> possibly of a flawed post-recovery FSM, that's sad, but I'm inclined to accept
> the loss.  A persistent error in an innocent INSERT is unacceptable, and the
> alternatives we discussed upthread have their own substantial trouble.  Other
> opinions?
> 
> Thanks,
> nm



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le samedi 16 mars 2024, 05:58:34 CET Noah Misch a écrit :
> > Does "that benchmark" refer to "unlogged tables, 4 parallel copies",
> > "logged tables, 4 parallel copies", or something else?

Sorry I was quite busy, and will not find time to run more benchmarks until
next week but I'll get back to it.

It was 4 parallel copies, logged tables. Results for unlogged tables are
similar.

> >
> > > master: 15.05s
> > > patched: 15.24s (+1.25%)
> > >
> > > If I remove the best and worst run for each of those, the difference
> > > falls at +0.7%.
> >
> > To get some additional perspective on the benchmark, how hard would it be
> > to run one or both of the following?  Feel free to decline if difficult.
> >
> > - Make GetPageWithFreeSpace() just "return InvalidBlockNumber", then rerun
> > the>
> >   benchmark.  This should be a lot slower.  If not, the bottleneck is
> >   somewhere unexpected, and we'd need a different benchmark.

Hum, it doesn't change much so that benchmark does not indeed exercise that
code too aggressively. I was thinking my initial benchmark for the very first
naive version of the patch could work, but I'm afraid it won't show much as
the table size will be cached soon enough.

> >
> > - Get profiles with both master and patched.  (lseek or freespace.c
> > functions>
> >   rising by 0.1%-1% would fit what we know.)
>

I'll try to run that asap, probably next week unfortunately.

> Forgot one more:
>
> - Your earlier version of the patch, with fewer lseek() but more disuse of
> FSM entries.

Noted.

--
Ronan Dunklau






Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le jeudi 21 mars 2024, 14:51:25 CET Ronan Dunklau a écrit :
> Le samedi 16 mars 2024, 05:58:34 CET Noah Misch a écrit :
> > > - Get profiles with both master and patched.  (lseek or freespace.c
> > > functions>
> > >
> > >   rising by 0.1%-1% would fit what we know.)

Running perf during the same benchmark, I get the following samples for lseek:

With the patch:

   0.02%  postgres  libc.so.6          [.] llseek@GLIBC_2.2.5

Without the patch:

   0.01%  postgres  libc.so.6          [.] llseek@GLIBC_2.2.5


So nothing too dramatic.
I haven't been able to come up with  a better benchmark.

Regards,

--
Ronan





Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
Your v3 has the right functionality.  As further confirmation of the fix, I
tried reverting the non-test parts of commit 917dc7d "Fix WAL-logging of FSM
and VM truncation".  That commit's 008_fsm_truncation.pl fails with 917dc7d
reverted from master, and adding this patch makes it pass again.  I ran
pgindent and edited comments.  I think the attached version is ready to go.

While updating comments in FreeSpaceMapPrepareTruncateRel(), I entered a
rabbit hole about the comments 917dc7d left about torn pages.  I'm sharing
these findings just in case it helps a reader of the $SUBJECT patch avoid the
same rabbit hole.  Both fsm and vm read with RBM_ZERO_ON_ERROR, so I think
they're fine with torn pages.  Per the README sentences I'm adding, FSM could
stop writing WAL.  I'm not proposing that, but I do bet it's the right thing.
visibilitymap_prepare_truncate() has mirrored fsm truncate since 917dc7d.  The
case for removing WAL there is clearer still, because parallel function
visibilitymap_clear() does not write WAL.  I'm attaching a WIP patch to remove
visibilitymap_prepare_truncate() WAL.  I'll abandon that or pursue it for v18,
in a different thread.

On Fri, Mar 29, 2024 at 09:47:52AM +0100, Ronan Dunklau wrote:
> Le jeudi 21 mars 2024, 14:51:25 CET Ronan Dunklau a écrit :
> > Le samedi 16 mars 2024, 05:58:34 CET Noah Misch a écrit :
> > > > - Get profiles with both master and patched.  (lseek or freespace.c
> > > > functions>
> > > > 
> > > >   rising by 0.1%-1% would fit what we know.)
> 
> Running perf during the same benchmark, I get the following samples for lseek:
> 
> With the patch:
> 
>    0.02%  postgres  libc.so.6          [.] llseek@GLIBC_2.2.5 
> 
> Without the patch:
> 
>    0.01%  postgres  libc.so.6          [.] llseek@GLIBC_2.2.5
> 
> 
> So nothing too dramatic.
> I haven't been able to come up with  a better benchmark.

I tried, without success, to reproduce the benefit of commit 719c84c "Extend
relations multiple blocks at a time to improve scalability" like
https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEomgHfuVdzNJnwWVoE_c60g%40mail.gmail.com
did.  My conditions must be lacking whatever factor caused the consistent >20%
improvement in the linked message.  Environment was ext4 on magnetic disk,
Linux 3.10.0-1160.99.1.el7.x86_64, 8 threads.  I'm attaching the script I used
to test 719c84c^, 719c84c, and [719c84c with GetPageWithFreeSpace() made to
always find nothing].  I'm also attaching the script's output after filtration
through "grep ^TIME | sort -n".  The range of timings within each test
scenario dwarfs any difference across scenarios.  I casually tried some other
variations without obvious change:

- 24MB shared_buffers
- 128MB shared_buffers
- 4 threads
- 32 threads
- bgwriter_flush_after & backend_flush_after at their defaults, instead of 0
- 2024-04 code (e2a2357) unmodified, then with GetPageWithFreeSpace() made to always find nothing
- logged tables
- INSERT ... SELECT instead of COPY

If I were continuing the benchmark study, I would try SSD, a newer kernel,
and/or shared_buffers=48GB.  Instead, since your perf results show only +0.01%
CPU from new lseek() calls, I'm going to stop there and say it's worth taking
the remaining risk that some realistic scenario gets a material regression
from those new lseek() calls.

Attachment

Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le dimanche 7 avril 2024, 00:30:37 CEST Noah Misch a écrit :
> Your v3 has the right functionality.  As further confirmation of the fix, I
> tried reverting the non-test parts of commit 917dc7d "Fix WAL-logging of FSM
> and VM truncation".  That commit's 008_fsm_truncation.pl fails with 917dc7d
> reverted from master, and adding this patch makes it pass again.  I ran
> pgindent and edited comments.  I think the attached version is ready to go.
>

Thank you Noah, the updated comments are much better. I think it should be
backported at least to 16 since the chances of tripping on that behaviour are
quite high here, but what about previous versions ?


> While updating comments in FreeSpaceMapPrepareTruncateRel(), I entered a
> rabbit hole about the comments 917dc7d left about torn pages.  I'm sharing
> these findings just in case it helps a reader of the $SUBJECT patch avoid
> the same rabbit hole.  Both fsm and vm read with RBM_ZERO_ON_ERROR, so I
> think they're fine with torn pages.  Per the README sentences I'm adding,
> FSM could stop writing WAL.  I'm not proposing that, but I do bet it's the
> right thing. visibilitymap_prepare_truncate() has mirrored fsm truncate
> since 917dc7d.  The case for removing WAL there is clearer still, because
> parallel function visibilitymap_clear() does not write WAL.  I'm attaching
> a WIP patch to remove visibilitymap_prepare_truncate() WAL.  I'll abandon
> that or pursue it for v18, in a different thread.

That's an interesting finding.

> If I were continuing the benchmark study, I would try SSD, a newer kernel,
> and/or shared_buffers=48GB.  Instead, since your perf results show only
> +0.01% CPU from new lseek() calls, I'm going to stop there and say it's
> worth taking the remaining risk that some realistic scenario gets a
> material regression from those new lseek() calls.

Agree with you here.

Many thanks,

--
Ronan Dunklau






Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Thu, Apr 11, 2024 at 09:36:50AM +0200, Ronan Dunklau wrote:
> Le dimanche 7 avril 2024, 00:30:37 CEST Noah Misch a écrit :
> > Your v3 has the right functionality.  As further confirmation of the fix, I
> > tried reverting the non-test parts of commit 917dc7d "Fix WAL-logging of FSM
> > and VM truncation".  That commit's 008_fsm_truncation.pl fails with 917dc7d
> > reverted from master, and adding this patch makes it pass again.  I ran
> > pgindent and edited comments.  I think the attached version is ready to go.
> 
> Thank you Noah, the updated comments are much better. I think it should be 
> backported at least to 16 since the chances of tripping on that behaviour are 
> quite high here, but what about previous versions ?

It should be reachable in all branches, just needing concurrent extension lock
waiters to reach before v16.  Hence, my plan is to back-patch it all the way.
It applies with negligible conflicts back to v12.



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Peter Geoghegan
Date:
On Wed, Mar 13, 2024 at 12:55 PM Noah Misch <noah@leadboat.com> wrote:
> That's a reasonable thing to worry about.  We could do wrong by trying too
> hard to use an FSM slot, and we could do wrong by not trying hard enough.

Although it's not related to the problem you're working on, it seems
like a good opportunity to bring up a concern about the FSM that I
don't believe was discussed at any point in the past few years: I
wonder if the way that fsm_search_avail() sometimes updates
fsmpage->fp_next_slot with only a shared lock on the page could cause
problems. At the very least, it's weird that we allow it.

--
Peter Geoghegan



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Thu, Apr 11, 2024 at 12:01:09PM -0400, Peter Geoghegan wrote:
> On Wed, Mar 13, 2024 at 12:55 PM Noah Misch <noah@leadboat.com> wrote:
> > That's a reasonable thing to worry about.  We could do wrong by trying too
> > hard to use an FSM slot, and we could do wrong by not trying hard enough.
> 
> Although it's not related to the problem you're working on, it seems
> like a good opportunity to bring up a concern about the FSM that I
> don't believe was discussed at any point in the past few years: I
> wonder if the way that fsm_search_avail() sometimes updates
> fsmpage->fp_next_slot with only a shared lock on the page could cause
> problems. At the very least, it's weird that we allow it.

fsm_search_avail() treats an out-of-range fp_next_slot like zero, so I'm not
seeing a correctness issue.  I bet changing it under an exclusive lock
wouldn't deliver better-optimized searches to an extent that pays for the
synchronization overhead, but it might.



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Thomas Munro
Date:
On Fri, Apr 12, 2024 at 4:01 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Although it's not related to the problem you're working on, it seems
> like a good opportunity to bring up a concern about the FSM that I
> don't believe was discussed at any point in the past few years: I
> wonder if the way that fsm_search_avail() sometimes updates
> fsmpage->fp_next_slot with only a shared lock on the page could cause
> problems. At the very least, it's weird that we allow it.

Aha.  Good to know.  So that is another place where direct I/O on a
file system with checksums might get very upset, if it takes no
measures of its own to prevent the data from changing underneath it
during a pwrite() call.  The only known system like that so far is
btrfs (phenemon #1 in [1], see reproducer).  The symptom is that the
next read fails with EIO.

[1] https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Noah Misch
Date:
On Thu, Apr 11, 2024 at 08:38:43AM -0700, Noah Misch wrote:
> On Thu, Apr 11, 2024 at 09:36:50AM +0200, Ronan Dunklau wrote:
> > Le dimanche 7 avril 2024, 00:30:37 CEST Noah Misch a écrit :
> > > Your v3 has the right functionality.  As further confirmation of the fix, I
> > > tried reverting the non-test parts of commit 917dc7d "Fix WAL-logging of FSM
> > > and VM truncation".  That commit's 008_fsm_truncation.pl fails with 917dc7d
> > > reverted from master, and adding this patch makes it pass again.  I ran
> > > pgindent and edited comments.  I think the attached version is ready to go.
> > 
> > Thank you Noah, the updated comments are much better. I think it should be 
> > backported at least to 16 since the chances of tripping on that behaviour are 
> > quite high here, but what about previous versions ?
> 
> It should be reachable in all branches, just needing concurrent extension lock
> waiters to reach before v16.  Hence, my plan is to back-patch it all the way.
> It applies with negligible conflicts back to v12.

While it applied, it doesn't build in v12 or v13, due to smgr_cached_nblocks
first appearing in c5315f4.  Options:

1. Back-patch the addition of smgr_cached_nblocks or equivalent.
2. Stop the back-patch of $SUBJECT at v14.
3. Incur more lseek() in v13 and v12.

Given the lack of reports before v16, (3) seems too likely to be a cure worse
than the disease.  I'm picking (2) for today.  We could do (1) tomorrow, but I
lean toward (2) until someone reports the problem on v13 or v12.  The
problem's impact is limited to DML giving ERROR when it should have succeeded,
and I expect VACUUM FULL is a workaround.  Without those mitigating factors, I
would choose (1).

Pushed that way, as 9358297.



Re: FSM Corruption (was: Could not read block at end of the relation)

From
Ronan Dunklau
Date:
Le samedi 13 avril 2024, 19:15:28 CEST Noah Misch a écrit :
> On Thu, Apr 11, 2024 at 08:38:43AM -0700, Noah Misch wrote:
> > On Thu, Apr 11, 2024 at 09:36:50AM +0200, Ronan Dunklau wrote:
> > > Le dimanche 7 avril 2024, 00:30:37 CEST Noah Misch a écrit :
> > > > Your v3 has the right functionality.  As further confirmation of the
> > > > fix, I
> > > > tried reverting the non-test parts of commit 917dc7d "Fix WAL-logging
> > > > of FSM and VM truncation".  That commit's 008_fsm_truncation.pl fails
> > > > with 917dc7d reverted from master, and adding this patch makes it
> > > > pass again.  I ran pgindent and edited comments.  I think the
> > > > attached version is ready to go.> >
> > > Thank you Noah, the updated comments are much better. I think it should
> > > be
> > > backported at least to 16 since the chances of tripping on that
> > > behaviour are quite high here, but what about previous versions ?
> >
> > It should be reachable in all branches, just needing concurrent extension
> > lock waiters to reach before v16.  Hence, my plan is to back-patch it all
> > the way. It applies with negligible conflicts back to v12.
>
> While it applied, it doesn't build in v12 or v13, due to smgr_cached_nblocks
> first appearing in c5315f4.  Options:
>
> 1. Back-patch the addition of smgr_cached_nblocks or equivalent.
> 2. Stop the back-patch of $SUBJECT at v14.
> 3. Incur more lseek() in v13 and v12.
>
> Given the lack of reports before v16, (3) seems too likely to be a cure
> worse than the disease.  I'm picking (2) for today.  We could do (1)
> tomorrow, but I lean toward (2) until someone reports the problem on v13 or
> v12.  The problem's impact is limited to DML giving ERROR when it should
> have succeeded, and I expect VACUUM FULL is a workaround.  Without those
> mitigating factors, I would choose (1).
>
> Pushed that way, as 9358297.

I agree with you that option 2 seems to be the safest course of action. For
the record, other available options when that happens is to stop PG and
manually remove the FSM from disk (see https://wiki.postgresql.org/wiki/
Free_Space_Map_Problems) or adapt the patch submitted here https://
www.postgresql.org/message-id/flat/5446938.Sb9uPGUboI%40aivenlaptop to do it
online.

Thank you for all your work with refining, testing and finally comitting the fix.

Best regards,

--
Ronan Dunklau