Thread: Could not read block at end of the relation
Hello, I'm sorry as this will be a very poor bug report. On PG16, I'm am experiencing random errors which share the same characteristics: - happens during heavy system load - lots of concurrent writes happening on a table - often (but haven't been able to confirm it is necessary), a vacuum is running on the table at the same time the error is triggered Then, several backends get the same error at once "ERROR: could not read block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with different block numbers. The relation is always a table (regular or toast). The blocks are past the end of the relation, and the different backends are all trying to read a different block. The offending queries are either an INSERT / UPDATE / COPY. I've seen that several bugs have been fixed in 16.1 and 16.2 regarding the new relation extension infrastructure, involving partitioned tables in one case and temp tables in the other one so I suspect maybe some other corner cases are uncovered in there. I suspected the FSM could be corrupted in some way but taking a look at it just after the errors have been triggered, the offending (non existing)blocks are just not present in the FSM either. I'm desperately trying to reproduce the issue in a test environment, without any luck so far... I suspected a race condition with VACUUM trying to reclaim the space at the end of the relation, but running a custom build trying to reproduce that (by always trying to truncate the relation during VACUUM regardless of the amount of possibly-freeable-space) hasn't led me anywhere. It seems that for some reason, a backend is extending the relation for the other waiting ones, and the newly allocated blocks don't end up being pinned in shared_buffers. They could then be evicted, and the waiting backend is now trying to read from a block which has to be read from disk but has never been marked dirty and never persisted. I don't have anything to back that hypothesis though... Once again I'm sorry that this report is too vague, I'll update if I manage to reproduce the issue or gather some more information, but in the meantime has anybody witnessed something similar ? And more importantly, do you have any pointers on how to investigate to try to trigger the issue manually ? Best regards, -- Ronan Dunklau
Le mardi 27 février 2024, 11:34:14 CET Ronan Dunklau a écrit : > I suspected the FSM could be corrupted in some way but taking a look at it > just after the errors have been triggered, the offending (non > existing)blocks are just not present in the FSM either. I think I may have missed something on my first look. On other affected clusters, the FSM is definitely corrupted. So it looks like we have an FSM corruption bug on our hands. The occurence of this bug happening makes it hard to reproduce, but it's definitely frequent enough we witnessed it on a dozen PostgreSQL clusters. In our case, we need to repair the FSM. The instructions on the wiki do work, but maybe we should add something like the attached patch (modeled after the same feature in pg_visibility) to make it possible to repair the FSM corruption online. What do you think about it ? The investigation of the corruption is still ongoing. Best regards, -- Ronan Dunklau
Attachment
On Fri, Mar 01, 2024 at 09:56:51AM +0100, Ronan Dunklau wrote: > In our case, we need to repair the FSM. The instructions on the wiki do work, > but maybe we should add something like the attached patch (modeled after the > same feature in pg_visibility) to make it possible to repair the FSM > corruption online. What do you think about it ? I vaguely recall that something like that has been suggested around 917dc7d2393c in the past. > The investigation of the corruption is still ongoing. Thanks! -- Michael
Attachment
On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote: > - happens during heavy system load > - lots of concurrent writes happening on a table > - often (but haven't been able to confirm it is necessary), a vacuum is running > on the table at the same time the error is triggered > > Then, several backends get the same error at once "ERROR: could not read > block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with different What are some of the specific block numbers reported? > has anybody witnessed something similar ? https://postgr.es/m/flat/CA%2BhUKGK%2B5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8D%2BZg%40mail.gmail.com reminded me of this. Did you upgrade your OS recently? On Fri, Mar 01, 2024 at 09:56:51AM +0100, Ronan Dunklau wrote: > I think I may have missed something on my first look. On other affected > clusters, the FSM is definitely corrupted. So it looks like we have an FSM > corruption bug on our hands. What corruption signs did you observe in the FSM? Since FSM is intentionally not WAL-logged, corruption is normal, but corruption causing errors is not normal. That said, if any crash leaves a state that the freespace/README "self-correcting measures" don't detect, errors may happen. Did the clusters crash recently? > The occurence of this bug happening makes it hard to reproduce, but it's > definitely frequent enough we witnessed it on a dozen PostgreSQL clusters. You could do "ALTER TABLE x SET (vacuum_truncate = off);" and see if the problem stops happening. That would corroborate the VACUUM theory. Can you use backtrace_functions to get a stack track? > In our case, we need to repair the FSM. The instructions on the wiki do work, > but maybe we should add something like the attached patch (modeled after the > same feature in pg_visibility) to make it possible to repair the FSM > corruption online. What do you think about it ? That's reasonable in concept.
Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit : > On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote: > > - happens during heavy system load > > - lots of concurrent writes happening on a table > > - often (but haven't been able to confirm it is necessary), a vacuum is > > running on the table at the same time the error is triggered > > > > Then, several backends get the same error at once "ERROR: could not read > > block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with > > different > What are some of the specific block numbers reported? Nothing specific, except they are past the end of the table, by an offset starting at one. I've seen it happens on regular tables restored by pg_restore quite a lot, but also on bigger tables, regular or TOAST. We've seen this on a dozen different clusters, but nothing reproducible for now. > > > has anybody witnessed something similar ? > > https://postgr.es/m/flat/CA%2BhUKGK%2B5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8 > D%2BZg%40mail.gmail.com reminded me of this. Did you upgrade your OS > recently? No recent upgrades I'm aware of. Also, it only touches PG16 clusters, not earlier versions. But it looks like the same kind of errors. > What corruption signs did you observe in the FSM? Since FSM is > intentionally not WAL-logged, corruption is normal, but corruption causing > errors is not normal. That said, if any crash leaves a state that the > freespace/README "self-correcting measures" don't detect, errors may > happen. Did the clusters crash recently? Well we still have full page images for the FSM. I'll take as an example the cluster I'm currently investigating on. The corruption I'm seeing is that I have an additional entry in the FSM for a block that does not exist in the relation: IIUC, slot 4129 corresponds to block 34, which is the last one I have in the relation. However, I have an entry (slot 4130) for block number 35, with MaxFSMRequestSize fsm_page_contents ------------------- 0: 255 + 1: 255 + 3: 255 + ... 4129: 114 + 4130: 255 + This then causes attempts to insert into that block to fail. Looking at when the corruption was WAL-logged, this particular case is quite easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table (probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog records for each of the relation main fork, followed by a lonely FPI for the leaf page of it's FSM: rmgr: Heap2 len (rec/tot): 5698/ 5698, tx: 1637, lsn: 1/810763B8, prev 1/81076388, desc: MULTI_INSERT+INIT ntuples: 124, flags: 0x08, blkref #0: rel 1663/17343/17959 blk 0 rmgr: XLOG len (rec/tot): 49/ 8241, tx: 1637, lsn: 1/81077A00, prev 1/810763B8, desc: FPI_FOR_HINT , blkref #0: rel 1663/17343/17959 fork fsm blk 2 FPW rmgr: Heap2 len (rec/tot): 5762/ 5762, tx: 1637, lsn: 1/81079A50, prev 1/81077A00, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08, blkref #0: rel 1663/17343/17959 blk 1 rmgr: Heap2 len (rec/tot): 5746/ 5746, tx: 1637, lsn: 1/8107B0F0, prev 1/81079A50, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08, blkref #0: rel 1663/17343/17959 blk 2 rmgr: Heap2 len (rec/tot): 5778/ 5778, tx: 1637, lsn: 1/8107C780, prev 1/8107B0F0, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08, blkref #0: rel 1663/17343/17959 blk 3 (up until blk 34 is wal logged) At some point pages 0 and 1 get a FPI, the leaf page (blk2) does not during the transaction. Later on, we have this for every main fork block: rmgr: XLOG len (rec/tot): 49/ 8201, tx: 0, lsn: 2/0E1ECEC8, prev 2/0E1ECE10, desc: FPI_FOR_HINT , blkref #0: rel 1663/17343/17959 blk 0 FPW rmgr: Heap2 len (rec/tot): 325/ 325, tx: 0, lsn: 2/0E1EEEF0, prev 2/0E1ECEC8, desc: FREEZE_PAGE snapshotConflictHorizon: 1637, nplans: 2, plans: [{ xmax: 0, infomask: 2816, infomask2: 5, ntuples: 86, offsets: [1, 2, 3, 5, 7, 10, 14, 16, 21, 22, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 45, 46, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 64, 65, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 85, 86, 87, 90, 92, 94, 95, 96, 98, 100, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 117, 118, 119, 120, 121, 122, 123, 124] }, { xmax: 0, infomask: 2817, infomask2: 5, ntuples: 38, offsets: [4, 6, 8, 9, 11, 12, 13, 15, 17, 18, 19, 20, 23, 24, 25, 26, 27, 29, 43, 44, 47, 48, 59, 61, 62, 63, 66, 70, 80, 84, 88, 89, 91, 93, 97, 99, 101, 116] }], blkref #0: rel 1663/17343/17959 blk 0 rmgr: Heap2 len (rec/tot): 64/ 8256, tx: 0, lsn: 2/0E1EF038, prev 2/0E1EEEF0, desc: VISIBLE snapshotConflictHorizon: 0, flags: 0x03, blkref #0: rel 1663/17343/17959 fork vm blk 0 FPW, blkref #1: rel 1663/17343/17959 blk 0 And then the single FSM FPI, which contains the invalid record: rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 2/0E28F640, prev 2/0E28F600, desc: FPI_FOR_HINT , blkref #0: rel 1663/17343/17959 fork fsm blk 2 FPW I've dumped the associated FPIs, and took a look at them. First FPI for leaf block (#2 in the fsm) happens in the beginning of the multi insert transaction, and only contains one leaf node for block 0 (slot 4095). Later FPI (during what I think is a VACUUM) persists the corruption in the WAL. So presumably something happened during the COPY operation which caused it to add an invalid entry for inexisting block 35 in the FSM. There are no traces of relation truncation happening in the WAL. This case only shows a single invalid entry in the FSM, but I've noticed as much as 62 blocks present in the FSM while they do not exist on disk, all tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk extension mechanism. > > > The occurence of this bug happening makes it hard to reproduce, but it's > > definitely frequent enough we witnessed it on a dozen PostgreSQL clusters. > > You could do "ALTER TABLE x SET (vacuum_truncate = off);" and see if the > problem stops happening. That would corroborate the VACUUM theory. > > Can you use backtrace_functions to get a stack track? I got a stack trace from the error after the corruption has occured (even got to attach gdb to it as I was able to restore a corrupted cluster), but I don't think it's of any use (this is from a different occurence than the one I wrote in details above, so the block numbers don't match but it's the exact same thing). #0 errstart (elevel=21, domain=0x0) at utils/error/elog.c:348 #1 0x000055c81ea83a0a in errstart_cold (domain=0x0, elevel=21) at utils/ error/elog.c:333 #2 mdread (reln=0x55c8210bbe00, forknum=MAIN_FORKNUM, blocknum=1203, buffer=0x7f51b4c32000) at storage/smgr/md.c:796 #3 0x000055c81ee396cf in smgrread (buffer=0x7f51b4c32000, blocknum=1203, forknum=MAIN_FORKNUM, reln=0x55c8210bbe00) at storage/smgr/smgr.c:565 #4 ReadBuffer_common (smgr=0x55c8210bbe00, relpersistence=relpersistence@entry=112 'p', forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=1203, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>, hit=0x7fff6e22f607) at storage/buffer/bufmgr.c:1124 #5 0x000055c81ee39d0b in ReadBufferExtended (reln=0x7f51b294f0d0, forkNum=MAIN_FORKNUM, blockNum=1203, mode=RBM_NORMAL, strategy=<optimized out>) at access/brin/../../../../src/include/utils/rel.h:576 #6 0x000055c81eb1525c in ReadBufferBI (mode=RBM_NORMAL, bistate=0x0, targetBlock=1203, relation=<optimized out>) at access/heap/hio.c:96 #7 RelationGetBufferForTuple (relation=<optimized out>, len=<optimized out>, otherBuffer=0, options=<optimized out>, bistate=0x0, vmbuffer=<optimized out>, vmbuffer_other=<optimized out>, num_pages=<optimized out>) at access/heap/ hio.c:620 #8 0x000055c81eafcace in heap_insert (relation=0x7f51b294f0d0, tup=0x55c821042580, cid=<optimized out>, options=0, bistate=0x0) at access/ heap/heapam.c:1862 #9 0x000055c81eb08bab in heapam_tuple_insert (relation=0x7f51b294f0d0, slot=0x55c821042478, cid=0, options=0, bistate=0x0) at access/heap/ heapam_handler.c:252 #10 0x000055c81ecce2f1 in table_tuple_insert (bistate=0x0, options=0, cid=<optimized out>, slot=0x55c821042478, rel=<optimized out>) at executor/../../../src/include/access/tableam.h:1400 #11 ExecInsert (context=context@entry=0x7fff6e22f9b0, resultRelInfo=resultRelInfo@entry=0x55c8210410c0, slot=slot@entry=0x55c821042478, canSetTag=<optimized out>, inserted_tuple=inserted_tuple@entry=0x0, insert_destrel=insert_destrel@entry=0x0) at executor/nodeModifyTable.c:1133 #12 0x000055c81eccf1eb in ExecModifyTable (pstate=<optimized out>) at executor/nodeModifyTable.c:3810 Just a regular insertion failing to read the block it's trying to insert into. > > In our case, we need to repair the FSM. The instructions on the wiki do > > work, but maybe we should add something like the attached patch (modeled > > after the same feature in pg_visibility) to make it possible to repair > > the FSM corruption online. What do you think about it ? > > That's reasonable in concept. Ok, I'll start a separate thread on -hackers for that. Thank you !
On Mon, Mar 04, 2024 at 02:10:39PM +0100, Ronan Dunklau wrote: > Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit : > > On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote: > > > - happens during heavy system load > > > - lots of concurrent writes happening on a table > > > - often (but haven't been able to confirm it is necessary), a vacuum is > > > running on the table at the same time the error is triggered > Looking at when the corruption was WAL-logged, this particular case is quite > easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table > (probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM > running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog > records for each of the relation main fork, followed by a lonely FPI for the > leaf page of it's FSM: You're using data_checksums, right? Thanks for the wal dump excerpts; I agree with this summary thereof. > There are no traces of relation truncation happening in the WAL. That is notable. > This case only shows a single invalid entry in the FSM, but I've noticed as > much as 62 blocks present in the FSM while they do not exist on disk, all > tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk > extension mechanism. Is this happening after an OS crash, a replica promote, or a PITR restore? If so, I think I see the problem. We have an undocumented rule that FSM shall not contain references to pages past the end of the relation. To facilitate that, relation truncation WAL-logs FSM truncate. However, there's no similar protection for relation extension, which is not WAL-logged. We break the rule whenever we write FSM for block X before some WAL record initializes block X. data_checksums makes the trouble easier to hit, since it creates FPI_FOR_HINT records for FSM changes. A replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield this broken state. While v16 RelationAddBlocks() made this easier to hit, I suspect it's reproducible in all supported branches. For example, lazy_scan_new_or_empty() and multiple index AMs break the rule via RecordPageWithFreeSpace() on a PageIsNew() page. I think the fix is one of: - Revoke the undocumented rule. Make FSM consumers resilient to the FSM returning a now-too-large block number. - Enforce a new "main-fork WAL before FSM" rule for logged rels. For example, in each PageIsNew() case, either don't update FSM or WAL-log an init (like lazy_scan_new_or_empty() does when PageIsEmpty()).
Le lundi 4 mars 2024, 20:03:12 CET Noah Misch a écrit : > On Mon, Mar 04, 2024 at 02:10:39PM +0100, Ronan Dunklau wrote: > > Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit : > > > On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote: > You're using data_checksums, right? Thanks for the wal dump excerpts; I > agree with this summary thereof. Yes, data checksums so wal_log_hints is implied. > > > There are no traces of relation truncation happening in the WAL. > > That is notable. > > > This case only shows a single invalid entry in the FSM, but I've noticed > > as > > much as 62 blocks present in the FSM while they do not exist on disk, all > > tagged with MaxFSMRequestSize so I suppose something is wrong with the > > bulk > > extension mechanism. > > Is this happening after an OS crash, a replica promote, or a PITR restore? I need to double check all occurences, but I wouldn't be surprised if they all went trough a replica promotion. > If so, I think I see the problem. We have an undocumented rule that FSM > shall not contain references to pages past the end of the relation. To > facilitate that, relation truncation WAL-logs FSM truncate. However, > there's no similar protection for relation extension, which is not > WAL-logged. We break the rule whenever we write FSM for block X before > some WAL record initializes block X. data_checksums makes the trouble > easier to hit, since it creates FPI_FOR_HINT records for FSM changes. A > replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield > this broken state. While v16 RelationAddBlocks() made this easier to hit, > I suspect it's reproducible in all supported branches. For example, > lazy_scan_new_or_empty() and multiple index AMs break the rule via > RecordPageWithFreeSpace() on a PageIsNew() page. Very interesting. I understand the reasoning, and am now able to craft a very crude test case, which I will refine into something more actionable: - start a session 1, which will just vacuum our table continuously - start a ession 2, which will issue checkpoints continuously - in a session 3, COPY enough data so that we prealllocate too many pages. In my example, I 'm copying 33150 single integer rows, which fit on a bit more than 162 pages. The idea behind that is to make sure we don't fall on a round number of pages, and we leave several iterations of the extend mechanism running to issue a full page write of the fsm. So to trigger the bug, it seems to me one needs to run in parallel: - COPY - VACUUM - CHECKPOINT Which would explain why a surprisingly large number of occurences I've seen seemed to involve a pg_restore invocation. > > I think the fix is one of: > > - Revoke the undocumented rule. Make FSM consumers resilient to the FSM > returning a now-too-large block number. Performance wise, that seems to be the better answer but won't this kind of built-in resilience encourage subtle bugs creeping in ? > > - Enforce a new "main-fork WAL before FSM" rule for logged rels. For > example, in each PageIsNew() case, either don't update FSM or WAL-log an > init (like lazy_scan_new_or_empty() does when PageIsEmpty()). The FSM is not updated by the caller: in the bulk insert case, we intentionally don't add them to the FSM. So having VACUUM WAL-log the page in lazy_scan_new_or_empty in the New case as you propose seems a very good idea. Performance wise that would keep the WAL logging outside of the backend performing the preventive work for itself or others, and it should not be that many pages that it gives VACUUM too much work. I can try my hand at such a patch it if looks like a good idea. Thank you very much for your insights. Best regards, -- Ronan Dunklau
On Mon, Mar 04, 2024 at 11:21:07PM +0100, Ronan Dunklau wrote: > Le lundi 4 mars 2024, 20:03:12 CET Noah Misch a écrit : > > Is this happening after an OS crash, a replica promote, or a PITR restore? > > I need to double check all occurences, but I wouldn't be surprised if they all > went trough a replica promotion. > > > If so, I think I see the problem. We have an undocumented rule that FSM > > shall not contain references to pages past the end of the relation. To > > facilitate that, relation truncation WAL-logs FSM truncate. However, > > there's no similar protection for relation extension, which is not > > WAL-logged. We break the rule whenever we write FSM for block X before > > some WAL record initializes block X. data_checksums makes the trouble > > easier to hit, since it creates FPI_FOR_HINT records for FSM changes. A > > replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield > > this broken state. While v16 RelationAddBlocks() made this easier to hit, > > I suspect it's reproducible in all supported branches. For example, > > lazy_scan_new_or_empty() and multiple index AMs break the rule via > > RecordPageWithFreeSpace() on a PageIsNew() page. > > Very interesting. I understand the reasoning, and am now able to craft a very > crude test case, which I will refine into something more actionable: > > - start a session 1, which will just vacuum our table continuously > - start a ession 2, which will issue checkpoints continuously > - in a session 3, COPY enough data so that we prealllocate too many pages. In > my example, I 'm copying 33150 single integer rows, which fit on a bit more > than 162 pages. The idea behind that is to make sure we don't fall on a round > number of pages, and we leave several iterations of the extend mechanism > running to issue a full page write of the fsm. > > So to trigger the bug, it seems to me one needs to run in parallel: > - COPY > - VACUUM > - CHECKPOINT > > Which would explain why a surprisingly large number of occurences I've seen > seemed to involve a pg_restore invocation. Does that procedure alone reproduce the ERROR, without any replica promote, postmaster restart, etc.? That, I do not have reason to expect. > > I think the fix is one of: > > > > - Revoke the undocumented rule. Make FSM consumers resilient to the FSM > > returning a now-too-large block number. > > Performance wise, that seems to be the better answer I would guess this one is more risky from a performance perspective, since we'd be adding to a hotter path under RelationGetBufferForTuple(). Still, it's likely fine. > but won't this kind of > built-in resilience encourage subtle bugs creeping in ? Since the !wal_log_hints case avoids WAL for FSM, we've already accepted subtle bugs. I bet an unlucky torn FSM page could reach the ERROR you reached, even if we introduced a "main-fork WAL before FSM" rule. We'll have plenty of need for resilience. Hence, I'd lean toward this approach. > > - Enforce a new "main-fork WAL before FSM" rule for logged rels. For > > example, in each PageIsNew() case, either don't update FSM or WAL-log an > > init (like lazy_scan_new_or_empty() does when PageIsEmpty()). > > The FSM is not updated by the caller: in the bulk insert case, we > intentionally don't add them to the FSM. So having VACUUM WAL-log the page in > lazy_scan_new_or_empty in the New case as you propose seems a very good idea. > > Performance wise that would keep the WAL logging outside of the backend > performing the preventive work for itself or others, and it should not be that > many pages that it gives VACUUM too much work. A drawback is that this approach touches several access methods, so out-of-tree access methods also likely need changes. Still, this is a good backup if the first option measurably slows INSERT throughput. > I can try my hand at such a patch it if looks like a good idea. Please do. > Thank you very much for your insights. Thanks for the detailed report that made it possible.
Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit : > Does that procedure alone reproduce the ERROR, without any replica promote, > postmaster restart, etc.? That, I do not have reason to expect. Oh no sorrry for not being clear: it needs to go through recovery either through a backup restoration, a replica or anything else. This is why I've been struggling to reproduce the issue, it didn't occur to me that the fsm was only sane on the primary, as the extensions were not wal logged. > > > > I think the fix is one of: > > > > > > - Revoke the undocumented rule. Make FSM consumers resilient to the FSM > > > > > > returning a now-too-large block number. > > > > Performance wise, that seems to be the better answer > > I would guess this one is more risky from a performance perspective, since > we'd be adding to a hotter path under RelationGetBufferForTuple(). Still, > it's likely fine. Makes sense. > > > but won't this kind of > > built-in resilience encourage subtle bugs creeping in ? > > Since the !wal_log_hints case avoids WAL for FSM, we've already accepted > subtle bugs. I bet an unlucky torn FSM page could reach the ERROR you > reached, even if we introduced a "main-fork WAL before FSM" rule. We'll > have plenty of need for resilience. Hence, I'd lean toward this approach. > > > - Enforce a new "main-fork WAL before FSM" rule for logged rels. For > > > example, in each PageIsNew() case, either don't update FSM or WAL-log an > > > init (like lazy_scan_new_or_empty() does when PageIsEmpty()). > > > > The FSM is not updated by the caller: in the bulk insert case, we > > intentionally don't add them to the FSM. So having VACUUM WAL-log the page > > in lazy_scan_new_or_empty in the New case as you propose seems a very > > good idea. > > > > Performance wise that would keep the WAL logging outside of the backend > > performing the preventive work for itself or others, and it should not be > > that many pages that it gives VACUUM too much work. > > A drawback is that this approach touches several access methods, so > out-of-tree access methods also likely need changes. Still, this is a good > backup if the first option measurably slows INSERT throughput. Would it actually ? At least for the currently discussed VACUUM case, we will fall back to a generic FPI for the new page, which should work on any AM. But the requirement to WAL log the page before recording it in the FSM should be publicized, and the common infrastructure should adhere to it. I'd have to look into the other cases you mentioned where that can happen. > > > I can try my hand at such a patch it if looks like a good idea. > > Please do. > > > Thank you very much for your insights. > > Thanks for the detailed report that made it possible.
On Tue, Mar 05, 2024 at 12:33:13AM +0100, Ronan Dunklau wrote: > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit : > > > > - Enforce a new "main-fork WAL before FSM" rule for logged rels. For > > > > example, in each PageIsNew() case, either don't update FSM or WAL-log an > > > > init (like lazy_scan_new_or_empty() does when PageIsEmpty()). > > > > > > The FSM is not updated by the caller: in the bulk insert case, we > > > intentionally don't add them to the FSM. So having VACUUM WAL-log the page > > > in lazy_scan_new_or_empty in the New case as you propose seems a very > > > good idea. > > > > > > Performance wise that would keep the WAL logging outside of the backend > > > performing the preventive work for itself or others, and it should not be > > > that many pages that it gives VACUUM too much work. > > > > A drawback is that this approach touches several access methods, so > > out-of-tree access methods also likely need changes. Still, this is a good > > backup if the first option measurably slows INSERT throughput. > > Would it actually ? At least for the currently discussed VACUUM case, we will > fall back to a generic FPI for the new page, which should work on any AM. But > the requirement to WAL log the page before recording it in the FSM should be > publicized, and the common infrastructure should adhere to it. I agree each AM can use generic FPI code. I'm looking at code like this as needing change under this candidate design: static void spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno) { ... if (PageIsNew(page) || PageIsEmpty(page)) { RecordFreeIndexPage(index, blkno); The change there would be fairly simple, but non-core access methods may require similar simple changes. That's fine if this approach has other reasons to win, but it is a drawback. A grep of pgxn code shows "rum" as the only module containing a RecordFreeIndexPage() call. No module contains a RecordPageWithFreeSpace() call. So this drawback is all but negligible.
On Mon, Mar 04, 2024 at 04:13:46PM -0800, Noah Misch wrote: > I agree each AM can use generic FPI code. I'm looking at code like this as > needing change under this candidate design: > > static void > spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno) > { > ... > if (PageIsNew(page) || PageIsEmpty(page)) > { > RecordFreeIndexPage(index, blkno); > > The change there would be fairly simple, but non-core access methods may > require similar simple changes. That's fine if this approach has other > reasons to win, but it is a drawback. A grep of pgxn code shows "rum" as the > only module containing a RecordFreeIndexPage() call. No module contains a > RecordPageWithFreeSpace() call. So this drawback is all but negligible. This design does not sound that bad to be, FWIW, if what you are discussing impacts data inserts enough so be noticeable in the default cases. There are not that many out-of-core index AMs out there, and "rum" is as far as I know not supported but any of the major cloud vendors. So contacting the authors of such AMs and raising awareness would be enough. So I would worry much more about performance. My 2c. (You should be able to implement a cheap test using two injection points, from what I can read, if you need coordination between three actions.) -- Michael
Attachment
Le mardi 5 mars 2024, 01:56:12 CET Michael Paquier a écrit : > (You should be able to implement a cheap test using two injection > points, from what I can read, if you need coordination between three > actions.) I'm sorry, but I don't understand this sentence at all. Would you mind to elaborate ? Thanks ! > -- > Michael
Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit : > I would guess this one is more risky from a performance perspective, since > we'd be adding to a hotter path under RelationGetBufferForTuple(). Still, > it's likely fine. I ended up implementing this in the attached patch. The idea is that we detect if the FSM returns a page past the end of the relation, and ignore it. In that case we will fallback through the extension mechanism. For the corrupted-FSM case it is not great performance wise, as we will extend the relation in small steps every time we find a non existing block in the FSM, until the actual relation size matches what is recorded in the FSM. But since those seldom happen, I figured it was better to keep the code really simple for a bugfix. I wanted to test the impact in terms of performance, and I thought about the worst possible case for this. Then, run a pgbench doing insertions in the table. With the attached patch the worst case I could come up with is: - remember which page we last inserted into - notice we don't have enough space - ask the FSM for a block - now have to compare that to the actual relation size So I came up with the following initialization steps: - create a table with vacuum_truncate = off, with a tuple size big enough that it's impossible to fit two tuples on the same page - insert lots of tuple in it until it reaches a decent size - delete them all - vacuum - all of this fitting in shared_buffers As in: CREATE TABLE test_perf (c1 char(5000)); ALTER TABLE test_perf ALTER c1 SET STORAGE PLAIN; ALTER TABLE test_perf SET (VACUUM_TRUNCATE = off); INSERT INTO test_perf (c1) SELECT 'c' FROM generate_series(1, 1000000); DELETE FROM test_perf; VACUUM test_perf; Then I ran pgbench with a single client, with a script only inserting the same value over and over again, for 1000000 transactions (initial table size). I noticed no difference running with or without the patch, but maybe someone else can try to run that or find another adversarial case ? Best regards, -- Ronan Dunklau
Attachment
On 3/6/24 10:31, Ronan Dunklau wrote: > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit : >> I would guess this one is more risky from a performance perspective, since >> we'd be adding to a hotter path under RelationGetBufferForTuple(). Still, >> it's likely fine. > > I ended up implementing this in the attached patch. The idea is that we detect > if the FSM returns a page past the end of the relation, and ignore it. > In that case we will fallback through the extension mechanism. @@ -582,7 +583,17 @@ RelationGetBufferForTuple(Relation relation, Size len, * We have no cached target page, so ask the FSM for an initial * target. */ + nblocks = RelationGetNumberOfBlocks(relation); targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace); I'd move the fetching of nblocks to after getting the targetBlock. This avoids extending the relation in the unlikely event of a FSM/relation extension in-between those two calls. Patrick
Le mercredi 6 mars 2024, 11:59:43 CET Patrick Stählin a écrit : > On 3/6/24 10:31, Ronan Dunklau wrote: > > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit : > >> I would guess this one is more risky from a performance perspective, > >> since > >> we'd be adding to a hotter path under RelationGetBufferForTuple(). > >> Still, > >> it's likely fine. > > > > I ended up implementing this in the attached patch. The idea is that we > > detect if the FSM returns a page past the end of the relation, and ignore > > it. In that case we will fallback through the extension mechanism. > > @@ -582,7 +583,17 @@ RelationGetBufferForTuple(Relation relation, Size len, > * We have no cached target page, so ask the FSM for an initial > * target. > */ > + nblocks = RelationGetNumberOfBlocks(relation); > targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace); > > I'd move the fetching of nblocks to after getting the targetBlock. This > avoids extending the relation in the unlikely event of a FSM/relation > extension in-between those two calls. Good catch. Please find attached v2 of the patch. -- Ronan Dunklau
Attachment
On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote: > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit : > > I would guess this one is more risky from a performance perspective, since > > we'd be adding to a hotter path under RelationGetBufferForTuple(). Still, > > it's likely fine. > > I ended up implementing this in the attached patch. The idea is that we detect > if the FSM returns a page past the end of the relation, and ignore it. > In that case we will fallback through the extension mechanism. > > For the corrupted-FSM case it is not great performance wise, as we will extend > the relation in small steps every time we find a non existing block in the FSM, > until the actual relation size matches what is recorded in the FSM. But since > those seldom happen, I figured it was better to keep the code really simple for > a bugfix. That could be fine. Before your message, I assumed we would treat this like the case of a full page. That is, we'd call RecordAndGetPageWithFreeSpace(), which would remove the FSM entry and possibly return another. That could be problematic if another backend is extending the relation; we don't want to erase the extender's new FSM entry. I'm parking this topic for the moment. > I wanted to test the impact in terms of performance, and I thought about the > worst possible case for this. > [test details] > > I noticed no difference running with or without the patch, but maybe someone > else can try to run that or find another adversarial case ? The patch currently adds an lseek() whenever the FSM finds a block. I see relevant-looking posts from mailing list searches with subsets of these terms: RelationGetNumberOfBlocks lseek benchmark overhead. I bet we should reduce this cost add. Some ideas: - No need for a system call if the FSM-returned value is low enough with respect to any prior RelationGetNumberOfBlocks() call. - We'd be happy and efficient with a ReadBufferMode such that ReadBufferExtended() returns NULL after a 0-byte read, with all other errors handled normally. That sounds like the specification of RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't that, but perhaps one RBM could satisfy both sets of needs. On Wed, Mar 06, 2024 at 12:08:38PM +0100, Ronan Dunklau wrote: > Subject: [PATCH 2/2] Detect invalid FSM when finding a block. > > Whenever the FSM points to a block past the end of the relation, detect > it and fallback to the relation extension mechanism. > > This may arise when a FPI for the FSM has been triggered, and we end up > WAL-logging a page of the FSM pointing to newly extended blocks in the > relation which have never been WAL-logged. > --- > src/backend/access/heap/hio.c | 20 +++++++++++++++++++- Each GetPageWithFreeSpace() caller would need this, not just heap. Please evaluate whether this logic belongs inside freespace.c vs. in each caller. freespace.c comments and/or freespace/README may need updates. Please evaluate header comments of freespace.c functions whose callers you are changing, and evaluate comments about truncation.
Le mercredi 6 mars 2024 19:42:28 CET, vous avez écrit : > On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote: > > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit : > > > I would guess this one is more risky from a performance perspective, > > > since > > > we'd be adding to a hotter path under RelationGetBufferForTuple(). > > > Still, > > > it's likely fine. > > > > I ended up implementing this in the attached patch. The idea is that we > > detect if the FSM returns a page past the end of the relation, and ignore > > it. In that case we will fallback through the extension mechanism. > > > > For the corrupted-FSM case it is not great performance wise, as we will > > extend the relation in small steps every time we find a non existing > > block in the FSM, until the actual relation size matches what is recorded > > in the FSM. But since those seldom happen, I figured it was better to > > keep the code really simple for a bugfix. > > That could be fine. Before your message, I assumed we would treat this like > the case of a full page. That is, we'd call > RecordAndGetPageWithFreeSpace(), which would remove the FSM entry and > possibly return another. That could be problematic if another backend is > extending the relation; we don't want to erase the extender's new FSM > entry. I'm parking this topic for the moment. I would argue this is fine, as corruptions don't happen often, and FSM is not an exact thing anyway. If we record a block as full, it will just be unused until the next vacuum adds it back to the FSM with a correct value. > > I wanted to test the impact in terms of performance, and I thought about > > the worst possible case for this. > > [test details] > > > I noticed no difference running with or without the patch, but maybe > > someone else can try to run that or find another adversarial case ? > > The patch currently adds an lseek() whenever the FSM finds a block. I see > relevant-looking posts from mailing list searches with subsets of these > terms: RelationGetNumberOfBlocks lseek benchmark overhead. I bet we should > reduce this cost add. Some ideas: > > - No need for a system call if the FSM-returned value is low enough with > > respect to any prior RelationGetNumberOfBlocks() call. I'm not sure what you mean by "low enough". What I chose to implement instead is to rely on the cached value even when not in recovery for that particular case. The reasoning being, as above, that if someone extended the relation since the last cached value, worst case scenario is that we do not insert into that old block. Whenever we detect a case like this, we just erase the FSM entry to mark the block as full, so that it's not picked up again next time. > > - We'd be happy and efficient with a ReadBufferMode such that > ReadBufferExtended() returns NULL after a 0-byte read, with all other > errors handled normally. That sounds like the specification of > RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't that, > but perhaps one RBM could satisfy both sets of needs. I'm not sure about this one, this seems to have far more reaching consequences. This would mean we don't have any cooperation from the FSM, and would need to implement error recovery scenarios in each caller. But I admit I haven't looked too much into it, and focused instead in seeing if the current approach can get us to an acceptable state. > > On Wed, Mar 06, 2024 at 12:08:38PM +0100, Ronan Dunklau wrote: > > Subject: [PATCH 2/2] Detect invalid FSM when finding a block. > > > > Whenever the FSM points to a block past the end of the relation, detect > > it and fallback to the relation extension mechanism. > > > > This may arise when a FPI for the FSM has been triggered, and we end up > > WAL-logging a page of the FSM pointing to newly extended blocks in the > > relation which have never been WAL-logged. > > --- > > > > src/backend/access/heap/hio.c | 20 +++++++++++++++++++- > > Each GetPageWithFreeSpace() caller would need this, not just heap. Please > evaluate whether this logic belongs inside freespace.c vs. in each caller. > freespace.c comments and/or freespace/README may need updates. Please > evaluate header comments of freespace.c functions whose callers you are > changing, and evaluate comments about truncation. I implemented the new behaviour in fsm_search and RecordAndGetPageWithFreeSpace. Regards, -- Ronan
Attachment
On Thu, Mar 07, 2024 at 12:21:24PM +0100, Ronan Dunklau wrote: > Le mercredi 6 mars 2024 19:42:28 CET, vous avez écrit : > > On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote: > > > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit : > > > > I would guess this one is more risky from a performance perspective, > > > > since > > > > we'd be adding to a hotter path under RelationGetBufferForTuple(). > > > > Still, > > > > it's likely fine. > > > > > > I ended up implementing this in the attached patch. The idea is that we > > > detect if the FSM returns a page past the end of the relation, and ignore > > > it. In that case we will fallback through the extension mechanism. > > > > > > For the corrupted-FSM case it is not great performance wise, as we will > > > extend the relation in small steps every time we find a non existing > > > block in the FSM, until the actual relation size matches what is recorded > > > in the FSM. But since those seldom happen, I figured it was better to > > > keep the code really simple for a bugfix. > > > > That could be fine. Before your message, I assumed we would treat this like > > the case of a full page. That is, we'd call > > RecordAndGetPageWithFreeSpace(), which would remove the FSM entry and > > possibly return another. That could be problematic if another backend is > > extending the relation; we don't want to erase the extender's new FSM > > entry. I'm parking this topic for the moment. > > I would argue this is fine, as corruptions don't happen often, and FSM is not > an exact thing anyway. If we record a block as full, it will just be unused > until the next vacuum adds it back to the FSM with a correct value. If this happened only under FSM corruption, it would be fine. I was thinking about normal extension, with an event sequence like this: - RelationExtensionLockWaiterCount() returns 10. - Backend B1 extends by 10 pages. - B1 records 9 pages in the FSM. - B2, B3, ... B11 wake up and fetch from the FSM. In each of those backends, fsm_does_block_exist() returns false, because the cached relation size is stale. Those pages remain unused until VACUUM re-records them in the FSM. PostgreSQL added the multiple-block extension logic (commit 719c84c) from hard-won experience bottlenecking there. If fixing $SUBJECT will lose 1% of that change's benefit, that's probably okay. If it loses 20%, we should work to do better. While we could rerun the 2016 benchmark to see how the patch regresses it, there might be a simple fix. fsm_does_block_exist() could skip the cache when blknumber>=cached, like this: return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM]) && blknumber < smgr->smgr_cached_nblocks[MAIN_FORKNUM]) || blknumber < RelationGetNumberOfBlocks(rel)); How do you see it? That still leaves us with an avoidable lseek in B2, B3, ... B11, but that feels tolerable. I don't know how to do better without switching to the ReadBufferMode approach. > > The patch currently adds an lseek() whenever the FSM finds a block. I see > > relevant-looking posts from mailing list searches with subsets of these > > terms: RelationGetNumberOfBlocks lseek benchmark overhead. I bet we should > > reduce this cost add. Some ideas: > > > > - No need for a system call if the FSM-returned value is low enough with > > > respect to any prior RelationGetNumberOfBlocks() call. > > I'm not sure what you mean by "low enough". What I chose to implement instead I just meant "less than". Specifically, if "fsm_returned_value < MAX(all RelationGetNumberOfBlocks() calls done on this relation, since last postmaster start)" then the FSM-returned value won't have the problem seen in $SUBJECT. > > - We'd be happy and efficient with a ReadBufferMode such that > > ReadBufferExtended() returns NULL after a 0-byte read, with all other > > errors handled normally. That sounds like the specification of > > RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't that, > > but perhaps one RBM could satisfy both sets of needs. > > I'm not sure about this one, this seems to have far more reaching > consequences. This would mean we don't have any cooperation from the FSM, and > would need to implement error recovery scenarios in each caller. But I admit I > haven't looked too much into it, and focused instead in seeing if the current > approach can get us to an acceptable state. I, too, am not sure. I agree that RBM approach wouldn't let you isolate the changes like the last patch does. Each GetFreeIndexPage() caller would need code for the return-NULL case. Avoiding that is nice. (If we ever do the RBM change in the future, the fsm_readbuf() and vm_readbuf() !extend cases should also use it.) > --- a/src/backend/storage/freespace/freespace.c > +++ b/src/backend/storage/freespace/freespace.c > + if (addr.level == FSM_BOTTOM_LEVEL) { > + BlockNumber blkno = fsm_get_heap_blk(addr, slot); > + Page page; > + /* > + * If the buffer is past the end of the relation, > + * update the page and restarts from the root > + */ > + if (fsm_does_block_exist(rel, blkno)) > + { > + ReleaseBuffer(buf); > + return blkno; > + } > + page = BufferGetPage(buf); > + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); > + fsm_set_avail(page, slot, 0); > + MarkBufferDirtyHint(buf, false); I wondered if we should zero all slots on the same FSM page that correspond to main fork blocks past the end of the main fork. The usual "advancenext" behavior is poor for this case, since every later slot is invalid in same way. My current thinking: no, it's not worth the code bulk to do better. A single extension caps at 64 blocks, so the worst case is roughly locking the buffer 64 times and restarting from FSM root 64 times. For something this infrequent, that's not bad. Thanks, nm
Le mardi 12 mars 2024, 23:08:07 CET Noah Misch a écrit : > > I would argue this is fine, as corruptions don't happen often, and FSM is > > not an exact thing anyway. If we record a block as full, it will just be > > unused until the next vacuum adds it back to the FSM with a correct > > value. > If this happened only under FSM corruption, it would be fine. I was > thinking about normal extension, with an event sequence like this: > > - RelationExtensionLockWaiterCount() returns 10. > - Backend B1 extends by 10 pages. > - B1 records 9 pages in the FSM. > - B2, B3, ... B11 wake up and fetch from the FSM. In each of those > backends, fsm_does_block_exist() returns false, because the cached relation > size is stale. Those pages remain unused until VACUUM re-records them in > the FSM. > > PostgreSQL added the multiple-block extension logic (commit 719c84c) from > hard-won experience bottlenecking there. If fixing $SUBJECT will lose 1% of > that change's benefit, that's probably okay. If it loses 20%, we should > work to do better. While we could rerun the 2016 benchmark to see how the > patch regresses it, there might be a simple fix. fsm_does_block_exist() > could skip the cache when blknumber>=cached, like this: > > return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM]) && > blknumber < smgr- >smgr_cached_nblocks[MAIN_FORKNUM]) || > blknumber < RelationGetNumberOfBlocks(rel)); > > How do you see it? That still leaves us with an avoidable lseek in B2, B3, > ... B11, but that feels tolerable. I don't know how to do better without > switching to the ReadBufferMode approach. Hi Noah and thanks for your feedback. That makes sense. I'm a bit worried about triggering additional lseeks when we in fact have other free pages in the map, that would pass the test. I'm not sure how relevant it is given the way we search the FSM with fp_next_slot though... To address that, I've given a bit of thought about enabling / disabling the auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish the cases where we know we have a somewhat up-to-date value compared to the case where we don't (as in, for heap, try without repair, then get an uptodate value to try the last block, and if we need to look at the FSM again then ask for it to be repaired) but it brings way too much complexity and would need careful thought for each AM. So please find attached a patch with the change you propose. Do you have a link to the benchmark you mention though to evaluate the patch against it ? > > > > The patch currently adds an lseek() whenever the FSM finds a block. I > > > see > > > relevant-looking posts from mailing list searches with subsets of these > > > terms: RelationGetNumberOfBlocks lseek benchmark overhead. I bet we > > > should > > > reduce this cost add. Some ideas: > > > > > > - No need for a system call if the FSM-returned value is low enough with > > > > > > > respect to any prior RelationGetNumberOfBlocks() call. > > > > I'm not sure what you mean by "low enough". What I chose to implement > > instead > I just meant "less than". Specifically, if "fsm_returned_value < MAX(all > RelationGetNumberOfBlocks() calls done on this relation, since last > postmaster start)" then the FSM-returned value won't have the problem seen > in $SUBJECT. > > > - We'd be happy and efficient with a ReadBufferMode such that > > > > > > ReadBufferExtended() returns NULL after a 0-byte read, with all other > > > > > > errors handled normally. That sounds like the specification of > > > RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't > > > that, > > > but perhaps one RBM could satisfy both sets of needs. > > > > I'm not sure about this one, this seems to have far more reaching > > consequences. This would mean we don't have any cooperation from the FSM, > > and would need to implement error recovery scenarios in each caller. But > > I admit I haven't looked too much into it, and focused instead in seeing > > if the current approach can get us to an acceptable state. > > I, too, am not sure. I agree that RBM approach wouldn't let you isolate the > changes like the last patch does. Each GetFreeIndexPage() caller would > need code for the return-NULL case. Avoiding that is nice. (If we ever do > the RBM change in the future, the fsm_readbuf() and vm_readbuf() !extend > cases should also use it.) > > > --- a/src/backend/storage/freespace/freespace.c > > +++ b/src/backend/storage/freespace/freespace.c > > > > + if (addr.level == FSM_BOTTOM_LEVEL) { > > + BlockNumber blkno = fsm_get_heap_blk(addr, slot); > > + Page page; > > + /* > > + * If the buffer is past the end of the relation, > > + * update the page and restarts from the root > > + */ > > + if (fsm_does_block_exist(rel, blkno)) > > + { > > + ReleaseBuffer(buf); > > + return blkno; > > + } > > + page = BufferGetPage(buf); > > + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); > > + fsm_set_avail(page, slot, 0); > > + MarkBufferDirtyHint(buf, false); > > I wondered if we should zero all slots on the same FSM page that correspond > to main fork blocks past the end of the main fork. The usual "advancenext" > behavior is poor for this case, since every later slot is invalid in same > way. My current thinking: no, it's not worth the code bulk to do better. A > single extension caps at 64 blocks, so the worst case is roughly locking > the buffer 64 times and restarting from FSM root 64 times. For something > this infrequent, that's not bad. That was my reasoning as well. Paying a small performance penalty in the unlikely case we hit a corruption and get done with it seems correct. > > Thanks, > nm
Attachment
On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote: > Le mardi 12 mars 2024, 23:08:07 CET Noah Misch a écrit : > > > I would argue this is fine, as corruptions don't happen often, and FSM is > > > not an exact thing anyway. If we record a block as full, it will just be > > > unused until the next vacuum adds it back to the FSM with a correct > > > value. > > If this happened only under FSM corruption, it would be fine. I was > > thinking about normal extension, with an event sequence like this: > > > > - RelationExtensionLockWaiterCount() returns 10. > > - Backend B1 extends by 10 pages. > > - B1 records 9 pages in the FSM. > > - B2, B3, ... B11 wake up and fetch from the FSM. In each of those > > backends, fsm_does_block_exist() returns false, because the cached relation > > size is stale. Those pages remain unused until VACUUM re-records them in > > the FSM. > > > > PostgreSQL added the multiple-block extension logic (commit 719c84c) from > > hard-won experience bottlenecking there. If fixing $SUBJECT will lose 1% of > > that change's benefit, that's probably okay. If it loses 20%, we should > > work to do better. While we could rerun the 2016 benchmark to see how the > > patch regresses it, there might be a simple fix. fsm_does_block_exist() > > could skip the cache when blknumber>=cached, like this: > > > > return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM]) > && > > blknumber < smgr- > >smgr_cached_nblocks[MAIN_FORKNUM]) || > > blknumber < RelationGetNumberOfBlocks(rel)); > > > > How do you see it? That still leaves us with an avoidable lseek in B2, B3, > > ... B11, but that feels tolerable. I don't know how to do better without > > switching to the ReadBufferMode approach. > That makes sense. I'm a bit worried about triggering additional lseeks when we > in fact have other free pages in the map, that would pass the test. I'm not > sure how relevant it is given the way we search the FSM with fp_next_slot > though... That's a reasonable thing to worry about. We could do wrong by trying too hard to use an FSM slot, and we could do wrong by not trying hard enough. > To address that, I've given a bit of thought about enabling / disabling the > auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish the > cases where we know we have a somewhat up-to-date value compared to the case > where we don't (as in, for heap, try without repair, then get an uptodate > value to try the last block, and if we need to look at the FSM again then ask > for it to be repaired) but it brings way too much complexity and would need > careful thought for each AM. > > So please find attached a patch with the change you propose. > > Do you have a link to the benchmark you mention though to evaluate the patch > against it ? Here is one from the thread that created commit 719c84c: https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEomgHfuVdzNJnwWVoE_c60g%40mail.gmail.com There may be other benchmarks earlier in that thread.
Le mercredi 13 mars 2024, 17:55:23 CET Noah Misch a écrit : > On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote: > > Le mardi 12 mars 2024, 23:08:07 CET Noah Misch a écrit : > > > > I would argue this is fine, as corruptions don't happen often, and FSM > > > > is > > > > not an exact thing anyway. If we record a block as full, it will just > > > > be > > > > unused until the next vacuum adds it back to the FSM with a correct > > > > value. > > > > > > If this happened only under FSM corruption, it would be fine. I was > > > thinking about normal extension, with an event sequence like this: > > > > > > - RelationExtensionLockWaiterCount() returns 10. > > > - Backend B1 extends by 10 pages. > > > - B1 records 9 pages in the FSM. > > > - B2, B3, ... B11 wake up and fetch from the FSM. In each of those > > > backends, fsm_does_block_exist() returns false, because the cached > > > relation > > > size is stale. Those pages remain unused until VACUUM re-records them > > > in > > > the FSM. > > > > > > PostgreSQL added the multiple-block extension logic (commit 719c84c) > > > from > > > hard-won experience bottlenecking there. If fixing $SUBJECT will lose > > > 1% of that change's benefit, that's probably okay. If it loses 20%, we > > > should work to do better. While we could rerun the 2016 benchmark to > > > see how the patch regresses it, there might be a simple fix. > > > fsm_does_block_exist()> > > > > could skip the cache when blknumber>=cached, like this: > > > return((BlockNumberIsValid(smgr- >smgr_cached_nblocks[MAIN_FORKNUM]) > > > > && > > > > > blknumber < smgr- > > > > > >smgr_cached_nblocks[MAIN_FORKNUM]) || > > > > > > blknumber < RelationGetNumberOfBlocks(rel)); > > > > > > How do you see it? That still leaves us with an avoidable lseek in B2, > > > B3, > > > ... B11, but that feels tolerable. I don't know how to do better > > > without > > > switching to the ReadBufferMode approach. > > > > That makes sense. I'm a bit worried about triggering additional lseeks > > when we in fact have other free pages in the map, that would pass the > > test. I'm not sure how relevant it is given the way we search the FSM > > with fp_next_slot though... > > That's a reasonable thing to worry about. We could do wrong by trying too > hard to use an FSM slot, and we could do wrong by not trying hard enough. > > > To address that, I've given a bit of thought about enabling / disabling > > the > > auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish > > the cases where we know we have a somewhat up-to-date value compared to > > the case where we don't (as in, for heap, try without repair, then get an > > uptodate value to try the last block, and if we need to look at the FSM > > again then ask for it to be repaired) but it brings way too much > > complexity and would need careful thought for each AM. > > > > So please find attached a patch with the change you propose. > > > > Do you have a link to the benchmark you mention though to evaluate the > > patch against it ? > > Here is one from the thread that created commit 719c84c: > https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEom > gHfuVdzNJnwWVoE_c60g%40mail.gmail.com > > There may be other benchmarks earlier in that thread. Thank you. I tried to run that benchmark, with 4 clients all copying 10M pgbench_accounts row concurrently. With and without the patch, I ran the benchmark 5 times, and took the average. master: 15.05s patched: 15.24s (+1.25%) If I remove the best and worst run for each of those, the difference falls at +0.7%.
On Fri, Mar 15, 2024 at 07:23:46AM +0100, Ronan Dunklau wrote: > Le mercredi 13 mars 2024, 17:55:23 CET Noah Misch a écrit : > > On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote: > > > I'm a bit worried about triggering additional lseeks > > > when we in fact have other free pages in the map, that would pass the > > > test. I'm not sure how relevant it is given the way we search the FSM > > > with fp_next_slot though... > > > > That's a reasonable thing to worry about. We could do wrong by trying too > > hard to use an FSM slot, and we could do wrong by not trying hard enough. > > > > > To address that, I've given a bit of thought about enabling / disabling > > > the > > > auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish > > > the cases where we know we have a somewhat up-to-date value compared to > > > the case where we don't (as in, for heap, try without repair, then get an > > > uptodate value to try the last block, and if we need to look at the FSM > > > again then ask for it to be repaired) but it brings way too much > > > complexity and would need careful thought for each AM. > > > > > > So please find attached a patch with the change you propose. > > > > > > Do you have a link to the benchmark you mention though to evaluate the > > > patch against it ? > > > > Here is one from the thread that created commit 719c84c: > > https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEom > > gHfuVdzNJnwWVoE_c60g%40mail.gmail.com > > > > There may be other benchmarks earlier in that thread. > > Thank you. I tried to run that benchmark, with 4 clients all copying 10M > pgbench_accounts row concurrently. With and without the patch, I ran the > benchmark 5 times, and took the average. Does "that benchmark" refer to "unlogged tables, 4 parallel copies", "logged tables, 4 parallel copies", or something else? > master: 15.05s > patched: 15.24s (+1.25%) > > If I remove the best and worst run for each of those, the difference falls at > +0.7%. To get some additional perspective on the benchmark, how hard would it be to run one or both of the following? Feel free to decline if difficult. - Make GetPageWithFreeSpace() just "return InvalidBlockNumber", then rerun the benchmark. This should be a lot slower. If not, the bottleneck is somewhere unexpected, and we'd need a different benchmark. - Get profiles with both master and patched. (lseek or freespace.c functions rising by 0.1%-1% would fit what we know.) Distinguishing a 1% change from a 0% change would need different techniques still. If we're genuinely slowing bulk loads by ~1% to account for the possibly of a flawed post-recovery FSM, that's sad, but I'm inclined to accept the loss. A persistent error in an innocent INSERT is unacceptable, and the alternatives we discussed upthread have their own substantial trouble. Other opinions? Thanks, nm
On Fri, Mar 15, 2024 at 08:03:39PM -0700, Noah Misch wrote: > On Fri, Mar 15, 2024 at 07:23:46AM +0100, Ronan Dunklau wrote: > > Le mercredi 13 mars 2024, 17:55:23 CET Noah Misch a écrit : > > > On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote: > > > > I'm a bit worried about triggering additional lseeks > > > > when we in fact have other free pages in the map, that would pass the > > > > test. I'm not sure how relevant it is given the way we search the FSM > > > > with fp_next_slot though... > > > > > > That's a reasonable thing to worry about. We could do wrong by trying too > > > hard to use an FSM slot, and we could do wrong by not trying hard enough. > > > > > > > To address that, I've given a bit of thought about enabling / disabling > > > > the > > > > auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish > > > > the cases where we know we have a somewhat up-to-date value compared to > > > > the case where we don't (as in, for heap, try without repair, then get an > > > > uptodate value to try the last block, and if we need to look at the FSM > > > > again then ask for it to be repaired) but it brings way too much > > > > complexity and would need careful thought for each AM. > > > > > > > > So please find attached a patch with the change you propose. > > > > > > > > Do you have a link to the benchmark you mention though to evaluate the > > > > patch against it ? > > > > > > Here is one from the thread that created commit 719c84c: > > > https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEom > > > gHfuVdzNJnwWVoE_c60g%40mail.gmail.com > > > > > > There may be other benchmarks earlier in that thread. > > > > Thank you. I tried to run that benchmark, with 4 clients all copying 10M > > pgbench_accounts row concurrently. With and without the patch, I ran the > > benchmark 5 times, and took the average. > > Does "that benchmark" refer to "unlogged tables, 4 parallel copies", "logged > tables, 4 parallel copies", or something else? > > > master: 15.05s > > patched: 15.24s (+1.25%) > > > > If I remove the best and worst run for each of those, the difference falls at > > +0.7%. > > To get some additional perspective on the benchmark, how hard would it be to > run one or both of the following? Feel free to decline if difficult. > > - Make GetPageWithFreeSpace() just "return InvalidBlockNumber", then rerun the > benchmark. This should be a lot slower. If not, the bottleneck is > somewhere unexpected, and we'd need a different benchmark. > > - Get profiles with both master and patched. (lseek or freespace.c functions > rising by 0.1%-1% would fit what we know.) Forgot one more: - Your earlier version of the patch, with fewer lseek() but more disuse of FSM entries. > Distinguishing a 1% change from a 0% change would need different techniques > still. If we're genuinely slowing bulk loads by ~1% to account for the > possibly of a flawed post-recovery FSM, that's sad, but I'm inclined to accept > the loss. A persistent error in an innocent INSERT is unacceptable, and the > alternatives we discussed upthread have their own substantial trouble. Other > opinions? > > Thanks, > nm
Le samedi 16 mars 2024, 05:58:34 CET Noah Misch a écrit : > > Does "that benchmark" refer to "unlogged tables, 4 parallel copies", > > "logged tables, 4 parallel copies", or something else? Sorry I was quite busy, and will not find time to run more benchmarks until next week but I'll get back to it. It was 4 parallel copies, logged tables. Results for unlogged tables are similar. > > > > > master: 15.05s > > > patched: 15.24s (+1.25%) > > > > > > If I remove the best and worst run for each of those, the difference > > > falls at +0.7%. > > > > To get some additional perspective on the benchmark, how hard would it be > > to run one or both of the following? Feel free to decline if difficult. > > > > - Make GetPageWithFreeSpace() just "return InvalidBlockNumber", then rerun > > the> > > benchmark. This should be a lot slower. If not, the bottleneck is > > somewhere unexpected, and we'd need a different benchmark. Hum, it doesn't change much so that benchmark does not indeed exercise that code too aggressively. I was thinking my initial benchmark for the very first naive version of the patch could work, but I'm afraid it won't show much as the table size will be cached soon enough. > > > > - Get profiles with both master and patched. (lseek or freespace.c > > functions> > > rising by 0.1%-1% would fit what we know.) > I'll try to run that asap, probably next week unfortunately. > Forgot one more: > > - Your earlier version of the patch, with fewer lseek() but more disuse of > FSM entries. Noted. -- Ronan Dunklau
Le jeudi 21 mars 2024, 14:51:25 CET Ronan Dunklau a écrit : > Le samedi 16 mars 2024, 05:58:34 CET Noah Misch a écrit : > > > - Get profiles with both master and patched. (lseek or freespace.c > > > functions> > > > > > > rising by 0.1%-1% would fit what we know.) Running perf during the same benchmark, I get the following samples for lseek: With the patch: 0.02% postgres libc.so.6 [.] llseek@GLIBC_2.2.5 Without the patch: 0.01% postgres libc.so.6 [.] llseek@GLIBC_2.2.5 So nothing too dramatic. I haven't been able to come up with a better benchmark. Regards, -- Ronan
Your v3 has the right functionality. As further confirmation of the fix, I tried reverting the non-test parts of commit 917dc7d "Fix WAL-logging of FSM and VM truncation". That commit's 008_fsm_truncation.pl fails with 917dc7d reverted from master, and adding this patch makes it pass again. I ran pgindent and edited comments. I think the attached version is ready to go. While updating comments in FreeSpaceMapPrepareTruncateRel(), I entered a rabbit hole about the comments 917dc7d left about torn pages. I'm sharing these findings just in case it helps a reader of the $SUBJECT patch avoid the same rabbit hole. Both fsm and vm read with RBM_ZERO_ON_ERROR, so I think they're fine with torn pages. Per the README sentences I'm adding, FSM could stop writing WAL. I'm not proposing that, but I do bet it's the right thing. visibilitymap_prepare_truncate() has mirrored fsm truncate since 917dc7d. The case for removing WAL there is clearer still, because parallel function visibilitymap_clear() does not write WAL. I'm attaching a WIP patch to remove visibilitymap_prepare_truncate() WAL. I'll abandon that or pursue it for v18, in a different thread. On Fri, Mar 29, 2024 at 09:47:52AM +0100, Ronan Dunklau wrote: > Le jeudi 21 mars 2024, 14:51:25 CET Ronan Dunklau a écrit : > > Le samedi 16 mars 2024, 05:58:34 CET Noah Misch a écrit : > > > > - Get profiles with both master and patched. (lseek or freespace.c > > > > functions> > > > > > > > > rising by 0.1%-1% would fit what we know.) > > Running perf during the same benchmark, I get the following samples for lseek: > > With the patch: > > 0.02% postgres libc.so.6 [.] llseek@GLIBC_2.2.5 > > Without the patch: > > 0.01% postgres libc.so.6 [.] llseek@GLIBC_2.2.5 > > > So nothing too dramatic. > I haven't been able to come up with a better benchmark. I tried, without success, to reproduce the benefit of commit 719c84c "Extend relations multiple blocks at a time to improve scalability" like https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEomgHfuVdzNJnwWVoE_c60g%40mail.gmail.com did. My conditions must be lacking whatever factor caused the consistent >20% improvement in the linked message. Environment was ext4 on magnetic disk, Linux 3.10.0-1160.99.1.el7.x86_64, 8 threads. I'm attaching the script I used to test 719c84c^, 719c84c, and [719c84c with GetPageWithFreeSpace() made to always find nothing]. I'm also attaching the script's output after filtration through "grep ^TIME | sort -n". The range of timings within each test scenario dwarfs any difference across scenarios. I casually tried some other variations without obvious change: - 24MB shared_buffers - 128MB shared_buffers - 4 threads - 32 threads - bgwriter_flush_after & backend_flush_after at their defaults, instead of 0 - 2024-04 code (e2a2357) unmodified, then with GetPageWithFreeSpace() made to always find nothing - logged tables - INSERT ... SELECT instead of COPY If I were continuing the benchmark study, I would try SSD, a newer kernel, and/or shared_buffers=48GB. Instead, since your perf results show only +0.01% CPU from new lseek() calls, I'm going to stop there and say it's worth taking the remaining risk that some realistic scenario gets a material regression from those new lseek() calls.
Attachment
Le dimanche 7 avril 2024, 00:30:37 CEST Noah Misch a écrit : > Your v3 has the right functionality. As further confirmation of the fix, I > tried reverting the non-test parts of commit 917dc7d "Fix WAL-logging of FSM > and VM truncation". That commit's 008_fsm_truncation.pl fails with 917dc7d > reverted from master, and adding this patch makes it pass again. I ran > pgindent and edited comments. I think the attached version is ready to go. > Thank you Noah, the updated comments are much better. I think it should be backported at least to 16 since the chances of tripping on that behaviour are quite high here, but what about previous versions ? > While updating comments in FreeSpaceMapPrepareTruncateRel(), I entered a > rabbit hole about the comments 917dc7d left about torn pages. I'm sharing > these findings just in case it helps a reader of the $SUBJECT patch avoid > the same rabbit hole. Both fsm and vm read with RBM_ZERO_ON_ERROR, so I > think they're fine with torn pages. Per the README sentences I'm adding, > FSM could stop writing WAL. I'm not proposing that, but I do bet it's the > right thing. visibilitymap_prepare_truncate() has mirrored fsm truncate > since 917dc7d. The case for removing WAL there is clearer still, because > parallel function visibilitymap_clear() does not write WAL. I'm attaching > a WIP patch to remove visibilitymap_prepare_truncate() WAL. I'll abandon > that or pursue it for v18, in a different thread. That's an interesting finding. > If I were continuing the benchmark study, I would try SSD, a newer kernel, > and/or shared_buffers=48GB. Instead, since your perf results show only > +0.01% CPU from new lseek() calls, I'm going to stop there and say it's > worth taking the remaining risk that some realistic scenario gets a > material regression from those new lseek() calls. Agree with you here. Many thanks, -- Ronan Dunklau
On Thu, Apr 11, 2024 at 09:36:50AM +0200, Ronan Dunklau wrote: > Le dimanche 7 avril 2024, 00:30:37 CEST Noah Misch a écrit : > > Your v3 has the right functionality. As further confirmation of the fix, I > > tried reverting the non-test parts of commit 917dc7d "Fix WAL-logging of FSM > > and VM truncation". That commit's 008_fsm_truncation.pl fails with 917dc7d > > reverted from master, and adding this patch makes it pass again. I ran > > pgindent and edited comments. I think the attached version is ready to go. > > Thank you Noah, the updated comments are much better. I think it should be > backported at least to 16 since the chances of tripping on that behaviour are > quite high here, but what about previous versions ? It should be reachable in all branches, just needing concurrent extension lock waiters to reach before v16. Hence, my plan is to back-patch it all the way. It applies with negligible conflicts back to v12.
On Wed, Mar 13, 2024 at 12:55 PM Noah Misch <noah@leadboat.com> wrote: > That's a reasonable thing to worry about. We could do wrong by trying too > hard to use an FSM slot, and we could do wrong by not trying hard enough. Although it's not related to the problem you're working on, it seems like a good opportunity to bring up a concern about the FSM that I don't believe was discussed at any point in the past few years: I wonder if the way that fsm_search_avail() sometimes updates fsmpage->fp_next_slot with only a shared lock on the page could cause problems. At the very least, it's weird that we allow it. -- Peter Geoghegan
On Thu, Apr 11, 2024 at 12:01:09PM -0400, Peter Geoghegan wrote: > On Wed, Mar 13, 2024 at 12:55 PM Noah Misch <noah@leadboat.com> wrote: > > That's a reasonable thing to worry about. We could do wrong by trying too > > hard to use an FSM slot, and we could do wrong by not trying hard enough. > > Although it's not related to the problem you're working on, it seems > like a good opportunity to bring up a concern about the FSM that I > don't believe was discussed at any point in the past few years: I > wonder if the way that fsm_search_avail() sometimes updates > fsmpage->fp_next_slot with only a shared lock on the page could cause > problems. At the very least, it's weird that we allow it. fsm_search_avail() treats an out-of-range fp_next_slot like zero, so I'm not seeing a correctness issue. I bet changing it under an exclusive lock wouldn't deliver better-optimized searches to an extent that pays for the synchronization overhead, but it might.
On Fri, Apr 12, 2024 at 4:01 AM Peter Geoghegan <pg@bowt.ie> wrote: > Although it's not related to the problem you're working on, it seems > like a good opportunity to bring up a concern about the FSM that I > don't believe was discussed at any point in the past few years: I > wonder if the way that fsm_search_avail() sometimes updates > fsmpage->fp_next_slot with only a shared lock on the page could cause > problems. At the very least, it's weird that we allow it. Aha. Good to know. So that is another place where direct I/O on a file system with checksums might get very upset, if it takes no measures of its own to prevent the data from changing underneath it during a pwrite() call. The only known system like that so far is btrfs (phenemon #1 in [1], see reproducer). The symptom is that the next read fails with EIO. [1] https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com
On Thu, Apr 11, 2024 at 08:38:43AM -0700, Noah Misch wrote: > On Thu, Apr 11, 2024 at 09:36:50AM +0200, Ronan Dunklau wrote: > > Le dimanche 7 avril 2024, 00:30:37 CEST Noah Misch a écrit : > > > Your v3 has the right functionality. As further confirmation of the fix, I > > > tried reverting the non-test parts of commit 917dc7d "Fix WAL-logging of FSM > > > and VM truncation". That commit's 008_fsm_truncation.pl fails with 917dc7d > > > reverted from master, and adding this patch makes it pass again. I ran > > > pgindent and edited comments. I think the attached version is ready to go. > > > > Thank you Noah, the updated comments are much better. I think it should be > > backported at least to 16 since the chances of tripping on that behaviour are > > quite high here, but what about previous versions ? > > It should be reachable in all branches, just needing concurrent extension lock > waiters to reach before v16. Hence, my plan is to back-patch it all the way. > It applies with negligible conflicts back to v12. While it applied, it doesn't build in v12 or v13, due to smgr_cached_nblocks first appearing in c5315f4. Options: 1. Back-patch the addition of smgr_cached_nblocks or equivalent. 2. Stop the back-patch of $SUBJECT at v14. 3. Incur more lseek() in v13 and v12. Given the lack of reports before v16, (3) seems too likely to be a cure worse than the disease. I'm picking (2) for today. We could do (1) tomorrow, but I lean toward (2) until someone reports the problem on v13 or v12. The problem's impact is limited to DML giving ERROR when it should have succeeded, and I expect VACUUM FULL is a workaround. Without those mitigating factors, I would choose (1). Pushed that way, as 9358297.
Le samedi 13 avril 2024, 19:15:28 CEST Noah Misch a écrit : > On Thu, Apr 11, 2024 at 08:38:43AM -0700, Noah Misch wrote: > > On Thu, Apr 11, 2024 at 09:36:50AM +0200, Ronan Dunklau wrote: > > > Le dimanche 7 avril 2024, 00:30:37 CEST Noah Misch a écrit : > > > > Your v3 has the right functionality. As further confirmation of the > > > > fix, I > > > > tried reverting the non-test parts of commit 917dc7d "Fix WAL-logging > > > > of FSM and VM truncation". That commit's 008_fsm_truncation.pl fails > > > > with 917dc7d reverted from master, and adding this patch makes it > > > > pass again. I ran pgindent and edited comments. I think the > > > > attached version is ready to go.> > > > > Thank you Noah, the updated comments are much better. I think it should > > > be > > > backported at least to 16 since the chances of tripping on that > > > behaviour are quite high here, but what about previous versions ? > > > > It should be reachable in all branches, just needing concurrent extension > > lock waiters to reach before v16. Hence, my plan is to back-patch it all > > the way. It applies with negligible conflicts back to v12. > > While it applied, it doesn't build in v12 or v13, due to smgr_cached_nblocks > first appearing in c5315f4. Options: > > 1. Back-patch the addition of smgr_cached_nblocks or equivalent. > 2. Stop the back-patch of $SUBJECT at v14. > 3. Incur more lseek() in v13 and v12. > > Given the lack of reports before v16, (3) seems too likely to be a cure > worse than the disease. I'm picking (2) for today. We could do (1) > tomorrow, but I lean toward (2) until someone reports the problem on v13 or > v12. The problem's impact is limited to DML giving ERROR when it should > have succeeded, and I expect VACUUM FULL is a workaround. Without those > mitigating factors, I would choose (1). > > Pushed that way, as 9358297. I agree with you that option 2 seems to be the safest course of action. For the record, other available options when that happens is to stop PG and manually remove the FSM from disk (see https://wiki.postgresql.org/wiki/ Free_Space_Map_Problems) or adapt the patch submitted here https:// www.postgresql.org/message-id/flat/5446938.Sb9uPGUboI%40aivenlaptop to do it online. Thank you for all your work with refining, testing and finally comitting the fix. Best regards, -- Ronan Dunklau