Re: FSM Corruption (was: Could not read block at end of the relation) - Mailing list pgsql-bugs
From | Ronan Dunklau |
---|---|
Subject | Re: FSM Corruption (was: Could not read block at end of the relation) |
Date | |
Msg-id | 1925490.taCxCBeP46@aivenlaptop Whole thread Raw |
In response to | Re: FSM Corruption (was: Could not read block at end of the relation) (Noah Misch <noah@leadboat.com>) |
Responses |
Re: FSM Corruption (was: Could not read block at end of the relation)
|
List | pgsql-bugs |
Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit : > On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote: > > - happens during heavy system load > > - lots of concurrent writes happening on a table > > - often (but haven't been able to confirm it is necessary), a vacuum is > > running on the table at the same time the error is triggered > > > > Then, several backends get the same error at once "ERROR: could not read > > block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with > > different > What are some of the specific block numbers reported? Nothing specific, except they are past the end of the table, by an offset starting at one. I've seen it happens on regular tables restored by pg_restore quite a lot, but also on bigger tables, regular or TOAST. We've seen this on a dozen different clusters, but nothing reproducible for now. > > > has anybody witnessed something similar ? > > https://postgr.es/m/flat/CA%2BhUKGK%2B5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8 > D%2BZg%40mail.gmail.com reminded me of this. Did you upgrade your OS > recently? No recent upgrades I'm aware of. Also, it only touches PG16 clusters, not earlier versions. But it looks like the same kind of errors. > What corruption signs did you observe in the FSM? Since FSM is > intentionally not WAL-logged, corruption is normal, but corruption causing > errors is not normal. That said, if any crash leaves a state that the > freespace/README "self-correcting measures" don't detect, errors may > happen. Did the clusters crash recently? Well we still have full page images for the FSM. I'll take as an example the cluster I'm currently investigating on. The corruption I'm seeing is that I have an additional entry in the FSM for a block that does not exist in the relation: IIUC, slot 4129 corresponds to block 34, which is the last one I have in the relation. However, I have an entry (slot 4130) for block number 35, with MaxFSMRequestSize fsm_page_contents ------------------- 0: 255 + 1: 255 + 3: 255 + ... 4129: 114 + 4130: 255 + This then causes attempts to insert into that block to fail. Looking at when the corruption was WAL-logged, this particular case is quite easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table (probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog records for each of the relation main fork, followed by a lonely FPI for the leaf page of it's FSM: rmgr: Heap2 len (rec/tot): 5698/ 5698, tx: 1637, lsn: 1/810763B8, prev 1/81076388, desc: MULTI_INSERT+INIT ntuples: 124, flags: 0x08, blkref #0: rel 1663/17343/17959 blk 0 rmgr: XLOG len (rec/tot): 49/ 8241, tx: 1637, lsn: 1/81077A00, prev 1/810763B8, desc: FPI_FOR_HINT , blkref #0: rel 1663/17343/17959 fork fsm blk 2 FPW rmgr: Heap2 len (rec/tot): 5762/ 5762, tx: 1637, lsn: 1/81079A50, prev 1/81077A00, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08, blkref #0: rel 1663/17343/17959 blk 1 rmgr: Heap2 len (rec/tot): 5746/ 5746, tx: 1637, lsn: 1/8107B0F0, prev 1/81079A50, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08, blkref #0: rel 1663/17343/17959 blk 2 rmgr: Heap2 len (rec/tot): 5778/ 5778, tx: 1637, lsn: 1/8107C780, prev 1/8107B0F0, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08, blkref #0: rel 1663/17343/17959 blk 3 (up until blk 34 is wal logged) At some point pages 0 and 1 get a FPI, the leaf page (blk2) does not during the transaction. Later on, we have this for every main fork block: rmgr: XLOG len (rec/tot): 49/ 8201, tx: 0, lsn: 2/0E1ECEC8, prev 2/0E1ECE10, desc: FPI_FOR_HINT , blkref #0: rel 1663/17343/17959 blk 0 FPW rmgr: Heap2 len (rec/tot): 325/ 325, tx: 0, lsn: 2/0E1EEEF0, prev 2/0E1ECEC8, desc: FREEZE_PAGE snapshotConflictHorizon: 1637, nplans: 2, plans: [{ xmax: 0, infomask: 2816, infomask2: 5, ntuples: 86, offsets: [1, 2, 3, 5, 7, 10, 14, 16, 21, 22, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 45, 46, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 64, 65, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 85, 86, 87, 90, 92, 94, 95, 96, 98, 100, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 117, 118, 119, 120, 121, 122, 123, 124] }, { xmax: 0, infomask: 2817, infomask2: 5, ntuples: 38, offsets: [4, 6, 8, 9, 11, 12, 13, 15, 17, 18, 19, 20, 23, 24, 25, 26, 27, 29, 43, 44, 47, 48, 59, 61, 62, 63, 66, 70, 80, 84, 88, 89, 91, 93, 97, 99, 101, 116] }], blkref #0: rel 1663/17343/17959 blk 0 rmgr: Heap2 len (rec/tot): 64/ 8256, tx: 0, lsn: 2/0E1EF038, prev 2/0E1EEEF0, desc: VISIBLE snapshotConflictHorizon: 0, flags: 0x03, blkref #0: rel 1663/17343/17959 fork vm blk 0 FPW, blkref #1: rel 1663/17343/17959 blk 0 And then the single FSM FPI, which contains the invalid record: rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 2/0E28F640, prev 2/0E28F600, desc: FPI_FOR_HINT , blkref #0: rel 1663/17343/17959 fork fsm blk 2 FPW I've dumped the associated FPIs, and took a look at them. First FPI for leaf block (#2 in the fsm) happens in the beginning of the multi insert transaction, and only contains one leaf node for block 0 (slot 4095). Later FPI (during what I think is a VACUUM) persists the corruption in the WAL. So presumably something happened during the COPY operation which caused it to add an invalid entry for inexisting block 35 in the FSM. There are no traces of relation truncation happening in the WAL. This case only shows a single invalid entry in the FSM, but I've noticed as much as 62 blocks present in the FSM while they do not exist on disk, all tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk extension mechanism. > > > The occurence of this bug happening makes it hard to reproduce, but it's > > definitely frequent enough we witnessed it on a dozen PostgreSQL clusters. > > You could do "ALTER TABLE x SET (vacuum_truncate = off);" and see if the > problem stops happening. That would corroborate the VACUUM theory. > > Can you use backtrace_functions to get a stack track? I got a stack trace from the error after the corruption has occured (even got to attach gdb to it as I was able to restore a corrupted cluster), but I don't think it's of any use (this is from a different occurence than the one I wrote in details above, so the block numbers don't match but it's the exact same thing). #0 errstart (elevel=21, domain=0x0) at utils/error/elog.c:348 #1 0x000055c81ea83a0a in errstart_cold (domain=0x0, elevel=21) at utils/ error/elog.c:333 #2 mdread (reln=0x55c8210bbe00, forknum=MAIN_FORKNUM, blocknum=1203, buffer=0x7f51b4c32000) at storage/smgr/md.c:796 #3 0x000055c81ee396cf in smgrread (buffer=0x7f51b4c32000, blocknum=1203, forknum=MAIN_FORKNUM, reln=0x55c8210bbe00) at storage/smgr/smgr.c:565 #4 ReadBuffer_common (smgr=0x55c8210bbe00, relpersistence=relpersistence@entry=112 'p', forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=1203, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>, hit=0x7fff6e22f607) at storage/buffer/bufmgr.c:1124 #5 0x000055c81ee39d0b in ReadBufferExtended (reln=0x7f51b294f0d0, forkNum=MAIN_FORKNUM, blockNum=1203, mode=RBM_NORMAL, strategy=<optimized out>) at access/brin/../../../../src/include/utils/rel.h:576 #6 0x000055c81eb1525c in ReadBufferBI (mode=RBM_NORMAL, bistate=0x0, targetBlock=1203, relation=<optimized out>) at access/heap/hio.c:96 #7 RelationGetBufferForTuple (relation=<optimized out>, len=<optimized out>, otherBuffer=0, options=<optimized out>, bistate=0x0, vmbuffer=<optimized out>, vmbuffer_other=<optimized out>, num_pages=<optimized out>) at access/heap/ hio.c:620 #8 0x000055c81eafcace in heap_insert (relation=0x7f51b294f0d0, tup=0x55c821042580, cid=<optimized out>, options=0, bistate=0x0) at access/ heap/heapam.c:1862 #9 0x000055c81eb08bab in heapam_tuple_insert (relation=0x7f51b294f0d0, slot=0x55c821042478, cid=0, options=0, bistate=0x0) at access/heap/ heapam_handler.c:252 #10 0x000055c81ecce2f1 in table_tuple_insert (bistate=0x0, options=0, cid=<optimized out>, slot=0x55c821042478, rel=<optimized out>) at executor/../../../src/include/access/tableam.h:1400 #11 ExecInsert (context=context@entry=0x7fff6e22f9b0, resultRelInfo=resultRelInfo@entry=0x55c8210410c0, slot=slot@entry=0x55c821042478, canSetTag=<optimized out>, inserted_tuple=inserted_tuple@entry=0x0, insert_destrel=insert_destrel@entry=0x0) at executor/nodeModifyTable.c:1133 #12 0x000055c81eccf1eb in ExecModifyTable (pstate=<optimized out>) at executor/nodeModifyTable.c:3810 Just a regular insertion failing to read the block it's trying to insert into. > > In our case, we need to repair the FSM. The instructions on the wiki do > > work, but maybe we should add something like the attached patch (modeled > > after the same feature in pg_visibility) to make it possible to repair > > the FSM corruption online. What do you think about it ? > > That's reasonable in concept. Ok, I'll start a separate thread on -hackers for that. Thank you !
pgsql-bugs by date: