Re: REINDEX INDEX results in a crash for an index of pg_class since9.6 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: REINDEX INDEX results in a crash for an index of pg_class since9.6 |
Date | |
Msg-id | 20190501013638.7u7fwr4zktbyyng5@alap3.anarazel.de Whole thread Raw |
In response to | Re: REINDEX INDEX results in a crash for an index of pg_class since9.6 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6
Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6 |
List | pgsql-hackers |
On 2019-04-30 15:53:07 -0700, Andres Freund wrote: > Hi, > > On 2019-04-30 18:42:36 -0400, Tom Lane wrote: > > markhor just reported in with results showing that we have worse > > problems than deadlock-prone tests in the back branches: 9.4 > > for example looks like > > > -- whole tables > > REINDEX TABLE pg_class; -- mapped, non-shared, critical > > + ERROR: could not read block 0 in file "base/16384/27769": read only 0 of 8192 bytes > > Ugh. Also failed on 9.6. I see the bug. Turns out we need to figure out another way to solve the assertion triggered by doing catalog updates within RelationSetNewRelfilenode() - we can't just move the SetReindexProcessing() before it. When CCA is enabled, the CommandCounterIncrement() near the tail of RelationSetNewRelfilenode() triggers a rebuild of the catalog entries - but without the SetReindexProcessing() those scans will try to use the index currently being rebuilt. Which then predictably fails: #0 mdread (reln=0x5600aea36498, forknum=MAIN_FORKNUM, blocknum=0, buffer=0x7f71037db800 "") at /home/andres/src/postgresql/src/backend/storage/smgr/md.c:633 #1 0x00005600ae3f656f in smgrread (reln=0x5600aea36498, forknum=MAIN_FORKNUM, blocknum=0, buffer=0x7f71037db800 "") at /home/andres/src/postgresql/src/backend/storage/smgr/smgr.c:590 #2 0x00005600ae3b4c13 in ReadBuffer_common (smgr=0x5600aea36498, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0,mode=RBM_NORMAL, strategy=0x0, hit=0x7fff5bb11cab) at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:896 #3 0x00005600ae3b44ab in ReadBufferExtended (reln=0x7f7107972540, forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0) at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:664 #4 0x00005600ae3b437f in ReadBuffer (reln=0x7f7107972540, blockNum=0) at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:596 #5 0x00005600ae00e0b3 in _bt_getbuf (rel=0x7f7107972540, blkno=0, access=1) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtpage.c:805 #6 0x00005600ae00dd2a in _bt_heapkeyspace (rel=0x7f7107972540) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtpage.c:694 #7 0x00005600ae01679c in _bt_first (scan=0x5600aea44440, dir=ForwardScanDirection) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtsearch.c:1237 #8 0x00005600ae012617 in btgettuple (scan=0x5600aea44440, dir=ForwardScanDirection) at /home/andres/src/postgresql/src/backend/access/nbtree/nbtree.c:247 #9 0x00005600ae005572 in index_getnext_tid (scan=0x5600aea44440, direction=ForwardScanDirection) at /home/andres/src/postgresql/src/backend/access/index/indexam.c:550 #10 0x00005600ae00571e in index_getnext_slot (scan=0x5600aea44440, direction=ForwardScanDirection, slot=0x5600ae9c6ed0) at /home/andres/src/postgresql/src/backend/access/index/indexam.c:642 #11 0x00005600ae003e54 in systable_getnext (sysscan=0x5600aea44080) at /home/andres/src/postgresql/src/backend/access/index/genam.c:450 #12 0x00005600ae564292 in ScanPgRelation (targetRelId=1259, indexOK=true, force_non_historic=false) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:365 #13 0x00005600ae568203 in RelationReloadNailed (relation=0x5600aea0c4d0) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2292 #14 0x00005600ae568621 in RelationClearRelation (relation=0x5600aea0c4d0, rebuild=true) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2425 #15 0x00005600ae569081 in RelationCacheInvalidate () at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2858 #16 0x00005600ae55b32b in InvalidateSystemCaches () at /home/andres/src/postgresql/src/backend/utils/cache/inval.c:649 #17 0x00005600ae55b408 in AcceptInvalidationMessages () at /home/andres/src/postgresql/src/backend/utils/cache/inval.c:708 #18 0x00005600ae3d7b22 in LockRelationOid (relid=1259, lockmode=1) at /home/andres/src/postgresql/src/backend/storage/lmgr/lmgr.c:136 #19 0x00005600adf85ad2 in relation_open (relationId=1259, lockmode=1) at /home/andres/src/postgresql/src/backend/access/common/relation.c:56 #20 0x00005600ae040337 in table_open (relationId=1259, lockmode=1) at /home/andres/src/postgresql/src/backend/access/table/table.c:43 #21 0x00005600ae564215 in ScanPgRelation (targetRelId=2662, indexOK=false, force_non_historic=false) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:348 #22 0x00005600ae567ecf in RelationReloadIndexInfo (relation=0x7f7107972540) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2170 #23 0x00005600ae5681d3 in RelationReloadNailed (relation=0x7f7107972540) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2270 #24 0x00005600ae568621 in RelationClearRelation (relation=0x7f7107972540, rebuild=true) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2425 #25 0x00005600ae568d19 in RelationFlushRelation (relation=0x7f7107972540) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2686 #26 0x00005600ae568e32 in RelationCacheInvalidateEntry (relationId=2662) at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2738 #27 0x00005600ae55b1af in LocalExecuteInvalidationMessage (msg=0x5600aea262a8) at /home/andres/src/postgresql/src/backend/utils/cache/inval.c:589 #28 0x00005600ae55af06 in ProcessInvalidationMessages (hdr=0x5600aea26250, func=0x5600ae55b0a8 <LocalExecuteInvalidationMessage>) I can think of several ways to properly fix this: 1) Remove the CommandCounterIncrement() from RelationSetNewRelfilenode(), move it to the callers. That would allow for, I think, proper sequencing in reindex_index(): /* * Create a new relfilenode - note that this doesn't make the new * relfilenode visible yet, we'd otherwise run into danger of that * index (which is empty at this point) being used while processing * cache invalidations. */ RelationSetNewRelfilenode(iRel, persistence); /* * Before making the new relfilenode visible, prevent its use of the * to-be-reindexed index while building it. */ SetReindexProcessing(heapId, indexId); CommandCounterIncrement(); 2) Separate out the state for the assertion triggered by SetReindexProcessing from the prohibition of the use of the index for searches. 3) Turn on REINDEX_REL_SUPPRESS_INDEX_USE mode when reindexing pg_class. But that seems like a bigger hammer than necessary? Sidenote: It'd be pretty helpful to have an option for the buildfarm etc to turn md.c type errors like this into PANICs. > > Given this, I'm rethinking my position that we can dispense with these > > test cases. Let's try putting them in a standalone test script, and > > see whether that leads to failures or not. Even if it does, we'd > > better keep them until we've got a fully clean bill of health from > > the buildfarm. > > Yea. Seems likely this indicates a proper, distinct, bug :/ > > I'll move the test into a new "reindex_catalog" test, with a comment > explaining that the failure cases necessitating that are somewhere > between bugs, ugly warts, an hard to fix edge cases. Just pushed that. Greetings, Andres Freund
pgsql-hackers by date: