Re: REINDEX INDEX results in a crash for an index of pg_class since9.6 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: REINDEX INDEX results in a crash for an index of pg_class since9.6
Date
Msg-id 20190501013638.7u7fwr4zktbyyng5@alap3.anarazel.de
Whole thread Raw
In response to Re: REINDEX INDEX results in a crash for an index of pg_class since9.6  (Andres Freund <andres@anarazel.de>)
Responses Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6
Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6
List pgsql-hackers
On 2019-04-30 15:53:07 -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-04-30 18:42:36 -0400, Tom Lane wrote:
> > markhor just reported in with results showing that we have worse
> > problems than deadlock-prone tests in the back branches: 9.4
> > for example looks like
> 
> >   -- whole tables
> >   REINDEX TABLE pg_class; -- mapped, non-shared, critical
> > + ERROR:  could not read block 0 in file "base/16384/27769": read only 0 of 8192 bytes
> 
> Ugh.  Also failed on 9.6.

I see the bug. Turns out we need to figure out another way to solve the
assertion triggered by doing catalog updates within
RelationSetNewRelfilenode() - we can't just move the
SetReindexProcessing() before it.  When CCA is enabled, the
CommandCounterIncrement() near the tail of RelationSetNewRelfilenode()
triggers a rebuild of the catalog entries - but without the
SetReindexProcessing() those scans will try to use the index currently
being rebuilt. Which then predictably fails:

#0  mdread (reln=0x5600aea36498, forknum=MAIN_FORKNUM, blocknum=0, buffer=0x7f71037db800 "") at
/home/andres/src/postgresql/src/backend/storage/smgr/md.c:633
#1  0x00005600ae3f656f in smgrread (reln=0x5600aea36498, forknum=MAIN_FORKNUM, blocknum=0, buffer=0x7f71037db800 "")
    at /home/andres/src/postgresql/src/backend/storage/smgr/smgr.c:590
#2  0x00005600ae3b4c13 in ReadBuffer_common (smgr=0x5600aea36498, relpersistence=112 'p', forkNum=MAIN_FORKNUM,
blockNum=0,mode=RBM_NORMAL, strategy=0x0, 
 
    hit=0x7fff5bb11cab) at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:896
#3  0x00005600ae3b44ab in ReadBufferExtended (reln=0x7f7107972540, forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL,
strategy=0x0)
    at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:664
#4  0x00005600ae3b437f in ReadBuffer (reln=0x7f7107972540, blockNum=0) at
/home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:596
#5  0x00005600ae00e0b3 in _bt_getbuf (rel=0x7f7107972540, blkno=0, access=1) at
/home/andres/src/postgresql/src/backend/access/nbtree/nbtpage.c:805
#6  0x00005600ae00dd2a in _bt_heapkeyspace (rel=0x7f7107972540) at
/home/andres/src/postgresql/src/backend/access/nbtree/nbtpage.c:694
#7  0x00005600ae01679c in _bt_first (scan=0x5600aea44440, dir=ForwardScanDirection) at
/home/andres/src/postgresql/src/backend/access/nbtree/nbtsearch.c:1237
#8  0x00005600ae012617 in btgettuple (scan=0x5600aea44440, dir=ForwardScanDirection) at
/home/andres/src/postgresql/src/backend/access/nbtree/nbtree.c:247
#9  0x00005600ae005572 in index_getnext_tid (scan=0x5600aea44440, direction=ForwardScanDirection)
    at /home/andres/src/postgresql/src/backend/access/index/indexam.c:550
#10 0x00005600ae00571e in index_getnext_slot (scan=0x5600aea44440, direction=ForwardScanDirection,
slot=0x5600ae9c6ed0)
    at /home/andres/src/postgresql/src/backend/access/index/indexam.c:642
#11 0x00005600ae003e54 in systable_getnext (sysscan=0x5600aea44080) at
/home/andres/src/postgresql/src/backend/access/index/genam.c:450
#12 0x00005600ae564292 in ScanPgRelation (targetRelId=1259, indexOK=true, force_non_historic=false)
    at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:365
#13 0x00005600ae568203 in RelationReloadNailed (relation=0x5600aea0c4d0) at
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2292
#14 0x00005600ae568621 in RelationClearRelation (relation=0x5600aea0c4d0, rebuild=true) at
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2425
#15 0x00005600ae569081 in RelationCacheInvalidate () at
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2858
#16 0x00005600ae55b32b in InvalidateSystemCaches () at /home/andres/src/postgresql/src/backend/utils/cache/inval.c:649
#17 0x00005600ae55b408 in AcceptInvalidationMessages () at
/home/andres/src/postgresql/src/backend/utils/cache/inval.c:708
#18 0x00005600ae3d7b22 in LockRelationOid (relid=1259, lockmode=1) at
/home/andres/src/postgresql/src/backend/storage/lmgr/lmgr.c:136
#19 0x00005600adf85ad2 in relation_open (relationId=1259, lockmode=1) at
/home/andres/src/postgresql/src/backend/access/common/relation.c:56
#20 0x00005600ae040337 in table_open (relationId=1259, lockmode=1) at
/home/andres/src/postgresql/src/backend/access/table/table.c:43
#21 0x00005600ae564215 in ScanPgRelation (targetRelId=2662, indexOK=false, force_non_historic=false)
    at /home/andres/src/postgresql/src/backend/utils/cache/relcache.c:348
#22 0x00005600ae567ecf in RelationReloadIndexInfo (relation=0x7f7107972540) at
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2170
#23 0x00005600ae5681d3 in RelationReloadNailed (relation=0x7f7107972540) at
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2270
#24 0x00005600ae568621 in RelationClearRelation (relation=0x7f7107972540, rebuild=true) at
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2425
#25 0x00005600ae568d19 in RelationFlushRelation (relation=0x7f7107972540) at
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2686
#26 0x00005600ae568e32 in RelationCacheInvalidateEntry (relationId=2662) at
/home/andres/src/postgresql/src/backend/utils/cache/relcache.c:2738
#27 0x00005600ae55b1af in LocalExecuteInvalidationMessage (msg=0x5600aea262a8) at
/home/andres/src/postgresql/src/backend/utils/cache/inval.c:589
#28 0x00005600ae55af06 in ProcessInvalidationMessages (hdr=0x5600aea26250, func=0x5600ae55b0a8
<LocalExecuteInvalidationMessage>)


I can think of several ways to properly fix this:

1) Remove the CommandCounterIncrement() from
   RelationSetNewRelfilenode(), move it to the callers. That would allow
   for, I think, proper sequencing in reindex_index():

   /*
    * Create a new relfilenode - note that this doesn't make the new
    * relfilenode visible yet, we'd otherwise run into danger of that
    * index (which is empty at this point) being used while processing
    * cache invalidations.
    */
   RelationSetNewRelfilenode(iRel, persistence);

   /*
    * Before making the new relfilenode visible, prevent its use of the
    * to-be-reindexed index while building it.
    */
   SetReindexProcessing(heapId, indexId);

   CommandCounterIncrement();


2) Separate out the state for the assertion triggered by
   SetReindexProcessing from the prohibition of the use of the index for
   searches.

3) Turn on REINDEX_REL_SUPPRESS_INDEX_USE mode when reindexing
   pg_class. But that seems like a bigger hammer than necessary?



Sidenote: It'd be pretty helpful to have an option for the buildfarm etc
to turn md.c type errors like this into PANICs.


> > Given this, I'm rethinking my position that we can dispense with these
> > test cases.  Let's try putting them in a standalone test script, and
> > see whether that leads to failures or not.  Even if it does, we'd
> > better keep them until we've got a fully clean bill of health from
> > the buildfarm.
> 
> Yea. Seems likely this indicates a proper, distinct, bug :/
> 
> I'll move the test into a new "reindex_catalog" test, with a comment
> explaining that the failure cases necessitating that are somewhere
> between bugs, ugly warts, an hard to fix edge cases.

Just pushed that.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Adding a test for speculative insert abort case
Next
From: Peter Geoghegan
Date:
Subject: Re: Adding a test for speculative insert abort case