Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6 - Mailing list pgsql-hackers

From Tom Lane
Subject Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6
Date
Msg-id 22317.1556206341@sss.pgh.pa.us
Whole thread Raw
In response to Re: REINDEX INDEX results in a crash for an index of pg_class since9.6  (Michael Paquier <michael@paquier.xyz>)
Responses Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6
Re: REINDEX INDEX results in a crash for an index of pg_class since9.6
List pgsql-hackers
Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Apr 23, 2019 at 08:03:37PM -0400, Tom Lane wrote:
>> Oh!  One gets you ten it "works" as long as the pg_class update is a
>> HOT update, so that we don't actually end up touching the indexes.

> I have been able to spend a bit more time testing and looking at the
> root of the problem, and I have found two things:
> 1) The problem is reproducible with REL9_5_STABLE.

Actually, as far as I can tell, this has been broken since day 1.
I can reproduce the assertion failure back to 9.1, and I think the
only reason it doesn't happen in older branches is that they lack
the ReindexIsProcessingIndex() check in RELATION_CHECKS :-(.

What you have to do to get it to crash is to ensure that
RelationSetNewRelfilenode's update of pg_class will be a non-HOT
update.  You can try to set that up with "vacuum full pg_class"
but it turns out that that tends to leave the pg_class entries
for pg_class's indexes in the last page of the relation, which
is usually not totally full, so that a HOT update works and the
bug doesn't manifest.

A recipe like the following breaks every branch, by ensuring that
the page containing pg_class_relname_nsp_index's entry is full:

regression=# vacuum full pg_class;
VACUUM
regression=# do $$ begin
for i in 100 .. 150 loop
execute 'create table dummy'||i||'(f1 int)';
end loop;
end $$;
DO
regression=# reindex index pg_class_relname_nsp_index;
psql: server closed the connection unexpectedly


As for an actual fix, I tried just moving reindex_index's
SetReindexProcessing call from where it is down to after
RelationSetNewRelfilenode, but that isn't sufficient:

regression=# reindex index pg_class_relname_nsp_index;
psql: ERROR:  could not read block 3 in file "base/16384/41119": read only 0 of 8192 bytes

#0  errfinish (dummy=0) at elog.c:411
#1  0x00000000007a9453 in mdread (reln=<value optimized out>,
    forknum=<value optimized out>, blocknum=<value optimized out>,
    buffer=0x7f608e6a7d00 "") at md.c:633
#2  0x000000000077a9af in ReadBuffer_common (smgr=<value optimized out>,
    relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=3, mode=RBM_NORMAL,
    strategy=0x0, hit=0x7fff6a7452ef) at bufmgr.c:896
#3  0x000000000077b67e in ReadBufferExtended (reln=0x7f608db5d670,
    forkNum=MAIN_FORKNUM, blockNum=3, mode=<value optimized out>,
    strategy=<value optimized out>) at bufmgr.c:664
#4  0x00000000004ea95a in _bt_getbuf (rel=0x7f608db5d670,
    blkno=<value optimized out>, access=1) at nbtpage.c:805
#5  0x00000000004eb67a in _bt_getroot (rel=0x7f608db5d670, access=2)
    at nbtpage.c:323
#6  0x00000000004f2237 in _bt_search (rel=0x7f608db5d670, key=0x1d5a0c0,
    bufP=0x7fff6a7456a8, access=2, snapshot=0x0) at nbtsearch.c:99
#7  0x00000000004e8caf in _bt_doinsert (rel=0x7f608db5d670, itup=0x1c85e58,
    checkUnique=UNIQUE_CHECK_YES, heapRel=0x1ccb8d0) at nbtinsert.c:219
#8  0x00000000004efc17 in btinsert (rel=0x7f608db5d670,
    values=<value optimized out>, isnull=<value optimized out>,
    ht_ctid=0x1d12dc4, heapRel=0x1ccb8d0, checkUnique=UNIQUE_CHECK_YES,
    indexInfo=0x1c857f8) at nbtree.c:205
#9  0x000000000054c320 in CatalogIndexInsert (indstate=<value optimized out>,
    heapTuple=0x1d12dc0) at indexing.c:140
#10 0x000000000054c502 in CatalogTupleUpdate (heapRel=0x1ccb8d0,
    otid=0x1d12dc4, tup=0x1d12dc0) at indexing.c:215
#11 0x00000000008bcba7 in RelationSetNewRelfilenode (relation=0x7f608db5d670,
    persistence=112 'p') at relcache.c:3531
#12 0x0000000000548b16 in reindex_index (indexId=2663,
    skip_constraint_checks=false, persistence=112 'p', options=0)
    at index.c:3336
#13 0x00000000005ed129 in ReindexIndex (indexRelation=<value optimized out>,
    options=0, concurrent=false) at indexcmds.c:2304
#14 0x00000000007b5a45 in standard_ProcessUtility (pstmt=0x1c66d70,
    queryString=0x1c65f68 "reindex index pg_class_relname_nsp_index;",
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x1c66e68, completionTag=0x7fff6a745e40 "") at utility.c:787

The problem here is that RelationSetNewRelfilenode is aggressively
changing the index's relcache entry before it's written out the
updated tuple, so that the tuple update tries to make an index
entry in the new storage which isn't filled yet.  I think we can
fix it by *not* doing that, but leaving it to the relcache inval
during the CommandCounterIncrement call to update the relcache
entry.  However, it looks like that will take some API refactoring,
because the storage-creation functions expect to get the new
relfilenode out of the relcache entry, and they'll have to be
changed to not do it that way.

I'll work on a patch ...

            regards, tom lane



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: New vacuum option to do only freezing
Next
From: Fujii Masao
Date:
Subject: pg_waldump and PREPARE