Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY - Mailing list pgsql-hackers

From Pavan Deolasee
Subject Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Date
Msg-id CABOikdMPEe6UazAmyCwLQskA-+BbmGtAhbJZePxS64rQ-KeMOw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers


On Sat, Feb 4, 2017 at 12:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:


If we do above, then I think primary key attrs won't be returned
because for those we are using relation copy rather than an original
working copy of attrs. See code below:

switch (attrKind)
{
..
case INDEX_ATTR_BITMAP_PRIMARY_KEY:
return bms_copy(relation->rd_pkattr);


I don't think that would a problem since if relation_rd_indexattr is NULL, primary key attrs will be recomputed and returned.
 

Apart from above, I think after the proposed patch, it will be quite
possible that in heap_update(), hot_attrs, key_attrs and id_attrs are
computed using different index lists (consider relcache flush happens
after computation of hot_attrs or key_attrs) which was previously not
possible.  I think in the later code we assume that all the three
should be computed using the same indexlist.  For ex. see the below
code:

This seems like a real problem to me. I don't know what the consequences are, but definitely having various attribute lists to have different view of the set of indexes doesn't seem right.

The problem we are trying to fix is very clear by now. Relcache flush clears rd_indexvalid and rd_indexattr together, but because of the race, rd_indexattr is recomputed with the old information and gets cached again, while rd_indexvalid (and rd_indexlist) remains unset. That leads to the index corruption in CIC and may cause other issues too that we're not aware right now. We want cached attribute lists to become invalid whenever index list is invalidated and that's not happening right now.

So we have tried three approaches so far.

1. Simply reset rd_indexattr whenever we detect rd_indexvalid has been reset. This was the first patch. It's a very simple patch and has worked for me with CACHE_CLOBBER_ALWAYS and even fixes the CIC bug. The only downside it seems that we're invalidating cached attribute lists outside the normal relcache invalidation path. It also works on the premise that RelationGetIndexList() will be called every time before RelationGetIndexAttrBitmap() is called, otherwise we could still end up using stale cached copies. I wonder if cached plans can be a problem here.

2. In the second patch, we tried to recompute attribute lists if a relcache flush happens in between and index list is invalidated. We've seen problems with that, especially it getting into an infinite loop with CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache flushes keep happening.

3. We tried the third approach where we don't cache attribute lists if know that index set has changed and hope that the next caller gets the correct info. But Amit rightly identified problems with that approach too. 

So we either need to find a 4th approach or stay with the first patch unless we see a problem there too (cached plans, anything else). Or may be fix CREATE INDEX CONCURRENTLY so that at least the first phase which add the index entry to the catalog happens with a strong lock. This will lead to momentary blocking of write queries, but protect us from the race. I'm assuming this is the only code that can cause the problem, and I haven't checked other potential hazards.

Ideas/suggestions?

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: amul sul
Date:
Subject: Re: [HACKERS] Constraint exclusion failed to prune partition in caseof partition expression involves function call
Next
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem