On Mon, Aug 6, 2018 at 10:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I'll work to isolate and diagnose the problem today. It likely has
> something to do with corrupting the state needed by a catalog parallel
> index build in the context of the VACUUM FULL. pg_attribute grows to
> several tens of megabytes here, which is enough to get a parallel
> index build.
This repro can be further simplified, by just doing a VACUUM FULL on
pg_attribute alone. There is no index corruption prior to that point.
After that point, there is -- both pg_attribute_relid_attnam_index and
pg_attribute_relid_attnum_index seem to become corrupt. All other
symptoms probably stem from this initial corruption, so I'm focusing
on it.
What I see if I look at the corrupt pg_attribute_relid_attnum_index
structure is that the index does actually have an entry for a heap
tuple that amcheck complains about lacking an entry for -- at least,
it has a key match. The problem that amcheck noticed was that the heap
item pointer was not as it should be (i.e. the index tuple points to
the wrong heap tuple). I also noticed that nearby index tuples had
duplicate entries, the first pointing to approximately the same place
in the heap that the corrupt-to-amcheck tuple points to, and the
second pointing to approximately the same place in the heap that
amcheck expected to find it at (amcheck was complaining about an
adjacent entry, so it's only approximately the same place in the
heap).
I suspect that the problem is that parallel workers have a different
idea about which relfilenode they need to scan, or something along
those lines. Maybe cluster_rel() needs to be taught about parallel
CREATE INDEX. I must have missed some detail within cluster.c prior to
parallel CREATE INDEX going in.
--
Peter Geoghegan