Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0 - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
Date
Msg-id CAH2-Wzn9eJMQYnxBmc4=VsGcK3tLk6Z1xO2s9nXhBRMBqHTJ3Q@mail.gmail.com
Whole thread Raw
In response to Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
List pgsql-bugs
On Mon, Aug 6, 2018 at 10:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I'll work to isolate and diagnose the problem today. It likely has
> something to do with corrupting the state needed by a catalog parallel
> index build in the context of the VACUUM FULL. pg_attribute grows to
> several tens of megabytes here, which is enough to get a parallel
> index build.

This repro can be further simplified, by just doing a VACUUM FULL on
pg_attribute alone. There is no index corruption prior to that point.
After that point, there is -- both pg_attribute_relid_attnam_index and
pg_attribute_relid_attnum_index seem to become corrupt. All other
symptoms probably stem from this initial corruption, so I'm focusing
on it.

What I see if I look at the corrupt pg_attribute_relid_attnum_index
structure is that the index does actually have an entry for a heap
tuple that amcheck complains about lacking an entry for -- at least,
it has a key match. The problem that amcheck noticed was that the heap
item pointer was not as it should be (i.e. the index tuple points to
the wrong heap tuple). I also noticed that nearby index tuples had
duplicate entries, the first pointing to approximately the same place
in the heap that the corrupt-to-amcheck tuple points to, and the
second pointing to approximately the same place in the heap that
amcheck expected to find it at (amcheck was complaining about an
adjacent entry, so it's only approximately the same place in the
heap).

I suspect that the problem is that parallel workers have a different
idea about which relfilenode they need to scan, or something along
those lines. Maybe cluster_rel() needs to be taught about parallel
CREATE INDEX. I must have missed some detail within cluster.c prior to
parallel CREATE INDEX going in.

-- 
Peter Geoghegan


pgsql-bugs by date:

Previous
From: Yahor Yuzefovich
Date:
Subject: Re: Docker image of 11~beta2-2 orders strings case-insensitively
Next
From: "David G. Johnston"
Date:
Subject: Re: Docker image of 11~beta2-2 orders strings case-insensitively