Thread: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676when max_parallel_maintenance_workers > 0

BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676when max_parallel_maintenance_workers > 0

From

PG Bug reporting form

Date:

03 August 2018, 14:45:16

The following bug has been logged on the website:

Bug reference:      15309
Logged by:          death lock
Email address:      deathlock13@gmail.com
PostgreSQL version: 11beta2
Operating system:   Debian10 64b
Description:

11beta2 and git REL_11_STABLE, gcc 8.2, pg.conf : autovacuum=off
DB restored from plain-format backup, first 'vacuum full' ends with ERR :
"catalog is missing 1 attribute(s) for relid ..." - seem to point to only
PrimaryKey, sometimes the same PK

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0

From

Tom Lane

Date:

03 August 2018, 16:53:09

=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> 11beta2 and git REL_11_STABLE, gcc 8.2, pg.conf : autovacuum=off
> DB restored from plain-format backup, first 'vacuum full' ends with ERR :
> "catalog is missing 1 attribute(s) for relid ..." - seem to point to only
> PrimaryKey, sometimes the same PK

If you could provide a self-contained example, this would be very
interesting, but there's nothing we can do with just the information
you've provided here.

Perhaps you can strip and/or anonymize your backup dump file down
to a postable example that reproduces this problem?

            regards, tom lane

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

03 August 2018, 19:51:01

On Fri, Aug 3, 2018 at 6:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Perhaps you can strip and/or anonymize your backup dump file down
> to a postable example that reproduces this problem?

It would also be interesting to see the result of this query, which
relies on the v11 amcheck extension (so "CREATE EXTENSION amcheck"
first):

SELECT bt_index_parent_check(index => c.oid, heapallindexed => true),
               c.relname,
               c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

If that doesn't raise an error, you could try the same query, but
remove "AND n.nspname = 'pg_catalog'". That will take considerably
longer, but probably won't be intolerable.

Thanks
-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0

From

Andres Freund

Date:

03 August 2018, 20:11:40


On August 3, 2018 9:51:01 AM PDT, Peter Geoghegan <pg@bowt.ie> wrote:
>On Fri, Aug 3, 2018 at 6:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Perhaps you can strip and/or anonymize your backup dump file down
>> to a postable example that reproduces this problem?
>
>It would also be interesting to see the result of this query, which
>relies on the v11 amcheck extension (so "CREATE EXTENSION amcheck"
>first):

Maybe I'm missing something, but doesn't that error likely point to a bug in the new parallel index creation code,
ratherthan the created indexes? 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

03 August 2018, 20:29:54

On Fri, Aug 3, 2018 at 10:11 AM, Andres Freund <andres@anarazel.de> wrote:
> Maybe I'm missing something, but doesn't that error likely point to a bug in the new parallel index creation code,
ratherthan the created indexes?
 

I've certainly seen error messages like that with cases of
catalog corruption. I recall repairing pg_attribute by hand when this
happened, and then having to reindex the system catalogs.

I don't really know what's wrong here, but I wouldn't be surprised if
amcheck detected a problem. Let's see.

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0

From

Tom Lane

Date:

03 August 2018, 20:58:46

Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Aug 3, 2018 at 10:11 AM, Andres Freund <andres@anarazel.de> wrote:
>> Maybe I'm missing something, but doesn't that error likely point to a bug in the new parallel index creation code,
ratherthan the created indexes? 

> I've certainly seen error messages like that with cases of
> catalog corruption. I recall repairing pg_attribute by hand when this
> happened, and then having to reindex the system catalogs.

> I don't really know what's wrong here, but I wouldn't be surprised if
> amcheck detected a problem. Let's see.

I think the OP's time would be more usefully spent on creating a
submittable test case.  We will want that whether or not amcheck
complains, because the mere fact of the complaint is not likely to be
enough to find the problem.  (I do agree that this smells like a problem
in parallel index build.)

            regards, tom lane

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

03 August 2018, 21:03:51

On Fri, Aug 3, 2018 at 10:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think the OP's time would be more usefully spent on creating a
> submittable test case.  We will want that whether or not amcheck
> complains, because the mere fact of the complaint is not likely to be
> enough to find the problem.  (I do agree that this smells like a problem
> in parallel index build.)

A test case would be ideal, but it's unlikely to take very long to run
amcheck. If verification is limited to the system catalogs, then it
will probably take just a few seconds.

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

06 August 2018, 20:43:15

On Fri, Aug 3, 2018 at 6:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> =?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
>> 11beta2 and git REL_11_STABLE, gcc 8.2, pg.conf : autovacuum=off
>> DB restored from plain-format backup, first 'vacuum full' ends with ERR :
>> "catalog is missing 1 attribute(s) for relid ..." - seem to point to only
>> PrimaryKey, sometimes the same PK
>
> If you could provide a self-contained example, this would be very
> interesting, but there's nothing we can do with just the information
> you've provided here.

I got an off-list testcase from the OP, who writes:

"""
I've attached my 'SHOW ALL' and a backup of example db from PG wiki,
just kept renaming schema/tables until got a large enough number of
indexes - seem to be key to reproducing error(s).

2018-08-06 11:31:44 CEST|TEST|postgres|[local]| ERROR:  duplicate key
value violates unique constraint "pg_class_relname_nsp_index"
2018-08-06 11:31:44 CEST|TEST|postgres|[local]| DETAIL:  Key (relname,
relnamespace)=(pg_class_tblspc_relfilenode_index, 11) already exists.
2018-08-06 11:31:44 CEST|TEST|postgres|[local]| STATEMENT:  vacuum
full analyze ;

or 2018-08-06 11:20:13 CEST|PUMA|postgres|[local]| DETAIL:  Key
(relname, relnamespace)=(idx_last_name, 16498) already exists.

Got this after runing your query with amcheck  before 'VACUUM FULL'
2018-08-06 11:51:10 CEST|||| PANIC:  could not open critical system index 2676
2018-08-06 11:51:10 CEST|TEST|postgres|[local]| ERROR:  could not open
critical system index 2676
2018-08-06 11:51:10 CEST|TEST|postgres|[local]| CONTEXT:  parallel worker
2018-08-06 11:51:10 CEST|TEST|postgres|[local]| STATEMENT:  vacuum full;
2018-08-06 11:51:10 CEST|||| LOG:  background worker "parallel worker"
(PID 16618) was terminated by signal 6: Aborted
2018-08-06 11:51:10 CEST|||| LOG:  terminating any other active server processes

2018-08-06 12:11:24 CEST|||| ERROR:  could not open relation with OID 2696
2018-08-06 12:11:24 CEST|TEST|postgres|[local]| ERROR:  could not open
relation with OID 2696
2018-08-06 12:11:24 CEST|TEST|postgres|[local]| CONTEXT:  parallel worker
2018-08-06 12:11:24 CEST|TEST|postgres|[local]| STATEMENT:  vacuum full;
2018-08-06 12:11:24 CEST|||| LOG:  background worker "parallel worker"
(PID 16930) exited with exit code 1
"""

I can reproduce this with the backup provided. This seems to boil down
to the following:

1. Create a database with tens of thousands of relations, each of
which contain no data.

2. Run an unqualified VACUUM FULL.

This is what I see after the restore:

pg@foodb[22648]=# vacuum FULL ;
ERROR:  duplicate key value violates unique constraint
"pg_class_relname_nsp_index"
DETAIL:  Key (relname, relnamespace)=(customer_pkey, 16445) already exists.
pg@foodb[22648]=# :amcheck
ERROR:  heap tuple (358,1) from table "pg_attribute" lacks matching
index tuple within index "pg_attribute_relid_attnam_index"

I'll work to isolate and diagnose the problem today. It likely has
something to do with corrupting the state needed by a catalog parallel
index build in the context of the VACUUM FULL. pg_attribute grows to
several tens of megabytes here, which is enough to get a parallel
index build.

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

06 August 2018, 23:31:16

On Mon, Aug 6, 2018 at 10:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I'll work to isolate and diagnose the problem today. It likely has
> something to do with corrupting the state needed by a catalog parallel
> index build in the context of the VACUUM FULL. pg_attribute grows to
> several tens of megabytes here, which is enough to get a parallel
> index build.

This repro can be further simplified, by just doing a VACUUM FULL on
pg_attribute alone. There is no index corruption prior to that point.
After that point, there is -- both pg_attribute_relid_attnam_index and
pg_attribute_relid_attnum_index seem to become corrupt. All other
symptoms probably stem from this initial corruption, so I'm focusing
on it.

What I see if I look at the corrupt pg_attribute_relid_attnum_index
structure is that the index does actually have an entry for a heap
tuple that amcheck complains about lacking an entry for -- at least,
it has a key match. The problem that amcheck noticed was that the heap
item pointer was not as it should be (i.e. the index tuple points to
the wrong heap tuple). I also noticed that nearby index tuples had
duplicate entries, the first pointing to approximately the same place
in the heap that the corrupt-to-amcheck tuple points to, and the
second pointing to approximately the same place in the heap that
amcheck expected to find it at (amcheck was complaining about an
adjacent entry, so it's only approximately the same place in the
heap).

I suspect that the problem is that parallel workers have a different
idea about which relfilenode they need to scan, or something along
those lines. Maybe cluster_rel() needs to be taught about parallel
CREATE INDEX. I must have missed some detail within cluster.c prior to
parallel CREATE INDEX going in.

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

06 August 2018, 23:37:22

On Mon, Aug 6, 2018 at 1:31 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I suspect that the problem is that parallel workers have a different
> idea about which relfilenode they need to scan, or something along
> those lines. Maybe cluster_rel() needs to be taught about parallel
> CREATE INDEX. I must have missed some detail within cluster.c prior to
> parallel CREATE INDEX going in.

To be clear, I mean that the leader process's worker state has the
right relfilenode (the leader process always participates as a
worker), but all worker processes have the stale relfilenode.

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

07 August 2018, 00:21:33

On Mon, Aug 6, 2018 at 1:37 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> To be clear, I mean that the leader process's worker state has the
> right relfilenode (the leader process always participates as a
> worker), but all worker processes have the stale relfilenode.

Sure enough, that's what the bug is - a few debugging calls to
RelationMapFilenodeToOid() within nbtsort.c proves it. Several
approaches to fixing the bug occur to me:

* Ban parallel CREATE INDEX for all catalogs. This was how things were
up until several weeks before the original patch was committed.

* Ban parallel CREATE INDEX for mapped catalogs only.

* Find a way to propagate the state necessary to have parallel workers
agree with the leader on the correct relfilenode.

We could probably propagate backend-local state like
active_local_updates without too much difficulty, which looks like it
would fix the problem. Note that we did something very similar with
reindex-pending-indexes lists in commit 29d58fd3. That commit
similarly involved propagating more backend-local state so that
parallel index builds (or at least REINDEX) on catalogs could be
enabled/work reliably. Maybe we should continue down the road of
making parallel builds work on catalogs, on general principle.

Thoughts?

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0

From

Tom Lane

Date:

07 August 2018, 00:29:49

Peter Geoghegan <pg@bowt.ie> writes:
> Sure enough, that's what the bug is - a few debugging calls to
> RelationMapFilenodeToOid() within nbtsort.c proves it. Several
> approaches to fixing the bug occur to me:

> * Ban parallel CREATE INDEX for all catalogs. This was how things were
> up until several weeks before the original patch was committed.

> * Ban parallel CREATE INDEX for mapped catalogs only.

> * Find a way to propagate the state necessary to have parallel workers
> agree with the leader on the correct relfilenode.

> We could probably propagate backend-local state like
> active_local_updates without too much difficulty, which looks like it
> would fix the problem. Note that we did something very similar with
> reindex-pending-indexes lists in commit 29d58fd3. That commit
> similarly involved propagating more backend-local state so that
> parallel index builds (or at least REINDEX) on catalogs could be
> enabled/work reliably. Maybe we should continue down the road of
> making parallel builds work on catalogs, on general principle.

Hm.  Post-beta3, I think I'd vote for a conservative fix in v11,
which seems to be "ban for mapped catalogs".  Feel free to make
it work in HEAD, though.

            regards, tom lane

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

07 August 2018, 01:06:56

On Mon, Aug 6, 2018 at 2:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm.  Post-beta3, I think I'd vote for a conservative fix in v11,
> which seems to be "ban for mapped catalogs".  Feel free to make
> it work in HEAD, though.

Makes sense. I'm not sure if it's worth pursuing parallel mapped
catalog index builds much further, though. I doubt that ordinary users
care about whether or not this is supported, so this is a matter of
principle. I don't feel strongly on whether or not I should make
mapped builds work on HEAD, so I defer to you, and anyone else that
might have an interest. Does it matter, do you think?

It might be worth teaching heap_beginscan_parallel() to
cross-check each worker's heap relation's rd_smgr.smgr_node to a version
of the same field from the leader, stored in shared memory (in the
ParallelHeapScanDesc). That way, any future recurrence of a similar
bug will be far easier to detect. A "can't happen" error along these
lines seems like it would be worthwhile.

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Alvaro Herrera

Date:

07 August 2018, 01:10:23

On 2018-Aug-06, Peter Geoghegan wrote:

> On Mon, Aug 6, 2018 at 1:37 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> > To be clear, I mean that the leader process's worker state has the
> > right relfilenode (the leader process always participates as a
> > worker), but all worker processes have the stale relfilenode.
> 
> Sure enough, that's what the bug is - a few debugging calls to
> RelationMapFilenodeToOid() within nbtsort.c proves it.

Uh, that's weird, isn't it?  I mean, why is the relfilenode changing
underneath?  Why isn't it blocked by the CREATE INDEX?  Or is CREATE
INDEX inflicting that upon itself?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0

From

Tom Lane

Date:

07 August 2018, 01:14:39

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2018-Aug-06, Peter Geoghegan wrote:
>> Sure enough, that's what the bug is - a few debugging calls to
>> RelationMapFilenodeToOid() within nbtsort.c proves it.

> Uh, that's weird, isn't it?  I mean, why is the relfilenode changing
> underneath?

Because we're building a brand new index in a new file.

> Why isn't it blocked by the CREATE INDEX?  Or is CREATE
> INDEX inflicting that upon itself?

The problem is (or so I assume) failure to propagate process-local
state of relmapper.c into the worker processes.  So the leader knows
that we've assigned a new relfilenode to some mapped index, but the
workers don't.

            regards, tom lane

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

07 August 2018, 01:16:49

On Mon, Aug 6, 2018 at 3:10 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Uh, that's weird, isn't it?  I mean, why is the relfilenode changing
> underneath?  Why isn't it blocked by the CREATE INDEX?  Or is CREATE
> INDEX inflicting that upon itself?

CREATE INDEX/nbtsort.c does not take any special interest in
relfilenode, or anything like that.

As Tom said, this is a failure to propagate the process-local state of
relmapper.c into the worker processes. Or a failure to account for the
fact that that doesn't happen, perhaps.

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0

From

Tom Lane

Date:

07 August 2018, 01:18:05

Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Aug 6, 2018 at 2:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hm.  Post-beta3, I think I'd vote for a conservative fix in v11,
>> which seems to be "ban for mapped catalogs".  Feel free to make
>> it work in HEAD, though.

> Makes sense. I'm not sure if it's worth pursuing parallel mapped
> catalog index builds much further, though. I doubt that ordinary users
> care about whether or not this is supported, so this is a matter of
> principle. I don't feel strongly on whether or not I should make
> mapped builds work on HEAD, so I defer to you, and anyone else that
> might have an interest. Does it matter, do you think?

Apparently there are people out there with catalogs big enough
to justify parallel reindex.  I don't mind if the first version of
the feature doesn't handle that, but probably we should make it work
eventually.

> It might be worth teaching heap_beginscan_parallel() to
> cross-check each worker's heap relation's rd_smgr.smgr_node to a version
> of the same field from the leader, stored in shared memory (in the
> ParallelHeapScanDesc). That way, any future recurrence of a similar
> bug will be far easier to detect. A "can't happen" error along these
> lines seems like it would be worthwhile.

Well, maybe, but why is that field particularly vulnerable?  I'd think you
should crosscheck the index's relfilenode too, at least, if you're going
to worry about that.

            regards, tom lane

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

07 August 2018, 01:21:49

On Mon, Aug 6, 2018 at 3:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Apparently there are people out there with catalogs big enough
> to justify parallel reindex.  I don't mind if the first version of
> the feature doesn't handle that, but probably we should make it work
> eventually.

Okay. I'll pursue this in HEAD.

> Well, maybe, but why is that field particularly vulnerable?  I'd think you
> should crosscheck the index's relfilenode too, at least, if you're going
> to worry about that.

I agree.

I'd like to hear more opinions on the general idea. I'm particularly
interested in what Robert thinks.

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

07 August 2018, 05:32:42

On Mon, Aug 6, 2018 at 3:21 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Aug 6, 2018 at 3:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Apparently there are people out there with catalogs big enough
>> to justify parallel reindex.  I don't mind if the first version of
>> the feature doesn't handle that, but probably we should make it work
>> eventually.
>
> Okay. I'll pursue this in HEAD.

I wrote this in a little over an hour, and it works fine. I'll post it
shortly, once I polish it up some more. I'll also post an alternative
"no mapped relation parallel builds" fix for v11, which will be
trivial.

>> Well, maybe, but why is that field particularly vulnerable?  I'd think you
>> should crosscheck the index's relfilenode too, at least, if you're going
>> to worry about that.
>
> I agree.
>
> I'd like to hear more opinions on the general idea. I'm particularly
> interested in what Robert thinks.

On second thought, the crosscheck of the index's relfilenode seems
like independent work. I won't be working on it as part of the fix for
this bug.

The question of index relfilenode becoming stale or inconsistent
doesn't come up with parallel CREATE INDEX, since currently the
workers never write to the new index relfilenode. If we were going to
do a similar crosscheck for parallel index scan, the new code would be
need to be spread across places like btbeginscan(), and
RelationGetIndexScan(). It's not something that can just be done in
passing.

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

08 August 2018, 04:06:59

On Mon, Aug 6, 2018 at 7:32 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I wrote this in a little over an hour, and it works fine. I'll post it
> shortly, once I polish it up some more. I'll also post an alternative
> "no mapped relation parallel builds" fix for v11, which will be
> trivial.

Attached are two patches that each fix the issue -- a conservative
patch for v11, as well as a patch that actually propagates relmapper.c
state, for the master branch. It would be good to get a +1 on both
before pushing.

I'll pursue the parallel heap scan relfilenode cross-check thing
separately. Would still like to hear something from Robert on that
before proceeding.

-- 
Peter Geoghegan

Attachment

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

09 August 2018, 19:33:29

On Tue, Aug 7, 2018 at 6:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Aug 6, 2018 at 7:32 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I wrote this in a little over an hour, and it works fine. I'll post it
>> shortly, once I polish it up some more. I'll also post an alternative
>> "no mapped relation parallel builds" fix for v11, which will be
>> trivial.
>
> Attached are two patches that each fix the issue -- a conservative
> patch for v11, as well as a patch that actually propagates relmapper.c
> state, for the master branch. It would be good to get a +1 on both
> before pushing.

I'm planning to push these two in about 24 - 30 hours, since I have
buy-in from Tom on the high level approach. If anyone wants to review
the patches in detail, now would be a good time. There isn't much to
them.


-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0

From

Tom Lane

Date:

09 August 2018, 20:34:21

Peter Geoghegan <pg@bowt.ie> writes:
> Attached are two patches that each fix the issue -- a conservative
> patch for v11, as well as a patch that actually propagates relmapper.c
> state, for the master branch. It would be good to get a +1 on both
> before pushing.

I took a quick look at these.  The v11 patch seems OK as far as it goes,
but I wonder if you shouldn't also include the RelationMapUpdateMap
hunk from the master patch, ie adding rejection of map changes in parallel
mode.

I don't have any objection to the master patch, but it'd be good to get
a +1 from someone who's spent more time with the parallelism
infrastructure than I have.

            regards, tom lane

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

09 August 2018, 20:58:55

On Thu, Aug 9, 2018 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I took a quick look at these.  The v11 patch seems OK as far as it goes,
> but I wonder if you shouldn't also include the RelationMapUpdateMap
> hunk from the master patch, ie adding rejection of map changes in parallel
> mode.

Good idea. I'll do that.

> I don't have any objection to the master patch, but it'd be good to get
> a +1 from someone who's spent more time with the parallelism
> infrastructure than I have.

That would be my preference too, but commit 29d58fd3 is very analogous
to the proposed master branch fix, so I feel that it's reasonable to
proceed without final approval from someone like Robert or Amit. I'll
leave pushing until late in the day tomorrow, pacific time.

-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Amit Kapila

Date:

10 August 2018, 16:45:48

On Thu, Aug 9, 2018 at 11:28 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Aug 9, 2018 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I took a quick look at these.  The v11 patch seems OK as far as it goes,
>> but I wonder if you shouldn't also include the RelationMapUpdateMap
>> hunk from the master patch, ie adding rejection of map changes in parallel
>> mode.
>
> Good idea. I'll do that.
>
>> I don't have any objection to the master patch, but it'd be good to get
>> a +1 from someone who's spent more time with the parallelism
>> infrastructure than I have.
>
> That would be my preference too, but commit 29d58fd3 is very analogous
> to the proposed master branch fix, so I feel that it's reasonable to
> proceed without final approval from someone like Robert or Amit.
>

I haven't studied the complete problem, but the way you are
propagating the information to parallel workers looks correct to me.
Few minor comments:

1.
+void
+RestoreRelationMap(char *startAddress)
+{
+ SerializedActiveRelMaps   *relmaps;
+
+ if (active_shared_updates.num_mappings != 0 ||
+ active_local_updates.num_mappings != 0 ||
+ pending_shared_updates.num_mappings != 0 ||
+ pending_local_updates.num_mappings != 0)
+ elog(ERROR, "parallel worker has existing mappings");
..

Shouldn't above be Assert?

2.
+void
+SerializeRelationMap(Size maxSize, char *startAddress)
+{
+ SerializedActiveRelMaps   *relmaps;
+
+ relmaps = (SerializedActiveRelMaps *) startAddress;
+ relmaps->active_shared_updates = active_shared_updates;
+ relmaps->active_local_updates = active_local_updates;
..
}

Some of the other serialize functions use maxSize for Asserts.  See
SerializeComboCIDState.  I think we can do without that as well, but
it makes code consistent.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

10 August 2018, 19:36:53

On Fri, Aug 10, 2018 at 6:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I haven't studied the complete problem, but the way you are
> propagating the information to parallel workers looks correct to me.
> Few minor comments:
>
> 1.
> +void
> +RestoreRelationMap(char *startAddress)
> +{
> + SerializedActiveRelMaps   *relmaps;
> +
> + if (active_shared_updates.num_mappings != 0 ||
> + active_local_updates.num_mappings != 0 ||
> + pending_shared_updates.num_mappings != 0 ||
> + pending_local_updates.num_mappings != 0)
> + elog(ERROR, "parallel worker has existing mappings");
> ..
>
> Shouldn't above be Assert?

This was based on AtPrepare_RelationMap().

> 2.
> +void
> +SerializeRelationMap(Size maxSize, char *startAddress)
> +{
> + SerializedActiveRelMaps   *relmaps;
> +
> + relmaps = (SerializedActiveRelMaps *) startAddress;
> + relmaps->active_shared_updates = active_shared_updates;
> + relmaps->active_local_updates = active_local_updates;
> ..
> }
>
> Some of the other serialize functions use maxSize for Asserts.  See
> SerializeComboCIDState.  I think we can do without that as well, but
> it makes code consistent.

I'll put an assert in there.

Thanks
-- 
Peter Geoghegan

Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0

From

Peter Geoghegan

Date:

10 August 2018, 23:15:47

On Fri, Aug 3, 2018 at 4:45 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:
> 11beta2 and git REL_11_STABLE, gcc 8.2, pg.conf : autovacuum=off
> DB restored from plain-format backup, first 'vacuum full' ends with ERR :
> "catalog is missing 1 attribute(s) for relid ..." - seem to point to only
> PrimaryKey, sometimes the same PK

A fix for this bug has been pushed.

Thanks for the report!
-- 
Peter Geoghegan