Thread: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676when max_parallel_maintenance_workers > 0
BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676when max_parallel_maintenance_workers > 0
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15309 Logged by: death lock Email address: deathlock13@gmail.com PostgreSQL version: 11beta2 Operating system: Debian10 64b Description: 11beta2 and git REL_11_STABLE, gcc 8.2, pg.conf : autovacuum=off DB restored from plain-format backup, first 'vacuum full' ends with ERR : "catalog is missing 1 attribute(s) for relid ..." - seem to point to only PrimaryKey, sometimes the same PK
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0
From
Tom Lane
Date:
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes: > 11beta2 and git REL_11_STABLE, gcc 8.2, pg.conf : autovacuum=off > DB restored from plain-format backup, first 'vacuum full' ends with ERR : > "catalog is missing 1 attribute(s) for relid ..." - seem to point to only > PrimaryKey, sometimes the same PK If you could provide a self-contained example, this would be very interesting, but there's nothing we can do with just the information you've provided here. Perhaps you can strip and/or anonymize your backup dump file down to a postable example that reproduces this problem? regards, tom lane
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Fri, Aug 3, 2018 at 6:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Perhaps you can strip and/or anonymize your backup dump file down > to a postable example that reproduces this problem? It would also be interesting to see the result of this query, which relies on the v11 amcheck extension (so "CREATE EXTENSION amcheck" first): SELECT bt_index_parent_check(index => c.oid, heapallindexed => true), c.relname, c.relpages FROM pg_index i JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod = am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON c.relnamespace = n.oid WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog' -- Don't check temp tables, which may be from another session: AND c.relpersistence != 't' -- Function may throw an error when this is omitted: AND c.relkind = 'i' AND i.indisready AND i.indisvalid ORDER BY c.relpages DESC; If that doesn't raise an error, you could try the same query, but remove "AND n.nspname = 'pg_catalog'". That will take considerably longer, but probably won't be intolerable. Thanks -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0
From
Andres Freund
Date:
On August 3, 2018 9:51:01 AM PDT, Peter Geoghegan <pg@bowt.ie> wrote: >On Fri, Aug 3, 2018 at 6:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Perhaps you can strip and/or anonymize your backup dump file down >> to a postable example that reproduces this problem? > >It would also be interesting to see the result of this query, which >relies on the v11 amcheck extension (so "CREATE EXTENSION amcheck" >first): Maybe I'm missing something, but doesn't that error likely point to a bug in the new parallel index creation code, ratherthan the created indexes? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Fri, Aug 3, 2018 at 10:11 AM, Andres Freund <andres@anarazel.de> wrote: > Maybe I'm missing something, but doesn't that error likely point to a bug in the new parallel index creation code, ratherthan the created indexes? I've certainly seen error messages like that with cases of catalog corruption. I recall repairing pg_attribute by hand when this happened, and then having to reindex the system catalogs. I don't really know what's wrong here, but I wouldn't be surprised if amcheck detected a problem. Let's see. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > On Fri, Aug 3, 2018 at 10:11 AM, Andres Freund <andres@anarazel.de> wrote: >> Maybe I'm missing something, but doesn't that error likely point to a bug in the new parallel index creation code, ratherthan the created indexes? > I've certainly seen error messages like that with cases of > catalog corruption. I recall repairing pg_attribute by hand when this > happened, and then having to reindex the system catalogs. > I don't really know what's wrong here, but I wouldn't be surprised if > amcheck detected a problem. Let's see. I think the OP's time would be more usefully spent on creating a submittable test case. We will want that whether or not amcheck complains, because the mere fact of the complaint is not likely to be enough to find the problem. (I do agree that this smells like a problem in parallel index build.) regards, tom lane
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Fri, Aug 3, 2018 at 10:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think the OP's time would be more usefully spent on creating a > submittable test case. We will want that whether or not amcheck > complains, because the mere fact of the complaint is not likely to be > enough to find the problem. (I do agree that this smells like a problem > in parallel index build.) A test case would be ideal, but it's unlikely to take very long to run amcheck. If verification is limited to the system catalogs, then it will probably take just a few seconds. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Fri, Aug 3, 2018 at 6:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > =?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes: >> 11beta2 and git REL_11_STABLE, gcc 8.2, pg.conf : autovacuum=off >> DB restored from plain-format backup, first 'vacuum full' ends with ERR : >> "catalog is missing 1 attribute(s) for relid ..." - seem to point to only >> PrimaryKey, sometimes the same PK > > If you could provide a self-contained example, this would be very > interesting, but there's nothing we can do with just the information > you've provided here. I got an off-list testcase from the OP, who writes: """ I've attached my 'SHOW ALL' and a backup of example db from PG wiki, just kept renaming schema/tables until got a large enough number of indexes - seem to be key to reproducing error(s). 2018-08-06 11:31:44 CEST|TEST|postgres|[local]| ERROR: duplicate key value violates unique constraint "pg_class_relname_nsp_index" 2018-08-06 11:31:44 CEST|TEST|postgres|[local]| DETAIL: Key (relname, relnamespace)=(pg_class_tblspc_relfilenode_index, 11) already exists. 2018-08-06 11:31:44 CEST|TEST|postgres|[local]| STATEMENT: vacuum full analyze ; or 2018-08-06 11:20:13 CEST|PUMA|postgres|[local]| DETAIL: Key (relname, relnamespace)=(idx_last_name, 16498) already exists. Got this after runing your query with amcheck before 'VACUUM FULL' 2018-08-06 11:51:10 CEST|||| PANIC: could not open critical system index 2676 2018-08-06 11:51:10 CEST|TEST|postgres|[local]| ERROR: could not open critical system index 2676 2018-08-06 11:51:10 CEST|TEST|postgres|[local]| CONTEXT: parallel worker 2018-08-06 11:51:10 CEST|TEST|postgres|[local]| STATEMENT: vacuum full; 2018-08-06 11:51:10 CEST|||| LOG: background worker "parallel worker" (PID 16618) was terminated by signal 6: Aborted 2018-08-06 11:51:10 CEST|||| LOG: terminating any other active server processes 2018-08-06 12:11:24 CEST|||| ERROR: could not open relation with OID 2696 2018-08-06 12:11:24 CEST|TEST|postgres|[local]| ERROR: could not open relation with OID 2696 2018-08-06 12:11:24 CEST|TEST|postgres|[local]| CONTEXT: parallel worker 2018-08-06 12:11:24 CEST|TEST|postgres|[local]| STATEMENT: vacuum full; 2018-08-06 12:11:24 CEST|||| LOG: background worker "parallel worker" (PID 16930) exited with exit code 1 """ I can reproduce this with the backup provided. This seems to boil down to the following: 1. Create a database with tens of thousands of relations, each of which contain no data. 2. Run an unqualified VACUUM FULL. This is what I see after the restore: pg@foodb[22648]=# vacuum FULL ; ERROR: duplicate key value violates unique constraint "pg_class_relname_nsp_index" DETAIL: Key (relname, relnamespace)=(customer_pkey, 16445) already exists. pg@foodb[22648]=# :amcheck ERROR: heap tuple (358,1) from table "pg_attribute" lacks matching index tuple within index "pg_attribute_relid_attnam_index" I'll work to isolate and diagnose the problem today. It likely has something to do with corrupting the state needed by a catalog parallel index build in the context of the VACUUM FULL. pg_attribute grows to several tens of megabytes here, which is enough to get a parallel index build. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Mon, Aug 6, 2018 at 10:43 AM, Peter Geoghegan <pg@bowt.ie> wrote: > I'll work to isolate and diagnose the problem today. It likely has > something to do with corrupting the state needed by a catalog parallel > index build in the context of the VACUUM FULL. pg_attribute grows to > several tens of megabytes here, which is enough to get a parallel > index build. This repro can be further simplified, by just doing a VACUUM FULL on pg_attribute alone. There is no index corruption prior to that point. After that point, there is -- both pg_attribute_relid_attnam_index and pg_attribute_relid_attnum_index seem to become corrupt. All other symptoms probably stem from this initial corruption, so I'm focusing on it. What I see if I look at the corrupt pg_attribute_relid_attnum_index structure is that the index does actually have an entry for a heap tuple that amcheck complains about lacking an entry for -- at least, it has a key match. The problem that amcheck noticed was that the heap item pointer was not as it should be (i.e. the index tuple points to the wrong heap tuple). I also noticed that nearby index tuples had duplicate entries, the first pointing to approximately the same place in the heap that the corrupt-to-amcheck tuple points to, and the second pointing to approximately the same place in the heap that amcheck expected to find it at (amcheck was complaining about an adjacent entry, so it's only approximately the same place in the heap). I suspect that the problem is that parallel workers have a different idea about which relfilenode they need to scan, or something along those lines. Maybe cluster_rel() needs to be taught about parallel CREATE INDEX. I must have missed some detail within cluster.c prior to parallel CREATE INDEX going in. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Mon, Aug 6, 2018 at 1:31 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I suspect that the problem is that parallel workers have a different > idea about which relfilenode they need to scan, or something along > those lines. Maybe cluster_rel() needs to be taught about parallel > CREATE INDEX. I must have missed some detail within cluster.c prior to > parallel CREATE INDEX going in. To be clear, I mean that the leader process's worker state has the right relfilenode (the leader process always participates as a worker), but all worker processes have the stale relfilenode. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Mon, Aug 6, 2018 at 1:37 PM, Peter Geoghegan <pg@bowt.ie> wrote: > To be clear, I mean that the leader process's worker state has the > right relfilenode (the leader process always participates as a > worker), but all worker processes have the stale relfilenode. Sure enough, that's what the bug is - a few debugging calls to RelationMapFilenodeToOid() within nbtsort.c proves it. Several approaches to fixing the bug occur to me: * Ban parallel CREATE INDEX for all catalogs. This was how things were up until several weeks before the original patch was committed. * Ban parallel CREATE INDEX for mapped catalogs only. * Find a way to propagate the state necessary to have parallel workers agree with the leader on the correct relfilenode. We could probably propagate backend-local state like active_local_updates without too much difficulty, which looks like it would fix the problem. Note that we did something very similar with reindex-pending-indexes lists in commit 29d58fd3. That commit similarly involved propagating more backend-local state so that parallel index builds (or at least REINDEX) on catalogs could be enabled/work reliably. Maybe we should continue down the road of making parallel builds work on catalogs, on general principle. Thoughts? -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > Sure enough, that's what the bug is - a few debugging calls to > RelationMapFilenodeToOid() within nbtsort.c proves it. Several > approaches to fixing the bug occur to me: > * Ban parallel CREATE INDEX for all catalogs. This was how things were > up until several weeks before the original patch was committed. > * Ban parallel CREATE INDEX for mapped catalogs only. > * Find a way to propagate the state necessary to have parallel workers > agree with the leader on the correct relfilenode. > We could probably propagate backend-local state like > active_local_updates without too much difficulty, which looks like it > would fix the problem. Note that we did something very similar with > reindex-pending-indexes lists in commit 29d58fd3. That commit > similarly involved propagating more backend-local state so that > parallel index builds (or at least REINDEX) on catalogs could be > enabled/work reliably. Maybe we should continue down the road of > making parallel builds work on catalogs, on general principle. Hm. Post-beta3, I think I'd vote for a conservative fix in v11, which seems to be "ban for mapped catalogs". Feel free to make it work in HEAD, though. regards, tom lane
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Mon, Aug 6, 2018 at 2:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hm. Post-beta3, I think I'd vote for a conservative fix in v11, > which seems to be "ban for mapped catalogs". Feel free to make > it work in HEAD, though. Makes sense. I'm not sure if it's worth pursuing parallel mapped catalog index builds much further, though. I doubt that ordinary users care about whether or not this is supported, so this is a matter of principle. I don't feel strongly on whether or not I should make mapped builds work on HEAD, so I defer to you, and anyone else that might have an interest. Does it matter, do you think? It might be worth teaching heap_beginscan_parallel() to cross-check each worker's heap relation's rd_smgr.smgr_node to a version of the same field from the leader, stored in shared memory (in the ParallelHeapScanDesc). That way, any future recurrence of a similar bug will be far easier to detect. A "can't happen" error along these lines seems like it would be worthwhile. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Alvaro Herrera
Date:
On 2018-Aug-06, Peter Geoghegan wrote: > On Mon, Aug 6, 2018 at 1:37 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > To be clear, I mean that the leader process's worker state has the > > right relfilenode (the leader process always participates as a > > worker), but all worker processes have the stale relfilenode. > > Sure enough, that's what the bug is - a few debugging calls to > RelationMapFilenodeToOid() within nbtsort.c proves it. Uh, that's weird, isn't it? I mean, why is the relfilenode changing underneath? Why isn't it blocked by the CREATE INDEX? Or is CREATE INDEX inflicting that upon itself? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0
From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2018-Aug-06, Peter Geoghegan wrote: >> Sure enough, that's what the bug is - a few debugging calls to >> RelationMapFilenodeToOid() within nbtsort.c proves it. > Uh, that's weird, isn't it? I mean, why is the relfilenode changing > underneath? Because we're building a brand new index in a new file. > Why isn't it blocked by the CREATE INDEX? Or is CREATE > INDEX inflicting that upon itself? The problem is (or so I assume) failure to propagate process-local state of relmapper.c into the worker processes. So the leader knows that we've assigned a new relfilenode to some mapped index, but the workers don't. regards, tom lane
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Mon, Aug 6, 2018 at 3:10 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Uh, that's weird, isn't it? I mean, why is the relfilenode changing > underneath? Why isn't it blocked by the CREATE INDEX? Or is CREATE > INDEX inflicting that upon itself? CREATE INDEX/nbtsort.c does not take any special interest in relfilenode, or anything like that. As Tom said, this is a failure to propagate the process-local state of relmapper.c into the worker processes. Or a failure to account for the fact that that doesn't happen, perhaps. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Aug 6, 2018 at 2:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hm. Post-beta3, I think I'd vote for a conservative fix in v11, >> which seems to be "ban for mapped catalogs". Feel free to make >> it work in HEAD, though. > Makes sense. I'm not sure if it's worth pursuing parallel mapped > catalog index builds much further, though. I doubt that ordinary users > care about whether or not this is supported, so this is a matter of > principle. I don't feel strongly on whether or not I should make > mapped builds work on HEAD, so I defer to you, and anyone else that > might have an interest. Does it matter, do you think? Apparently there are people out there with catalogs big enough to justify parallel reindex. I don't mind if the first version of the feature doesn't handle that, but probably we should make it work eventually. > It might be worth teaching heap_beginscan_parallel() to > cross-check each worker's heap relation's rd_smgr.smgr_node to a version > of the same field from the leader, stored in shared memory (in the > ParallelHeapScanDesc). That way, any future recurrence of a similar > bug will be far easier to detect. A "can't happen" error along these > lines seems like it would be worthwhile. Well, maybe, but why is that field particularly vulnerable? I'd think you should crosscheck the index's relfilenode too, at least, if you're going to worry about that. regards, tom lane
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Mon, Aug 6, 2018 at 3:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Apparently there are people out there with catalogs big enough > to justify parallel reindex. I don't mind if the first version of > the feature doesn't handle that, but probably we should make it work > eventually. Okay. I'll pursue this in HEAD. > Well, maybe, but why is that field particularly vulnerable? I'd think you > should crosscheck the index's relfilenode too, at least, if you're going > to worry about that. I agree. I'd like to hear more opinions on the general idea. I'm particularly interested in what Robert thinks. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Mon, Aug 6, 2018 at 3:21 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Aug 6, 2018 at 3:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Apparently there are people out there with catalogs big enough >> to justify parallel reindex. I don't mind if the first version of >> the feature doesn't handle that, but probably we should make it work >> eventually. > > Okay. I'll pursue this in HEAD. I wrote this in a little over an hour, and it works fine. I'll post it shortly, once I polish it up some more. I'll also post an alternative "no mapped relation parallel builds" fix for v11, which will be trivial. >> Well, maybe, but why is that field particularly vulnerable? I'd think you >> should crosscheck the index's relfilenode too, at least, if you're going >> to worry about that. > > I agree. > > I'd like to hear more opinions on the general idea. I'm particularly > interested in what Robert thinks. On second thought, the crosscheck of the index's relfilenode seems like independent work. I won't be working on it as part of the fix for this bug. The question of index relfilenode becoming stale or inconsistent doesn't come up with parallel CREATE INDEX, since currently the workers never write to the new index relfilenode. If we were going to do a similar crosscheck for parallel index scan, the new code would be need to be spread across places like btbeginscan(), and RelationGetIndexScan(). It's not something that can just be done in passing. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Mon, Aug 6, 2018 at 7:32 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I wrote this in a little over an hour, and it works fine. I'll post it > shortly, once I polish it up some more. I'll also post an alternative > "no mapped relation parallel builds" fix for v11, which will be > trivial. Attached are two patches that each fix the issue -- a conservative patch for v11, as well as a patch that actually propagates relmapper.c state, for the master branch. It would be good to get a +1 on both before pushing. I'll pursue the parallel heap scan relfilenode cross-check thing separately. Would still like to hear something from Robert on that before proceeding. -- Peter Geoghegan
Attachment
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Tue, Aug 7, 2018 at 6:06 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Aug 6, 2018 at 7:32 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> I wrote this in a little over an hour, and it works fine. I'll post it >> shortly, once I polish it up some more. I'll also post an alternative >> "no mapped relation parallel builds" fix for v11, which will be >> trivial. > > Attached are two patches that each fix the issue -- a conservative > patch for v11, as well as a patch that actually propagates relmapper.c > state, for the master branch. It would be good to get a +1 on both > before pushing. I'm planning to push these two in about 24 - 30 hours, since I have buy-in from Tom on the high level approach. If anyone wants to review the patches in detail, now would be a good time. There isn't much to them. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid 760676 when max_parallel_maintenance_workers > 0
From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes: > Attached are two patches that each fix the issue -- a conservative > patch for v11, as well as a patch that actually propagates relmapper.c > state, for the master branch. It would be good to get a +1 on both > before pushing. I took a quick look at these. The v11 patch seems OK as far as it goes, but I wonder if you shouldn't also include the RelationMapUpdateMap hunk from the master patch, ie adding rejection of map changes in parallel mode. I don't have any objection to the master patch, but it'd be good to get a +1 from someone who's spent more time with the parallelism infrastructure than I have. regards, tom lane
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Thu, Aug 9, 2018 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I took a quick look at these. The v11 patch seems OK as far as it goes, > but I wonder if you shouldn't also include the RelationMapUpdateMap > hunk from the master patch, ie adding rejection of map changes in parallel > mode. Good idea. I'll do that. > I don't have any objection to the master patch, but it'd be good to get > a +1 from someone who's spent more time with the parallelism > infrastructure than I have. That would be my preference too, but commit 29d58fd3 is very analogous to the proposed master branch fix, so I feel that it's reasonable to proceed without final approval from someone like Robert or Amit. I'll leave pushing until late in the day tomorrow, pacific time. -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Amit Kapila
Date:
On Thu, Aug 9, 2018 at 11:28 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Aug 9, 2018 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I took a quick look at these. The v11 patch seems OK as far as it goes, >> but I wonder if you shouldn't also include the RelationMapUpdateMap >> hunk from the master patch, ie adding rejection of map changes in parallel >> mode. > > Good idea. I'll do that. > >> I don't have any objection to the master patch, but it'd be good to get >> a +1 from someone who's spent more time with the parallelism >> infrastructure than I have. > > That would be my preference too, but commit 29d58fd3 is very analogous > to the proposed master branch fix, so I feel that it's reasonable to > proceed without final approval from someone like Robert or Amit. > I haven't studied the complete problem, but the way you are propagating the information to parallel workers looks correct to me. Few minor comments: 1. +void +RestoreRelationMap(char *startAddress) +{ + SerializedActiveRelMaps *relmaps; + + if (active_shared_updates.num_mappings != 0 || + active_local_updates.num_mappings != 0 || + pending_shared_updates.num_mappings != 0 || + pending_local_updates.num_mappings != 0) + elog(ERROR, "parallel worker has existing mappings"); .. Shouldn't above be Assert? 2. +void +SerializeRelationMap(Size maxSize, char *startAddress) +{ + SerializedActiveRelMaps *relmaps; + + relmaps = (SerializedActiveRelMaps *) startAddress; + relmaps->active_shared_updates = active_shared_updates; + relmaps->active_local_updates = active_local_updates; .. } Some of the other serialize functions use maxSize for Asserts. See SerializeComboCIDState. I think we can do without that as well, but it makes code consistent. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Fri, Aug 10, 2018 at 6:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I haven't studied the complete problem, but the way you are > propagating the information to parallel workers looks correct to me. > Few minor comments: > > 1. > +void > +RestoreRelationMap(char *startAddress) > +{ > + SerializedActiveRelMaps *relmaps; > + > + if (active_shared_updates.num_mappings != 0 || > + active_local_updates.num_mappings != 0 || > + pending_shared_updates.num_mappings != 0 || > + pending_local_updates.num_mappings != 0) > + elog(ERROR, "parallel worker has existing mappings"); > .. > > Shouldn't above be Assert? This was based on AtPrepare_RelationMap(). > 2. > +void > +SerializeRelationMap(Size maxSize, char *startAddress) > +{ > + SerializedActiveRelMaps *relmaps; > + > + relmaps = (SerializedActiveRelMaps *) startAddress; > + relmaps->active_shared_updates = active_shared_updates; > + relmaps->active_local_updates = active_local_updates; > .. > } > > Some of the other serialize functions use maxSize for Asserts. See > SerializeComboCIDState. I think we can do without that as well, but > it makes code consistent. I'll put an assert in there. Thanks -- Peter Geoghegan
Re: BUG #15309: ERROR: catalog is missing 1 attribute(s) for relid760676 when max_parallel_maintenance_workers > 0
From
Peter Geoghegan
Date:
On Fri, Aug 3, 2018 at 4:45 AM, PG Bug reporting form <noreply@postgresql.org> wrote: > 11beta2 and git REL_11_STABLE, gcc 8.2, pg.conf : autovacuum=off > DB restored from plain-format backup, first 'vacuum full' ends with ERR : > "catalog is missing 1 attribute(s) for relid ..." - seem to point to only > PrimaryKey, sometimes the same PK A fix for this bug has been pushed. Thanks for the report! -- Peter Geoghegan