Thread: [HACKERS] Transactional sequence stuff breaks pg_upgrade
I believe I've identified the reason why skink and some other buildfarm members have been failing the pg_upgrade test recently. It is that recent changes in sequence support have caused binary-upgrade restore runs to do some sequence OID/relfilenode assignments without any heed to the OIDs that pg_upgrade tried to impose on those sequences. Once those sequences have relfilenodes other than the intended ones, they are land mines for all subsequent pg_upgrade-controlled table OID assignments. I am not very sure why it's so hard to duplicate the misbehavior; perhaps, in order to make the failure happen with the current regression tests, it's necessary for a background auto-analyze to happen and consume some OIDs (for pg_statistic TOAST entries) at just the wrong time. However, I can definitely demonstrate that there are uncontrolled relfilenode assignments happening during pg_upgrade's restore run. I stuck an elog() call into GetNewObjectId(), along with generation of a stack trace using backtrace(), and here is one example: [593daad3.4863:2243] LOG: generated OID 16735 [593daad3.4863:2244] STATEMENT: -- For binary upgrade, must preserve pg_class oidsSELECT pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('46851'::pg_catalog.oid);--For binary upgrade, must preserve pg_typeoidSELECT pg_catalog.binary_upgrade_set_next_pg_type_oid('46852'::pg_catalog.oid);ALTER TABLE "itest10" ALTER COLUMN"a" ADD GENERATED BY DEFAULT AS IDENTITY ( SEQUENCE NAME "itest10_a_seq" START WITH 1 INCREMENT BY 1 NOMINVALUE NO MAXVALUE CACHE 1); postgres: postgres regression [local] ALTER TABLE(GetNewObjectId+0xda) [0x50397a] postgres: postgres regression [local] ALTER TABLE(GetNewRelFileNode+0xec) [0x52430c] postgres: postgres regression [local] ALTER TABLE(RelationSetNewRelfilenode+0x79) [0x851d59] postgres: postgres regression [local] ALTER TABLE(AlterSequence+0x1cd) [0x5d976d] postgres: postgres regression [local] ALTER TABLE() [0x75d279] postgres: postgres regression [local] ALTER TABLE(standard_ProcessUtility+0xb7) [0x75dec7] postgres: postgres regression [local] ALTER TABLE() [0x75cb1d] postgres: postgres regression [local] ALTER TABLE(standard_ProcessUtility+0xb7) [0x75dec7] postgres: postgres regression [local] ALTER TABLE() [0x759f0b] postgres: postgres regression [local] ALTER TABLE() [0x75ae91] postgres: postgres regression [local] ALTER TABLE(PortalRun+0x250) [0x75b740] postgres: postgres regression [local] ALTER TABLE() [0x757be7] postgres: postgres regression [local] ALTER TABLE(PostgresMain+0xe08) [0x759968] postgres: postgres regression [local] ALTER TABLE(PostmasterMain+0x1a99) [0x6e21a9] postgres: postgres regression [local] ALTER TABLE(main+0x6b8) [0x65b958] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3f3bc1ed1d] postgres: postgres regression [local] ALTER TABLE() [0x473899] Judging by when we started to see buildfarm failures, I think that commit 3d79013b9 probably broke it, but the problem seems latent in the whole concept of transactional sequence information. Not sure what we want to do about it. One idea is to make ALTER SEQUENCE not so transactional when in binary-upgrade mode. (I'm also tempted to make GetNewRelFileNode complain if IsBinaryUpgrade is true, but that's a separate matter.) In any case, this is a "must fix" problem IMO, so I'll go add it to the open items list. regards, tom lane
I wrote: > I believe I've identified the reason why skink and some other buildfarm > members have been failing the pg_upgrade test recently. > ... > Not sure what we want to do about it. One idea is to make > ALTER SEQUENCE not so transactional when in binary-upgrade mode. On closer inspection, the only thing that pg_upgrade needs is to be able to do ALTER SEQUENCE OWNED BY without a relfilenode bump. PFA two versions of a patch that fixes this problem, at least to the extent that it gets through check-world without triggering the Assert I added to GetNewRelFileNode (which HEAD doesn't). The first one is a minimally-invasive hack; the second one puts the responsibility for deciding if a sequence rewrite is needed into init_params. That's bulkier but might be useful if we ever grow additional ALTER SEQUENCE options that don't need a rewrite. OTOH, I'm not very clear on what such options might look like, so maybe the extra flexibility is useless. Thoughts? regards, tom lane diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c index 11ee536..43ef4cd 100644 *** a/src/backend/catalog/catalog.c --- b/src/backend/catalog/catalog.c *************** GetNewRelFileNode(Oid reltablespace, Rel *** 391,396 **** --- 391,403 ---- bool collides; BackendId backend; + /* + * If we ever get here during pg_upgrade, there's something wrong; all + * relfilenode assignments during a binary-upgrade run should be + * determined by commands in the dump script. + */ + Assert(!IsBinaryUpgrade); + switch (relpersistence) { case RELPERSISTENCE_TEMP: diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c index 7e85b69..188429c 100644 *** a/src/backend/commands/sequence.c --- b/src/backend/commands/sequence.c *************** AlterSequence(ParseState *pstate, AlterS *** 473,490 **** GetTopTransactionId(); /* ! * Create a new storage file for the sequence, making the state changes ! * transactional. We want to keep the sequence's relfrozenxid at 0, since ! * it won't contain any unfrozen XIDs. Same with relminmxid, since a ! * sequence will never contain multixacts. */ ! RelationSetNewRelfilenode(seqrel, seqrel->rd_rel->relpersistence, ! InvalidTransactionId, InvalidMultiXactId); ! /* ! * Insert the modified tuple into the new storage file. ! */ ! fill_seq_with_data(seqrel, newdatatuple); /* process OWNED BY if given */ if (owned_by) --- 473,499 ---- GetTopTransactionId(); /* ! * If we are *only* doing OWNED BY, there is no need to rewrite the ! * sequence file nor the pg_sequence tuple; and we mustn't do so because ! * it breaks pg_upgrade by causing unwanted changes in the sequence's ! * relfilenode. */ ! if (!(owned_by && list_length(stmt->options) == 1)) ! { ! /* ! * Create a new storage file for the sequence, making the state ! * changes transactional. We want to keep the sequence's relfrozenxid ! * at 0, since it won't contain any unfrozen XIDs. Same with ! * relminmxid, since a sequence will never contain multixacts. ! */ ! RelationSetNewRelfilenode(seqrel, seqrel->rd_rel->relpersistence, ! InvalidTransactionId, InvalidMultiXactId); ! /* ! * Insert the modified tuple into the new storage file. ! */ ! fill_seq_with_data(seqrel, newdatatuple); ! } /* process OWNED BY if given */ if (owned_by) diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c index 11ee536..43ef4cd 100644 *** a/src/backend/catalog/catalog.c --- b/src/backend/catalog/catalog.c *************** GetNewRelFileNode(Oid reltablespace, Rel *** 391,396 **** --- 391,403 ---- bool collides; BackendId backend; + /* + * If we ever get here during pg_upgrade, there's something wrong; all + * relfilenode assignments during a binary-upgrade run should be + * determined by commands in the dump script. + */ + Assert(!IsBinaryUpgrade); + switch (relpersistence) { case RELPERSISTENCE_TEMP: diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c index 7e85b69..34b8aa2 100644 *** a/src/backend/commands/sequence.c --- b/src/backend/commands/sequence.c *************** static Form_pg_sequence_data read_seq_tu *** 101,107 **** static void init_params(ParseState *pstate, List *options, bool for_identity, bool isInit, Form_pg_sequence seqform, ! Form_pg_sequence_data seqdataform, List **owned_by); static void do_setval(Oid relid, int64 next, bool iscalled); static void process_owned_by(Relation seqrel, List *owned_by, bool for_identity); --- 101,109 ---- static void init_params(ParseState *pstate, List *options, bool for_identity, bool isInit, Form_pg_sequence seqform, ! Form_pg_sequence_data seqdataform, ! bool *need_newrelfilenode, ! List **owned_by); static void do_setval(Oid relid, int64 next, bool iscalled); static void process_owned_by(Relation seqrel, List *owned_by, bool for_identity); *************** DefineSequence(ParseState *pstate, Creat *** 115,120 **** --- 117,123 ---- { FormData_pg_sequence seqform; FormData_pg_sequence_data seqdataform; + bool need_newrelfilenode; List *owned_by; CreateStmt *stmt = makeNode(CreateStmt); Oid seqoid; *************** DefineSequence(ParseState *pstate, Creat *** 153,159 **** } /* Check and set all option values */ ! init_params(pstate, seq->options, seq->for_identity, true, &seqform, &seqdataform, &owned_by); /* * Create relation (and fill value[] and null[] for the tuple) --- 156,164 ---- } /* Check and set all option values */ ! init_params(pstate, seq->options, seq->for_identity, true, ! &seqform, &seqdataform, ! &need_newrelfilenode, &owned_by); /* * Create relation (and fill value[] and null[] for the tuple) *************** AlterSequence(ParseState *pstate, AlterS *** 417,422 **** --- 422,428 ---- HeapTupleData datatuple; Form_pg_sequence seqform; Form_pg_sequence_data newdataform; + bool need_newrelfilenode; List *owned_by; ObjectAddress address; Relation rel; *************** AlterSequence(ParseState *pstate, AlterS *** 461,468 **** UnlockReleaseBuffer(buf); /* Check and set new values */ ! init_params(pstate, stmt->options, stmt->for_identity, false, seqform, ! newdataform, &owned_by); /* Clear local cache so that we don't think we have cached numbers */ /* Note that we do not change the currval() state */ --- 467,475 ---- UnlockReleaseBuffer(buf); /* Check and set new values */ ! init_params(pstate, stmt->options, stmt->for_identity, false, ! seqform, newdataform, ! &need_newrelfilenode, &owned_by); /* Clear local cache so that we don't think we have cached numbers */ /* Note that we do not change the currval() state */ *************** AlterSequence(ParseState *pstate, AlterS *** 473,490 **** GetTopTransactionId(); /* ! * Create a new storage file for the sequence, making the state changes ! * transactional. We want to keep the sequence's relfrozenxid at 0, since ! * it won't contain any unfrozen XIDs. Same with relminmxid, since a ! * sequence will never contain multixacts. */ ! RelationSetNewRelfilenode(seqrel, seqrel->rd_rel->relpersistence, ! InvalidTransactionId, InvalidMultiXactId); ! /* ! * Insert the modified tuple into the new storage file. ! */ ! fill_seq_with_data(seqrel, newdatatuple); /* process OWNED BY if given */ if (owned_by) --- 480,500 ---- GetTopTransactionId(); /* ! * If needed, create a new storage file for the sequence, making the state ! * changes transactional. We want to keep the sequence's relfrozenxid at ! * 0, since it won't contain any unfrozen XIDs. Same with relminmxid, ! * since a sequence will never contain multixacts. */ ! if (need_newrelfilenode) ! { ! RelationSetNewRelfilenode(seqrel, seqrel->rd_rel->relpersistence, ! InvalidTransactionId, InvalidMultiXactId); ! /* ! * Insert the modified tuple into the new storage file. ! */ ! fill_seq_with_data(seqrel, newdatatuple); ! } /* process OWNED BY if given */ if (owned_by) *************** read_seq_tuple(Relation rel, Buffer *buf *** 1205,1213 **** * init_params: process the options list of CREATE or ALTER SEQUENCE, and * store the values into appropriate fields of seqform, for changes that go * into the pg_sequence catalog, and seqdataform for changes to the sequence ! * relation itself. Set *changed_seqform to true if seqform was changed ! * (interesting for ALTER SEQUENCE). Also set *owned_by to any OWNED BY ! * option, or to NIL if there is none. * * If isInit is true, fill any unspecified options with default values; * otherwise, do not change existing options that aren't explicitly overridden. --- 1215,1224 ---- * init_params: process the options list of CREATE or ALTER SEQUENCE, and * store the values into appropriate fields of seqform, for changes that go * into the pg_sequence catalog, and seqdataform for changes to the sequence ! * relation itself. Set *need_newrelfilenode to true if we changed any ! * parameters that require assigning a new sequence relfilenode (interesting ! * for ALTER SEQUENCE). Also set *owned_by to any OWNED BY option, or to NIL ! * if there is none. * * If isInit is true, fill any unspecified options with default values; * otherwise, do not change existing options that aren't explicitly overridden. *************** init_params(ParseState *pstate, List *op *** 1217,1222 **** --- 1228,1234 ---- bool isInit, Form_pg_sequence seqform, Form_pg_sequence_data seqdataform, + bool *need_newrelfilenode, List **owned_by) { DefElem *as_type = NULL; *************** init_params(ParseState *pstate, List *op *** 1231,1236 **** --- 1243,1249 ---- bool reset_max_value = false; bool reset_min_value = false; + *need_newrelfilenode = false; *owned_by = NIL; foreach(option, options) *************** init_params(ParseState *pstate, List *op *** 1245,1250 **** --- 1258,1264 ---- errmsg("conflicting or redundant options"), parser_errposition(pstate, defel->location))); as_type = defel; + *need_newrelfilenode = true; } else if (strcmp(defel->defname, "increment") == 0) { *************** init_params(ParseState *pstate, List *op *** 1254,1259 **** --- 1268,1274 ---- errmsg("conflicting or redundant options"), parser_errposition(pstate, defel->location))); increment_by = defel; + *need_newrelfilenode = true; } else if (strcmp(defel->defname, "start") == 0) { *************** init_params(ParseState *pstate, List *op *** 1263,1268 **** --- 1278,1284 ---- errmsg("conflicting or redundant options"), parser_errposition(pstate, defel->location))); start_value = defel; + *need_newrelfilenode = true; } else if (strcmp(defel->defname, "restart") == 0) { *************** init_params(ParseState *pstate, List *op *** 1272,1277 **** --- 1288,1294 ---- errmsg("conflicting or redundant options"), parser_errposition(pstate, defel->location))); restart_value = defel; + *need_newrelfilenode = true; } else if (strcmp(defel->defname, "maxvalue") == 0) { *************** init_params(ParseState *pstate, List *op *** 1281,1286 **** --- 1298,1304 ---- errmsg("conflicting or redundant options"), parser_errposition(pstate, defel->location))); max_value = defel; + *need_newrelfilenode = true; } else if (strcmp(defel->defname, "minvalue") == 0) { *************** init_params(ParseState *pstate, List *op *** 1290,1295 **** --- 1308,1314 ---- errmsg("conflicting or redundant options"), parser_errposition(pstate, defel->location))); min_value = defel; + *need_newrelfilenode = true; } else if (strcmp(defel->defname, "cache") == 0) { *************** init_params(ParseState *pstate, List *op *** 1299,1304 **** --- 1318,1324 ---- errmsg("conflicting or redundant options"), parser_errposition(pstate, defel->location))); cache_value = defel; + *need_newrelfilenode = true; } else if (strcmp(defel->defname, "cycle") == 0) { *************** init_params(ParseState *pstate, List *op *** 1308,1313 **** --- 1328,1334 ---- errmsg("conflicting or redundant options"), parser_errposition(pstate, defel->location))); is_cycled = defel; + *need_newrelfilenode = true; } else if (strcmp(defel->defname, "owned_by") == 0) { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote: > On closer inspection, the only thing that pg_upgrade needs is to be > able to do ALTER SEQUENCE OWNED BY without a relfilenode bump. PFA > two versions of a patch that fixes this problem, at least to the > extent that it gets through check-world without triggering the Assert > I added to GetNewRelFileNode (which HEAD doesn't). The first one > is a minimally-invasive hack; the second one puts the responsibility > for deciding if a sequence rewrite is needed into init_params. That's > bulkier but might be useful if we ever grow additional ALTER SEQUENCE > options that don't need a rewrite. OTOH, I'm not very clear on what > such options might look like, so maybe the extra flexibility is useless. > Thoughts? I vote for the second patch. I don't have a clear idea either, but I'm pretty sure the logical-replication people is going to be hacking on sequences some more, yet, and this is likely to come in handy. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2017-06-11 17:17:07 -0400, Tom Lane wrote: > I am not very sure why it's so hard to duplicate the misbehavior; perhaps, > in order to make the failure happen with the current regression tests, > it's necessary for a background auto-analyze to happen and consume some > OIDs (for pg_statistic TOAST entries) at just the wrong time. However, > I can definitely demonstrate that there are uncontrolled relfilenode > assignments happening during pg_upgrade's restore run. I stuck an > elog() call into GetNewObjectId(), along with generation of a stack > trace using backtrace(), and here is one example: Yea, that's not all that surprising :/ > Judging by when we started to see buildfarm failures, I think that > commit 3d79013b9 probably broke it, but the problem seems latent > in the whole concept of transactional sequence information. Hm. > Not sure what we want to do about it. One idea is to make > ALTER SEQUENCE not so transactional when in binary-upgrade mode. I think what you proposed downthread makes sense - I'd ripped out fairly similar code in my recent changes, because it was only introduced to address the concurrency problems the new sequence code had, and it didn't seem to buy sufficiently much... - Andres
Hi, On 2017-06-11 20:03:28 -0400, Tom Lane wrote: > I wrote: > > I believe I've identified the reason why skink and some other buildfarm > > members have been failing the pg_upgrade test recently. > > ... > > Not sure what we want to do about it. One idea is to make > > ALTER SEQUENCE not so transactional when in binary-upgrade mode. > > On closer inspection, the only thing that pg_upgrade needs is to be > able to do ALTER SEQUENCE OWNED BY without a relfilenode bump. That indeed seems appropriate. More on the reliability of that below, though. > PFA two versions of a patch that fixes this problem, at least to the > extent that it gets through check-world without triggering the Assert > I added to GetNewRelFileNode (which HEAD doesn't). The first one > is a minimally-invasive hack; the second one puts the responsibility > for deciding if a sequence rewrite is needed into init_params. That's > bulkier but might be useful if we ever grow additional ALTER SEQUENCE > options that don't need a rewrite. OTOH, I'm not very clear on what > such options might look like, so maybe the extra flexibility is useless. > Thoughts? I think the flexibility isn't a bad idea, there's certainly other options (cycle, cache, and with more work additional ones) that could be changed without a rewrite. > diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c > index 11ee536..43ef4cd 100644 > --- a/src/backend/catalog/catalog.c > +++ b/src/backend/catalog/catalog.c > @@ -391,6 +391,13 @@ GetNewRelFileNode(Oid reltablespace, Rel > bool collides; > BackendId backend; > > + /* > + * If we ever get here during pg_upgrade, there's something wrong; all > + * relfilenode assignments during a binary-upgrade run should be > + * determined by commands in the dump script. > + */ > + Assert(!IsBinaryUpgrade); > + I'm very doubtful that a) this doesn't get hit in practice, and b) that we can rely on it going forward. At least until we change toasting to not use the global oid counter. A number of catalog tables have toastable columns, and the decision when to toast will every now and then change. Even without changing the compression algorithm, added columns will push things across boundaries. I'm not yet sure what the best fix for that will be, but I don't think we should bake in the assumption that the oid counter won't be touched. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-06-11 20:03:28 -0400, Tom Lane wrote: >> @@ -391,6 +391,13 @@ GetNewRelFileNode(Oid reltablespace, Rel >> bool collides; >> BackendId backend; >> >> + /* >> + * If we ever get here during pg_upgrade, there's something wrong; all >> + * relfilenode assignments during a binary-upgrade run should be >> + * determined by commands in the dump script. >> + */ >> + Assert(!IsBinaryUpgrade); >> + > I'm very doubtful that a) this doesn't get hit in practice, and b) that > we can rely on it going forward. At least until we change toasting to > not use the global oid counter. This is not about assignments from the global OID counter; the function it's touching is GetNewRelFileNode() not GetNewObjectId(). GetNewObjectId() definitely does get hit during a binary-upgrade restore, for all object types that pg_upgrade doesn't try to control the OIDs of --- which is most. We don't care, for the most part. But we *do* care about relfilenode assignments, for precisely the reason seen in this bug. *All* assignments of relfilenodes have to be shortcircuited by pg_upgrade override calls during a binary-restore run, or we risk filename collisions. So if this assert ever gets hit, we have something to fix. I intend to not only commit this, but back-patch it. There's enough changes in relevant code paths that logic that is fine in HEAD might not be fine in back branches. regards, tom lane
On 2017-06-12 17:13:34 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-06-11 20:03:28 -0400, Tom Lane wrote: > >> @@ -391,6 +391,13 @@ GetNewRelFileNode(Oid reltablespace, Rel > >> bool collides; > >> BackendId backend; > >> > >> + /* > >> + * If we ever get here during pg_upgrade, there's something wrong; all > >> + * relfilenode assignments during a binary-upgrade run should be > >> + * determined by commands in the dump script. > >> + */ > >> + Assert(!IsBinaryUpgrade); > >> + > > > I'm very doubtful that a) this doesn't get hit in practice, and b) that > > we can rely on it going forward. At least until we change toasting to > > not use the global oid counter. > > This is not about assignments from the global OID counter; the function > it's touching is GetNewRelFileNode() not GetNewObjectId(). Ah, that makes more sense. You'd put the backtrace() into GetNewObjectId() your original message, that's probably why I thought about it. > We don't care, for the most part. But we *do* care about relfilenode > assignments, for precisely the reason seen in this bug. Even there I don't think that's a sane assumption *for the future*. We just need a slight change in the rules about when a toast table is needed - and that stuff seriously need overhauling - and it doesn't work anymore. In my opinion the problem of: > assignments of relfilenodes have to be shortcircuited by pg_upgrade > override calls during a binary-restore run, or we risk filename > collisions. should instead be solved by simply not even trying to preserve relfilenodes. We can "just" copy/link files to the the new relfilenodes, there's no need to preserve them, in contrast to pg_class.oid etc. But that's obviously something for the future. > I intend to not only commit this, but back-patch it. There's enough > changes in relevant code paths that logic that is fine in HEAD might > not be fine in back branches. Hm. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > In my opinion the problem of: > On 2017-06-12 17:13:34 -0400, Tom Lane wrote: >> assignments of relfilenodes have to be shortcircuited by pg_upgrade >> override calls during a binary-restore run, or we risk filename >> collisions. > should instead be solved by simply not even trying to preserve > relfilenodes. We can "just" copy/link files to the the new > relfilenodes, there's no need to preserve them, in contrast to > pg_class.oid etc. But that's obviously something for the future. Right. But until somebody gets around to fixing that, it's a problem. Also, even if we fix this, we still have the issue of type OIDs residing on-disk in places like array headers; that results in pg_upgrade having to control type OID assignments as well. I thought about trying to insert an assert in GetNewOidWithIndex to ensure that no type OIDs are selected automatically during upgrade, but it seemed a bit too complicated. Might be a good idea to figure it out though, unless you have an idea for removing that requirement. >> I intend to not only commit this, but back-patch it. There's enough >> changes in relevant code paths that logic that is fine in HEAD might >> not be fine in back branches. > Hm. Just found out that 9.4 (and probably older) fail the assertion because of the temp table created in get_rel_infos(). That's *probably* all right because the table is *probably* gone from disk by the time we start the actual restore run. But we start the new postmaster only once, with -b, so the assertion applies to the preparatory steps as well as the restore proper. Later versions avoid that by using a CTE instead of a temp table. I'm not excited enough about this to back-port the patch that changed it, so I'm thinking of just adding the assert back to 9.5. regards, tom lane
On 2017-06-12 17:35:37 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > In my opinion the problem of: > > > On 2017-06-12 17:13:34 -0400, Tom Lane wrote: > >> assignments of relfilenodes have to be shortcircuited by pg_upgrade > >> override calls during a binary-restore run, or we risk filename > >> collisions. > > > should instead be solved by simply not even trying to preserve > > relfilenodes. We can "just" copy/link files to the the new > > relfilenodes, there's no need to preserve them, in contrast to > > pg_class.oid etc. But that's obviously something for the future. > > Right. But until somebody gets around to fixing that, it's a problem. > > Also, even if we fix this, we still have the issue of type OIDs residing > on-disk in places like array headers; that results in pg_upgrade having > to control type OID assignments as well. Yes, but those should be largely controlled and controllable. With the exception of the type oids of toast tables, more on that below. > I thought about trying to insert an assert in GetNewOidWithIndex to > ensure that no type OIDs are selected automatically during upgrade, > but it seemed a bit too complicated. Might be a good idea to figure > it out though, unless you have an idea for removing that requirement. I think the only type oids assigned during pg_upgrade are the oids for toast type types, right? I can think of two decent ways to deal with those: a) Do not create a corresponding composite type for toast tables. Not super pretty, but I doubt it'd be a huge issue. b) Use *one* composite type for all of the toast tables. That'd not be entirely trivial because of pg_type.typrelid. Both of these would have the advantage of removing some quite redundant content from the catalogs. I think with such a change, we'd have no issue with *additional* toast tables being created during the run? We already take care of toast tables not being created, by forcing the creation in binary upgrade mode if a toast oid is set. > Later versions avoid that by using a CTE instead of a temp table. > I'm not excited enough about this to back-port the patch that > changed it, so I'm thinking of just adding the assert back to 9.5. I'm still a bit doubtful that we know enough to consider whether the assert is actually safe in practice, to backpatch it. On the other hand, it'd only affect people building with asserts... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-06-12 17:35:37 -0400, Tom Lane wrote: >> I thought about trying to insert an assert in GetNewOidWithIndex to >> ensure that no type OIDs are selected automatically during upgrade, >> but it seemed a bit too complicated. Might be a good idea to figure >> it out though, unless you have an idea for removing that requirement. > I think the only type oids assigned during pg_upgrade are the oids for > toast type types, right? Perhaps that was a problem once, but it isn't now; see binary_upgrade_set_next_toast_pg_type_oid(). In fact, we get through the pg_upgrade regression test with the attached patch, and I really think we ought to commit it. Having said that, I wouldn't mind trying to reduce the catalog overhead for toast tables in v11 or beyond. > a) Do not create a corresponding composite type for toast tables. Not > super pretty, but I doubt it'd be a huge issue. I suspect there are places that assume all tables have type OIDs. > b) Use *one* composite type for all of the toast tables. That'd not be > entirely trivial because of pg_type.typrelid. That might work, with some klugery. Peter might have some insight about this --- I'm not sure whether the CREATE TABLE OF TYPE syntax shares a type OID across all the tables. regards, tom lane diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c index 11ee536..92d943c 100644 *** a/src/backend/catalog/catalog.c --- b/src/backend/catalog/catalog.c *************** *** 38,43 **** --- 38,44 ---- #include "catalog/pg_shseclabel.h" #include "catalog/pg_subscription.h" #include "catalog/pg_tablespace.h" + #include "catalog/pg_type.h" #include "catalog/toasting.h" #include "miscadmin.h" #include "storage/fd.h" *************** GetNewOidWithIndex(Relation relation, Oi *** 340,345 **** --- 341,354 ---- ScanKeyData key; bool collides; + /* + * We should never be asked to generate a new pg_type OID during + * pg_upgrade; doing so would risk collisions with the OIDs it wants to + * assign. Hitting this assert means there's some path where we failed to + * ensure that a type OID is determined by commands in the dump script. + */ + Assert(!IsBinaryUpgrade || RelationGetRelid(relation) != TypeRelationId); + InitDirtySnapshot(SnapshotDirty); /* Generate new OIDs until we find one not in the table */ *************** GetNewRelFileNode(Oid reltablespace, Rel *** 391,396 **** --- 400,412 ---- bool collides; BackendId backend; + /* + * If we ever get here during pg_upgrade, there's something wrong; all + * relfilenode assignments during a binary-upgrade run should be + * determined by commands in the dump script. + */ + Assert(!IsBinaryUpgrade); + switch (relpersistence) { case RELPERSISTENCE_TEMP: -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 12, 2017 at 5:25 PM, Andres Freund <andres@anarazel.de> wrote: > Even there I don't think that's a sane assumption *for the future*. We > just need a slight change in the rules about when a toast table is needed > - and that stuff seriously need overhauling - and it doesn't work > anymore. The problem is that if a relfilenode ever gets assigned by GetNewRelFileNode() during a binary-upgrade restore, that OID may turn out to be used by some other object later in the dump. And then you're dead, because the dump restore will fail later on complaining about, well, I forget the error message wording exactly, but, basically, an OID collision. So if we change the rules in such a way that objects which currently lack TOAST tables acquire them, we need to first restore all of the objects *without* adding any new TOAST tables, and then at the end create any new TOAST tables once we have a complete list of the relfilenodes that are actually used. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > The problem is that if a relfilenode ever gets assigned by > GetNewRelFileNode() during a binary-upgrade restore, that OID may turn > out to be used by some other object later in the dump. And then > you're dead, because the dump restore will fail later on complaining > about, well, I forget the error message wording exactly, but, > basically, an OID collision. Right. > So if we change the rules in such a way > that objects which currently lack TOAST tables acquire them, we need > to first restore all of the objects *without* adding any new TOAST > tables, and then at the end create any new TOAST tables once we have a > complete list of the relfilenodes that are actually used. toasting.c explains why this currently doesn't seem necessary: /* * In binary-upgrade mode, create a TOAST table if and only if * pg_upgrade told us to (ie, a TOASTtable OID has been provided). * * This indicates that the old cluster had a TOAST table for the * current table. We must create a TOAST table to receive the old * TOAST file, even if the table seems not to needone. * * Contrariwise, if the old cluster did not have a TOAST table, we * should be able to getalong without one even if the new version's * needs_toast_table rules suggest we should have one. There is a lot * of daylight between where we will create a TOAST table and where * one is really necessary to avoid failures,so small cross-version * differences in the when-to-create heuristic shouldn't be a problem. * Ifwe tried to create a TOAST table anyway, we would have the * problem that it might take up an OID that will conflictwith some * old-cluster table we haven't seen yet. */ I don't really see any near-term reason why we would need to change this. In the long run, it would certainly be cleaner if pg_upgrade dropped the force-the-relfilenode-assignment approach and instead remapped relfilenodes from old cluster to new. But I think it's just for cleanliness rather to fix any live or foreseeable bug. regards, tom lane
On Tue, Jun 13, 2017 at 9:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In the long run, it would certainly be cleaner if pg_upgrade dropped > the force-the-relfilenode-assignment approach and instead remapped > relfilenodes from old cluster to new. But I think it's just for > cleanliness rather to fix any live or foreseeable bug. Honestly, I think that would probably be *less* robust overall. I think we ought to be driving in the direction of having more and more things common between the old and new clusters, rather than trying to cope with them being different. It's pretty easy for users to have hidden dependencies on values that we think are only for internal use - e.g. somebody's got a table column of type regclass, and it breaks if you changed the table OIDs. A table with relfilenode values in it is perhaps less likely, but it is not beyond imagining that someone (who wrote a replication system?) has a use for it. Now maybe you're going to say "that's not a good idea", and maybe you're right, but users don't enjoy being told that something they've been doing for years works all the time *except* when you use pg_upgrade. Also, I think that if we did it that way, it would be significantly harder to debug. Right now, if something goes boom, you can look at the old and new clusters and figure out what doesn't match, but if pg_upgrade renumbered everything, you would no longer be able to do that, or at least not easily. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 13, 2017 at 11:14:02AM -0400, Robert Haas wrote: > Also, I think that if we did it that way, it would be significantly > harder to debug. Right now, if something goes boom, you can look at > the old and new clusters and figure out what doesn't match, but if > pg_upgrade renumbered everything, you would no longer be able to do > that, or at least not easily. FYI, pg_upgrade is designed to go boom if something doesn't look right because it can't anticipate what changes might be made to Postgres in the future. boom == feature! -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +