Thread: [HACKERS] Transactional sequence stuff breaks pg_upgrade

[HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Tom Lane

Date:

12 June 2017, 00:17:07

I believe I've identified the reason why skink and some other buildfarm
members have been failing the pg_upgrade test recently.  It is that
recent changes in sequence support have caused binary-upgrade restore
runs to do some sequence OID/relfilenode assignments without any heed
to the OIDs that pg_upgrade tried to impose on those sequences.  Once
those sequences have relfilenodes other than the intended ones, they
are land mines for all subsequent pg_upgrade-controlled table OID
assignments.

I am not very sure why it's so hard to duplicate the misbehavior; perhaps,
in order to make the failure happen with the current regression tests,
it's necessary for a background auto-analyze to happen and consume some
OIDs (for pg_statistic TOAST entries) at just the wrong time.  However,
I can definitely demonstrate that there are uncontrolled relfilenode
assignments happening during pg_upgrade's restore run.  I stuck an
elog() call into GetNewObjectId(), along with generation of a stack
trace using backtrace(), and here is one example:

[593daad3.4863:2243] LOG:  generated OID 16735
[593daad3.4863:2244] STATEMENT:  -- For binary upgrade, must preserve pg_class oidsSELECT
pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('46851'::pg_catalog.oid);--For binary upgrade, must preserve
pg_typeoidSELECT pg_catalog.binary_upgrade_set_next_pg_type_oid('46852'::pg_catalog.oid);ALTER TABLE "itest10" ALTER
COLUMN"a" ADD GENERATED BY DEFAULT AS IDENTITY (    SEQUENCE NAME "itest10_a_seq"    START WITH 1    INCREMENT BY 1
NOMINVALUE    NO MAXVALUE    CACHE 1); 

postgres: postgres regression [local] ALTER TABLE(GetNewObjectId+0xda) [0x50397a]
postgres: postgres regression [local] ALTER TABLE(GetNewRelFileNode+0xec) [0x52430c]
postgres: postgres regression [local] ALTER TABLE(RelationSetNewRelfilenode+0x79) [0x851d59]
postgres: postgres regression [local] ALTER TABLE(AlterSequence+0x1cd) [0x5d976d]
postgres: postgres regression [local] ALTER TABLE() [0x75d279]
postgres: postgres regression [local] ALTER TABLE(standard_ProcessUtility+0xb7) [0x75dec7]
postgres: postgres regression [local] ALTER TABLE() [0x75cb1d]
postgres: postgres regression [local] ALTER TABLE(standard_ProcessUtility+0xb7) [0x75dec7]
postgres: postgres regression [local] ALTER TABLE() [0x759f0b]
postgres: postgres regression [local] ALTER TABLE() [0x75ae91]
postgres: postgres regression [local] ALTER TABLE(PortalRun+0x250) [0x75b740]
postgres: postgres regression [local] ALTER TABLE() [0x757be7]
postgres: postgres regression [local] ALTER TABLE(PostgresMain+0xe08) [0x759968]
postgres: postgres regression [local] ALTER TABLE(PostmasterMain+0x1a99) [0x6e21a9]
postgres: postgres regression [local] ALTER TABLE(main+0x6b8) [0x65b958]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x3f3bc1ed1d]
postgres: postgres regression [local] ALTER TABLE() [0x473899]


Judging by when we started to see buildfarm failures, I think that
commit 3d79013b9 probably broke it, but the problem seems latent
in the whole concept of transactional sequence information.

Not sure what we want to do about it.  One idea is to make
ALTER SEQUENCE not so transactional when in binary-upgrade mode.

(I'm also tempted to make GetNewRelFileNode complain if IsBinaryUpgrade
is true, but that's a separate matter.)

In any case, this is a "must fix" problem IMO, so I'll go add it to the
open items list.
        regards, tom lane

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Tom Lane

Date:

12 June 2017, 03:03:28

I wrote:
> I believe I've identified the reason why skink and some other buildfarm
> members have been failing the pg_upgrade test recently.
> ...
> Not sure what we want to do about it.  One idea is to make
> ALTER SEQUENCE not so transactional when in binary-upgrade mode.

On closer inspection, the only thing that pg_upgrade needs is to be
able to do ALTER SEQUENCE OWNED BY without a relfilenode bump.  PFA
two versions of a patch that fixes this problem, at least to the
extent that it gets through check-world without triggering the Assert
I added to GetNewRelFileNode (which HEAD doesn't).  The first one
is a minimally-invasive hack; the second one puts the responsibility
for deciding if a sequence rewrite is needed into init_params.  That's
bulkier but might be useful if we ever grow additional ALTER SEQUENCE
options that don't need a rewrite.  OTOH, I'm not very clear on what
such options might look like, so maybe the extra flexibility is useless.
Thoughts?

            regards, tom lane

diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 11ee536..43ef4cd 100644
*** a/src/backend/catalog/catalog.c
--- b/src/backend/catalog/catalog.c
*************** GetNewRelFileNode(Oid reltablespace, Rel
*** 391,396 ****
--- 391,403 ----
      bool        collides;
      BackendId    backend;

+     /*
+      * If we ever get here during pg_upgrade, there's something wrong; all
+      * relfilenode assignments during a binary-upgrade run should be
+      * determined by commands in the dump script.
+      */
+     Assert(!IsBinaryUpgrade);
+
      switch (relpersistence)
      {
          case RELPERSISTENCE_TEMP:
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 7e85b69..188429c 100644
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
*************** AlterSequence(ParseState *pstate, AlterS
*** 473,490 ****
          GetTopTransactionId();

      /*
!      * Create a new storage file for the sequence, making the state changes
!      * transactional.  We want to keep the sequence's relfrozenxid at 0, since
!      * it won't contain any unfrozen XIDs.  Same with relminmxid, since a
!      * sequence will never contain multixacts.
       */
!     RelationSetNewRelfilenode(seqrel, seqrel->rd_rel->relpersistence,
!                               InvalidTransactionId, InvalidMultiXactId);

!     /*
!      * Insert the modified tuple into the new storage file.
!      */
!     fill_seq_with_data(seqrel, newdatatuple);

      /* process OWNED BY if given */
      if (owned_by)
--- 473,499 ----
          GetTopTransactionId();

      /*
!      * If we are *only* doing OWNED BY, there is no need to rewrite the
!      * sequence file nor the pg_sequence tuple; and we mustn't do so because
!      * it breaks pg_upgrade by causing unwanted changes in the sequence's
!      * relfilenode.
       */
!     if (!(owned_by && list_length(stmt->options) == 1))
!     {
!         /*
!          * Create a new storage file for the sequence, making the state
!          * changes transactional.  We want to keep the sequence's relfrozenxid
!          * at 0, since it won't contain any unfrozen XIDs.  Same with
!          * relminmxid, since a sequence will never contain multixacts.
!          */
!         RelationSetNewRelfilenode(seqrel, seqrel->rd_rel->relpersistence,
!                                   InvalidTransactionId, InvalidMultiXactId);

!         /*
!          * Insert the modified tuple into the new storage file.
!          */
!         fill_seq_with_data(seqrel, newdatatuple);
!     }

      /* process OWNED BY if given */
      if (owned_by)
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 11ee536..43ef4cd 100644
*** a/src/backend/catalog/catalog.c
--- b/src/backend/catalog/catalog.c
*************** GetNewRelFileNode(Oid reltablespace, Rel
*** 391,396 ****
--- 391,403 ----
      bool        collides;
      BackendId    backend;

+     /*
+      * If we ever get here during pg_upgrade, there's something wrong; all
+      * relfilenode assignments during a binary-upgrade run should be
+      * determined by commands in the dump script.
+      */
+     Assert(!IsBinaryUpgrade);
+
      switch (relpersistence)
      {
          case RELPERSISTENCE_TEMP:
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 7e85b69..34b8aa2 100644
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
*************** static Form_pg_sequence_data read_seq_tu
*** 101,107 ****
  static void init_params(ParseState *pstate, List *options, bool for_identity,
              bool isInit,
              Form_pg_sequence seqform,
!             Form_pg_sequence_data seqdataform, List **owned_by);
  static void do_setval(Oid relid, int64 next, bool iscalled);
  static void process_owned_by(Relation seqrel, List *owned_by, bool for_identity);

--- 101,109 ----
  static void init_params(ParseState *pstate, List *options, bool for_identity,
              bool isInit,
              Form_pg_sequence seqform,
!             Form_pg_sequence_data seqdataform,
!             bool *need_newrelfilenode,
!             List **owned_by);
  static void do_setval(Oid relid, int64 next, bool iscalled);
  static void process_owned_by(Relation seqrel, List *owned_by, bool for_identity);

*************** DefineSequence(ParseState *pstate, Creat
*** 115,120 ****
--- 117,123 ----
  {
      FormData_pg_sequence seqform;
      FormData_pg_sequence_data seqdataform;
+     bool        need_newrelfilenode;
      List       *owned_by;
      CreateStmt *stmt = makeNode(CreateStmt);
      Oid            seqoid;
*************** DefineSequence(ParseState *pstate, Creat
*** 153,159 ****
      }

      /* Check and set all option values */
!     init_params(pstate, seq->options, seq->for_identity, true, &seqform, &seqdataform, &owned_by);

      /*
       * Create relation (and fill value[] and null[] for the tuple)
--- 156,164 ----
      }

      /* Check and set all option values */
!     init_params(pstate, seq->options, seq->for_identity, true,
!                 &seqform, &seqdataform,
!                 &need_newrelfilenode, &owned_by);

      /*
       * Create relation (and fill value[] and null[] for the tuple)
*************** AlterSequence(ParseState *pstate, AlterS
*** 417,422 ****
--- 422,428 ----
      HeapTupleData datatuple;
      Form_pg_sequence seqform;
      Form_pg_sequence_data newdataform;
+     bool        need_newrelfilenode;
      List       *owned_by;
      ObjectAddress address;
      Relation    rel;
*************** AlterSequence(ParseState *pstate, AlterS
*** 461,468 ****
      UnlockReleaseBuffer(buf);

      /* Check and set new values */
!     init_params(pstate, stmt->options, stmt->for_identity, false, seqform,
!                 newdataform, &owned_by);

      /* Clear local cache so that we don't think we have cached numbers */
      /* Note that we do not change the currval() state */
--- 467,475 ----
      UnlockReleaseBuffer(buf);

      /* Check and set new values */
!     init_params(pstate, stmt->options, stmt->for_identity, false,
!                 seqform, newdataform,
!                 &need_newrelfilenode, &owned_by);

      /* Clear local cache so that we don't think we have cached numbers */
      /* Note that we do not change the currval() state */
*************** AlterSequence(ParseState *pstate, AlterS
*** 473,490 ****
          GetTopTransactionId();

      /*
!      * Create a new storage file for the sequence, making the state changes
!      * transactional.  We want to keep the sequence's relfrozenxid at 0, since
!      * it won't contain any unfrozen XIDs.  Same with relminmxid, since a
!      * sequence will never contain multixacts.
       */
!     RelationSetNewRelfilenode(seqrel, seqrel->rd_rel->relpersistence,
!                               InvalidTransactionId, InvalidMultiXactId);

!     /*
!      * Insert the modified tuple into the new storage file.
!      */
!     fill_seq_with_data(seqrel, newdatatuple);

      /* process OWNED BY if given */
      if (owned_by)
--- 480,500 ----
          GetTopTransactionId();

      /*
!      * If needed, create a new storage file for the sequence, making the state
!      * changes transactional.  We want to keep the sequence's relfrozenxid at
!      * 0, since it won't contain any unfrozen XIDs.  Same with relminmxid,
!      * since a sequence will never contain multixacts.
       */
!     if (need_newrelfilenode)
!     {
!         RelationSetNewRelfilenode(seqrel, seqrel->rd_rel->relpersistence,
!                                   InvalidTransactionId, InvalidMultiXactId);

!         /*
!          * Insert the modified tuple into the new storage file.
!          */
!         fill_seq_with_data(seqrel, newdatatuple);
!     }

      /* process OWNED BY if given */
      if (owned_by)
*************** read_seq_tuple(Relation rel, Buffer *buf
*** 1205,1213 ****
   * init_params: process the options list of CREATE or ALTER SEQUENCE, and
   * store the values into appropriate fields of seqform, for changes that go
   * into the pg_sequence catalog, and seqdataform for changes to the sequence
!  * relation itself.  Set *changed_seqform to true if seqform was changed
!  * (interesting for ALTER SEQUENCE).  Also set *owned_by to any OWNED BY
!  * option, or to NIL if there is none.
   *
   * If isInit is true, fill any unspecified options with default values;
   * otherwise, do not change existing options that aren't explicitly overridden.
--- 1215,1224 ----
   * init_params: process the options list of CREATE or ALTER SEQUENCE, and
   * store the values into appropriate fields of seqform, for changes that go
   * into the pg_sequence catalog, and seqdataform for changes to the sequence
!  * relation itself.  Set *need_newrelfilenode to true if we changed any
!  * parameters that require assigning a new sequence relfilenode (interesting
!  * for ALTER SEQUENCE).  Also set *owned_by to any OWNED BY option, or to NIL
!  * if there is none.
   *
   * If isInit is true, fill any unspecified options with default values;
   * otherwise, do not change existing options that aren't explicitly overridden.
*************** init_params(ParseState *pstate, List *op
*** 1217,1222 ****
--- 1228,1234 ----
              bool isInit,
              Form_pg_sequence seqform,
              Form_pg_sequence_data seqdataform,
+             bool *need_newrelfilenode,
              List **owned_by)
  {
      DefElem    *as_type = NULL;
*************** init_params(ParseState *pstate, List *op
*** 1231,1236 ****
--- 1243,1249 ----
      bool        reset_max_value = false;
      bool        reset_min_value = false;

+     *need_newrelfilenode = false;
      *owned_by = NIL;

      foreach(option, options)
*************** init_params(ParseState *pstate, List *op
*** 1245,1250 ****
--- 1258,1264 ----
                           errmsg("conflicting or redundant options"),
                           parser_errposition(pstate, defel->location)));
              as_type = defel;
+             *need_newrelfilenode = true;
          }
          else if (strcmp(defel->defname, "increment") == 0)
          {
*************** init_params(ParseState *pstate, List *op
*** 1254,1259 ****
--- 1268,1274 ----
                           errmsg("conflicting or redundant options"),
                           parser_errposition(pstate, defel->location)));
              increment_by = defel;
+             *need_newrelfilenode = true;
          }
          else if (strcmp(defel->defname, "start") == 0)
          {
*************** init_params(ParseState *pstate, List *op
*** 1263,1268 ****
--- 1278,1284 ----
                           errmsg("conflicting or redundant options"),
                           parser_errposition(pstate, defel->location)));
              start_value = defel;
+             *need_newrelfilenode = true;
          }
          else if (strcmp(defel->defname, "restart") == 0)
          {
*************** init_params(ParseState *pstate, List *op
*** 1272,1277 ****
--- 1288,1294 ----
                           errmsg("conflicting or redundant options"),
                           parser_errposition(pstate, defel->location)));
              restart_value = defel;
+             *need_newrelfilenode = true;
          }
          else if (strcmp(defel->defname, "maxvalue") == 0)
          {
*************** init_params(ParseState *pstate, List *op
*** 1281,1286 ****
--- 1298,1304 ----
                           errmsg("conflicting or redundant options"),
                           parser_errposition(pstate, defel->location)));
              max_value = defel;
+             *need_newrelfilenode = true;
          }
          else if (strcmp(defel->defname, "minvalue") == 0)
          {
*************** init_params(ParseState *pstate, List *op
*** 1290,1295 ****
--- 1308,1314 ----
                           errmsg("conflicting or redundant options"),
                           parser_errposition(pstate, defel->location)));
              min_value = defel;
+             *need_newrelfilenode = true;
          }
          else if (strcmp(defel->defname, "cache") == 0)
          {
*************** init_params(ParseState *pstate, List *op
*** 1299,1304 ****
--- 1318,1324 ----
                           errmsg("conflicting or redundant options"),
                           parser_errposition(pstate, defel->location)));
              cache_value = defel;
+             *need_newrelfilenode = true;
          }
          else if (strcmp(defel->defname, "cycle") == 0)
          {
*************** init_params(ParseState *pstate, List *op
*** 1308,1313 ****
--- 1328,1334 ----
                           errmsg("conflicting or redundant options"),
                           parser_errposition(pstate, defel->location)));
              is_cycled = defel;
+             *need_newrelfilenode = true;
          }
          else if (strcmp(defel->defname, "owned_by") == 0)
          {

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Alvaro Herrera

Date:

12 June 2017, 20:29:49

Tom Lane wrote:

> On closer inspection, the only thing that pg_upgrade needs is to be
> able to do ALTER SEQUENCE OWNED BY without a relfilenode bump.  PFA
> two versions of a patch that fixes this problem, at least to the
> extent that it gets through check-world without triggering the Assert
> I added to GetNewRelFileNode (which HEAD doesn't).  The first one
> is a minimally-invasive hack; the second one puts the responsibility
> for deciding if a sequence rewrite is needed into init_params.  That's
> bulkier but might be useful if we ever grow additional ALTER SEQUENCE
> options that don't need a rewrite.  OTOH, I'm not very clear on what
> such options might look like, so maybe the extra flexibility is useless.
> Thoughts?

I vote for the second patch.  I don't have a clear idea either, but I'm
pretty sure the logical-replication people is going to be hacking on
sequences some more, yet, and this is likely to come in handy.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Andres Freund

Date:

12 June 2017, 22:42:55

Hi,

On 2017-06-11 17:17:07 -0400, Tom Lane wrote:
> I am not very sure why it's so hard to duplicate the misbehavior; perhaps,
> in order to make the failure happen with the current regression tests,
> it's necessary for a background auto-analyze to happen and consume some
> OIDs (for pg_statistic TOAST entries) at just the wrong time.  However,
> I can definitely demonstrate that there are uncontrolled relfilenode
> assignments happening during pg_upgrade's restore run.  I stuck an
> elog() call into GetNewObjectId(), along with generation of a stack
> trace using backtrace(), and here is one example:

Yea, that's not all that surprising :/

> Judging by when we started to see buildfarm failures, I think that
> commit 3d79013b9 probably broke it, but the problem seems latent
> in the whole concept of transactional sequence information.

Hm.

> Not sure what we want to do about it.  One idea is to make
> ALTER SEQUENCE not so transactional when in binary-upgrade mode.

I think what you proposed downthread makes sense - I'd ripped out fairly
similar code in my recent changes, because it was only introduced to
address the concurrency problems the new sequence code had, and it
didn't seem to buy sufficiently much...

- Andres

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Andres Freund

Date:

13 June 2017, 00:02:21

Hi,

On 2017-06-11 20:03:28 -0400, Tom Lane wrote:
> I wrote:
> > I believe I've identified the reason why skink and some other buildfarm
> > members have been failing the pg_upgrade test recently.
> > ...
> > Not sure what we want to do about it.  One idea is to make
> > ALTER SEQUENCE not so transactional when in binary-upgrade mode.
> 
> On closer inspection, the only thing that pg_upgrade needs is to be
> able to do ALTER SEQUENCE OWNED BY without a relfilenode bump.

That indeed seems appropriate.  More on the reliability of that below,
though.

> PFA two versions of a patch that fixes this problem, at least to the
> extent that it gets through check-world without triggering the Assert
> I added to GetNewRelFileNode (which HEAD doesn't).  The first one
> is a minimally-invasive hack; the second one puts the responsibility
> for deciding if a sequence rewrite is needed into init_params.  That's
> bulkier but might be useful if we ever grow additional ALTER SEQUENCE
> options that don't need a rewrite.  OTOH, I'm not very clear on what
> such options might look like, so maybe the extra flexibility is useless.
> Thoughts?

I think the flexibility isn't a bad idea, there's certainly other
options (cycle, cache, and with more work additional ones) that could be
changed without a rewrite.

> diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
> index 11ee536..43ef4cd 100644
> --- a/src/backend/catalog/catalog.c
> +++ b/src/backend/catalog/catalog.c
> @@ -391,6 +391,13 @@ GetNewRelFileNode(Oid reltablespace, Rel
>      bool        collides;
>      BackendId    backend;
>  
> +    /*
> +     * If we ever get here during pg_upgrade, there's something wrong; all
> +     * relfilenode assignments during a binary-upgrade run should be
> +     * determined by commands in the dump script.
> +     */
> +    Assert(!IsBinaryUpgrade);
> +

I'm very doubtful that a) this doesn't get hit in practice, and b) that
we can rely on it going forward.  At least until we change toasting to
not use the global oid counter.  A number of catalog tables have
toastable columns, and the decision when to toast will every now and
then change. Even without changing the compression algorithm, added
columns will push things across boundaries.

I'm not yet sure what the best fix for that will be, but I don't think
we should bake in the assumption that the oid counter won't be touched.

- Andres

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Tom Lane

Date:

13 June 2017, 00:13:34

Andres Freund <andres@anarazel.de> writes:
> On 2017-06-11 20:03:28 -0400, Tom Lane wrote:
>> @@ -391,6 +391,13 @@ GetNewRelFileNode(Oid reltablespace, Rel
>> bool        collides;
>> BackendId    backend;
>> 
>> +    /*
>> +     * If we ever get here during pg_upgrade, there's something wrong; all
>> +     * relfilenode assignments during a binary-upgrade run should be
>> +     * determined by commands in the dump script.
>> +     */
>> +    Assert(!IsBinaryUpgrade);
>> +

> I'm very doubtful that a) this doesn't get hit in practice, and b) that
> we can rely on it going forward.  At least until we change toasting to
> not use the global oid counter.

This is not about assignments from the global OID counter; the function
it's touching is GetNewRelFileNode() not GetNewObjectId().

GetNewObjectId() definitely does get hit during a binary-upgrade restore,
for all object types that pg_upgrade doesn't try to control the OIDs of
--- which is most.  We don't care, for the most part.  But we *do* care
about relfilenode assignments, for precisely the reason seen in this bug.
*All* assignments of relfilenodes have to be shortcircuited by pg_upgrade
override calls during a binary-restore run, or we risk filename
collisions.  So if this assert ever gets hit, we have something to fix.

I intend to not only commit this, but back-patch it.  There's enough
changes in relevant code paths that logic that is fine in HEAD might
not be fine in back branches.
        regards, tom lane

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Andres Freund

Date:

13 June 2017, 00:25:17

On 2017-06-12 17:13:34 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-06-11 20:03:28 -0400, Tom Lane wrote:
> >> @@ -391,6 +391,13 @@ GetNewRelFileNode(Oid reltablespace, Rel
> >> bool        collides;
> >> BackendId    backend;
> >> 
> >> +    /*
> >> +     * If we ever get here during pg_upgrade, there's something wrong; all
> >> +     * relfilenode assignments during a binary-upgrade run should be
> >> +     * determined by commands in the dump script.
> >> +     */
> >> +    Assert(!IsBinaryUpgrade);
> >> +
> 
> > I'm very doubtful that a) this doesn't get hit in practice, and b) that
> > we can rely on it going forward.  At least until we change toasting to
> > not use the global oid counter.
> 
> This is not about assignments from the global OID counter; the function
> it's touching is GetNewRelFileNode() not GetNewObjectId().

Ah, that makes more sense. You'd put the backtrace() into
GetNewObjectId() your original message, that's probably why I thought
about it.


> We don't care, for the most part.  But we *do* care about relfilenode
> assignments, for precisely the reason seen in this bug.

Even there I don't think that's a sane assumption *for the future*. We
just need a slight change in the rules about when a toast table is needed
- and that stuff seriously need overhauling - and it doesn't work
anymore.

In my opinion the problem of:
> assignments of relfilenodes have to be shortcircuited by pg_upgrade
> override calls during a binary-restore run, or we risk filename
> collisions.

should instead be solved by simply not even trying to preserve
relfilenodes.  We can "just" copy/link files to the the new
relfilenodes, there's no need to preserve them, in contrast to
pg_class.oid etc.  But that's obviously something for the future.


> I intend to not only commit this, but back-patch it.  There's enough
> changes in relevant code paths that logic that is fine in HEAD might
> not be fine in back branches.

Hm.

Greetings,

Andres Freund

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Tom Lane

Date:

13 June 2017, 00:35:37

Andres Freund <andres@anarazel.de> writes:
> In my opinion the problem of:

> On 2017-06-12 17:13:34 -0400, Tom Lane wrote:
>> assignments of relfilenodes have to be shortcircuited by pg_upgrade
>> override calls during a binary-restore run, or we risk filename
>> collisions.

> should instead be solved by simply not even trying to preserve
> relfilenodes.  We can "just" copy/link files to the the new
> relfilenodes, there's no need to preserve them, in contrast to
> pg_class.oid etc.  But that's obviously something for the future.

Right.  But until somebody gets around to fixing that, it's a problem.

Also, even if we fix this, we still have the issue of type OIDs residing
on-disk in places like array headers; that results in pg_upgrade having
to control type OID assignments as well.  I thought about trying to insert
an assert in GetNewOidWithIndex to ensure that no type OIDs are selected
automatically during upgrade, but it seemed a bit too complicated.  Might
be a good idea to figure it out though, unless you have an idea for
removing that requirement.

>> I intend to not only commit this, but back-patch it.  There's enough
>> changes in relevant code paths that logic that is fine in HEAD might
>> not be fine in back branches.

> Hm.

Just found out that 9.4 (and probably older) fail the assertion because
of the temp table created in get_rel_infos().  That's *probably* all
right because the table is *probably* gone from disk by the time we
start the actual restore run.  But we start the new postmaster only
once, with -b, so the assertion applies to the preparatory steps as
well as the restore proper.

Later versions avoid that by using a CTE instead of a temp table.
I'm not excited enough about this to back-port the patch that
changed it, so I'm thinking of just adding the assert back to 9.5.
        regards, tom lane

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Andres Freund

Date:

13 June 2017, 01:11:18

On 2017-06-12 17:35:37 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > In my opinion the problem of:
> 
> > On 2017-06-12 17:13:34 -0400, Tom Lane wrote:
> >> assignments of relfilenodes have to be shortcircuited by pg_upgrade
> >> override calls during a binary-restore run, or we risk filename
> >> collisions.
> 
> > should instead be solved by simply not even trying to preserve
> > relfilenodes.  We can "just" copy/link files to the the new
> > relfilenodes, there's no need to preserve them, in contrast to
> > pg_class.oid etc.  But that's obviously something for the future.
> 
> Right.  But until somebody gets around to fixing that, it's a problem.
> 
> Also, even if we fix this, we still have the issue of type OIDs residing
> on-disk in places like array headers; that results in pg_upgrade having
> to control type OID assignments as well.

Yes, but those should be largely controlled and controllable.  With the
exception of the type oids of toast tables, more on that below.

> I thought about trying to insert an assert in GetNewOidWithIndex to
> ensure that no type OIDs are selected automatically during upgrade,
> but it seemed a bit too complicated.  Might be a good idea to figure
> it out though, unless you have an idea for removing that requirement.

I think the only type oids assigned during pg_upgrade are the oids for
toast type types, right?  I can think of two decent ways to deal with
those:

a) Do not create a corresponding composite type for toast tables. Not  super pretty, but I doubt it'd be a huge issue.
b) Use *one* composite type for all of the toast tables.  That'd not be  entirely trivial because of pg_type.typrelid.

Both of these would have the advantage of removing some quite redundant
content from the catalogs.

I think with such a change, we'd have no issue with *additional* toast
tables being created during the run?  We already take care of toast
tables not being created, by forcing the creation in binary upgrade mode
if a toast oid is set.

> Later versions avoid that by using a CTE instead of a temp table.
> I'm not excited enough about this to back-port the patch that
> changed it, so I'm thinking of just adding the assert back to 9.5.

I'm still a bit doubtful that we know enough to consider whether the
assert is actually safe in practice, to backpatch it.  On the other
hand, it'd only affect people building with asserts...

Greetings,

Andres Freund

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Tom Lane

Date:

13 June 2017, 01:31:04

Andres Freund <andres@anarazel.de> writes:
> On 2017-06-12 17:35:37 -0400, Tom Lane wrote:
>> I thought about trying to insert an assert in GetNewOidWithIndex to
>> ensure that no type OIDs are selected automatically during upgrade,
>> but it seemed a bit too complicated.  Might be a good idea to figure
>> it out though, unless you have an idea for removing that requirement.

> I think the only type oids assigned during pg_upgrade are the oids for
> toast type types, right?

Perhaps that was a problem once, but it isn't now; see 
binary_upgrade_set_next_toast_pg_type_oid().

In fact, we get through the pg_upgrade regression test with the attached
patch, and I really think we ought to commit it.

Having said that, I wouldn't mind trying to reduce the catalog overhead
for toast tables in v11 or beyond.

> a) Do not create a corresponding composite type for toast tables. Not
>    super pretty, but I doubt it'd be a huge issue.

I suspect there are places that assume all tables have type OIDs.

> b) Use *one* composite type for all of the toast tables.  That'd not be
>    entirely trivial because of pg_type.typrelid.

That might work, with some klugery.  Peter might have some insight about
this --- I'm not sure whether the CREATE TABLE OF TYPE syntax shares
a type OID across all the tables.

            regards, tom lane

diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 11ee536..92d943c 100644
*** a/src/backend/catalog/catalog.c
--- b/src/backend/catalog/catalog.c
***************
*** 38,43 ****
--- 38,44 ----
  #include "catalog/pg_shseclabel.h"
  #include "catalog/pg_subscription.h"
  #include "catalog/pg_tablespace.h"
+ #include "catalog/pg_type.h"
  #include "catalog/toasting.h"
  #include "miscadmin.h"
  #include "storage/fd.h"
*************** GetNewOidWithIndex(Relation relation, Oi
*** 340,345 ****
--- 341,354 ----
      ScanKeyData key;
      bool        collides;

+     /*
+      * We should never be asked to generate a new pg_type OID during
+      * pg_upgrade; doing so would risk collisions with the OIDs it wants to
+      * assign.  Hitting this assert means there's some path where we failed to
+      * ensure that a type OID is determined by commands in the dump script.
+      */
+     Assert(!IsBinaryUpgrade || RelationGetRelid(relation) != TypeRelationId);
+
      InitDirtySnapshot(SnapshotDirty);

      /* Generate new OIDs until we find one not in the table */
*************** GetNewRelFileNode(Oid reltablespace, Rel
*** 391,396 ****
--- 400,412 ----
      bool        collides;
      BackendId    backend;

+     /*
+      * If we ever get here during pg_upgrade, there's something wrong; all
+      * relfilenode assignments during a binary-upgrade run should be
+      * determined by commands in the dump script.
+      */
+     Assert(!IsBinaryUpgrade);
+
      switch (relpersistence)
      {
          case RELPERSISTENCE_TEMP:

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Robert Haas

Date:

13 June 2017, 14:55:38

On Mon, Jun 12, 2017 at 5:25 PM, Andres Freund <andres@anarazel.de> wrote:
> Even there I don't think that's a sane assumption *for the future*. We
> just need a slight change in the rules about when a toast table is needed
> - and that stuff seriously need overhauling - and it doesn't work
> anymore.

The problem is that if a relfilenode ever gets assigned by
GetNewRelFileNode() during a binary-upgrade restore, that OID may turn
out to be used by some other object later in the dump.  And then
you're dead, because the dump restore will fail later on complaining
about, well, I forget the error message wording exactly, but,
basically, an OID collision.  So if we change the rules in such a way
that objects which currently lack TOAST tables acquire them, we need
to first restore all of the objects *without* adding any new TOAST
tables, and then at the end create any new TOAST tables once we have a
complete list of the relfilenodes that are actually used.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Tom Lane

Date:

13 June 2017, 16:37:59

Robert Haas <robertmhaas@gmail.com> writes:
> The problem is that if a relfilenode ever gets assigned by
> GetNewRelFileNode() during a binary-upgrade restore, that OID may turn
> out to be used by some other object later in the dump.  And then
> you're dead, because the dump restore will fail later on complaining
> about, well, I forget the error message wording exactly, but,
> basically, an OID collision.

Right.

> So if we change the rules in such a way
> that objects which currently lack TOAST tables acquire them, we need
> to first restore all of the objects *without* adding any new TOAST
> tables, and then at the end create any new TOAST tables once we have a
> complete list of the relfilenodes that are actually used.

toasting.c explains why this currently doesn't seem necessary:
       /*        * In binary-upgrade mode, create a TOAST table if and only if        * pg_upgrade told us to (ie, a
TOASTtable OID has been provided).        *        * This indicates that the old cluster had a TOAST table for the
 * current table.  We must create a TOAST table to receive the old        * TOAST file, even if the table seems not to
needone.        *        * Contrariwise, if the old cluster did not have a TOAST table, we        * should be able to
getalong without one even if the new version's        * needs_toast_table rules suggest we should have one.  There is a
lot       * of daylight between where we will create a TOAST table and where        * one is really necessary to avoid
failures,so small cross-version        * differences in the when-to-create heuristic shouldn't be a problem.        *
Ifwe tried to create a TOAST table anyway, we would have the        * problem that it might take up an OID that will
conflictwith some        * old-cluster table we haven't seen yet.        */
 

I don't really see any near-term reason why we would need to change this.

In the long run, it would certainly be cleaner if pg_upgrade dropped
the force-the-relfilenode-assignment approach and instead remapped
relfilenodes from old cluster to new.  But I think it's just for
cleanliness rather to fix any live or foreseeable bug.
        regards, tom lane

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Robert Haas

Date:

13 June 2017, 18:14:02

On Tue, Jun 13, 2017 at 9:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In the long run, it would certainly be cleaner if pg_upgrade dropped
> the force-the-relfilenode-assignment approach and instead remapped
> relfilenodes from old cluster to new.  But I think it's just for
> cleanliness rather to fix any live or foreseeable bug.

Honestly, I think that would probably be *less* robust overall.  I
think we ought to be driving in the direction of having more and more
things common between the old and new clusters, rather than trying to
cope with them being different.  It's pretty easy for users to have
hidden dependencies on values that we think are only for internal use
- e.g. somebody's got a table column of type regclass, and it breaks
if you changed the table OIDs.  A table with relfilenode values in it
is perhaps less likely, but it is not beyond imagining that someone
(who wrote a replication system?) has a use for it.  Now maybe you're
going to say "that's not a good idea", and maybe you're right, but
users don't enjoy being told that something they've been doing for
years works all the time *except* when you use pg_upgrade.

Also, I think that if we did it that way, it would be significantly
harder to debug.  Right now, if something goes boom, you can look at
the old and new clusters and figure out what doesn't match, but if
pg_upgrade renumbered everything, you would no longer be able to do
that, or at least not easily.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Transactional sequence stuff breaks pg_upgrade

From

Bruce Momjian

Date:

13 June 2017, 20:07:59

On Tue, Jun 13, 2017 at 11:14:02AM -0400, Robert Haas wrote:
> Also, I think that if we did it that way, it would be significantly
> harder to debug.  Right now, if something goes boom, you can look at
> the old and new clusters and figure out what doesn't match, but if
> pg_upgrade renumbered everything, you would no longer be able to do
> that, or at least not easily.

FYI, pg_upgrade is designed to go boom if something doesn't look right
because it can't anticipate what changes might be made to Postgres in
the future.

boom == feature!

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +