Thread: pg_upgrade: fail early if a tablespace dir already exists for new cluster version

pg_upgrade: fail early if a tablespace dir already exists for new cluster version

From
Justin Pryzby
Date:
This should be caught during --check, or earlier in the upgrade, rather than
only after dumping the schema.

Typically, the tablespace dir would be left behind by a previous failed upgrade
attempt, causing a cascade of failured upgrades.  I guess I may not be the only
one with a 3 year old wrapper which has a hack to check for this.

|rm -fr pgsql12.dat pgsql13.dat
|/usr/pgsql-12/bin/initdb -D pgsql12.dat --no-sync
|/usr/pgsql-13/bin/initdb -D pgsql13.dat --no-sync
|/usr/pgsql-12/bin/postgres -D pgsql12.dat -c port=5678 -k /tmp
|mkdir tblspc tblspc/PG_13_202007203
|psql -h /tmp -p 5678 postgres -c "CREATE TABLESPACE one LOCATION '/home/pryzbyj/tblspc'"
|/usr/pgsql-13/bin/pg_upgrade -D pgsql13.dat -d pgsql12.dat -b /usr/pgsql-12/bin
|pg_upgrade_utility.log:
|CREATE TABLESPACE "one" OWNER "pryzbyj" LOCATION '/home/pryzbyj/tblspc';
|psql:pg_upgrade_dump_globals.sql:27: ERROR:  directory "/home/pryzbyj/tblspc/PG_13_202007201" already in use as a
tablespace

Attachment
On Thu, Sep 24, 2020 at 07:55:31PM -0500, Justin Pryzby wrote:
> This should be caught during --check, or earlier in the upgrade, rather than
> only after dumping the schema.
> 
> Typically, the tablespace dir would be left behind by a previous failed upgrade
> attempt, causing a cascade of failured upgrades.  I guess I may not be the only
> one with a 3 year old wrapper which has a hack to check for this.

This is an interesting failure case I never considered --- running
pg_upgrade, having it fail, deleting and recreating the _new_ cluster
directory, but not removing the new cluster's tablepace directories.  I
was able to recreate the failure, and verify that your patch properly
throws an error during 'check' for this case.

Modified patch attached.  I plan to apply this to all supported Postgres
versions.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment
On Fri, Oct 09, 2020 at 02:53:25PM -0400, Bruce Momjian wrote:
> On Thu, Sep 24, 2020 at 07:55:31PM -0500, Justin Pryzby wrote:
> > This should be caught during --check, or earlier in the upgrade, rather than
> > only after dumping the schema.
> > 
> > Typically, the tablespace dir would be left behind by a previous failed upgrade
> > attempt, causing a cascade of failured upgrades.  I guess I may not be the only
> > one with a 3 year old wrapper which has a hack to check for this.
> 
> This is an interesting failure case I never considered --- running
> pg_upgrade, having it fail, deleting and recreating the _new_ cluster
> directory, but not removing the new cluster's tablepace directories.  I
> was able to recreate the failure, and verify that your patch properly
> throws an error during 'check' for this case.
> 
> Modified patch attached.  I plan to apply this to all supported Postgres
> versions.

Thanks for looking.  It hits me a bunch of times every year.

> +  * Check that tablespace directories do not already exist for new cluster.
> +  * If they do, it would cause an error while restoring global objects.
> +  * This allows the failure to be detected at check time, rather than
> +  * during schema restore.
> +  *
> +  * Note, v8.4 has no tablespace_suffix, which is fine as long as the new
> +  * cluster version has a suffix.

In my local branch, I had revised this comment to say:

+ * Note, v8.4 has no tablespace_suffix, which is fine so long as the version we
+ * being upgraded *to* has a suffix, since it's not allowed to pg_upgrade from
+ * a version to the same version if tablespaces are in use.

Cheers,
-- 
Justin



On Fri, Oct  9, 2020 at 02:23:10PM -0500, Justin Pryzby wrote:
> In my local branch, I had revised this comment to say:
> 
> + * Note, v8.4 has no tablespace_suffix, which is fine so long as the version we
> + * being upgraded *to* has a suffix, since it's not allowed to pg_upgrade from
> + * a version to the same version if tablespaces are in use.

OK, updated patch attached.  Also, from your original patch, I didn't
need to call canonicalize_path() since we are not comparing paths, and I
didn't need to include common/relpath.h.  I also renamed a variable.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee


Attachment
On Fri, Oct 09, 2020 at 07:42:51PM -0400, Bruce Momjian wrote:
> On Fri, Oct  9, 2020 at 02:23:10PM -0500, Justin Pryzby wrote:
> > In my local branch, I had revised this comment to say:
> > 
> > + * Note, v8.4 has no tablespace_suffix, which is fine so long as the version we
> > + * being upgraded *to* has a suffix, since it's not allowed to pg_upgrade from
> > + * a version to the same version if tablespaces are in use.
> 
> OK, updated patch attached.  Also, from your original patch, I didn't
> need to call canonicalize_path() since we are not comparing paths, and I
> didn't need to include common/relpath.h.  I also renamed a variable.

Since I just hit it again, I'll take the opportunity to give an example of how
this can happen.

Here, I've pg_upgraded from a pg12 to pg13, but the pg12 cluster has
postgis-3.0, and pg13 has postgis-3.1.  Maybe that's not guaranteed/intended to
work, but that's what's been working well so far this year, except that there's
two gis functions which no longer exist.  So we fail while "Restoring database
schemas in the new cluster", leaving behind tablespace dirs, which then cause
future pg_upgrades to fail.

pg_restore: from TOC entry 13543; 1255 17106 FUNCTION pgis_geometry_union_transfn("internal", "public"."geometry")
postgres
pg_restore: error: could not execute query: ERROR:  could not find function "pgis_geometry_union_transfn" in file
"/usr/pgsql-13/lib/postgis-3.so"
Command was: CREATE FUNCTION "public"."pgis_geometry_union_transfn"("internal", "public"."geometry") RETURNS
"internal"
    LANGUAGE "c" PARALLEL SAFE
    AS '$libdir/postgis-3', 'pgis_geometry_union_transfn';

I imagine in previous years I hit that some other way, like maybe I installed
postgres RC1 on a customer server, imported their schema (and maybe data),
which then caused upgrade failure 1-2 months later when trying to upgrade their
production instance.

-- 
Justin



On Fri, Oct  9, 2020 at 07:42:51PM -0400, Bruce Momjian wrote:
> On Fri, Oct  9, 2020 at 02:23:10PM -0500, Justin Pryzby wrote:
> > In my local branch, I had revised this comment to say:
> > 
> > + * Note, v8.4 has no tablespace_suffix, which is fine so long as the version we
> > + * being upgraded *to* has a suffix, since it's not allowed to pg_upgrade from
> > + * a version to the same version if tablespaces are in use.
> 
> OK, updated patch attached.  Also, from your original patch, I didn't
> need to call canonicalize_path() since we are not comparing paths, and I
> didn't need to include common/relpath.h.  I also renamed a variable.

Patch applied to all supported versions.  Thanks for the report, and the
patch.  While I listed you as reporter, I forgot to list you as the
patch author --- Tom, can you remember that for the minor release notes
please?  Thanks

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Hi,

On 2020-10-15 19:35:30 -0400, Bruce Momjian wrote:
> On Fri, Oct  9, 2020 at 07:42:51PM -0400, Bruce Momjian wrote:
> > On Fri, Oct  9, 2020 at 02:23:10PM -0500, Justin Pryzby wrote:
> > > In my local branch, I had revised this comment to say:
> > > 
> > > + * Note, v8.4 has no tablespace_suffix, which is fine so long as the version we
> > > + * being upgraded *to* has a suffix, since it's not allowed to pg_upgrade from
> > > + * a version to the same version if tablespaces are in use.
> > 
> > OK, updated patch attached.  Also, from your original patch, I didn't
> > need to call canonicalize_path() since we are not comparing paths, and I
> > didn't need to include common/relpath.h.  I also renamed a variable.
> 
> Patch applied to all supported versions.  Thanks for the report, and the
> patch.  While I listed you as reporter, I forgot to list you as the
> patch author --- Tom, can you remember that for the minor release notes
> please?  Thanks

This fails on some animals <= 11, because we didn't yet rely on C99.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2020-10-15%2023%3A52%3A04
check.c:503:2: error: \342\200\230for\342\200\231 loop initial declarations are only allowed in C99 mode
  for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)

Greetings,

Andres Freund



On Thu, Oct 15, 2020 at 05:19:59PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-10-15 19:35:30 -0400, Bruce Momjian wrote:
> > On Fri, Oct  9, 2020 at 07:42:51PM -0400, Bruce Momjian wrote:
> > > On Fri, Oct  9, 2020 at 02:23:10PM -0500, Justin Pryzby wrote:
> > > > In my local branch, I had revised this comment to say:
> > > > 
> > > > + * Note, v8.4 has no tablespace_suffix, which is fine so long as the version we
> > > > + * being upgraded *to* has a suffix, since it's not allowed to pg_upgrade from
> > > > + * a version to the same version if tablespaces are in use.
> > > 
> > > OK, updated patch attached.  Also, from your original patch, I didn't
> > > need to call canonicalize_path() since we are not comparing paths, and I
> > > didn't need to include common/relpath.h.  I also renamed a variable.
> > 
> > Patch applied to all supported versions.  Thanks for the report, and the
> > patch.  While I listed you as reporter, I forgot to list you as the
> > patch author --- Tom, can you remember that for the minor release notes
> > please?  Thanks
> 
> This fails on some animals <= 11, because we didn't yet rely on C99.
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2020-10-15%2023%3A52%3A04
> check.c:503:2: error: \342\200\230for\342\200\231 loop initial declarations are only allowed in C99 mode
>   for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)

Thanks, fixed.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




On Thu, Oct 15, 2020 at 07:35:30PM -0400, Bruce Momjian wrote:
> On Fri, Oct  9, 2020 at 07:42:51PM -0400, Bruce Momjian wrote:
> > On Fri, Oct  9, 2020 at 02:23:10PM -0500, Justin Pryzby wrote:
> > > In my local branch, I had revised this comment to say:
> > > 
> > > + * Note, v8.4 has no tablespace_suffix, which is fine so long as the version we
> > > + * being upgraded *to* has a suffix, since it's not allowed to pg_upgrade from
> > > + * a version to the same version if tablespaces are in use.
> > 
> > OK, updated patch attached.  Also, from your original patch, I didn't
> > need to call canonicalize_path() since we are not comparing paths, and I
> > didn't need to include common/relpath.h.  I also renamed a variable.
> 
> Patch applied to all supported versions.  Thanks for the report, and the

I wonder if pg_upgrade should try to rmdir() the tablespace dirs before
restoring global objects, allowing it to succeed, rather than just "failing
early".  That can avoid the need to re-create the new cluster, which I imagine
in some scenarios might be bad enough to require aborting the upgrade.
I propose only for master branch.

This doesn't avoid the need to re-create the new cluster if anything has been
restored into it (eg. if pg_tablespace is populated), it just cleans out any
empty dirs left behind by a previous pg_upgrade which failed after "restoring
global objects" but before copying/linking data into that tablespace.

|rm -fr pgsql12.dat pgsql14.dat tblspc;
|/usr/lib/postgresql/12/bin/initdb -D pgsql12.dat
|./tmp_install/usr/local/pgsql/bin/initdb -D pgsql14.dat
|mkdir tblspc tblspc/PG_14_202010201
|echo "CREATE TABLESPACE one LOCATION '`pwd`/tblspc'" |/usr/lib/postgresql/12/bin/postgres --single postgres -D
pgsql12.dat-p 5678 -k /tmp
 
|./tmp_install/usr/local/pgsql/bin/pg_upgrade -D pgsql14.dat/ -d pgsql12.dat -b /usr/lib/postgresql/12/bin

-- 
Justin

Attachment
Justin Pryzby <pryzby@telsasoft.com> writes:
> I wonder if pg_upgrade should try to rmdir() the tablespace dirs before
> restoring global objects, allowing it to succeed, rather than just "failing
> early".

I'm a little confused about that.  If the directories aren't empty,
that will fail, but if they are, shouldn't the upgrade just work?
initdb is not normally unhappy about the target directory existing
if it's empty.

The reason why rmdir-and-recreate is not a great substitute for
"it just works" is that (a) you may lose the intended ownership or
permissions of those dirs, (b) you may lack write permission on their
parent dirs.  So cases where the DBA has pre-created the directories
are important to support.

            regards, tom lane



On Tue, Oct 20, 2020 at 09:17:22PM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > I wonder if pg_upgrade should try to rmdir() the tablespace dirs before
> > restoring global objects, allowing it to succeed, rather than just "failing
> > early".
> 
> I'm a little confused about that.  If the directories aren't empty,
> that will fail,

You mean rmdir() will fail, returning -1, which my patch will ignore, and the
pg_upgrade will fail, same as it would have before.  The goal of the patch is
to improve the "empty" case, only.

> but if they are, shouldn't the upgrade just work?

It fails in "Restoring global objects", which runs "CREATE TABLESPACE".
| errmsg("directory \"%s\" already in use as a tablespace",

I considered the possibility of changing that, but it seems like this is
specific to pg_upgrade.  I wouldn't want to change the core server just for
that, and it has a good reason for failing in that case:
| * The creation of the version directory prevents more than one tablespace
| * in a single location.

-- 
Justin



On Wed, Oct 21, 2020 at 06:54:17AM -0500, Justin Pryzby wrote:
> On Tue, Oct 20, 2020 at 09:17:22PM -0400, Tom Lane wrote:
> > Justin Pryzby <pryzby@telsasoft.com> writes:
> > > I wonder if pg_upgrade should try to rmdir() the tablespace dirs before
> > > restoring global objects, allowing it to succeed, rather than just "failing
> > > early".
> > 
> > I'm a little confused about that.  If the directories aren't empty,
> > that will fail,
> 
> You mean rmdir() will fail, returning -1, which my patch will ignore, and the
> pg_upgrade will fail, same as it would have before.  The goal of the patch is
> to improve the "empty" case, only.
> 
> > but if they are, shouldn't the upgrade just work?
> 
> It fails in "Restoring global objects", which runs "CREATE TABLESPACE".
> | errmsg("directory \"%s\" already in use as a tablespace",
> 
> I considered the possibility of changing that, but it seems like this is
> specific to pg_upgrade.  I wouldn't want to change the core server just for
> that, and it has a good reason for failing in that case:
> | * The creation of the version directory prevents more than one tablespace
> | * in a single location.

I think it is best to report an error --- pg_upgrade rarely tries to fix
things because that adds complexity, and perhaps bugs.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee