Thread: 9.3beta2: Failure to pg_upgrade

9.3beta2: Failure to pg_upgrade

From

Jesse Denardo

Date:

30 July 2013, 14:38:55

Name: Jesse Denardo
Release: 9.2.2 -> 9.3beta2
Test Type: Install/Upgrade Test
Test Detail: pg_upgrade in a fresh install of 9.3beta2
Platform: Debian Linux 6.0.5
Installation Method: From source
Platform Detail: Debian Linux 6.0.5, 2.6.32.45-grsec-2.2.2-r3, x86_64
Test Procedure:

I made a byte for byte copy of our exsting 9.2.2 Postgres directory (which
includes the data directory), changed the port, and started it up. I
pointed our dev application at the new port, and everything worked as
expected. I then followed the procedure outlined here:
http://www.postgresql.org/docs/9.3/static/pgupgrade.html

I installed 9.3beta2 into a new directory from source, installed Postgis
2.1 and our required contrib modules, ran initdb, and ran pg_upgrade
pointing at the right spots. The exact commands I used were:

$ whoami
postgres

$ pwd
/home/postgres

$ 9.3beta2/bin/initdb 9.3beta2/data
The files belonging to this database system will be owned by user
"postgres".
This user must also own the server process.

The database cluster will be initialized with locales
  COLLATE:  C
  CTYPE:    en_US.UTF-8
  MESSAGES: en_US.UTF-8
  MONETARY: en_US.UTF-8
  NUMERIC:  en_US.UTF-8
  TIME:     en_US.UTF-8
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory 9.3beta2/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
creating configuration files ... ok
creating template1 database in 9.3beta2/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    9.3beta2/bin/postgres -D 9.3beta2/data
or
    9.3beta2/bin/pg_ctl -D 9.3beta2/data -l logfile start

(At this point, starting the new database succeeds.)

$ 9.3beta2/bin/pg_upgrade --old-bindir=/home/postgres/9.2_dev/bin
--new-bindir=/home/postgres/9.3beta2/bin
--old-datadir=/home/postgres/9.2_dev/data
--new-datadir=/home/postgres/9.3beta2/data -u postgres

Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is a superuser                       ok
Checking for prepared transactions                          ok
Checking for reg* system OID user data types                ok
Checking for contrib/isn with bigint-passing mismatch       ok
Creating dump of global objects                             ok
Creating dump of database schemas
                                                            ok
Checking for presence of required libraries                 ok
Checking database user is a superuser                       ok
Checking for prepared transactions                          ok

If pg_upgrade fails after this point, you must re-initdb the
new cluster before continuing.

Performing Upgrade
------------------
Analyzing all rows in the new cluster                       ok
Freezing all rows on the new cluster                        ok
Deleting files from new pg_clog                             ok
Copying old pg_clog to new server                           ok
Setting next transaction ID for new cluster                 ok
Setting oldest multixact ID on new cluster                  ok
Resetting WAL archives                                      ok

*failure*
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or
directory
 Is the server running locally and accepting
connections on Unix domain socket "/home/postgres/.s.PGSQL.50432"?


could not connect to new postmaster started with the command:
"/home/postgres/9.3beta2/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D
"/home/postgres/9.3beta2/data" -o "-p 50432 -b -c synchronous_commit=off -c
fsync=off -c full_page_writes=off  -c listen_addresses='' -c
unix_socket_permissions=0700 -c unix_socket_directories='/home/postgres'"
start
Failure, exiting

Failure: Error, possible compatibility issue
Results: See above. The pg_upgrade_server.log contains at the end:

command: "/home/postgres/9.3beta2/bin/pg_ctl" -w -l "pg_upgrade_server.log"
-D "/home/postgres/9.3beta2/data" -o "-p 50432 -b -c synchronous_commit=off
-c fsync=off -c full_page_writes=off  -c listen_addresses='' -c
unix_socket_permissions=0700 -c unix_socket_directories='/home/postgres'"
start >> "pg_upgrade_server.log" 2>&1
waiting for server to start....LOG:  database system was shut down at
2013-07-30 09:57:58 EDT
FATAL:  could not access status of transaction 2983
DETAIL:  Could not read from file "pg_multixact/offsets/0000" at offset
8192: Success.
LOG:  startup process (PID 5239) exited with exit code 1
LOG:  aborting startup due to startup process failure
.... stopped waiting
pg_ctl: could not start server
Examine the log output.

At this point, attempting to start the server fails with the same error
until I delete the data directory and re-initdb.

--
Jesse Denardo

Re: 9.3beta2: Failure to pg_upgrade

From

Bruce Momjian

Date:

31 July 2013, 16:03:13

On Tue, Jul 30, 2013 at 10:17:52AM -0400, Jesse Denardo wrote:
> Name: Jesse Denardo
> Release: 9.2.2 -> 9.3beta2
> Test Type: Install/Upgrade Test
> Test Detail: pg_upgrade in a fresh install of 9.3beta2
> Platform: Debian Linux 6.0.5
> Installation Method: From source
> Platform Detail: Debian Linux 6.0.5, 2.6.32.45-grsec-2.2.2-r3, x86_64
> Test Procedure:
>
> I made a byte for byte copy of our exsting 9.2.2 Postgres directory (which
> includes the data directory), changed the port, and started it up. I pointed

I assume you did this while the server was down.

> command: "/home/postgres/9.3beta2/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D
> "/home/postgres/9.3beta2/data" -o "-p 50432 -b -c synchronous_commit=off -c
> fsync=off -c full_page_writes=off  -c listen_addresses='' -c
> unix_socket_permissions=0700 -c unix_socket_directories='/home/postgres'" start
> >> "pg_upgrade_server.log" 2>&1
> waiting for server to start....LOG:  database system was shut down at
> 2013-07-30 09:57:58 EDT
> FATAL:  could not access status of transaction 2983
> DETAIL:  Could not read from file "pg_multixact/offsets/0000" at offset 8192:
> Success.

OK, I actually have an idea on this.  Here is the pg_upgrade code:

    /*
     * If the old server is before the MULTIXACT_FORMATCHANGE_CAT_VER change
     * (see pg_upgrade.h) and the new server is after, then we don't copy
     * pg_multixact files, but we need to reset pg_control so that the new
     * server doesn't attempt to read multis older than the cutoff value.
     */
    if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER &&
        new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER)
    {
        copy_subdir_files("pg_multixact/offsets");
        copy_subdir_files("pg_multixact/members");
        prep_status("Setting next multixact ID and offset for new cluster");

        /*
         * we preserve all files and contents, so we must preserve both "next"
         * counters here and the oldest multi present on system.
         */
        exec_prog(UTILITY_LOG_FILE, NULL, true,
                  "\"%s/pg_resetxlog\" -O %u -m %u,%u \"%s\"",
                  new_cluster.bindir,
                  old_cluster.controldata.chkpnt_nxtmxoff,
                  old_cluster.controldata.chkpnt_nxtmulti,
                  old_cluster.controldata.chkpnt_oldstMulti,
                  new_cluster.pgdata);
        check_ok();
    }

and the C comment is:

    /*
     * pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85,
     * ("Improve concurrency of foreign key locking") which also updated catalog
     * version to this value.  pg_upgrade behavior depends on whether old and new
     * server versions are both newer than this, or only the new one is.
     */
    #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231

So, first, this is new in 9.3, and second, it seems the comment "we need
to reset pg_control so that the new server doesn't attempt to read
multis older than the cutoff value" is not working.  Alvaro, can you
comment on this?  I think you added this code with this commit:

    commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182
    Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
    Date:   Wed Jan 23 12:04:59 2013 -0300

    ...

    pg_upgrade also needs to be careful to copy pg_multixact files over from
    the old server to the new, or at least part of multixact.c state,
    depending on the versions of the old and new servers.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: 9.3beta2: Failure to pg_upgrade

From

Alvaro Herrera

Date:

31 July 2013, 16:56:42

Bruce Momjian escribió:
> On Tue, Jul 30, 2013 at 10:17:52AM -0400, Jesse Denardo wrote:

> So, first, this is new in 9.3, and second, it seems the comment "we need
> to reset pg_control so that the new server doesn't attempt to read
> multis older than the cutoff value" is not working.  Alvaro, can you
> comment on this?  I think you added this code with this commit:

So it seems.  I will have a look.

Jesse, can you please supply pg_controldata output for the PGDATA you're
upgrading?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: 9.3beta2: Failure to pg_upgrade

From

Bruce Momjian

Date:

31 July 2013, 17:06:42

On Wed, Jul 31, 2013 at 12:56:33PM -0400, Alvaro Herrera wrote:
> Bruce Momjian escribió:
> > On Tue, Jul 30, 2013 at 10:17:52AM -0400, Jesse Denardo wrote:
>
> > So, first, this is new in 9.3, and second, it seems the comment "we need
> > to reset pg_control so that the new server doesn't attempt to read
> > multis older than the cutoff value" is not working.  Alvaro, can you
> > comment on this?  I think you added this code with this commit:
>
> So it seems.  I will have a look.

Well, the good news is that this is new 9.3 code, this bug was caught
during beta, and pg_upgrade failed visibly, rather than silently.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: 9.3beta2: Failure to pg_upgrade

From

Jesse Denardo

Date:

31 July 2013, 18:38:05

$ 9.2_dev/bin/pg_controldata data
pg_control version number:            922
Catalog version number:               201204301
Database system identifier:           5789770930315980286
Database cluster state:               shut down
pg_control last modified:             Tue 30 Jul 2013 09:57:54 AM EDT
Latest checkpoint location:           6C/F8EE7500
Prior checkpoint location:            6C/F8E90C18
Latest checkpoint's REDO location:    6C/F8EE7500
Latest checkpoint's TimeLineID:       1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/15960984
Latest checkpoint's NextOID:          4747737
Latest checkpoint's NextMultiXactId:  2982
Latest checkpoint's NextMultiOffset:  6479
Latest checkpoint's oldestXID:        1761
Latest checkpoint's oldestXID's DB:   12843
Latest checkpoint's oldestActiveXID:  0
Time of latest checkpoint:            Tue 30 Jul 2013 09:57:53 AM EDT
Minimum recovery ending location:     0/0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
Current wal_level setting:            minimal
Current max_connections setting:      100
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   256
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value


--
Jesse Denardo


On Wed, Jul 31, 2013 at 5:06 PM, Bruce Momjian <bruce@momjian.us> wrote:

> On Wed, Jul 31, 2013 at 12:56:33PM -0400, Alvaro Herrera wrote:
> > Bruce Momjian escribi=C3=B3:
> > > On Tue, Jul 30, 2013 at 10:17:52AM -0400, Jesse Denardo wrote:
> >
> > > So, first, this is new in 9.3, and second, it seems the comment "we
> need
> > > to reset pg_control so that the new server doesn't attempt to read
> > > multis older than the cutoff value" is not working.  Alvaro, can you
> > > comment on this?  I think you added this code with this commit:
> >
> > So it seems.  I will have a look.
>
> Well, the good news is that this is new 9.3 code, this bug was caught
> during beta, and pg_upgrade failed visibly, rather than silently.
>
> --
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
>
>   + It's impossible for everything to be true. +
>

Re: 9.3beta2: Failure to pg_upgrade

From

Alvaro Herrera

Date:

31 July 2013, 20:55:42

Jesse Denardo escribió:

> $ 9.2_dev/bin/pg_controldata data

> Latest checkpoint's NextMultiXactId:  2982
> Latest checkpoint's NextMultiOffset:  6479

So what's happening here is that the MultiXact 2982 lives in a SLRU page
that doesn't exist.  pg_upgrade didn't copy the pg_multixact files from
the old cluster, because they are not compatible; instead it just sets
the values in pg_control.  As soon as a new multixact is to be created,
things fail because the code is not prepared to deal with the
possibility that the underlying SLRU files have not been extended during
normal operation.

I see two ways to deal with this:

1. On each multixact creation, verify whether the pages we're trying to
modify do in fact exist.  If they don't, create them.

2. At startup, verify the "next" multixact values, and extend the files
if necessary.

I think (1) is not a very good idea because it will cause too large an
impact at runtime, when it is not really necessary.  I lean more towards
(2).  On IM, Bruce suggested instead:

2a. Same as (2), but only do it in pg_upgrade's usage of postgres'
binary-upgrade mode (postgres -b).  Thus this will be done once during
the upgrade process and not every time the system starts up.


As it turns out, I have a patched slru.c that adds a new function to
verify whether a page exists on disk.  I created this for the commit
timestamp module, for the BDR branch, but I think it's what we need
here.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: 9.3beta2: Failure to pg_upgrade

From

Alvaro Herrera

Date:

02 August 2013, 22:18:00

Alvaro Herrera escribió:

> As it turns out, I have a patched slru.c that adds a new function to
> verify whether a page exists on disk.  I created this for the commit
> timestamp module, for the BDR branch, but I think it's what we need
> here.

Here's a patch that should fix the problem.  Jesse, if you're able to
test it, please give it a run and let me know if it works for you.  I
was able to upgrade an installation containing a problem that should
reproduce yours.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

extend-slru.patch

Re: 9.3beta2: Failure to pg_upgrade

From

Andres Freund

Date:

03 August 2013, 01:14:20

On 2013-08-02 18:17:43 -0400, Alvaro Herrera wrote:
> Alvaro Herrera escribió:
> 
> > As it turns out, I have a patched slru.c that adds a new function to
> > verify whether a page exists on disk.  I created this for the commit
> > timestamp module, for the BDR branch, but I think it's what we need
> > here.
> 
> Here's a patch that should fix the problem.  Jesse, if you're able to
> test it, please give it a run and let me know if it works for you.  I
> was able to upgrade an installation containing a problem that should
> reproduce yours.

Wouldn't it be easier to make pg_upgrade fudge pg_control to have a safe
NextMultiXactId/Offset using pg_resetxlog?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: 9.3beta2: Failure to pg_upgrade

From

Alvaro Herrera

Date:

03 August 2013, 02:25:44

Andres Freund escribió:
> On 2013-08-02 18:17:43 -0400, Alvaro Herrera wrote:
> > Alvaro Herrera escribió:
> > 
> > > As it turns out, I have a patched slru.c that adds a new function to
> > > verify whether a page exists on disk.  I created this for the commit
> > > timestamp module, for the BDR branch, but I think it's what we need
> > > here.
> > 
> > Here's a patch that should fix the problem.  Jesse, if you're able to
> > test it, please give it a run and let me know if it works for you.  I
> > was able to upgrade an installation containing a problem that should
> > reproduce yours.
> 
> Wouldn't it be easier to make pg_upgrade fudge pg_control to have a safe
> NextMultiXactId/Offset using pg_resetxlog?

I don't understand.  pg_upgrade already fudges pg_control to have a safe
next multi, namely the same value used by the old cluster.  The reason
to preserve this value is that we must ensure no older value is
consulted in pg_multixact: those might be present in tuples that were
locked in the old cluster.  (To be precise, this is the value to set as
oldest multi, not next multi.  But of course, the next multi must be
greater than that one.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: 9.3beta2: Failure to pg_upgrade

From

Jesse Denardo

Date:

03 August 2013, 03:34:37

Alvaro,

I applied the patch and tried upgrading again, and everything seemed to work as expected. We are now up and running the beta!

--
Jesse Denardo

On Fri, Aug 2, 2013 at 10:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Andres Freund escribió:
> On 2013-08-02 18:17:43 -0400, Alvaro Herrera wrote:
> > Alvaro Herrera escribió:
> >
> > > As it turns out, I have a patched slru.c that adds a new function to
> > > verify whether a page exists on disk. I created this for the commit
> > > timestamp module, for the BDR branch, but I think it's what we need
> > > here.
> >
> > Here's a patch that should fix the problem. Jesse, if you're able to
> > test it, please give it a run and let me know if it works for you. I
> > was able to upgrade an installation containing a problem that should
> > reproduce yours.
>
> Wouldn't it be easier to make pg_upgrade fudge pg_control to have a safe
> NextMultiXactId/Offset using pg_resetxlog?

I don't understand. pg_upgrade already fudges pg_control to have a safe
next multi, namely the same value used by the old cluster. The reason
to preserve this value is that we must ensure no older value is
consulted in pg_multixact: those might be present in tuples that were
locked in the old cluster. (To be precise, this is the value to set as
oldest multi, not next multi. But of course, the next multi must be
greater than that one.)

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: 9.3beta2: Failure to pg_upgrade

From

Bruce Momjian

Date:

03 August 2013, 03:47:09

On Fri, Aug  2, 2013 at 11:20:37PM -0400, Jesse Denardo wrote:
> Alvaro,
> 
> I applied the patch and tried upgrading again, and everything seemed to work as
> expected. We are now up and running the beta!

Yeah, great, thanks everyone!

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: [HACKERS] 9.3beta2: Failure to pg_upgrade

From

Andres Freund

Date:

03 August 2013, 04:08:42

On 2013-08-02 22:25:36 -0400, Alvaro Herrera wrote:
> Andres Freund escribió:
> > On 2013-08-02 18:17:43 -0400, Alvaro Herrera wrote:
> > > Alvaro Herrera escribió:
> > > 
> > > > As it turns out, I have a patched slru.c that adds a new function to
> > > > verify whether a page exists on disk.  I created this for the commit
> > > > timestamp module, for the BDR branch, but I think it's what we need
> > > > here.
> > > 
> > > Here's a patch that should fix the problem.  Jesse, if you're able to
> > > test it, please give it a run and let me know if it works for you.  I
> > > was able to upgrade an installation containing a problem that should
> > > reproduce yours.
> > 
> > Wouldn't it be easier to make pg_upgrade fudge pg_control to have a safe
> > NextMultiXactId/Offset using pg_resetxlog?
> 
> I don't understand.  pg_upgrade already fudges pg_control to have a safe
> next multi, namely the same value used by the old cluster.  The reason
> to preserve this value is that we must ensure no older value is
> consulted in pg_multixact: those might be present in tuples that were
> locked in the old cluster.  (To be precise, this is the value to set as
> oldest multi, not next multi.  But of course, the next multi must be
> greater than that one.)

I am suggesting to set them to a greater value than in the old cluster,
computed so it's guaranteed that they are proper page boundaries. Then
the situation described upthread shouldn't occur anymore, right?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: 9.3beta2: Failure to pg_upgrade

From

Alvaro Herrera

Date:

19 August 2013, 16:57:52

Jesse Denardo escribió:
> Alvaro,
> 
> I applied the patch and tried upgrading again, and everything seemed to
> work as expected. We are now up and running the beta!

Pushed, thanks.


-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services