Thread: 9.3beta2: Failure to pg_upgrade
Name: Jesse Denardo Release: 9.2.2 -> 9.3beta2 Test Type: Install/Upgrade Test Test Detail: pg_upgrade in a fresh install of 9.3beta2 Platform: Debian Linux 6.0.5 Installation Method: From source Platform Detail: Debian Linux 6.0.5, 2.6.32.45-grsec-2.2.2-r3, x86_64 Test Procedure: I made a byte for byte copy of our exsting 9.2.2 Postgres directory (which includes the data directory), changed the port, and started it up. I pointed our dev application at the new port, and everything worked as expected. I then followed the procedure outlined here: http://www.postgresql.org/docs/9.3/static/pgupgrade.html I installed 9.3beta2 into a new directory from source, installed Postgis 2.1 and our required contrib modules, ran initdb, and ran pg_upgrade pointing at the right spots. The exact commands I used were: $ whoami postgres $ pwd /home/postgres $ 9.3beta2/bin/initdb 9.3beta2/data The files belonging to this database system will be owned by user "postgres". This user must also own the server process. The database cluster will be initialized with locales COLLATE: C CTYPE: en_US.UTF-8 MESSAGES: en_US.UTF-8 MONETARY: en_US.UTF-8 NUMERIC: en_US.UTF-8 TIME: en_US.UTF-8 The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. creating directory 9.3beta2/data ... ok creating subdirectories ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 128MB creating configuration files ... ok creating template1 database in 9.3beta2/data/base/1 ... ok initializing pg_authid ... ok initializing dependencies ... ok creating system views ... ok loading system objects' descriptions ... ok creating collations ... ok creating conversions ... ok creating dictionaries ... ok setting privileges on built-in objects ... ok creating information schema ... ok loading PL/pgSQL server-side language ... ok vacuuming database template1 ... ok copying template1 to template0 ... ok copying template1 to postgres ... ok syncing data to disk ... ok WARNING: enabling "trust" authentication for local connections You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. Success. You can now start the database server using: 9.3beta2/bin/postgres -D 9.3beta2/data or 9.3beta2/bin/pg_ctl -D 9.3beta2/data -l logfile start (At this point, starting the new database succeeds.) $ 9.3beta2/bin/pg_upgrade --old-bindir=/home/postgres/9.2_dev/bin --new-bindir=/home/postgres/9.3beta2/bin --old-datadir=/home/postgres/9.2_dev/data --new-datadir=/home/postgres/9.3beta2/data -u postgres Performing Consistency Checks ----------------------------- Checking cluster versions ok Checking database user is a superuser ok Checking for prepared transactions ok Checking for reg* system OID user data types ok Checking for contrib/isn with bigint-passing mismatch ok Creating dump of global objects ok Creating dump of database schemas ok Checking for presence of required libraries ok Checking database user is a superuser ok Checking for prepared transactions ok If pg_upgrade fails after this point, you must re-initdb the new cluster before continuing. Performing Upgrade ------------------ Analyzing all rows in the new cluster ok Freezing all rows on the new cluster ok Deleting files from new pg_clog ok Copying old pg_clog to new server ok Setting next transaction ID for new cluster ok Setting oldest multixact ID on new cluster ok Resetting WAL archives ok *failure* Consult the last few lines of "pg_upgrade_server.log" for the probable cause of the failure. connection to database failed: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/home/postgres/.s.PGSQL.50432"? could not connect to new postmaster started with the command: "/home/postgres/9.3beta2/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/home/postgres/9.3beta2/data" -o "-p 50432 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/home/postgres'" start Failure, exiting Failure: Error, possible compatibility issue Results: See above. The pg_upgrade_server.log contains at the end: command: "/home/postgres/9.3beta2/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/home/postgres/9.3beta2/data" -o "-p 50432 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/home/postgres'" start >> "pg_upgrade_server.log" 2>&1 waiting for server to start....LOG: database system was shut down at 2013-07-30 09:57:58 EDT FATAL: could not access status of transaction 2983 DETAIL: Could not read from file "pg_multixact/offsets/0000" at offset 8192: Success. LOG: startup process (PID 5239) exited with exit code 1 LOG: aborting startup due to startup process failure .... stopped waiting pg_ctl: could not start server Examine the log output. At this point, attempting to start the server fails with the same error until I delete the data directory and re-initdb. -- Jesse Denardo
On Tue, Jul 30, 2013 at 10:17:52AM -0400, Jesse Denardo wrote: > Name: Jesse Denardo > Release: 9.2.2 -> 9.3beta2 > Test Type: Install/Upgrade Test > Test Detail: pg_upgrade in a fresh install of 9.3beta2 > Platform: Debian Linux 6.0.5 > Installation Method: From source > Platform Detail: Debian Linux 6.0.5, 2.6.32.45-grsec-2.2.2-r3, x86_64 > Test Procedure: > > I made a byte for byte copy of our exsting 9.2.2 Postgres directory (which > includes the data directory), changed the port, and started it up. I pointed I assume you did this while the server was down. > command: "/home/postgres/9.3beta2/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D > "/home/postgres/9.3beta2/data" -o "-p 50432 -b -c synchronous_commit=off -c > fsync=off -c full_page_writes=off -c listen_addresses='' -c > unix_socket_permissions=0700 -c unix_socket_directories='/home/postgres'" start > >> "pg_upgrade_server.log" 2>&1 > waiting for server to start....LOG: database system was shut down at > 2013-07-30 09:57:58 EDT > FATAL: could not access status of transaction 2983 > DETAIL: Could not read from file "pg_multixact/offsets/0000" at offset 8192: > Success. OK, I actually have an idea on this. Here is the pg_upgrade code: /* * If the old server is before the MULTIXACT_FORMATCHANGE_CAT_VER change * (see pg_upgrade.h) and the new server is after, then we don't copy * pg_multixact files, but we need to reset pg_control so that the new * server doesn't attempt to read multis older than the cutoff value. */ if (old_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER && new_cluster.controldata.cat_ver >= MULTIXACT_FORMATCHANGE_CAT_VER) { copy_subdir_files("pg_multixact/offsets"); copy_subdir_files("pg_multixact/members"); prep_status("Setting next multixact ID and offset for new cluster"); /* * we preserve all files and contents, so we must preserve both "next" * counters here and the oldest multi present on system. */ exec_prog(UTILITY_LOG_FILE, NULL, true, "\"%s/pg_resetxlog\" -O %u -m %u,%u \"%s\"", new_cluster.bindir, old_cluster.controldata.chkpnt_nxtmxoff, old_cluster.controldata.chkpnt_nxtmulti, old_cluster.controldata.chkpnt_oldstMulti, new_cluster.pgdata); check_ok(); } and the C comment is: /* * pg_multixact format changed in 9.3 commit 0ac5ad5134f2769ccbaefec73844f85, * ("Improve concurrency of foreign key locking") which also updated catalog * version to this value. pg_upgrade behavior depends on whether old and new * server versions are both newer than this, or only the new one is. */ #define MULTIXACT_FORMATCHANGE_CAT_VER 201301231 So, first, this is new in 9.3, and second, it seems the comment "we need to reset pg_control so that the new server doesn't attempt to read multis older than the cutoff value" is not working. Alvaro, can you comment on this? I think you added this code with this commit: commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182 Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: Wed Jan 23 12:04:59 2013 -0300 ... pg_upgrade also needs to be careful to copy pg_multixact files over from the old server to the new, or at least part of multixact.c state, depending on the versions of the old and new servers. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian escribió: > On Tue, Jul 30, 2013 at 10:17:52AM -0400, Jesse Denardo wrote: > So, first, this is new in 9.3, and second, it seems the comment "we need > to reset pg_control so that the new server doesn't attempt to read > multis older than the cutoff value" is not working. Alvaro, can you > comment on this? I think you added this code with this commit: So it seems. I will have a look. Jesse, can you please supply pg_controldata output for the PGDATA you're upgrading? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 31, 2013 at 12:56:33PM -0400, Alvaro Herrera wrote: > Bruce Momjian escribió: > > On Tue, Jul 30, 2013 at 10:17:52AM -0400, Jesse Denardo wrote: > > > So, first, this is new in 9.3, and second, it seems the comment "we need > > to reset pg_control so that the new server doesn't attempt to read > > multis older than the cutoff value" is not working. Alvaro, can you > > comment on this? I think you added this code with this commit: > > So it seems. I will have a look. Well, the good news is that this is new 9.3 code, this bug was caught during beta, and pg_upgrade failed visibly, rather than silently. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
$ 9.2_dev/bin/pg_controldata data pg_control version number: 922 Catalog version number: 201204301 Database system identifier: 5789770930315980286 Database cluster state: shut down pg_control last modified: Tue 30 Jul 2013 09:57:54 AM EDT Latest checkpoint location: 6C/F8EE7500 Prior checkpoint location: 6C/F8E90C18 Latest checkpoint's REDO location: 6C/F8EE7500 Latest checkpoint's TimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/15960984 Latest checkpoint's NextOID: 4747737 Latest checkpoint's NextMultiXactId: 2982 Latest checkpoint's NextMultiOffset: 6479 Latest checkpoint's oldestXID: 1761 Latest checkpoint's oldestXID's DB: 12843 Latest checkpoint's oldestActiveXID: 0 Time of latest checkpoint: Tue 30 Jul 2013 09:57:53 AM EDT Minimum recovery ending location: 0/0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no Current wal_level setting: minimal Current max_connections setting: 100 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 256 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value -- Jesse Denardo On Wed, Jul 31, 2013 at 5:06 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Jul 31, 2013 at 12:56:33PM -0400, Alvaro Herrera wrote: > > Bruce Momjian escribi=C3=B3: > > > On Tue, Jul 30, 2013 at 10:17:52AM -0400, Jesse Denardo wrote: > > > > > So, first, this is new in 9.3, and second, it seems the comment "we > need > > > to reset pg_control so that the new server doesn't attempt to read > > > multis older than the cutoff value" is not working. Alvaro, can you > > > comment on this? I think you added this code with this commit: > > > > So it seems. I will have a look. > > Well, the good news is that this is new 9.3 code, this bug was caught > during beta, and pg_upgrade failed visibly, rather than silently. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + It's impossible for everything to be true. + >
Jesse Denardo escribió: > $ 9.2_dev/bin/pg_controldata data > Latest checkpoint's NextMultiXactId: 2982 > Latest checkpoint's NextMultiOffset: 6479 So what's happening here is that the MultiXact 2982 lives in a SLRU page that doesn't exist. pg_upgrade didn't copy the pg_multixact files from the old cluster, because they are not compatible; instead it just sets the values in pg_control. As soon as a new multixact is to be created, things fail because the code is not prepared to deal with the possibility that the underlying SLRU files have not been extended during normal operation. I see two ways to deal with this: 1. On each multixact creation, verify whether the pages we're trying to modify do in fact exist. If they don't, create them. 2. At startup, verify the "next" multixact values, and extend the files if necessary. I think (1) is not a very good idea because it will cause too large an impact at runtime, when it is not really necessary. I lean more towards (2). On IM, Bruce suggested instead: 2a. Same as (2), but only do it in pg_upgrade's usage of postgres' binary-upgrade mode (postgres -b). Thus this will be done once during the upgrade process and not every time the system starts up. As it turns out, I have a patched slru.c that adds a new function to verify whether a page exists on disk. I created this for the commit timestamp module, for the BDR branch, but I think it's what we need here. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera escribió: > As it turns out, I have a patched slru.c that adds a new function to > verify whether a page exists on disk. I created this for the commit > timestamp module, for the BDR branch, but I think it's what we need > here. Here's a patch that should fix the problem. Jesse, if you're able to test it, please give it a run and let me know if it works for you. I was able to upgrade an installation containing a problem that should reproduce yours. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-08-02 18:17:43 -0400, Alvaro Herrera wrote: > Alvaro Herrera escribió: > > > As it turns out, I have a patched slru.c that adds a new function to > > verify whether a page exists on disk. I created this for the commit > > timestamp module, for the BDR branch, but I think it's what we need > > here. > > Here's a patch that should fix the problem. Jesse, if you're able to > test it, please give it a run and let me know if it works for you. I > was able to upgrade an installation containing a problem that should > reproduce yours. Wouldn't it be easier to make pg_upgrade fudge pg_control to have a safe NextMultiXactId/Offset using pg_resetxlog? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund escribió: > On 2013-08-02 18:17:43 -0400, Alvaro Herrera wrote: > > Alvaro Herrera escribió: > > > > > As it turns out, I have a patched slru.c that adds a new function to > > > verify whether a page exists on disk. I created this for the commit > > > timestamp module, for the BDR branch, but I think it's what we need > > > here. > > > > Here's a patch that should fix the problem. Jesse, if you're able to > > test it, please give it a run and let me know if it works for you. I > > was able to upgrade an installation containing a problem that should > > reproduce yours. > > Wouldn't it be easier to make pg_upgrade fudge pg_control to have a safe > NextMultiXactId/Offset using pg_resetxlog? I don't understand. pg_upgrade already fudges pg_control to have a safe next multi, namely the same value used by the old cluster. The reason to preserve this value is that we must ensure no older value is consulted in pg_multixact: those might be present in tuples that were locked in the old cluster. (To be precise, this is the value to set as oldest multi, not next multi. But of course, the next multi must be greater than that one.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro,
I applied the patch and tried upgrading again, and everything seemed to work as expected. We are now up and running the beta!
--
Jesse Denardo
On Fri, Aug 2, 2013 at 10:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Andres Freund escribió:> On 2013-08-02 18:17:43 -0400, Alvaro Herrera wrote:I don't understand. pg_upgrade already fudges pg_control to have a safe
> > Alvaro Herrera escribió:
> >
> > > As it turns out, I have a patched slru.c that adds a new function to
> > > verify whether a page exists on disk. I created this for the commit
> > > timestamp module, for the BDR branch, but I think it's what we need
> > > here.
> >
> > Here's a patch that should fix the problem. Jesse, if you're able to
> > test it, please give it a run and let me know if it works for you. I
> > was able to upgrade an installation containing a problem that should
> > reproduce yours.
>
> Wouldn't it be easier to make pg_upgrade fudge pg_control to have a safe
> NextMultiXactId/Offset using pg_resetxlog?
next multi, namely the same value used by the old cluster. The reason
to preserve this value is that we must ensure no older value is
consulted in pg_multixact: those might be present in tuples that were
locked in the old cluster. (To be precise, this is the value to set as
oldest multi, not next multi. But of course, the next multi must be
greater than that one.)
--
Álvaro Herrera http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 2, 2013 at 11:20:37PM -0400, Jesse Denardo wrote: > Alvaro, > > I applied the patch and tried upgrading again, and everything seemed to work as > expected. We are now up and running the beta! Yeah, great, thanks everyone! -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 2013-08-02 22:25:36 -0400, Alvaro Herrera wrote: > Andres Freund escribió: > > On 2013-08-02 18:17:43 -0400, Alvaro Herrera wrote: > > > Alvaro Herrera escribió: > > > > > > > As it turns out, I have a patched slru.c that adds a new function to > > > > verify whether a page exists on disk. I created this for the commit > > > > timestamp module, for the BDR branch, but I think it's what we need > > > > here. > > > > > > Here's a patch that should fix the problem. Jesse, if you're able to > > > test it, please give it a run and let me know if it works for you. I > > > was able to upgrade an installation containing a problem that should > > > reproduce yours. > > > > Wouldn't it be easier to make pg_upgrade fudge pg_control to have a safe > > NextMultiXactId/Offset using pg_resetxlog? > > I don't understand. pg_upgrade already fudges pg_control to have a safe > next multi, namely the same value used by the old cluster. The reason > to preserve this value is that we must ensure no older value is > consulted in pg_multixact: those might be present in tuples that were > locked in the old cluster. (To be precise, this is the value to set as > oldest multi, not next multi. But of course, the next multi must be > greater than that one.) I am suggesting to set them to a greater value than in the old cluster, computed so it's guaranteed that they are proper page boundaries. Then the situation described upthread shouldn't occur anymore, right? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Jesse Denardo escribió: > Alvaro, > > I applied the patch and tried upgrading again, and everything seemed to > work as expected. We are now up and running the beta! Pushed, thanks. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services