Thread: Bogus error handling in pg_upgrade

Bogus error handling in pg_upgrade

From

Tom Lane

Date:

29 December 2013, 08:25:13

A credulous person might suppose that this chunk of code is designed
to abort if pg_resetxlog fails:
   prep_status("Setting next transaction ID for new cluster");   exec_prog(UTILITY_LOG_FILE, NULL, true,
"\"%s/pg_resetxlog\"-f -x %u \"%s\"",             new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
new_cluster.pgdata);   check_ok();
 

In point of fact, it does no such thing, but blithely continues
(even though pg_resetxlog has corrupted things horribly before failing).

check_ok() is particularly badly named, since it contains not one iota
of error checking.  misleadingly_claim_ok() would be a better name.

If this isn't broken-by-design, I'd like an explanation why not.

In case you're wondering, I'm investigating the problem mentioned
at <1387636762.30013.13.camel@vanquo.pezone.net>.  I see this output:

Performing Upgrade
------------------
Analyzing all rows in the new cluster                       ok
Freezing all rows on the new cluster                        ok
Deleting files from new pg_clog                             ok
Copying old pg_clog to new server                           ok
Setting next transaction ID for new cluster                 
*failure*

Consult the last few lines of "pg_upgrade_utility.log" for
the probable cause of the failure.
ok
Deleting files from new pg_multixact/offsets                ok
Copying old pg_multixact/offsets to new server              ok
Deleting files from new pg_multixact/members                ok
Copying old pg_multixact/members to new server              ok
Setting next multixact ID and offset for new cluster        ok
Resetting WAL archives                                      
*failure*

Consult the last few lines of "pg_upgrade_utility.log" for
the probable cause of the failure.
ok


*failure*
Consult the last few lines of "pg_upgrade_server_start.log" or "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory       Is the server running
locallyand accepting       connections on Unix domain socket "/Users/tgl/pgsql/contrib/pg_upgrade/.s.PGSQL.57632"?
 


could not connect to new postmaster started with the command:
"/Users/tgl/pgsql/contrib/pg_upgrade/tmp_check/install//Users/tgl/testversion/bin/pg_ctl" -w -l "pg_upgrade_server.log"
-D"/Users/tgl/pgsql/contrib/pg_upgrade/tmp_check/data" -o "-p 57632 -b -c synchronous_commit=off -c fsync=off -c
full_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700 -c
unix_socket_directories='/Users/tgl/pgsql/contrib/pg_upgrade'"start
 
Failure, exiting
make: *** [check] Error 1

I think the actual problem is that pg_resetxlog rewrites pg_control, zaps
everything in pg_xlog/, and then fails before writing a new initial xlog
segment.  However, pg_upgrade isn't making this any easier to investigate
by failing to stop at the first sign of trouble.
        regards, tom lane

Re: Bogus error handling in pg_upgrade

From

Robert Haas

Date:

02 January 2014, 05:37:31

On Sun, Dec 29, 2013 at 12:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> check_ok() is particularly badly named, since it contains not one iota
> of error checking.  misleadingly_claim_ok() would be a better name.

That's pretty hilarious, actually.  I think it probably started as a
copy of initdb.c's check_ok(), and then at some point along the line
it got its heart ripped out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Bogus error handling in pg_upgrade

From

Bruce Momjian

Date:

10 January 2014, 07:07:58

On Sun, Dec 29, 2013 at 12:25:04AM -0500, Tom Lane wrote:
> A credulous person might suppose that this chunk of code is designed
> to abort if pg_resetxlog fails:
> 
>     prep_status("Setting next transaction ID for new cluster");
>     exec_prog(UTILITY_LOG_FILE, NULL, true,
>               "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
>               new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
>               new_cluster.pgdata);
>     check_ok();
> 
> In point of fact, it does no such thing, but blithely continues
> (even though pg_resetxlog has corrupted things horribly before failing).

Well, exec_prog() does this:
   result = system(cmd);
   if (result != 0)

So, is pg_resetxlog returning a zero value?  I am guessing it is.

> check_ok() is particularly badly named, since it contains not one iota
> of error checking.  misleadingly_claim_ok() would be a better name.
> 
> If this isn't broken-by-design, I'd like an explanation why not.

It is probably because it came from initdb.c, but I always read check_ok
as report_ok.  Should I rename it?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +