BUG #8043: 9.2.4 doesn't open WAL files from archive, only looks in pg_xlog - Mailing list pgsql-bugs

From bohmer@visionlink.org
Subject BUG #8043: 9.2.4 doesn't open WAL files from archive, only looks in pg_xlog
Date
Msg-id E1UOCIL-0004ll-Qt@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #8043: 9.2.4 doesn't open WAL files from archive, only looks in pg_xlog  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      8043
Logged by:          Jeff Bohmer
Email address:      bohmer@visionlink.org
PostgreSQL version: 9.2.4
Operating system:   CentOS 5.9 x86_64 kernel 2.6.18-348.3.1.el5
Description:        =


Hi pgsql-bugs,

PG version: 9.2.4, from yum.postgresql.org
OS: CentOS 5.9 x86_64 kernel 2.6.18-348.3.1.el5

Upgrading from 9.2.3 to 9.2.4 has broken archive recovery for me. Probably
related to this 9.2.4 change:

    Ensure we do crash recovery before entering archive recovery,
    if the database was not stopped cleanly and a recovery.conf
    file is present (Heikki Linnakangas, Kyotaro Horiguchi,
    Mitsumasa Kondo)

When starting the PostgreSQL 9.2.4 service on a base backup, I get this:

2013-04-05 12:49:04 MDT [10302]: [1-1] user=3D,db=3D LOG:  database system =
was
interrupted; last known up at 2013-04-05 10:02:01 MDT
2013-04-05 12:49:04 MDT [10302]: [2-1] user=3D,db=3D LOG:  starting archive
recovery
2013-04-05 12:49:04 MDT [10302]: [3-1] user=3D,db=3D LOG:  could not open f=
ile
"pg_xlog/000000010000002F0000002D" (log file 47, segment 45): No such file
or directory
2013-04-05 12:49:04 MDT [10302]: [4-1] user=3D,db=3D LOG:  invalid primary
checkpoint record
2013-04-05 12:49:04 MDT [10302]: [5-1] user=3D,db=3D LOG:  could not open f=
ile
"pg_xlog/000000010000002F0000002C" (log file 47, segment 44): No such file
or directory
2013-04-05 12:49:04 MDT [10302]: [6-1] user=3D,db=3D LOG:  invalid secondary
checkpoint record
2013-04-05 12:49:04 MDT [10302]: [7-1] user=3D,db=3D PANIC:  could not loca=
te a
valid checkpoint record
2013-04-05 12:49:04 MDT [10297]: [2-1] user=3D,db=3D LOG:  startup process =
(PID
10302) was terminated by signal 6: Aborted
2013-04-05 12:49:04 MDT [10297]: [3-1] user=3D,db=3D LOG:  aborting startup=
 due
to startup process failure

The WAL file 000000010000002F0000002D does exist in my WAL archive, but not
in pg_xlog. I exclude pg_xlog files when taking the base backup, per the
instructions from
<http://www.postgresql.org/docs/9.2/static/continuous-archiving.html>:

    You can, however, omit from the backup dump the files within
    the cluster's pg_xlog/ subdirectory. This slight adjustment
    is worthwhile because it reduces the risk of mistakes when
    restoring.

I use a custom base backup script to call pg_start/stop_backup() and make
the backup with rsync.

The restore_command in recovery.conf is never called by PG 9.2.4 during
startup. I confirmed this by adding a "touch /tmp/restore_command.`date
+%H:%M:%S`" line at the beginning of the shell script I use for my
restore_command. No such files are created when starting PG 9.2.4.

After downgrading back to 9.2.3, archive recovery works using the very same
base backup, recovery.conf file, and restore_command. The log indicates that
PG 9.2.3 begins recovery by pulling WAL files from the archive instead of
pg_xlog:

2013-04-05 13:01:14 MDT [16981]: [1-1] user=3D,db=3D LOG:  database system =
was
interrupted; last known up at 2013-04-05 10:02:01 MDT
2013-04-05 13:01:14 MDT [16981]: [2-1] user=3D,db=3D LOG:  starting archive
recovery
2013-04-05 13:01:14 MDT [16981]: [3-1] user=3D,db=3D LOG:  restored log file
"000000010000002F0000002D" from archive
2013-04-05 13:01:14 MDT [16981]: [4-1] user=3D,db=3D LOG:  consistent recov=
ery
state reached at 2F/2D000080
2013-04-05 13:01:14 MDT [16981]: [5-1] user=3D,db=3D LOG:  redo starts at
2F/2D000080
2013-04-05 13:01:15 MDT [16981]: [6-1] user=3D,db=3D LOG:  restored log file
"000000010000002F0000002E" from archive
2013-04-05 13:01:15 MDT [16981]: [7-1] user=3D,db=3D LOG:  restored log file
"000000010000002F0000002F" from archive
<snip>
2013-04-05 13:01:17 MDT [16981]: [25-1] user=3D,db=3D LOG:  redo done at
2F/3F07B4D0
2013-04-05 13:01:17 MDT [16981]: [26-1] user=3D,db=3D LOG:  last completed
transaction was at log time 2013-04-05 12:53:01.069826-06
2013-04-05 13:01:17 MDT [16981]: [27-1] user=3D,db=3D LOG:  restored log fi=
le
"000000010000002F0000003F" from archive
2013-04-05 13:01:17 MDT [16981]: [28-1] user=3D,db=3D LOG:  selected new
timeline ID: 2
2013-04-05 13:01:17 MDT [16981]: [29-1] user=3D,db=3D LOG:  archive recovery
complete
2013-04-05 13:01:17 MDT [16991]: [1-1] user=3D,db=3D LOG:  checkpoint start=
ing:
end-of-recovery immediate wait
2013-04-05 13:01:17 MDT [16991]: [2-1] user=3D,db=3D LOG:  checkpoint compl=
ete:
wrote 327 buffers (0.1%); 0 transaction log file(s) ad
ded, 0 removed, 0 recycled; write=3D0.015 s, sync=3D0.000 s, total=3D0.035 =
s; sync
files=3D0, longest=3D0.000 s, average=3D0.000 s
2013-04-05 13:01:17 MDT [16978]: [2-1] user=3D,db=3D LOG:  database system =
is
ready to accept connections

So, the behavior has definitely changed between 9.2.3 and 9.2.4. Is this a
bug in 9.2.4?

Or, must I now include pg_xlog files when taking base backups with 9.2.4,
contrary to the documentation?

Or, is there a way to make PG 9.2.4 use restore_command for recovery, as
9.2.3 does?

Thank you,
- Jeff

pgsql-bugs by date:

Previous
From: Sergey Konoplev
Date:
Subject: Re: Completely broken replica after PANIC: WAL contains references to invalid pages
Next
From: Jeff Janes
Date:
Subject: Re: BUG #8043: 9.2.4 doesn't open WAL files from archive, only looks in pg_xlog