Re: Streaming replication slave crash - Mailing list pgsql-general

From Mahlon E. Smith
Subject Re: Streaming replication slave crash
Date
Msg-id 20130909200423.GG56045@martini.nu
Whole thread Raw
In response to Streaming replication slave crash  (Quentin Hartman <qhartman@direwolfdigital.com>)
Responses Re: Streaming replication slave crash  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-general
[piggybackin' on older (seeming very similar) thread...]

On Fri, Mar 29, 2013, Quentin Hartman wrote:
> Yesterday morning, one of my streaming replication slaves running 9.2.3
> crashed with the following in the log file:
>
> 2013-03-28 12:49:30 GMT WARNING:  page 1441792 of relation base/63229/63370
> does not exist
> 2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index
> 1663/63229/109956; iblk 303, heap 1663/63229/63370;
> 2013-03-28 12:49:30 GMT PANIC:  WAL contains references to invalid pages
> 2013-03-28 12:49:30 GMT CONTEXT:  xlog redo delete: index


I've got a pair of 9.2.4 DBs here in a streaming + log shipping setup.
Everything moving like clockwork until a couple of weeks ago, at which
time my slave died with the 'WAL contains references to invalid pages'
error.

After some wild googlin' "research", I saw the index visibility map fix
for 9.2.1.  We did pg_upgrade in-between versions, but just to be sure I
wasn't somehow carrying corrupt data across versions (?), I went ahead
and VACUUMed everythng with the vacuum_freeze_table_age set to 0, and
went on with my life, hoping I had removed whatever demons were running
around in there.

Slave died again this weekend:

    [2013-09-08 02:29:14.712 GMT] 55340: <> : WARNING:  page 2424 of relation base/16411/2529375 is uninitialized
    [2013-09-08 02:29:14.712 GMT] 55340: <> : CONTEXT:  xlog redo vacuum: rel 1663/16411/2529375; blk 26937,
lastBlockVacuumed0 
    [2013-09-08 02:29:14.712 GMT] 55340: <> : PANIC:  WAL contains references to invalid pages
    [2013-09-08 02:29:14.712 GMT] 55340: <> : CONTEXT:  xlog redo vacuum: rel 1663/16411/2529375; blk 26937,
lastBlockVacuumed0 
    [2013-09-08 02:29:14.761 GMT] 55338: <> : LOG:  startup process (PID 55340) was terminated by signal 6: Abort trap
    [2013-09-08 02:29:14.761 GMT] 55338: <> : LOG:  terminating any other active server processes


Unlike Quentin's original message, simply restarting the slave didn't
bring it back to life.  I had to pg_start_backup/rsync again from the
master, at which point:

    cp: /usr/local/pgsql/wal/00000003.history: No such file or directory
    [2013-09-09 03:58:29.235 GMT] 96710: <> : LOG:  entering standby mode
    [2013-09-09 03:58:29.324 GMT] 96710: <> : LOG:  restored log file "0000000300000E1E0000004B" from archive
    [...]
    [2013-09-09 03:59:05.151 GMT] 96710: <> : LOG:  consistent recovery state reached at E1E/89BE3DD8
    [2013-09-09 03:59:05.152 GMT] 96708: <> : LOG:  database system is ready to accept read only connections
    cp: /usr/local/pgsql/wal/0000000300000E1E0000008A: No such file or directory
    [2013-09-09 03:59:05.154 GMT] 96710: <> : LOG:  unexpected pageaddr E1D/FB000000 in log file 3614, segment 138,
offset0 
    cp: /usr/local/pgsql/wal/0000000300000E1E0000008A: No such file or directory
    [2013-09-09 03:59:05.209 GMT] 96786: <> : LOG:  streaming replication successfully connected to primary


Seemingly okay again for now, though I'm anticipating similar future
failures.  The original relation that it had a problem with is a
child partitioned table's index:


    [pgsql ~]$ oid2name -d db01 -o 2529375
    From database "db01":
      Filenode                Table Name
    ------------------------------------
       2529375  syslog_p_2013_32_seq_key

Partitioning on year/week_of_year, it definitely shouldn't have been
receiving any writes at the time of the failure.  Also interesting, both
failures were on a Saturday, around the time a base backup is performed
(though it failed afterwards in both cases, so this may be coincidence.)

Is there any other information I can gather than can help shed light on
what is going on?  Any recommendations going from here?  My master is
fine during all of this, but I'm concerned I've got some hidden
corruption that will bite me when I least expect it.


pgsql=# select version();
                                                   version
-------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.2.4 on amd64-portbld-freebsd9.0, compiled by cc (GCC) 4.2.1 20070831 patched [FreeBSD], 64-bit


--
Mahlon E. Smith
http://www.martini.nu/contact.html


pgsql-general by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Making substrings uppercase
Next
From: patrick keshishian
Date:
Subject: Re: Call for design: PostgreSQL mugs