Thread: Streaming replication slave crash
Yesterday morning, one of my streaming replication slaves running 9.2.3 crashed with the following in the log file:
2013-03-28 12:49:30 GMT WARNING: page 1441792 of relation base/63229/63370 does not exist
2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index 1663/63229/109956; iblk 303, heap 1663/63229/63370;
2013-03-28 12:49:30 GMT PANIC: WAL contains references to invalid pages
2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index 1663/63229/109956; iblk 303, heap 1663/63229/63370;
2013-03-28 12:49:31 GMT LOG: startup process (PID 22941) was terminated by signal 6: Aborted
2013-03-28 12:49:31 GMT LOG: terminating any other active server processes
2013-03-28 12:49:31 GMT WARNING: terminating connection because of crash of another server process
2013-03-28 12:49:31 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2013-03-28 12:49:31 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command.
2013-03-28 12:57:44 GMT LOG: database system was interrupted while in recovery at log time 2013-03-28 12:37:42 GMT
2013-03-28 12:57:44 GMT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2013-03-28 12:57:44 GMT LOG: entering standby mode
2013-03-28 12:57:44 GMT LOG: redo starts at 19/2367CE30
2013-03-28 12:57:44 GMT LOG: incomplete startup packet
2013-03-28 12:57:44 GMT LOG: consistent recovery state reached at 19/241835B0
2013-03-28 12:57:44 GMT LOG: database system is ready to accept read only connections
2013-03-28 12:57:44 GMT LOG: invalid record length at 19/2419EE38
2013-03-28 12:57:44 GMT LOG: streaming replication successfully connected to primary
As you can see I was able to restart it and it picked up and synchronized right away, but this crash still concerns me.
The DB has about 75GB of data in it, and it is almost entirely write traffic. It's essentially a log aggregator. I believe it was doing a pg_dump backup at the time of the crash. It has hot_standby_feedback on to allow that process to complete.
Any insights into this, or advice on figuring out the root of it would be appreciated. So far all the things I've found like this are bugs that should be fixed in this version, or the internet equivalent of a shrug.
Thanks!
QH
2013-03-28 12:49:30 GMT WARNING: page 1441792 of relation base/63229/63370 does not exist
2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index 1663/63229/109956; iblk 303, heap 1663/63229/63370;
2013-03-28 12:49:30 GMT PANIC: WAL contains references to invalid pages
2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index 1663/63229/109956; iblk 303, heap 1663/63229/63370;
2013-03-28 12:49:31 GMT LOG: startup process (PID 22941) was terminated by signal 6: Aborted
2013-03-28 12:49:31 GMT LOG: terminating any other active server processes
2013-03-28 12:49:31 GMT WARNING: terminating connection because of crash of another server process
2013-03-28 12:49:31 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2013-03-28 12:49:31 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command.
2013-03-28 12:57:44 GMT LOG: database system was interrupted while in recovery at log time 2013-03-28 12:37:42 GMT
2013-03-28 12:57:44 GMT HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2013-03-28 12:57:44 GMT LOG: entering standby mode
2013-03-28 12:57:44 GMT LOG: redo starts at 19/2367CE30
2013-03-28 12:57:44 GMT LOG: incomplete startup packet
2013-03-28 12:57:44 GMT LOG: consistent recovery state reached at 19/241835B0
2013-03-28 12:57:44 GMT LOG: database system is ready to accept read only connections
2013-03-28 12:57:44 GMT LOG: invalid record length at 19/2419EE38
2013-03-28 12:57:44 GMT LOG: streaming replication successfully connected to primary
As you can see I was able to restart it and it picked up and synchronized right away, but this crash still concerns me.
The DB has about 75GB of data in it, and it is almost entirely write traffic. It's essentially a log aggregator. I believe it was doing a pg_dump backup at the time of the crash. It has hot_standby_feedback on to allow that process to complete.
Any insights into this, or advice on figuring out the root of it would be appreciated. So far all the things I've found like this are bugs that should be fixed in this version, or the internet equivalent of a shrug.
Thanks!
QH
Looks like you've got some form of coruption: page 1441792 of relation base/63229/63370 does not exist The question is whether it was corrupted on the master and then replicated to the slave, or if it was corrupted on the slave. I'd guess that the pg_dump tried to read from that page and barfed. It would be interesting to try re-running the pg_dump again to see if this crash can be replicated. If so, does it also replicate if you run pg_dump against the master? If not, then the corruption is isolated to the slave, and you might have a hardware problem which is causing the data to get corrupted. On Fri, Mar 29, 2013 at 9:19 AM, Quentin Hartman <qhartman@direwolfdigital.com> wrote: > Yesterday morning, one of my streaming replication slaves running 9.2.3 > crashed with the following in the log file: > > 2013-03-28 12:49:30 GMT WARNING: page 1441792 of relation base/63229/63370 > does not exist > 2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index 1663/63229/109956; > iblk 303, heap 1663/63229/63370; > 2013-03-28 12:49:30 GMT PANIC: WAL contains references to invalid pages > 2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index 1663/63229/109956; > iblk 303, heap 1663/63229/63370; > 2013-03-28 12:49:31 GMT LOG: startup process (PID 22941) was terminated by > signal 6: Aborted > 2013-03-28 12:49:31 GMT LOG: terminating any other active server processes > 2013-03-28 12:49:31 GMT WARNING: terminating connection because of crash of > another server process > 2013-03-28 12:49:31 GMT DETAIL: The postmaster has commanded this server > process to roll back the current transaction and exit, because another > server process exited abnormally and possibly corrupted shared memory. > 2013-03-28 12:49:31 GMT HINT: In a moment you should be able to reconnect > to the database and repeat your command. > 2013-03-28 12:57:44 GMT LOG: database system was interrupted while in > recovery at log time 2013-03-28 12:37:42 GMT > 2013-03-28 12:57:44 GMT HINT: If this has occurred more than once some data > might be corrupted and you might need to choose an earlier recovery target. > 2013-03-28 12:57:44 GMT LOG: entering standby mode > 2013-03-28 12:57:44 GMT LOG: redo starts at 19/2367CE30 > 2013-03-28 12:57:44 GMT LOG: incomplete startup packet > 2013-03-28 12:57:44 GMT LOG: consistent recovery state reached at > 19/241835B0 > 2013-03-28 12:57:44 GMT LOG: database system is ready to accept read only > connections > 2013-03-28 12:57:44 GMT LOG: invalid record length at 19/2419EE38 > 2013-03-28 12:57:44 GMT LOG: streaming replication successfully connected > to primary > > As you can see I was able to restart it and it picked up and synchronized > right away, but this crash still concerns me. > > The DB has about 75GB of data in it, and it is almost entirely write > traffic. It's essentially a log aggregator. I believe it was doing a pg_dump > backup at the time of the crash. It has hot_standby_feedback on to allow > that process to complete. > > Any insights into this, or advice on figuring out the root of it would be > appreciated. So far all the things I've found like this are bugs that should > be fixed in this version, or the internet equivalent of a shrug. > > Thanks! > > QH -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand https://netllama.linux-sxs.org
On Fri, Mar 29, 2013 at 10:23 AM, Lonni J Friedman <netllama@gmail.com> wrote:
Thanks for the insight. I thought that might be it, but never having seen this before I'm glad to have some confirmation.
Yes, we've gotten several clean dumps form the slave since then w/o crashing. We're running these machines on EC2 so we sadly have no control over the hardware. With your confirmation, and an apparently clean state now, I'm inclined to chalk this up to an EC2 hiccup getting caught by Postgres and get on with life.
Thanks!
QH
Looks like you've got some form of coruption:page 1441792 of relation base/63229/63370 does not exist
Thanks for the insight. I thought that might be it, but never having seen this before I'm glad to have some confirmation.
The question is whether it was corrupted on the master and then
replicated to the slave, or if it was corrupted on the slave. I'd
guess that the pg_dump tried to read from that page and barfed. It
would be interesting to try re-running the pg_dump again to see if
this crash can be replicated. If so, does it also replicate if you
run pg_dump against the master? If not, then the corruption is
isolated to the slave, and you might have a hardware problem which is
causing the data to get corrupted.
Yes, we've gotten several clean dumps form the slave since then w/o crashing. We're running these machines on EC2 so we sadly have no control over the hardware. With your confirmation, and an apparently clean state now, I'm inclined to chalk this up to an EC2 hiccup getting caught by Postgres and get on with life.
Thanks!
QH
Quentin Hartman <qhartman@direwolfdigital.com> writes: > Yesterday morning, one of my streaming replication slaves running 9.2.3 > crashed with the following in the log file: What process did you use for setting up the slave? There's a fix awaiting release in 9.2.4 that might explain data corruption on a slave, depending on how it was set up: * Ensure we do crash recovery before entering archive recovery, if the database was not stopped cleanly and a recovery.conf file is present (Heikki Linnakangas, Kyotaro Horiguchi, Mitsumasa Kondo) This is needed to ensure that the database is consistent in certain scenarios, such as initializing a standby server with a filesystem snapshot from a running server. This theory would be more probable if it's a relatively new slave, since any corruption would have been there in the slave's initial state, just waiting for the replay to run into it. regards, tom lane
On Fri, Mar 29, 2013 at 10:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I used an rsync from the master while both were stopped.
It's newish. I upgraded this pair from 9.1.x to 9.2.3 a little over a week ago, so did a dump/reload and resynced then. It was running happily in between now and then.
Thanks!Quentin Hartman <qhartman@direwolfdigital.com> writes:What process did you use for setting up the slave?
> Yesterday morning, one of my streaming replication slaves running 9.2.3
> crashed with the following in the log file:
I used an rsync from the master while both were stopped.
This theory would be more probable if it's a relatively new slave, since
any corruption would have been there in the slave's initial state, just
waiting for the replay to run into it.
It's newish. I upgraded this pair from 9.1.x to 9.2.3 a little over a week ago, so did a dump/reload and resynced then. It was running happily in between now and then.
QH
Quentin Hartman <qhartman@direwolfdigital.com> writes: > On Fri, Mar 29, 2013 at 10:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What process did you use for setting up the slave? > I used an rsync from the master while both were stopped. If the master was shut down cleanly (not -m immediate) then the bug fix I was thinking about wouldn't explain this. The fact that the panic didn't recur after restarting seems to void that theory as well. I'm not sure what to make of that angle. Can you determine which table is being complained of in the failure message, ie, what has relfilenode 63370 in database 63229? If so it would be interesting to know what was being done to that table on the master. regards, tom lane
On Fri, Mar 29, 2013 at 10:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yes, it was shut down cleanly. A good thought, but I don't think it's relevant in this case.
Good point! Looking deeper into that, it's actually one of our smaller tables, and it doesn't seem to have any corruption, on either server. I was able to select all the records from it and the content seems sane. The only thing that would have been happening on that table is an INSERT or UPDATE.
I think I'm going to run with the spurious EC2 hiccup explanation. I'm comfortable with that given the extra due diligence I've done with your (and Lonni's) guidance.Quentin Hartman <qhartman@direwolfdigital.com> writes:
> On Fri, Mar 29, 2013 at 10:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:>> What process did you use for setting up the slave?If the master was shut down cleanly (not -m immediate) then the bug fix
> I used an rsync from the master while both were stopped.
I was thinking about wouldn't explain this. The fact that the panic
didn't recur after restarting seems to void that theory as well. I'm
not sure what to make of that angle.
Yes, it was shut down cleanly. A good thought, but I don't think it's relevant in this case.
Can you determine which table is being complained of in the failure
message, ie, what has relfilenode 63370 in database 63229? If so it
would be interesting to know what was being done to that table on the
master.
Good point! Looking deeper into that, it's actually one of our smaller tables, and it doesn't seem to have any corruption, on either server. I was able to select all the records from it and the content seems sane. The only thing that would have been happening on that table is an INSERT or UPDATE.
Thanks!
QH
[piggybackin' on older (seeming very similar) thread...] On Fri, Mar 29, 2013, Quentin Hartman wrote: > Yesterday morning, one of my streaming replication slaves running 9.2.3 > crashed with the following in the log file: > > 2013-03-28 12:49:30 GMT WARNING: page 1441792 of relation base/63229/63370 > does not exist > 2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index > 1663/63229/109956; iblk 303, heap 1663/63229/63370; > 2013-03-28 12:49:30 GMT PANIC: WAL contains references to invalid pages > 2013-03-28 12:49:30 GMT CONTEXT: xlog redo delete: index I've got a pair of 9.2.4 DBs here in a streaming + log shipping setup. Everything moving like clockwork until a couple of weeks ago, at which time my slave died with the 'WAL contains references to invalid pages' error. After some wild googlin' "research", I saw the index visibility map fix for 9.2.1. We did pg_upgrade in-between versions, but just to be sure I wasn't somehow carrying corrupt data across versions (?), I went ahead and VACUUMed everythng with the vacuum_freeze_table_age set to 0, and went on with my life, hoping I had removed whatever demons were running around in there. Slave died again this weekend: [2013-09-08 02:29:14.712 GMT] 55340: <> : WARNING: page 2424 of relation base/16411/2529375 is uninitialized [2013-09-08 02:29:14.712 GMT] 55340: <> : CONTEXT: xlog redo vacuum: rel 1663/16411/2529375; blk 26937, lastBlockVacuumed0 [2013-09-08 02:29:14.712 GMT] 55340: <> : PANIC: WAL contains references to invalid pages [2013-09-08 02:29:14.712 GMT] 55340: <> : CONTEXT: xlog redo vacuum: rel 1663/16411/2529375; blk 26937, lastBlockVacuumed0 [2013-09-08 02:29:14.761 GMT] 55338: <> : LOG: startup process (PID 55340) was terminated by signal 6: Abort trap [2013-09-08 02:29:14.761 GMT] 55338: <> : LOG: terminating any other active server processes Unlike Quentin's original message, simply restarting the slave didn't bring it back to life. I had to pg_start_backup/rsync again from the master, at which point: cp: /usr/local/pgsql/wal/00000003.history: No such file or directory [2013-09-09 03:58:29.235 GMT] 96710: <> : LOG: entering standby mode [2013-09-09 03:58:29.324 GMT] 96710: <> : LOG: restored log file "0000000300000E1E0000004B" from archive [...] [2013-09-09 03:59:05.151 GMT] 96710: <> : LOG: consistent recovery state reached at E1E/89BE3DD8 [2013-09-09 03:59:05.152 GMT] 96708: <> : LOG: database system is ready to accept read only connections cp: /usr/local/pgsql/wal/0000000300000E1E0000008A: No such file or directory [2013-09-09 03:59:05.154 GMT] 96710: <> : LOG: unexpected pageaddr E1D/FB000000 in log file 3614, segment 138, offset0 cp: /usr/local/pgsql/wal/0000000300000E1E0000008A: No such file or directory [2013-09-09 03:59:05.209 GMT] 96786: <> : LOG: streaming replication successfully connected to primary Seemingly okay again for now, though I'm anticipating similar future failures. The original relation that it had a problem with is a child partitioned table's index: [pgsql ~]$ oid2name -d db01 -o 2529375 From database "db01": Filenode Table Name ------------------------------------ 2529375 syslog_p_2013_32_seq_key Partitioning on year/week_of_year, it definitely shouldn't have been receiving any writes at the time of the failure. Also interesting, both failures were on a Saturday, around the time a base backup is performed (though it failed afterwards in both cases, so this may be coincidence.) Is there any other information I can gather than can help shed light on what is going on? Any recommendations going from here? My master is fine during all of this, but I'm concerned I've got some hidden corruption that will bite me when I least expect it. pgsql=# select version(); version ------------------------------------------------------------------------------------------------------------- PostgreSQL 9.2.4 on amd64-portbld-freebsd9.0, compiled by cc (GCC) 4.2.1 20070831 patched [FreeBSD], 64-bit -- Mahlon E. Smith http://www.martini.nu/contact.html
On Mon, 2013-09-09 at 13:04 -0700, Mahlon E. Smith wrote: > After some wild googlin' "research", I saw the index visibility map fix > for 9.2.1. We did pg_upgrade in-between versions, but just to be sure I > wasn't somehow carrying corrupt data across versions (?), I went ahead > and VACUUMed everythng with the vacuum_freeze_table_age set to 0, and > went on with my life, hoping I had removed whatever demons were running > around in there. You may have seen only partial information about that bug and the fix. See the first item in the release notes here: http://www.postgresql.org/docs/current/static/release-9-2-1.html And the actual fix here: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=beb850e1d873f8920a78b9b9ee27e9f87c95592f To rid the daemons entirely, after upgrading to 9.1+, you must REINDEX all btree and GIN indexes that may have been touched by 9.2.0. Since the actual problem you see is on an index (as you say later in your report), then I would suggest that you do that. Not 100% sure this is the root cause of your problem, of course, but the symptoms seem to line up. > Unlike Quentin's original message, simply restarting the slave didn't > bring it back to life. I had to pg_start_backup/rsync again from the > master, at which point: It looks OK to me, so I think you are in the clear. If you are particularly unlucky, your master server crashed (while still on 9.2.0) without writing the data and left your master copy of the index corrupt. If you are worried about that, you can do another re-sync after you finish the REINDEXing. This is not necessary unless you experienced at least one crash on 9.2.0. Regards, Jeff Davis
On Mon, Sep 09, 2013, Jeff Davis wrote: > > You may have seen only partial information about that bug and the fix. Yep, I totally glazed over the REINDEX. Giving it a go -- thank you! -- Mahlon E. Smith http://www.martini.nu/contact.html
Attachment
On Tue, Sep 10, 2013, Mahlon E. Smith wrote: > On Mon, Sep 09, 2013, Jeff Davis wrote: > > > > You may have seen only partial information about that bug and the fix. > > Yep, I totally glazed over the REINDEX. Giving it a go -- thank you! As a followup for anyone else landing on this thread, the REINDEX was (unsurprisingly) the solution. Been runnin' like a champ for the last 2.5 months. -- Mahlon E. Smith http://www.martini.nu/contact.html