bug, bad memory, or bad disk? - Mailing list pgsql-general

From Ben Chobot
Subject bug, bad memory, or bad disk?
Date
Msg-id 77BBEB20-89D9-47C5-8F36-41DF5E05355A@silentmedia.com
Whole thread Raw
Responses Re: bug, bad memory, or bad disk?  (Amit Kapila <amit.kapila@huawei.com>)
List pgsql-general
We have a Postgres server (PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, =
compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit) which does =
streaming replication to some slaves, and has another set of slaves =
reading the wal archive for wal-based replication. We had a bit of fun =
yesterday where, suddenly, the master started spewing errors like:

2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1]  =
ERROR:  invalid memory alloc request size 1968078400
2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1]  =
ERROR:  invalid page header in block 2948 of relation =
pg_tblspc/16435/PG_9.1_201105231/188417/56951641
2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]  =
ERROR:  could not open file =
"pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block =
3936767042): No such file or directory
2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1]  =
ERROR:  could not open file =
"pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block =
3936767042): No such file or directory
2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1]  =
ERROR:  could not open file =
"pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block =
3936767042): No such file or directory
2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1]  =
ERROR:  invalid memory alloc request size 1968078400
2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1]  =
ERROR:  could not open file =
"pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block =
3936767042): No such file or directory
2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1]  =
ERROR:  invalid page header in block 38887 of relation =
pg_tblspc/16435/PG_9.1_201105231/188417/58206627
2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1]  =
ERROR:  invalid page header in block 2368 of relation =
pg_tblspc/16435/PG_9.1_201105231/188417/60418945

There didn't seem to be much correlation to which files were affected, =
and this was a critical server, so once we realized a simple reindex =
wasn't going to solve things, we shut it down and brought up a slave as =
the new master db.

While that seemed to fix these issues, we soon noticed problems with =
missing clog files. The missing clogs were outside the range of the =
existing clogs, so we tried using dummy clog files. It didn't help, and =
running pg_check we found that one block of one table was definitely =
corrupt. Worse, that corruption had spread to all our replicas.

I know this is a little sparse on details, but my questions are:

1. What kind of fault should I be looking to fix? Because it spread to =
all the replicas, both those that stream and those that replicate by =
replaying wals in the wal archive, I assume it's not a storage issue. =
(My understanding is that streaming replicas stream their changes from =
memory, not from wals.) So that leaves bad memory on the master, or a =
bug in postgres. Or a flawed assumption... :)

2. Is it possible that the corruption that was on the master got =
replicated to the slaves when I tried to cleanly shut down the master =
before bringing up a new slave as the new master and switching the other =
slaves over to replicating from that?

pgsql-general by date:

Previous
From: François Beausoleil
Date:
Subject: PGbouncer and batch vs real-time pools
Next
From: Aleksey Tsalolikhin
Date:
Subject: Re: Graphing query results from within psql.