Thread: bug, bad memory, or bad disk?

bug, bad memory, or bad disk?

From
Ben Chobot
Date:
We have a Postgres server (PostgreSQL 9.1.6 on x86_64-unknown-linux-gnu, =
compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit) which does =
streaming replication to some slaves, and has another set of slaves =
reading the wal archive for wal-based replication. We had a bit of fun =
yesterday where, suddenly, the master started spewing errors like:

2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1]  =
ERROR:  invalid memory alloc request size 1968078400
2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1]  =
ERROR:  invalid page header in block 2948 of relation =
pg_tblspc/16435/PG_9.1_201105231/188417/56951641
2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]  =
ERROR:  could not open file =
"pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block =
3936767042): No such file or directory
2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1]  =
ERROR:  could not open file =
"pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block =
3936767042): No such file or directory
2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1]  =
ERROR:  could not open file =
"pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block =
3936767042): No such file or directory
2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1]  =
ERROR:  invalid memory alloc request size 1968078400
2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1]  =
ERROR:  could not open file =
"pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block =
3936767042): No such file or directory
2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1]  =
ERROR:  invalid page header in block 38887 of relation =
pg_tblspc/16435/PG_9.1_201105231/188417/58206627
2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1]  =
ERROR:  invalid page header in block 2368 of relation =
pg_tblspc/16435/PG_9.1_201105231/188417/60418945

There didn't seem to be much correlation to which files were affected, =
and this was a critical server, so once we realized a simple reindex =
wasn't going to solve things, we shut it down and brought up a slave as =
the new master db.

While that seemed to fix these issues, we soon noticed problems with =
missing clog files. The missing clogs were outside the range of the =
existing clogs, so we tried using dummy clog files. It didn't help, and =
running pg_check we found that one block of one table was definitely =
corrupt. Worse, that corruption had spread to all our replicas.

I know this is a little sparse on details, but my questions are:

1. What kind of fault should I be looking to fix? Because it spread to =
all the replicas, both those that stream and those that replicate by =
replaying wals in the wal archive, I assume it's not a storage issue. =
(My understanding is that streaming replicas stream their changes from =
memory, not from wals.) So that leaves bad memory on the master, or a =
bug in postgres. Or a flawed assumption... :)

2. Is it possible that the corruption that was on the master got =
replicated to the slaves when I tried to cleanly shut down the master =
before bringing up a new slave as the new master and switching the other =
slaves over to replicating from that?

Re: bug, bad memory, or bad disk?

From
Amit Kapila
Date:
On Friday, February 15, 2013 1:33 AM Ben Chobot wrote:

> 2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1]
=A0ERROR: =A0invalid memory alloc request size=20
> 1968078400
> 2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1]
=A0ERROR: =A0invalid page header in block 2948 of=20
> relation pg_tblspc/16435/PG_9.1_201105231/188417/56951641
> 2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]
=A0ERROR: =A0could not open file=20
> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
3936767042): No such file or directory
> 2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1] =
=A0ERROR:
=A0could not open file=20
> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
3936767042): No such file or directory
> 2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1] =
=A0ERROR:
=A0could not open file=20
> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
3936767042): No such file or directory
> 2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1]
=A0ERROR: =A0invalid memory alloc request size=20
> 1968078400
> 2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1]
=A0ERROR: =A0could not open file=20
> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
3936767042): No such file or directory
> 2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1]
=A0ERROR: =A0invalid page header in block 38887 of=20
> relation pg_tblspc/16435/PG_9.1_201105231/188417/58206627
> 2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1]
=A0ERROR: =A0invalid page header in block 2368 of=20
> relation pg_tblspc/16435/PG_9.1_201105231/188417/60418945

> There didn't seem to be much correlation to which files were affected, =
and
this was a critical server, so once we=20
> realized a simple reindex wasn't going to solve things, we shut it =
down
and brought up a slave as the new master db.

> While that seemed to fix these issues, we soon noticed problems with
missing clog files. The missing clogs were outside > the range of the
existing clogs, so we tried using dummy clog files. It didn't help, and
running pg_check we found that > one block of one table was definitely
corrupt. Worse, that corruption had spread to all our replicas.

Can you check that corrupted block is from one of the relations =
mentioned in
your errors. This is just to reconfirm.

> I know this is a little sparse on details, but my questions are:

> 1. What kind of fault should I be looking to fix? Because it spread to =
all
the replicas, both those that stream and=20
> those that replicate by replaying wals in the wal archive, I assume =
it's
not=A0a storage issue. (My understanding is that > streaming replicas =
stream
their changes from memory, not from wals.)=20

  Streaming replication stream their changes from wals.

> 2. Is it possible that the corruption that was on the master got
replicated to the slaves when I tried to cleanly shut > down the master
before bringing up a new slave as the new master and switching the other
slaves over to replicating=20
> from that?

At shutdown, master will send all WAL (upto shutdown checkpoint)

I think there are 2 issues in your mail
1. access to corrupted blocks - there are 2 things in this, one is how =
the
block get corrupted in master and why it's replicated to other servers.

The corrupted block replication can be done because of WAL as WAL =
contains
backup copies of blocks if full_page_write=3Don, which is default
configuration.
So I think now the main question remains is how the block/'s get =
corrupted
on master. For that I think some more information is required, like what
kind of operations are being done for relation which has corrupted =
block.
If we drop and recreate that relation, will this problem remains.
Is there any chance that the block gets corrupted due to hardware =
problem?

2. missing clog files - how did you find missing clog files, is any
operation got failed or just an observation?
Do you see any problems in system due to it?

With Regards,
Amit Kapila.

Re: bug, bad memory, or bad disk?

From
Merlin Moncure
Date:
On Fri, Feb 15, 2013 at 8:08 AM, Amit Kapila <amit.kapila@huawei.com> wrote:
> On Friday, February 15, 2013 1:33 AM Ben Chobot wrote:
>
>> 2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1]
>  ERROR:  invalid memory alloc request size
>> 1968078400
>> 2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1]
>  ERROR:  invalid page header in block 2948 of
>> relation pg_tblspc/16435/PG_9.1_201105231/188417/56951641
>> 2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]
>  ERROR:  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1]  ERROR:
>  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1]  ERROR:
>  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1]
>  ERROR:  invalid memory alloc request size
>> 1968078400
>> 2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1]
>  ERROR:  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1]
>  ERROR:  invalid page header in block 38887 of
>> relation pg_tblspc/16435/PG_9.1_201105231/188417/58206627
>> 2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1]
>  ERROR:  invalid page header in block 2368 of
>> relation pg_tblspc/16435/PG_9.1_201105231/188417/60418945
>
>> There didn't seem to be much correlation to which files were affected, and
> this was a critical server, so once we
>> realized a simple reindex wasn't going to solve things, we shut it down
> and brought up a slave as the new master db.
>
>> While that seemed to fix these issues, we soon noticed problems with
> missing clog files. The missing clogs were outside > the range of the
> existing clogs, so we tried using dummy clog files. It didn't help, and
> running pg_check we found that > one block of one table was definitely
> corrupt. Worse, that corruption had spread to all our replicas.
>
> Can you check that corrupted block is from one of the relations mentioned in
> your errors. This is just to reconfirm.
>
>> I know this is a little sparse on details, but my questions are:
>
>> 1. What kind of fault should I be looking to fix? Because it spread to all
> the replicas, both those that stream and
>> those that replicate by replaying wals in the wal archive, I assume it's
> not a storage issue. (My understanding is that > streaming replicas stream
> their changes from memory, not from wals.)
>
>   Streaming replication stream their changes from wals.

Yeah.  This smells like disk corruption to me, but it really could be
anything.  Unfortunately it can spread to the replicas especially if
you're not timely about taking the master down.  page checksums (a
proposed feature) are a way of dealing with this problem.

The biggest issue is the missing clog files -- did you have more than
one replica? Were they missing on all of them?

merlin

Re: bug, bad memory, or bad disk?

From
Amit Kapila
Date:
> -----Original Message-----
> From: Merlin Moncure [mailto:mmoncure@gmail.com]
> Sent: Friday, February 15, 2013 7:56 PM
> To: Amit Kapila
> Cc: Ben Chobot; PostgreSQL General
> Subject: Re: [GENERAL] bug, bad memory, or bad disk?
>
> On Fri, Feb 15, 2013 at 8:08 AM, Amit Kapila <amit.kapila@huawei.com>
> wrote:
> > On Friday, February 15, 2013 1:33 AM Ben Chobot wrote:
> >
> >> 2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1]
> >  ERROR:  invalid memory alloc request size
> >> 1968078400
> >> 2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1]
> >  ERROR:  invalid page header in block 2948 of
> >> relation pg_tblspc/16435/PG_9.1_201105231/188417/56951641
> >> 2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]
> >  ERROR:  could not open file
> >> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> > 3936767042): No such file or directory
> >> 2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1]
> ERROR:
> >  could not open file
> >> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> > 3936767042): No such file or directory
> >> 2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1]
> ERROR:
> >  could not open file
> >> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> > 3936767042): No such file or directory
> >> 2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1]
> >  ERROR:  invalid memory alloc request size
> >> 1968078400
> >> 2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1]
> >  ERROR:  could not open file
> >> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> > 3936767042): No such file or directory
> >> 2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1]
> >  ERROR:  invalid page header in block 38887 of
> >> relation pg_tblspc/16435/PG_9.1_201105231/188417/58206627
> >> 2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1]
> >  ERROR:  invalid page header in block 2368 of
> >> relation pg_tblspc/16435/PG_9.1_201105231/188417/60418945
> >
> >> There didn't seem to be much correlation to which files were
> affected, and
> > this was a critical server, so once we
> >> realized a simple reindex wasn't going to solve things, we shut it
> down
> > and brought up a slave as the new master db.
> >
> >> While that seemed to fix these issues, we soon noticed problems with
> > missing clog files. The missing clogs were outside > the range of the
> > existing clogs, so we tried using dummy clog files. It didn't help,
> and
> > running pg_check we found that > one block of one table was
> definitely
> > corrupt. Worse, that corruption had spread to all our replicas.
> >
> > Can you check that corrupted block is from one of the relations
> mentioned in
> > your errors. This is just to reconfirm.
> >
> >> I know this is a little sparse on details, but my questions are:
> >
> >> 1. What kind of fault should I be looking to fix? Because it spread
> to all
> > the replicas, both those that stream and
> >> those that replicate by replaying wals in the wal archive, I assume
> it's
> > not a storage issue. (My understanding is that > streaming replicas
> stream
> > their changes from memory, not from wals.)
> >
> >   Streaming replication stream their changes from wals.
>
> Yeah.  This smells like disk corruption to me, but it really could be
> anything.  Unfortunately it can spread to the replicas especially if
> you're not timely about taking the master down.  page checksums (a
> proposed feature) are a way of dealing with this problem.

Yes, it can be one of the reason.

One more thing I observed is the block it is trying to access in file has
very large block number (Ex. See in below log message). Is it really
feasible to have such large block number?

2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]  ERROR:
could not open file "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1"
(target block 3936767042): No such file or directory

Also the first error message:
"invalid memory alloc request size 1968078400"
Here why and which operation requests such a memory request?

Not sure if there is any chances of memory corruption?

> The biggest issue is the missing clog files -- did you have more than
> one replica? Were they missing on all of them?

Won't it be possible that there are less clog files on replica. Not sure if
such can happen in
Normal scenario's?

With Regards,
Amit Kapila.