Thread: ERROR: invalid page in block 1226710 of relation base/16750/27244

ERROR: invalid page in block 1226710 of relation base/16750/27244

From

bricklen

Date:

21 October 2015, 18:16:47

Hi,

We have run into some corruption in one of our production tables. We know the cause (a compute node was moved), but now we need to fix the data. We have backups, but at this point they are nearly a day old, so recovering from them is a last-resort and will incur significant downtime.
We are running 9.3.9

Following the steps at http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html

I get the following output for ctid, id, other_id, tstamp:

(690651,42) |    318698967 |   347978007 | 2015-10-20 01:55:41.757+00
(690651,43) |    318698968 |   347978008 | 2015-10-20 01:55:41.663+00
(690651,44) |    318698969 |   347978009 | 2015-10-20 01:55:42.005+00
ERROR: invalid page in block 1226710 of relation base/16750/27244

It appears 690652 is what would be dd'd if that's the route we take. Is that accurate?

Because the message indicates the corruption is in the table's page, not the page header, according to the docs zero_damaged_pages probably won't work.

What are my options?

1). Enable zero_damaged_pages and execute VACUUM FREEZE (and hope).
2). dd the block(s) using the output of the ctid query above.

It is multi-gigabyte table that is extremely heavily used (100's to 1000's of queries per second) so a VACUUM FULL or CLUSTER are options we'd really like to avoid if possible. The database is about 250GB, not huge, but big enough that slaves and backups are time consuming to redo, or recover from.

Will attempting zero_damaged_pages cause any harm as the first step (other than the obvious destruction of any bad pages)?

Is this the correct command if option #2 is chosen? Can it be executed against a running cluster?
dd if=/dev/zero of=database/16750/27244 bs=8192 seek=690652 count=1 conv=notrunc

Thanks,

Bricklen

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

From

Tom Lane

Date:

21 October 2015, 18:46:33

bricklen <bricklen@gmail.com> writes:
> We have run into some corruption in one of our production tables. We know
> the cause (a compute node was moved), but now we need to fix the data. We
> have backups, but at this point they are nearly a day old, so recovering
> from them is a last-resort and will incur significant downtime.
> We are running 9.3.9

> Following the steps at
> http://blog.endpoint.com/2010/06/tracking-down-database-corruption-with.html

> I get the following output for ctid, id, other_id, tstamp:

>  (690651,42) |    318698967 |   347978007 | 2015-10-20 01:55:41.757+00
>  (690651,43) |    318698968 |   347978008 | 2015-10-20 01:55:41.663+00
>  (690651,44) |    318698969 |   347978009 | 2015-10-20 01:55:42.005+00
> ERROR:  invalid page in block 1226710 of relation base/16750/27244

> It appears 690652 is what would be dd'd if that's the route we take. Is
> that accurate?

I'm confused by the block mentioned in the error message not having
anything to do with the TID sequence.  I wonder whether it refers to an
index not the table proper.  What query were you using to get this output,
exactly?  Have you confirmed which relation has relfilenode 27244?

> Because the message indicates the corruption is in the table's page, not
> the page header, according to the docs zero_damaged_pages probably won't
> work.

I should think that zero_damaged_pages would work fine, if the problem
is indeed in the base table.  But if it's in an index, a REINDEX would be
a better answer.

Another thing to keep in mind here, if you've got replication slaves,
is that I'm not sure whether the effects of zero_damaged_pages would
propagate to slaves.  Have you investigated the possibility that some
slave has an uncorrupt copy that you could dd into place in the master?

> Is this the correct command if option #2 is chosen? Can it be executed
> against a running cluster?
> dd if=/dev/zero of=database/16750/27244 bs=8192 seek=690652 count=1
> conv=notrunc

Uh, no, you're not accounting for the fact that such an offset wouldn't be
in the first segment file of the relation.

            regards, tom lane

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

From

bricklen

Date:

21 October 2015, 19:31:18

Hi Tom,

On Wed, Oct 21, 2015 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

bricklen <bricklen@gmail.com> writes:
> I get the following output for ctid, id, other_id, tstamp:

> (690651,42) | 318698967 | 347978007 | 2015-10-20 01:55:41.757+00
> (690651,43) | 318698968 | 347978008 | 2015-10-20 01:55:41.663+00
> (690651,44) | 318698969 | 347978009 | 2015-10-20 01:55:42.005+00
> ERROR: invalid page in block 1226710 of relation base/16750/27244

> It appears 690652 is what would be dd'd if that's the route we take. Is
> that accurate?

I'm confused by the block mentioned in the error message not having
anything to do with the TID sequence. I wonder whether it refers to an
index not the table proper. What query were you using to get this output,
exactly? Have you confirmed which relation has relfilenode 27244?

Yes, it is definitely a table. There was originally an index on that table which threw the original error (about sibling mismatch). I dropped the index and attempted to recreate it, which failed. Further investigation led to discovery of corruption in the table.

> Because the message indicates the corruption is in the table's page, not
> the page header, according to the docs zero_damaged_pages probably won't
> work.

I should think that zero_damaged_pages would work fine, if the problem
is indeed in the base table.

I will make note of that.

Another thing to keep in mind here, if you've got replication slaves,
is that I'm not sure whether the effects of zero_damaged_pages would
propagate to slaves. Have you investigated the possibility that some
slave has an uncorrupt copy that you could dd into place in the master?

We do have one uncorrupted slave, and one corrupted. I have a 4 hour delayed WAL-apply script that runs on the primary slaves in the disaster recovery data centres, and I stopped that process as soon as I saw the error about the sibling mismatch on the master. It is a viable candidate to fail over to, if we can swing a 20+ hour window of data loss. Right now that is an undesirable option.

> Is this the correct command if option #2 is chosen? Can it be executed
> against a running cluster?
> dd if=/dev/zero of=database/16750/27244 bs=8192 seek=690652 count=1
> conv=notrunc

Uh, no, you're not accounting for the fact that such an offset wouldn't be
in the first segment file of the relation.

Hmm, I wasn't sure about that. Thanks for confirming that.

As it stands, my next step is going to be a pg_dump of one of the up-to-date slaves (with corruption) but I will exclude the bad table. Given that I know the PK id range, I can COPY out the table's contents before and after the affected data. This way we can at least recover from backup if things get entirely borked.

The next part of the plan is to create a temporary version of the table with all data other than the corrupted range, then do some transaction-fu to rename the tables.

Thank you for your response, and any other insights are gratefully received.

Cheers,

Bricklen

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

From

Tom Lane

Date:

21 October 2015, 19:52:14

bricklen <bricklen@gmail.com> writes:
> On Wed, Oct 21, 2015 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm confused by the block mentioned in the error message not having
>> anything to do with the TID sequence.  I wonder whether it refers to an
>> index not the table proper.  What query were you using to get this output,
>> exactly?  Have you confirmed which relation has relfilenode 27244?

> Yes, it is definitely a table. There was originally an index on that table
> which threw the original error (about sibling mismatch). I dropped the
> index and attempted to recreate it, which failed. Further investigation led
> to discovery of corruption in the table.

Hm.  There's still something weird about this though.  Maybe there is no
data at all between pages 1226710 and 690651?  Might be worth doing some
poking around with contrib/pageinspect/.

> As it stands, my next step is going to be a pg_dump of one of the
> up-to-date slaves (with corruption) but I will exclude the bad table. Given
> that I know the PK id range, I can COPY out the table's contents before and
> after the affected data. This way we can at least recover from backup if
> things get entirely borked.

Agreed, if you're gonna mess with the table files directly, it's always
smart to have a fallback plan in case you make things worse.

            regards, tom lane

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

From

bricklen

Date:

21 October 2015, 19:55:13

On Wed, Oct 21, 2015 at 12:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

bricklen <bricklen@gmail.com> writes:
> Yes, it is definitely a table. There was originally an index on that table
> which threw the original error (about sibling mismatch). I dropped the
> index and attempted to recreate it, which failed. Further investigation led
> to discovery of corruption in the table.

Hm. There's still something weird about this though. Maybe there is no
data at all between pages 1226710 and 690651? Might be worth doing some
poking around with contrib/pageinspect/.

Ah, good idea.

Thanks again!

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

From

bricklen

Date:

22 October 2015, 15:12:50

A follow-up question if I may,

bricklen <bricklen@gmail.com> writes:
> Yes, it is definitely a table. There was originally an index on that table
> which threw the original error (about sibling mismatch). I dropped the
> index and attempted to recreate it, which failed. Further investigation led
> to discovery of corruption in the table.

There are several hot standby servers attached to the master, some streaming, and one in a different data centre that is using WAL shipping only.

The streaming slave IIRC got the corruption from the master (I can't check now, it was rebuilt).

What would have happened to the WAL-shipping-only standby if the WALs were all applied? Would it have it balked at applying a WAL containing bad data from the master, or would it have applied the WAL and continued on? For the latter, would physical corruption on the master even transfer via WAL?

I didn't get a chance to answer those questions because we promoted the DR WAL-shipping standby before it got to the corrupted section.

Thanks,

Bricklen

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

From

Tom Lane

Date:

22 October 2015, 16:16:06

bricklen <bricklen@gmail.com> writes:
> There are several hot standby servers attached to the master, some
> streaming, and one in a different data centre that is using WAL shipping
> only.
> The streaming slave IIRC got the corruption from the master (I can't check
> now, it was rebuilt).
> What would have happened to the WAL-shipping-only standby if the WALs were
> all applied? Would it have it balked at applying a WAL containing bad data
> from the master, or would it have applied the WAL and continued on? For the
> latter, would physical corruption on the master even transfer via WAL?

Hard to tell.  I'd have guessed that corruption that made a page
unreadable would not transfer across WAL (streaming or otherwise), because
the master could not have read it in to apply an update to it.  However,
we don't know the exact sequence of events here; there may have more than
one step on the way to disaster.

            regards, tom lane

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

From

bricklen

Date:

22 October 2015, 16:25:45

On Thu, Oct 22, 2015 at 9:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

bricklen <bricklen@gmail.com> writes:
> What would have happened to the WAL-shipping-only standby if the WALs were
> all applied? Would it have it balked at applying a WAL containing bad data
> from the master, or would it have applied the WAL and continued on? For the
> latter, would physical corruption on the master even transfer via WAL?

Hard to tell. I'd have guessed that corruption that made a page
unreadable would not transfer across WAL (streaming or otherwise), because
the master could not have read it in to apply an update to it. However,
we don't know the exact sequence of events here; there may have more than
one step on the way to disaster.

regards, tom lane

I would have liked to have had the opportunity to answer those questions myself but alas, in the heat of the moment some of the data useful for forensics was lost.

Thanks again!

Re: ERROR: invalid page in block 1226710 of relation base/16750/27244

From

Jim Nasby

Date:

22 October 2015, 17:51:15

On 10/22/15 11:25 AM, bricklen wrote:
> I would have liked to have had the opportunity to answer those questions
> myself but alas, in the heat of the moment some of the data useful for
> forensics was lost.

You could always roll WAL forward from the previous base backup and see
what happens.

FWIW, most times that I've experienced corruption it's percolated
through the WAL stream as well, presumably due to full_page_writes. It's
why I like making londiste part of the DR configuration for really
high-value data. Of course, if you're having corruption problems then
it's also very possible that your user data is getting crapped on too,
and logical replication won't help you terribly much there. I
investigated adding user-space row-level checksums but never actually
rolled that out. Of course now you'd just use page level checksums.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com