Re: corrupted item pointer in streaming based replication - Mailing list pgsql-general

From Jigar Shah
Subject Re: corrupted item pointer in streaming based replication
Date
Msg-id 1E737D138B89104D8A7853F7DD23177DB6358A@SF1-EXMBX-2.ad.savagebeast.com
Whole thread Raw
In response to Re: corrupted item pointer in streaming based replication  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: corrupted item pointer in streaming based replication
List pgsql-general
We had some disk issues on the primary, but raid verification corrected
those blocks. That may have caused the primary to be corrupt.

I have identified the objects, they both are indexes

        relname         | relfilenode | relkind
------------------------+-------------+---------
 feedback_packed_pkey   |      114846 | i
 feedback_packed_id_idx |      115085 | i


We did start a pg_dump but it then started throwing the error below. It
happened while querying a different table and has nothing to do with the
above indexes

[d: u:customerservice p:28567 3] ERROR: could not access status of
transaction 4074203375 [d: u:customerservice p:28567 4]
DETAIL: Could not open file "pg_clog/0F2D": No such file or directory. [d:
u:customerservice p:28567 5]
STATEMENT: select listener_id, station_id from station where date_created
> '2013-03-26 00:23:17.249' and name != 'QuickMix';

We do have a backup from last night but its several hours old.

The secondary is the most recent copy. If we could just tell the secondary
to go passed beyond that corrupt block and get the database started, we
can then divert traffic to the secondary so our system can run read-only
until we can isolate and fix our primary. But the secondary is stuck at
this point and wont start. Is there a way to make the secondary do that?
Is there a way to remove that block from the wal file its applying so it
can go passed that point?

2013-03-27 11:00:47.281 PDT LOG:  recovery restart point at 161A/17108AA8
2013-03-27 11:00:47.281 PDT DETAIL:  last completed transaction was at log
time 2013-03-27 11:00:47.241236-07
2013-03-27 11:00:47.520 PDT LOG:  restartpoint starting: xlog

2013-03-27 11:07:51.348 PDT FATAL:  corrupted item pointer: offset = 0,
size = 0
2013-03-27 11:07:51.348 PDT CONTEXT:  xlog redo split_l: rel
1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0,
firstright 192
2013-03-27 11:07:51.716 PDT LOG:  startup process (PID 5959) exited with
exit code 1
2013-03-27 11:07:51.716 PDT LOG:  terminating any other active server
processes


Also, incase we are not able to fix our corrupt primary, we could promote
our secondary as its the most recent copy and will save us a lot of time
restoring an old backup. We could then rebuild corrupt indexes on the
secondary. All this is possible only if we can get the secondary started,
but it wont budge. Any suggestions?

Thanks
Jigar




On 4/3/13 1:18 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

>Jigar Shah <jshah@pandora.com> writes:
>> Postgres version = 9.1.2
>
>Um, you do realize this is over a year out of date right?
>(Fortunately, you will have an excellent opportunity to update tomorrow.)
>
>> Few days ago we had a situation where our Primary started to through
>>the error messages below indicating corruption in the database. It
>>crashed sometimes and showed a panic message in the logs
>
>> [d: u:radio p:31917 242] ERROR: could not open file
>>"base/16384/114846.39" (target block 360448000): No such file or
>>directory [d: u:radio p:31917 243]
>
>> 2013-03-27 11:07:51.348 PDT FATAL:  corrupted item pointer: offset = 0,
>>size = 0
>> 2013-03-27 11:07:51.348 PDT CONTEXT:  xlog redo split_l: rel
>>1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0,
>>firstright 192
>
>Look up relfilenodes 114846 and 115085 in pg_class of whichever database
>has OID 16384.  I'm guessing the latter is an index of the former.  If
>that's true, then both of these messages suggest corruption in the index
>--- the latter pretty obviously, and the former because it looks like
>it's an attempt to fetch from a silly block number, which could have
>come out of a corrupted index entry.  So if you're really lucky and
>nothing but that index is corrupted, a REINDEX will fix it.  Personally
>I'd be wondering about what's the underlying cause and whether there is
>corruption elsewhere, though.  Try looking for evidence of flaky RAM or
>flaky disk drives on your primary.  See if you can pg_dump (not just
>for forensic reasons, but so you've got some kind of backup if things
>go downhill from here).
>
>            regards, tom lane



pgsql-general by date:

Previous
From: David Wall
Date:
Subject: Permissions on large objects - db backup and restore
Next
From: Tom Lane
Date:
Subject: Re: Permissions on large objects - db backup and restore