Thread: corrupted item pointer in streaming based replication

corrupted item pointer in streaming based replication

From
Jigar Shah
Date:
Hi,

Postgres version = 9.1.2
OS = debian(6.0.7)
fsync = on
full_page_writes = on
Setup = Primary and streaming replication based secondary

Few days ago we had a situation where our Primary started to through the error messages below indicating corruption in the database. It crashed sometimes and showed a panic message in the logs

2013-03-25 07:30:39.545 PDT PANIC:  corrupted item pointer: offset = 0, size = 0
2013-03-25 07:30:39.704 PDT LOG:  server process (PID 8715) was terminated by signal 6: Aborted
2013-03-25 07:30:39.704 PDT LOG:  terminating any other active server processes

Days before it started to crash it showed the below error messages in the logs.

[d: u:postgres p:2498 7] ERROR: could not access status of transaction 837550133 
DETAIL: Could not open file "pg_clog/031E": No such file or directory. [u:postgres p:2498 9]

[d: u:radio p:31917 242] ERROR: could not open file "base/16384/114846.39" (target block 360448000): No such file or directory [d: u:radio p:31917 243]

On top of that, our secondaries are now crashed and would not startup and showed the error messages below in pg logs.

2013-03-27 11:00:47.281 PDT LOG:  recovery restart point at 161A/17108AA8
2013-03-27 11:00:47.281 PDT DETAIL:  last completed transaction was at log time 2013-03-27 11:00:47.241236-07
2013-03-27 11:00:47.520 PDT LOG:  restartpoint starting: xlog

2013-03-27 11:07:51.348 PDT FATAL:  corrupted item pointer: offset = 0, size = 0
2013-03-27 11:07:51.348 PDT CONTEXT:  xlog redo split_l: rel 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0, firstright 192
2013-03-27 11:07:51.716 PDT LOG:  startup process (PID 5959) exited with exit code 1
2013-03-27 11:07:51.716 PDT LOG:  terminating any other active server processes

At this point we have a running but corrupt primary and crashed secondary that wont startup.

I am wondering what are our options at this point. Can we do something to fix this? How can we recover from corruption.

Thanks for help in advance.

Regards
Jigar


Re: corrupted item pointer in streaming based replication

From
Lonni J Friedman
Date:
You should figure out what base/16384/114846.39 corresponds to inside
the database.  If you're super lucky its something unimportant and/or
something that can be recreated easily (like an index).  If its
something important, then you're only option is to try to drop the
object and restore it from the last known good backup.

On Wed, Apr 3, 2013 at 1:02 PM, Jigar Shah <jshah@pandora.com> wrote:
> Hi,
>
> Postgres version = 9.1.2
> OS = debian(6.0.7)
> fsync = on
> full_page_writes = on
> Setup = Primary and streaming replication based secondary
>
> Few days ago we had a situation where our Primary started to through the
> error messages below indicating corruption in the database. It crashed
> sometimes and showed a panic message in the logs
>
> 2013-03-25 07:30:39.545 PDT PANIC:  corrupted item pointer: offset = 0, size
> = 0
> 2013-03-25 07:30:39.704 PDT LOG:  server process (PID 8715) was terminated
> by signal 6: Aborted
> 2013-03-25 07:30:39.704 PDT LOG:  terminating any other active server
> processes
>
> Days before it started to crash it showed the below error messages in the
> logs.
>
> [d: u:postgres p:2498 7] ERROR: could not access status of transaction
> 837550133
> DETAIL: Could not open file "pg_clog/031E": No such file or directory.
> [u:postgres p:2498 9]
>
> [d: u:radio p:31917 242] ERROR: could not open file "base/16384/114846.39"
> (target block 360448000): No such file or directory [d: u:radio p:31917 243]
>
> On top of that, our secondaries are now crashed and would not startup and
> showed the error messages below in pg logs.
>
> 2013-03-27 11:00:47.281 PDT LOG:  recovery restart point at 161A/17108AA8
> 2013-03-27 11:00:47.281 PDT DETAIL:  last completed transaction was at log
> time 2013-03-27 11:00:47.241236-07
> 2013-03-27 11:00:47.520 PDT LOG:  restartpoint starting: xlog
>
> 2013-03-27 11:07:51.348 PDT FATAL:  corrupted item pointer: offset = 0, size
> = 0
> 2013-03-27 11:07:51.348 PDT CONTEXT:  xlog redo split_l: rel
> 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0,
> firstright 192
> 2013-03-27 11:07:51.716 PDT LOG:  startup process (PID 5959) exited with
> exit code 1
> 2013-03-27 11:07:51.716 PDT LOG:  terminating any other active server
> processes
>
> At this point we have a running but corrupt primary and crashed secondary
> that wont startup.
>
> I am wondering what are our options at this point. Can we do something to
> fix this? How can we recover from corruption.
>
> Thanks for help in advance.
>
> Regards
> Jigar
>
>



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org


Re: corrupted item pointer in streaming based replication

From
Tom Lane
Date:
Jigar Shah <jshah@pandora.com> writes:
> Postgres version = 9.1.2

Um, you do realize this is over a year out of date right?
(Fortunately, you will have an excellent opportunity to update tomorrow.)

> Few days ago we had a situation where our Primary started to through the error messages below indicating corruption
inthe database. It crashed sometimes and showed a panic message in the logs 

> [d: u:radio p:31917 242] ERROR: could not open file "base/16384/114846.39" (target block 360448000): No such file or
directory[d: u:radio p:31917 243] 

> 2013-03-27 11:07:51.348 PDT FATAL:  corrupted item pointer: offset = 0, size = 0
> 2013-03-27 11:07:51.348 PDT CONTEXT:  xlog redo split_l: rel 1663/16384/115085 left 4256959, right 5861610, next
5044459,level 0, firstright 192 

Look up relfilenodes 114846 and 115085 in pg_class of whichever database
has OID 16384.  I'm guessing the latter is an index of the former.  If
that's true, then both of these messages suggest corruption in the index
--- the latter pretty obviously, and the former because it looks like
it's an attempt to fetch from a silly block number, which could have
come out of a corrupted index entry.  So if you're really lucky and
nothing but that index is corrupted, a REINDEX will fix it.  Personally
I'd be wondering about what's the underlying cause and whether there is
corruption elsewhere, though.  Try looking for evidence of flaky RAM or
flaky disk drives on your primary.  See if you can pg_dump (not just
for forensic reasons, but so you've got some kind of backup if things
go downhill from here).

            regards, tom lane


Re: corrupted item pointer in streaming based replication

From
Jigar Shah
Date:
We had some disk issues on the primary, but raid verification corrected
those blocks. That may have caused the primary to be corrupt.

I have identified the objects, they both are indexes

        relname         | relfilenode | relkind
------------------------+-------------+---------
 feedback_packed_pkey   |      114846 | i
 feedback_packed_id_idx |      115085 | i


We did start a pg_dump but it then started throwing the error below. It
happened while querying a different table and has nothing to do with the
above indexes

[d: u:customerservice p:28567 3] ERROR: could not access status of
transaction 4074203375 [d: u:customerservice p:28567 4]
DETAIL: Could not open file "pg_clog/0F2D": No such file or directory. [d:
u:customerservice p:28567 5]
STATEMENT: select listener_id, station_id from station where date_created
> '2013-03-26 00:23:17.249' and name != 'QuickMix';

We do have a backup from last night but its several hours old.

The secondary is the most recent copy. If we could just tell the secondary
to go passed beyond that corrupt block and get the database started, we
can then divert traffic to the secondary so our system can run read-only
until we can isolate and fix our primary. But the secondary is stuck at
this point and wont start. Is there a way to make the secondary do that?
Is there a way to remove that block from the wal file its applying so it
can go passed that point?

2013-03-27 11:00:47.281 PDT LOG:  recovery restart point at 161A/17108AA8
2013-03-27 11:00:47.281 PDT DETAIL:  last completed transaction was at log
time 2013-03-27 11:00:47.241236-07
2013-03-27 11:00:47.520 PDT LOG:  restartpoint starting: xlog

2013-03-27 11:07:51.348 PDT FATAL:  corrupted item pointer: offset = 0,
size = 0
2013-03-27 11:07:51.348 PDT CONTEXT:  xlog redo split_l: rel
1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0,
firstright 192
2013-03-27 11:07:51.716 PDT LOG:  startup process (PID 5959) exited with
exit code 1
2013-03-27 11:07:51.716 PDT LOG:  terminating any other active server
processes


Also, incase we are not able to fix our corrupt primary, we could promote
our secondary as its the most recent copy and will save us a lot of time
restoring an old backup. We could then rebuild corrupt indexes on the
secondary. All this is possible only if we can get the secondary started,
but it wont budge. Any suggestions?

Thanks
Jigar




On 4/3/13 1:18 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

>Jigar Shah <jshah@pandora.com> writes:
>> Postgres version = 9.1.2
>
>Um, you do realize this is over a year out of date right?
>(Fortunately, you will have an excellent opportunity to update tomorrow.)
>
>> Few days ago we had a situation where our Primary started to through
>>the error messages below indicating corruption in the database. It
>>crashed sometimes and showed a panic message in the logs
>
>> [d: u:radio p:31917 242] ERROR: could not open file
>>"base/16384/114846.39" (target block 360448000): No such file or
>>directory [d: u:radio p:31917 243]
>
>> 2013-03-27 11:07:51.348 PDT FATAL:  corrupted item pointer: offset = 0,
>>size = 0
>> 2013-03-27 11:07:51.348 PDT CONTEXT:  xlog redo split_l: rel
>>1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0,
>>firstright 192
>
>Look up relfilenodes 114846 and 115085 in pg_class of whichever database
>has OID 16384.  I'm guessing the latter is an index of the former.  If
>that's true, then both of these messages suggest corruption in the index
>--- the latter pretty obviously, and the former because it looks like
>it's an attempt to fetch from a silly block number, which could have
>come out of a corrupted index entry.  So if you're really lucky and
>nothing but that index is corrupted, a REINDEX will fix it.  Personally
>I'd be wondering about what's the underlying cause and whether there is
>corruption elsewhere, though.  Try looking for evidence of flaky RAM or
>flaky disk drives on your primary.  See if you can pg_dump (not just
>for forensic reasons, but so you've got some kind of backup if things
>go downhill from here).
>
>            regards, tom lane



Re: corrupted item pointer in streaming based replication

From
Tom Lane
Date:
Jigar Shah <jshah@pandora.com> writes:
> We had some disk issues on the primary, but raid verification corrected
> those blocks. That may have caused the primary to be corrupt.

"corrected" for small values of "corrected", I'm guessing :-(

> I have identified the objects, they both are indexes

>         relname         | relfilenode | relkind
> ------------------------+-------------+---------
>  feedback_packed_pkey   |      114846 | i
>  feedback_packed_id_idx |      115085 | i

Hm, well, the good news is you could reindex both of those, the bad is
that there are certainly more problems than this.

> The secondary is the most recent copy. If we could just tell the secondary
> to go passed beyond that corrupt block and get the database started, we
> can then divert traffic to the secondary so our system can run read-only
> until we can isolate and fix our primary. But the secondary is stuck at
> this point and wont start. Is there a way to make the secondary do that?
> Is there a way to remove that block from the wal file its applying so it
> can go passed that point?

No.  You could probably make use of the PITR functionality to let the
secondary replay up to just short of the WAL record where corruption
becomes apparent, then stop and come up normally.  The problem here
is that it seems unlikely that the detected-inconsistent WAL record
is the first bit of corruption that's been passed to the secondary.
I don't have a lot of faith in the idea that your troubles would be over
if you could only fire up the secondary.

It's particularly worrisome that you seem to be looking for ways to
avoid a dump/restore.  That should be your zeroth-order priority at
this point.  What I would do if I were you is to take a filesystem
backup of the secondary's entire current state (WAL and data directory)
so that you can get back to this point if you have to.  Then try the
PITR stop-at-this-point trick.  Be prepared to restore from the
filesystem backup and recover to some other stopping point, possibly a
few times, to get to the latest point that doesn't have clear
corruption.

Meanwhile you could be trying to get the master into a better state.
It's not immediately obvious which path is going to lead to a better
outcome faster, but I wouldn't assume the secondary is in better shape
than the primary.  On the master, again it seems like a filesystem dump
ought to be the first priority, mainly so that you still have the data
if the disks continue the downward arc that it sounds like they're on.

In short: you're in for a long day, but your first priority ought to be
to make sure things can't get even worse.

            regards, tom lane