Re: Tracking down log segment corruption - Mailing list pgsql-general

From Tom Lane
Subject Re: Tracking down log segment corruption
Date
Msg-id 13174.1272819723@sss.pgh.pa.us
Whole thread Raw
In response to Re: Tracking down log segment corruption  (Gordon Shannon <gordo169@gmail.com>)
Responses Re: Tracking down log segment corruption  (Gordon Shannon <gordo169@gmail.com>)
List pgsql-general
Gordon Shannon <gordo169@gmail.com> writes:
> I just got ran into the same problem.  Both servers are running 8.4.3, and
> the standby server had been running for 2 days, processing many thousands of
> logs successfully.  Here's my error:

> 4158   2010-05-02 11:12:09 EDT [26445]LOG:  restored log file
> "0000000100003C77000000C3" from archive
> 4158   2010-05-02 11:12:09 EDT [26446]LOG:  restored log file
> "0000000100003C77000000C4" from archive
> 4158   2010-05-02 11:12:09 EDT [26447]WARNING:  specified item offset is too
> large
> 4158   2010-05-02 11:12:09 EDT [26448]CONTEXT:  xlog redo insert: rel
> 48777166/22362/48778276; tid 2/2
> 4158   2010-05-02 11:12:09 EDT [26449]PANIC:  btree_insert_redo: failed to
> add item
> 4158   2010-05-02 11:12:09 EDT [26450]CONTEXT:  xlog redo insert: rel
> 48777166/22362/48778276; tid 2/2
> 4151   2010-05-02 11:12:09 EDT [1]LOG:  startup process (PID 4158) was
> terminated by signal 6: Aborted
> 4151   2010-05-02 11:12:09 EDT [2]LOG:  terminating any other active server
> processes

Hmm ... AFAICS the only way to get that message when the incoming TID's
offsetNumber is only 2 is for the index page to be completely empty
(not zeroes, else PageAddItem's sanity check would have triggered,
but valid and empty).  What that smells like is a software bug, like
failing to emit a WAL record in a case where it was necessary.  Can you
identify which index this was?  (Look for relfilenode 48778276 in the
database with OID 22362.)  If so, can you give us any hints about
unusual things that might have been done with that index?

> Any suggestions?

As far as recovering goes, there's probably not much you can do except
resync the standby from scratch.  But it would be nice to get to the
bottom of the problem, so that we can fix the bug.  Have you got an
archive of this xlog segment and the ones before it, and would you be
willing to let a developer look at them?

            regards, tom lane

pgsql-general by date:

Previous
From: Gordon Shannon
Date:
Subject: Re: Tracking down log segment corruption
Next
From: Gordon Shannon
Date:
Subject: Re: Tracking down log segment corruption