Thread: BUG #8347: PANIC: heap_insert_redo: failed to add tuple when applying WAL

BUG #8347: PANIC: heap_insert_redo: failed to add tuple when applying WAL

From
maciek@heroku.com
Date:
The following bug has been logged on the website:

Bug reference:      8347
Logged by:          Maciek Sakrejda
Email address:      maciek@heroku.com
PostgreSQL version: 9.2.4
Operating system:   Ubuntu 12.04.2 LTS 64-bit
Description:

Running into a recovery failure on a customer's replica:


Jul 31 00:11:55: LOG:  restored log file "00000001000000E200000067" from
archive
Jul 31 00:11:55: WARNING:  will not overwrite a used ItemId
Jul 31 00:11:55: CONTEXT:  xlog redo insert: rel 1663/16385/16619; tid
25260/37
Jul 31 00:11:55: PANIC:  heap_insert_redo: failed to add tuple
Jul 31 00:11:55: CONTEXT:  xlog redo insert: rel 1663/16385/16619; tid
25260/37


I see a similar bug filed [1], but no replies. This happens repeatedly when
attempting to apply this segment.


[1]:
http://www.postgresql.org/message-id/CANbzriT3h1kf2EaKEBcDqwu4AYwUjCuKcrDkjdxJ0CTjNeGnFQ@mail.gmail.com
On 2013-07-31 01:27:39 +0000, maciek@heroku.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      8347
> Logged by:          Maciek Sakrejda
> Email address:      maciek@heroku.com
> PostgreSQL version: 9.2.4
> Operating system:   Ubuntu 12.04.2 LTS 64-bit
> Description:
>
> Running into a recovery failure on a customer's replica:
>
>
> Jul 31 00:11:55: LOG:  restored log file "00000001000000E200000067" from
> archive
> Jul 31 00:11:55: WARNING:  will not overwrite a used ItemId
> Jul 31 00:11:55: CONTEXT:  xlog redo insert: rel 1663/16385/16619; tid
> 25260/37
> Jul 31 00:11:55: PANIC:  heap_insert_redo: failed to add tuple
> Jul 31 00:11:55: CONTEXT:  xlog redo insert: rel 1663/16385/16619; tid
> 25260/37
>
>
> I see a similar bug filed [1], but no replies. This happens repeatedly when
> attempting to apply this segment.

Any chance you could https://github.com/snaga/xlogdump that and the
neighbouring segments? That might tell us whether we're dealing with
broken locking or possibly disk corruption (doesn't sound too likely).

Just to be sure, you're not running with full_page_writes = off or
something?

Could you possibly run a patched postgres against that, to get more
info?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #8347: PANIC: heap_insert_redo: failed to add tuple when applying WAL

From
Maciek Sakrejda
Date:
On Tue, Jul 30, 2013 at 9:28 PM, Andres Freund <andres@2ndquadrant.com>wrote:

> Any chance you could https://github.com/snaga/xlogdump that and the
> neighbouring segments? That might tell us whether we're dealing with
> broken locking or possibly disk corruption (doesn't sound too likely).
>

Actually, we did find what looks like some pretty crazy disk corruption
after I reported this (heap tuple data in pg_clog files). I'm surprised
Postgres did not wig out more, actually. I can run xlogdump later this week
if it's still of interest, but I'm pretty satisfied that this was not
Postgres' fault.

Incidentally, the system performed admirably in the course of the recovery,
considering the severely compromised state of heap and clog data. I'm
really glad we're using Postgres.
Isn't it a funny coincidence, that we also had a corruption of that
same/similar type?

my disk was quite confidently not tampered. I am wondering: Does PG sign,
or checksum wal_files? Is the integrity of wal_files ensured by any
mechanism? Because if it IS, then - in our case - it's a corruption caused
BY the postgres master server. I can replay the wal's and re-create the
same error over and over.

lg,k


On Thu, Aug 1, 2013 at 11:13 PM, Maciek Sakrejda <maciek@heroku.com> wrote:

> On Tue, Jul 30, 2013 at 9:28 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
>> Any chance you could https://github.com/snaga/xlogdump that and the
>> neighbouring segments? That might tell us whether we're dealing with
>> broken locking or possibly disk corruption (doesn't sound too likely).
>>
>
> Actually, we did find what looks like some pretty crazy disk corruption
> after I reported this (heap tuple data in pg_clog files). I'm surprised
> Postgres did not wig out more, actually. I can run xlogdump later this week
> if it's still of interest, but I'm pretty satisfied that this was not
> Postgres' fault.
>
> Incidentally, the system performed admirably in the course of the
> recovery, considering the severely compromised state of heap and clog data.
> I'm really glad we're using Postgres.
>

Re: BUG #8347: PANIC: heap_insert_redo: failed to add tuple when applying WAL

From
Daniel Farina
Date:
On Fri, Aug 2, 2013 at 12:51 AM, Klaus Ita <klaus@worstofall.com> wrote:
> Isn't it a funny coincidence, that we also had a corruption of that
> same/similar type?
>
> my disk was quite confidently not tampered. I am wondering: Does PG sign, or
> checksum wal_files? Is the integrity of wal_files ensured by any mechanism?
> Because if it IS, then - in our case - it's a corruption caused BY the
> postgres master server. I can replay the wal's and re-create the same error
> over and over.

Corruption can hitch a ride on a WAL full page image without much
difficulty, as long as the page header looks legit (from what I've
seen so far, a bad page header will prevent the system from doing much
with it, so no FPIs will be generated).