Re: GIN data corruption bug(s) in 9.6devel - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: GIN data corruption bug(s) in 9.6devel
Date
Msg-id CAMkU=1w5x8rY5EvWieJsfZWB3eNGmFGHmOBf9r5VLWDTW72b2g@mail.gmail.com
Whole thread Raw
In response to GIN data corruption bug(s) in 9.6devel  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: GIN data corruption bug(s) in 9.6devel  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: GIN data corruption bug(s) in 9.6devel  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On Thu, Nov 5, 2015 at 2:18 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> while repeating some full-text benchmarks on master, I've discovered
> that there's a data corruption bug somewhere. What happens is that while
> loading data into a table with GIN indexes (using multiple parallel
> connections), I sometimes get this:
>
> TRAP: FailedAssertion("!(((PageHeader) (page))->pd_special >=
> (__builtin_offsetof (PageHeaderData, pd_linp)))", File: "ginfast.c",
> Line: 537)
> LOG:  server process (PID 22982) was terminated by signal 6: Aborted
> DETAIL:  Failed process was running: autovacuum: ANALYZE messages
>
> The details of the assert are always exactly the same - it's always
> autovacuum and it trips on exactly the same check. And the backtrace
> always looks like this (full backtrace attached):
>
> #0  0x00007f133b635045 in raise () from /lib64/libc.so.6
> #1  0x00007f133b6364ea in abort () from /lib64/libc.so.6
> #2  0x00000000007dc007 in ExceptionalCondition
> (conditionName=conditionName@entry=0x81a088 "!(((PageHeader)
> (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))",
>        errorType=errorType@entry=0x81998b "FailedAssertion",
> fileName=fileName@entry=0x83480a "ginfast.c",
> lineNumber=lineNumber@entry=537) at assert.c:54
> #3  0x00000000004894aa in shiftList (stats=0x0, fill_fsm=1 '\001',
> newHead=26357, metabuffer=130744, index=0x7f133c0f7518) at ginfast.c:537
> #4  ginInsertCleanup (ginstate=ginstate@entry=0x7ffd98ac9160,
> vac_delay=vac_delay@entry=1 '\001', fill_fsm=fill_fsm@entry=1 '\001',
> stats=stats@entry=0x0) at ginfast.c:908
> #5  0x00000000004874f7 in ginvacuumcleanup (fcinfo=<optimized out>) at
> ginvacuum.c:662
> ...

This looks like it is probably the same bug discussed here:

http://www.postgresql.org/message-id/CAMkU=1xALfLhUUohFP5v33RzedLVb5aknNUjcEuM9KNBKrB6-Q@mail.gmail.com

And here:

http://www.postgresql.org/message-id/56041B26.2040902@sigaev.ru

The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit
e95680832854cf300e64c) that free pages were recycled aggressively
enough that it actually becomes likely to be hit.

There are some proposed patches in those threads, but discussion on
them seems to have stalled out.  Can you try one and see if it fixes
the problems you are seeing?

Cheers,

Jeff



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: GIN data corruption bug(s) in 9.6devel
Next
From: Haribabu Kommi
Date:
Subject: Re: NOTIFY in Background Worker