Thread: GIN data corruption bug(s) in 9.6devel

GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:
Hi,

while repeating some full-text benchmarks on master, I've discovered
that there's a data corruption bug somewhere. What happens is that while
loading data into a table with GIN indexes (using multiple parallel
connections), I sometimes get this:

TRAP: FailedAssertion("!(((PageHeader) (page))->pd_special >=
(__builtin_offsetof (PageHeaderData, pd_linp)))", File: "ginfast.c",
Line: 537)
LOG:  server process (PID 22982) was terminated by signal 6: Aborted
DETAIL:  Failed process was running: autovacuum: ANALYZE messages

The details of the assert are always exactly the same - it's always
autovacuum and it trips on exactly the same check. And the backtrace
always looks like this (full backtrace attached):

#0  0x00007f133b635045 in raise () from /lib64/libc.so.6
#1  0x00007f133b6364ea in abort () from /lib64/libc.so.6
#2  0x00000000007dc007 in ExceptionalCondition
(conditionName=conditionName@entry=0x81a088 "!(((PageHeader)
(page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))",
        errorType=errorType@entry=0x81998b "FailedAssertion",
fileName=fileName@entry=0x83480a "ginfast.c",
lineNumber=lineNumber@entry=537) at assert.c:54
#3  0x00000000004894aa in shiftList (stats=0x0, fill_fsm=1 '\001',
newHead=26357, metabuffer=130744, index=0x7f133c0f7518) at ginfast.c:537
#4  ginInsertCleanup (ginstate=ginstate@entry=0x7ffd98ac9160,
vac_delay=vac_delay@entry=1 '\001', fill_fsm=fill_fsm@entry=1 '\001',
stats=stats@entry=0x0) at ginfast.c:908
#5  0x00000000004874f7 in ginvacuumcleanup (fcinfo=<optimized out>) at
ginvacuum.c:662
...

It's not perfectly deterministic - sometimes I had repeat the whole load
multiple times (up to 5x, and each load takes ~30minutes).

I'm pretty sure this is not external issue, because I've reproduced it
on a different machine (different CPU / kernel / libraries / compiler).
It's however interesting that on the other machine I've also observed a
different kind of lockups, where the sessions get stuck on semop() in
gininsert (again, full backtrace attached):

#0  0x0000003f3d4eaf97 in semop () from /lib64/libc.so.6
#1  0x000000000067a41f in PGSemaphoreLock (sema=0x7f93290405d8) at
pg_sema.c:387
#2  0x00000000006df613 in LWLockAcquire (lock=0x7f92a4dce900,
mode=LW_EXCLUSIVE) at lwlock.c:1049
#3  0x00000000004878c6 in ginHeapTupleFastInsert
(ginstate=0x7ffd969c88f0, collector=0x7ffd969caeb0) at ginfast.c:250
#4  0x000000000047423a in gininsert (fcinfo=<value optimized out>) at
gininsert.c:531
...

I'm not sure whether this is a different manifestation of the same issue
or another bug. The systems are not exactly the same - one has a single
socket (i5) while the other one has 2 (Xeon), the compilers and kernels
are different and so on.

I've also seen cases when the load seemingly completed OK, but trying to
dump the table to disk using COPY resulted in

        ERROR:  compressed data is corrupted

which I find rather strange as there was no previous error, and also
COPY should only dump table data (while the asserts were in GIN code
handling index pages, unless I'm mistaken). Seems like a case of
insufficient locking where two backends scribble on the same page
somehow, and then autovacuum hits the assert. Ot maybe not, not sure.

I've been unable to reproduce the issue on REL9_5_STABLE (despite
running the load ~20x on each machine), so that seems safe, and the
issue was introduced by some of the newer commits.

I've already spent too much CPU time on this, so perhaps someone with
better knowledge of the GIN code can take care of this. To reproduce it
you may use the same code I did - it's available here:

        https://bitbucket.org/tvondra/archie

it's a PoC of database with pgsql mailing lists with fulltext. It's a
bit messy, but it's rather simple

        1) clone the repo

           $ git clone https://bitbucket.org/tvondra/archie.git

        2) create a directory for downloading the mbox files

           $ mkdir archie-data

        3) download the mbox files (~4.5GB of data) using the download
           script (make sure archie/bin is on PATH)

           $ cd archie-data
           $ export PATH=../archie/bin:$PATH
           $ ../archie/download

        4) use "run" scipt (attached) to run the load n-times on a given
           commit

           $ run.sh master 10

           NOTICE: The run script is the messiest one, you'll have to
                   edit it to fix paths etc.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Attachment

Re: GIN data corruption bug(s) in 9.6devel

From
Jeff Janes
Date:
On Thu, Nov 5, 2015 at 2:18 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> while repeating some full-text benchmarks on master, I've discovered
> that there's a data corruption bug somewhere. What happens is that while
> loading data into a table with GIN indexes (using multiple parallel
> connections), I sometimes get this:
>
> TRAP: FailedAssertion("!(((PageHeader) (page))->pd_special >=
> (__builtin_offsetof (PageHeaderData, pd_linp)))", File: "ginfast.c",
> Line: 537)
> LOG:  server process (PID 22982) was terminated by signal 6: Aborted
> DETAIL:  Failed process was running: autovacuum: ANALYZE messages
>
> The details of the assert are always exactly the same - it's always
> autovacuum and it trips on exactly the same check. And the backtrace
> always looks like this (full backtrace attached):
>
> #0  0x00007f133b635045 in raise () from /lib64/libc.so.6
> #1  0x00007f133b6364ea in abort () from /lib64/libc.so.6
> #2  0x00000000007dc007 in ExceptionalCondition
> (conditionName=conditionName@entry=0x81a088 "!(((PageHeader)
> (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))",
>        errorType=errorType@entry=0x81998b "FailedAssertion",
> fileName=fileName@entry=0x83480a "ginfast.c",
> lineNumber=lineNumber@entry=537) at assert.c:54
> #3  0x00000000004894aa in shiftList (stats=0x0, fill_fsm=1 '\001',
> newHead=26357, metabuffer=130744, index=0x7f133c0f7518) at ginfast.c:537
> #4  ginInsertCleanup (ginstate=ginstate@entry=0x7ffd98ac9160,
> vac_delay=vac_delay@entry=1 '\001', fill_fsm=fill_fsm@entry=1 '\001',
> stats=stats@entry=0x0) at ginfast.c:908
> #5  0x00000000004874f7 in ginvacuumcleanup (fcinfo=<optimized out>) at
> ginvacuum.c:662
> ...

This looks like it is probably the same bug discussed here:

http://www.postgresql.org/message-id/CAMkU=1xALfLhUUohFP5v33RzedLVb5aknNUjcEuM9KNBKrB6-Q@mail.gmail.com

And here:

http://www.postgresql.org/message-id/56041B26.2040902@sigaev.ru

The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit
e95680832854cf300e64c) that free pages were recycled aggressively
enough that it actually becomes likely to be hit.

There are some proposed patches in those threads, but discussion on
them seems to have stalled out.  Can you try one and see if it fixes
the problems you are seeing?

Cheers,

Jeff



Re: GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:

On 11/05/2015 11:44 PM, Jeff Janes wrote:>
> This looks like it is probably the same bug discussed here:
>
> http://www.postgresql.org/message-id/CAMkU=1xALfLhUUohFP5v33RzedLVb5aknNUjcEuM9KNBKrB6-Q@mail.gmail.com
>
> And here:
>
> http://www.postgresql.org/message-id/56041B26.2040902@sigaev.ru
>
> The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit
> e95680832854cf300e64c) that free pages were recycled aggressively
> enough that it actually becomes likely to be hit.

I have only quickly skimmed the discussions, but my impression was that 
it's mostly about removing stuff that shouldn't be removed and such. But 
maybe there are race conditions that cause data corruption. I don't 
really want to dive too deeply into this, I've already spent too much 
time trying to reproduce it.

>
> There are some proposed patches in those threads, but discussion on
> them seems to have stalled out. Can you try one and see if it fixes
> the problems you are seeing?

I can do that - I see there are three patches in the two threads:
  1) gin_pending_lwlock.patch (Jeff Janes)  2) gin_pending_pagelock.patch (Jeff Janes)  3) gin_alone_cleanup-2.patch
(TeodorSigaev)
 

Should I test all of them? Or is (1) obsoleted by (2) for example?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Jeff Janes
Date:
On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 11/05/2015 11:44 PM, Jeff Janes wrote:
>>
>>
>> This looks like it is probably the same bug discussed here:
>>
>>
>> http://www.postgresql.org/message-id/CAMkU=1xALfLhUUohFP5v33RzedLVb5aknNUjcEuM9KNBKrB6-Q@mail.gmail.com
>>
>> And here:
>>
>> http://www.postgresql.org/message-id/56041B26.2040902@sigaev.ru
>>
>> The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit
>> e95680832854cf300e64c) that free pages were recycled aggressively
>> enough that it actually becomes likely to be hit.
>
>
> I have only quickly skimmed the discussions, but my impression was that it's
> mostly about removing stuff that shouldn't be removed and such. But maybe
> there are race conditions that cause data corruption. I don't really want to
> dive too deeply into this, I've already spent too much time trying to
> reproduce it.
>
>>
>> There are some proposed patches in those threads, but discussion on
>> them seems to have stalled out. Can you try one and see if it fixes
>> the problems you are seeing?
>
>
> I can do that - I see there are three patches in the two threads:
>
>   1) gin_pending_lwlock.patch (Jeff Janes)
>   2) gin_pending_pagelock.patch (Jeff Janes)
>   3) gin_alone_cleanup-2.patch (Teodor Sigaev)
>
> Should I test all of them? Or is (1) obsoleted by (2) for example?

1 is obsolete.  Either 2 or 3 should fix the bug, provided this is the
bug you are seeing.  They have different performance side effects, but
as far as fixing the bug they should be equivalent.

Cheers,

Jeff



Re: GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:
Hi,

On 11/06/2015 01:05 AM, Jeff Janes wrote:
> On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
...
>>
>> I can do that - I see there are three patches in the two threads:
>>
>>    1) gin_pending_lwlock.patch (Jeff Janes)
>>    2) gin_pending_pagelock.patch (Jeff Janes)
>>    3) gin_alone_cleanup-2.patch (Teodor Sigaev)
>>
>> Should I test all of them? Or is (1) obsoleted by (2) for example?
>
> 1 is obsolete.  Either 2 or 3 should fix the bug, provided this is the
> bug you are seeing.  They have different performance side effects, but
> as far as fixing the bug they should be equivalent.

OK, I'll do testing with those two patches then, and I'll also note the 
performance difference (the data load was very stable). Of course, it's 
just one particular workload.

I'll post an update after the weekend.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:
Hi,

On 11/06/2015 02:09 AM, Tomas Vondra wrote:
> Hi,
>
> On 11/06/2015 01:05 AM, Jeff Janes wrote:
>> On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
> ...
>>>
>>> I can do that - I see there are three patches in the two threads:
>>>
>>>    1) gin_pending_lwlock.patch (Jeff Janes)
>>>    2) gin_pending_pagelock.patch (Jeff Janes)
>>>    3) gin_alone_cleanup-2.patch (Teodor Sigaev)
>>>
>>> Should I test all of them? Or is (1) obsoleted by (2) for example?
>>
>> 1 is obsolete.  Either 2 or 3 should fix the bug, provided this is the
>> bug you are seeing.  They have different performance side effects, but
>> as far as fixing the bug they should be equivalent.
>
> OK, I'll do testing with those two patches then, and I'll also note the
> performance difference (the data load was very stable). Of course, it's
> just one particular workload.
>
> I'll post an update after the weekend.

I've finally managed to test the two patches. Sorry for the delay.

I've repeated the workload on 9.5, 9.6 and 9.6 with (1) or (2), looking 
for lockups or data corruption. I've also measured duration of the 
script, to see what's the impact on performance. The configuration 
(shared_buffers, work_mem ...) was exactly the same in all cases.

9.5     : runtime ~1380 seconds
9.6     : runtime ~1380 seconds (but lockups and data corruption)
9.6+(1) : runtime ~1380 seconds
9.6+(2) : runtime ~1290 seconds

So both patches seem to do the trick, but (2) is faster. Not sure if 
this is expected. (BTW all the results are without asserts enabled).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Jeff Janes
Date:
On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> On 11/06/2015 02:09 AM, Tomas Vondra wrote:
>>
>> Hi,
>>
>> On 11/06/2015 01:05 AM, Jeff Janes wrote:
>>>
>>> On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>
>> ...
>>>>
>>>>
>>>> I can do that - I see there are three patches in the two threads:
>>>>
>>>>    1) gin_pending_lwlock.patch (Jeff Janes)
>>>>    2) gin_pending_pagelock.patch (Jeff Janes)
>>>>    3) gin_alone_cleanup-2.patch (Teodor Sigaev)
>>>>
>>>> Should I test all of them? Or is (1) obsoleted by (2) for example?
>>>
>>>
>>> 1 is obsolete.  Either 2 or 3 should fix the bug, provided this is the
>>> bug you are seeing.  They have different performance side effects, but
>>> as far as fixing the bug they should be equivalent.
>>
>>
>> OK, I'll do testing with those two patches then, and I'll also note the
>> performance difference (the data load was very stable). Of course, it's
>> just one particular workload.
>>
>> I'll post an update after the weekend.
>
>
> I've finally managed to test the two patches. Sorry for the delay.
>
> I've repeated the workload on 9.5, 9.6 and 9.6 with (1) or (2), looking for
> lockups or data corruption. I've also measured duration of the script, to
> see what's the impact on performance. The configuration (shared_buffers,
> work_mem ...) was exactly the same in all cases.
>
> 9.5     : runtime ~1380 seconds
> 9.6     : runtime ~1380 seconds (but lockups and data corruption)
> 9.6+(1) : runtime ~1380 seconds
> 9.6+(2) : runtime ~1290 seconds
>
> So both patches seem to do the trick, but (2) is faster. Not sure if this is
> expected. (BTW all the results are without asserts enabled).

Do you know what the size of the pending list was at the end of each test?

I think last one may be faster because it left a large mess behind
that someone needs to clean up later.

Also, do you have the final size of the indexes in each case?

Cheers,

Jeff



Re: GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:

On 12/21/2015 07:41 PM, Jeff Janes wrote:
> On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:

...

>> So both patches seem to do the trick, but (2) is faster. Not sure
>> if this is expected. (BTW all the results are without asserts
>> enabled).
>
> Do you know what the size of the pending list was at the end of each
> test?
>
> I think last one may be faster because it left a large mess behind
> that someone needs to clean up later.

No. How do I measure it?

>
> Also, do you have the final size of the indexes in each case?

No, I haven't realized the patches do affect that, so I haven't measured it.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Jeff Janes
Date:
On Mon, Dec 21, 2015 at 11:51 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 12/21/2015 07:41 PM, Jeff Janes wrote:
>>
>> On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>
>
> ...
>
>>> So both patches seem to do the trick, but (2) is faster. Not sure
>>> if this is expected. (BTW all the results are without asserts
>>> enabled).
>>
>>
>> Do you know what the size of the pending list was at the end of each
>> test?
>>
>> I think last one may be faster because it left a large mess behind
>> that someone needs to clean up later.
>
>
> No. How do I measure it?

pageinspect's gin_metapage_info, or pgstattuple's pgstatginindex


>
>>
>> Also, do you have the final size of the indexes in each case?
>
>
> No, I haven't realized the patches do affect that, so I haven't measured it.

There shouldn't be a difference between the two approaches (although I
guess there could be if one left a larger pending list than the other,
as pending lists is very space inefficient), but since you included
9.5 in your test I thought it would be interesting to see how either
patched version under 9.6 compared to 9.5.

Cheers,

Jeff



Re: GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:
Hi,

On 12/23/2015 09:33 PM, Jeff Janes wrote:
> On Mon, Dec 21, 2015 at 11:51 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> On 12/21/2015 07:41 PM, Jeff Janes wrote:
>>>
>>> On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> ...
>>
>>>> So both patches seem to do the trick, but (2) is faster. Not sure
>>>> if this is expected. (BTW all the results are without asserts
>>>> enabled).
>>>
>>>
>>> Do you know what the size of the pending list was at the end of each
>>> test?
>>>
>>> I think last one may be faster because it left a large mess behind
>>> that someone needs to clean up later.
>>
>>
>> No. How do I measure it?
>
> pageinspect's gin_metapage_info, or pgstattuple's pgstatginindex

Hmmm, so this turns out not very useful, because at the end the data I 
get from gin_metapage_info is almost exactly the same for both patches 
(more details below).

>>
>>>
>>> Also, do you have the final size of the indexes in each case?
>>
>>
>> No, I haven't realized the patches do affect that, so I haven't measured it.
>
> There shouldn't be a difference between the two approaches (although I
> guess there could be if one left a larger pending list than the other,
> as pending lists is very space inefficient), but since you included
> 9.5 in your test I thought it would be interesting to see how either
> patched version under 9.6 compared to 9.5.

Well, turns out there's a quite significant difference, actually. The 
index sizes I get (quite stable after multiple runs):
   9.5 : 2428 MB   9.6 + alone cleanup : 730 MB   9.6 + pending lock : 488 MB

So that's quite a significant difference, I guess. The load duration for 
each version look like this:
   9.5                 : 1415 seconds   9.6 + alone cleanup : 1310 seconds   9.6 + pending lock  : 1380 seconds

I'd say I'm happy with sacrificing ~5% of time in exchange for ~35% 
reduction of index size.

The size of the index on 9.5 after VACUUM FULL (so pretty much the 
smallest index possible) is 440MB, which suggests the "pending lock" 
patch does a quite good job.

The gin_metapage_info at the end of one of the runs (pretty much all the 
runs look exactly the same) looks like this:
                  pending lock   alone cleanup      9.5
-------------------------------------------------------- pending_head                2               2   310460
pending_tail             338             345   310806 tail_free_size            812             812      812
n_pending_pages          330             339      347 n_pending_tuples         1003            1037     1059
n_total_pages              2               2        2 n_entry_pages               1               1        1
n_data_pages               0               0        0 n_entries                   0               0        0 version
                2               2        2
 

So almost no difference, except for the pending_* attributes, and even 
in that case the values are only different for 9.5 branch. Not sure what 
conclusion to draw from this - maybe it's necessary to collect the 
function input while the load is running (but that'd be tricky to 
process, I guess).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:
Hi,

On 01/05/2016 10:38 AM, Tomas Vondra wrote:
> Hi,
>
...
>>
>> There shouldn't be a difference between the two approaches (although I
>> guess there could be if one left a larger pending list than the other,
>> as pending lists is very space inefficient), but since you included
>> 9.5 in your test I thought it would be interesting to see how either
>> patched version under 9.6 compared to 9.5.
>
> Well, turns out there's a quite significant difference, actually. The
> index sizes I get (quite stable after multiple runs):
>
>     9.5 : 2428 MB
>     9.6 + alone cleanup : 730 MB
>     9.6 + pending lock : 488 MB
>
> So that's quite a significant difference, I guess. The load duration for
> each version look like this:
>
>     9.5                 : 1415 seconds
>     9.6 + alone cleanup : 1310 seconds
>     9.6 + pending lock  : 1380 seconds
>
> I'd say I'm happy with sacrificing ~5% of time in exchange for ~35%
> reduction of index size.
>
> The size of the index on 9.5 after VACUUM FULL (so pretty much the
> smallest index possible) is 440MB, which suggests the "pending lock"
> patch does a quite good job.
>
> The gin_metapage_info at the end of one of the runs (pretty much all the
> runs look exactly the same) looks like this:
>
>                    pending lock   alone cleanup      9.5
> --------------------------------------------------------
>   pending_head                2               2   310460
>   pending_tail              338             345   310806
>   tail_free_size            812             812      812
>   n_pending_pages           330             339      347
>   n_pending_tuples         1003            1037     1059
>   n_total_pages               2               2        2
>   n_entry_pages               1               1        1
>   n_data_pages                0               0        0
>   n_entries                   0               0        0
>   version                     2               2        2
>
> So almost no difference, except for the pending_* attributes, and even
> in that case the values are only different for 9.5 branch. Not sure what
> conclusion to draw from this - maybe it's necessary to collect the
> function input while the load is running (but that'd be tricky to
> process, I guess).

Are we going to anything about this? While the bug is present in 9.5 
(and possibly other versions), fixing it before 9.6 gets out seems 
important because reproducing it there is rather trivial (while I've 
been unable to reproduce it on 9.5).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Robert Haas
Date:
On Wed, Feb 24, 2016 at 9:17 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>> Well, turns out there's a quite significant difference, actually. The
>> index sizes I get (quite stable after multiple runs):
>>
>>     9.5 : 2428 MB
>>     9.6 + alone cleanup : 730 MB
>>     9.6 + pending lock : 488 MB
>>
>> So that's quite a significant difference, I guess. The load duration for
>> each version look like this:
>>
>>     9.5                 : 1415 seconds
>>     9.6 + alone cleanup : 1310 seconds
>>     9.6 + pending lock  : 1380 seconds
>>
>> I'd say I'm happy with sacrificing ~5% of time in exchange for ~35%
>> reduction of index size.
>>
>> The size of the index on 9.5 after VACUUM FULL (so pretty much the
>> smallest index possible) is 440MB, which suggests the "pending lock"
>> patch does a quite good job.
>>
>> The gin_metapage_info at the end of one of the runs (pretty much all the
>> runs look exactly the same) looks like this:
>>
>>                    pending lock   alone cleanup      9.5
>> --------------------------------------------------------
>>   pending_head                2               2   310460
>>   pending_tail              338             345   310806
>>   tail_free_size            812             812      812
>>   n_pending_pages           330             339      347
>>   n_pending_tuples         1003            1037     1059
>>   n_total_pages               2               2        2
>>   n_entry_pages               1               1        1
>>   n_data_pages                0               0        0
>>   n_entries                   0               0        0
>>   version                     2               2        2
>>
>> So almost no difference, except for the pending_* attributes, and even
>> in that case the values are only different for 9.5 branch. Not sure what
>> conclusion to draw from this - maybe it's necessary to collect the
>> function input while the load is running (but that'd be tricky to
>> process, I guess).
>
> Are we going to anything about this? While the bug is present in 9.5 (and
> possibly other versions), fixing it before 9.6 gets out seems important
> because reproducing it there is rather trivial (while I've been unable to
> reproduce it on 9.5).

I'm not going to promise to commit anything here, because GIN is not
usually my area, but could you provide a link to the email that
contains the patch you think should be committed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:
On 02/24/2016 06:56 AM, Robert Haas wrote:
> On Wed, Feb 24, 2016 at 9:17 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
...
>> Are we going to anything about this? While the bug is present in 9.5 (and
>> possibly other versions), fixing it before 9.6 gets out seems important
>> because reproducing it there is rather trivial (while I've been unable to
>> reproduce it on 9.5).
>
> I'm not going to promise to commit anything here, because GIN is not
> usually my area, but could you provide a link to the email that
> contains the patch you think should be committed?

Sure. There are actually two candidate patches in two separate threads, 
I'm nots sure which one is better. Based on the testing both seem to fix 
the issue and the "pending lock" patch produces much smaller indexes (at 
least in my benchmark):

[1] http://www.postgresql.org/message-id/56041B26.2040902@sigaev.ru
[2] 
http://www.postgresql.org/message-id/CAMkU=1w7Uu1GZ8N0bxMykRLgTh-uPH+GPHfhMNeryZPCv7fqdA@mail.gmail.com

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Teodor Sigaev
Date:
Thank you for remembering this problem, at least for me.

>> Well, turns out there's a quite significant difference, actually. The
>> index sizes I get (quite stable after multiple runs):
>>
>>     9.5 : 2428 MB
>>     9.6 + alone cleanup : 730 MB
>>     9.6 + pending lock : 488 MB
Interesting, I don't see why alone_cleanup and pending_lock are so differ. I'd
like to understand that, but does somebody have an good theory? The single point
in pending_lock patch is an suspicious exception in ProcSleep, this exception
may cause problem in future.

>>
>> So that's quite a significant difference, I guess. The load duration for
>> each version look like this:
>>
>>     9.5                 : 1415 seconds
>>     9.6 + alone cleanup : 1310 seconds
>>     9.6 + pending lock  : 1380 seconds
>>
>> I'd say I'm happy with sacrificing ~5% of time in exchange for ~35%
>> reduction of index size.
I think, alone_cleanup patch is faster because regular insert could break its
cleanup process if autovacuum waits to begin work on cleanup. So, insert process
could returns earlier from pending cleanup process.

In attachment just rebased v2 alone_cleanup patch.
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: GIN data corruption bug(s) in 9.6devel

From
Teodor Sigaev
Date:
>> Well, turns out there's a quite significant difference, actually. The
>> index sizes I get (quite stable after multiple runs):
>>
>>     9.5 : 2428 MB
>>     9.6 + alone cleanup : 730 MB
>>     9.6 + pending lock : 488 MB

In attach modified alone_cleanup patch which doesn't break cleanup process as it
does pending_lock patch but attached patch doesn't touch a lock management.

Tomas. if you can, pls, repeat test with this patch. If not, I will try to do
it, but later.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:
Hi,

On 02/25/2016 05:32 PM, Teodor Sigaev wrote:
>>> Well, turns out there's a quite significant difference, actually. The
>>> index sizes I get (quite stable after multiple runs):
>>>
>>>     9.5 : 2428 MB
>>>     9.6 + alone cleanup : 730 MB
>>>     9.6 + pending lock : 488 MB
>
> In attach modified alone_cleanup patch which doesn't break cleanup
> process as it does pending_lock patch but attached patch doesn't touch a
> lock management.
>
> Tomas. if you can, pls, repeat test with this patch. If not, I will try
> to do it, but later.

I'll do that once the system I used for that gets available - right now 
it's running other benchmarks.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Jeff Janes
Date:
On Wed, Feb 24, 2016 at 8:51 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Thank you for remembering this problem, at least for me.
>
>>> Well, turns out there's a quite significant difference, actually. The
>>> index sizes I get (quite stable after multiple runs):
>>>
>>>     9.5 : 2428 MB
>>>     9.6 + alone cleanup : 730 MB
>>>     9.6 + pending lock : 488 MB
>
> Interesting, I don't see why alone_cleanup and pending_lock are so differ.
> I'd like to understand that, but does somebody have an good theory?

Under my patch, anyone who wanted to do a clean up and detected
someone else was doing one would wait for the concurrent one to end.
(This is more consistent with the existing behavior, I just made it so
they don't do any damage while they wait.)

Under your patch, if a backend wants to do a clean up and detects
someone else is already doing one, it would just skip the clean up and
proceed on with whatever it was doing.  This allows one process
(hopefully a vacuum, but maybe a user backend) to get pinned down
indefinitely, as other processes keep putting stuff onto the end of
the pending_list with no throttle.

Since the freespace recycling only takes place once the list is
completely cleaned, allowing some processes to add to the end while
one poor process is trying to clean can lead to less effective
recycling.

That is my theory, anyway.

Cheers,

Jeff



Re: GIN data corruption bug(s) in 9.6devel

From
Noah Misch
Date:
On Thu, Feb 25, 2016 at 11:19:20AM -0800, Jeff Janes wrote:
> On Wed, Feb 24, 2016 at 8:51 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> > Thank you for remembering this problem, at least for me.
> >
> >>> Well, turns out there's a quite significant difference, actually. The
> >>> index sizes I get (quite stable after multiple runs):
> >>>
> >>>     9.5 : 2428 MB
> >>>     9.6 + alone cleanup : 730 MB
> >>>     9.6 + pending lock : 488 MB
> >
> > Interesting, I don't see why alone_cleanup and pending_lock are so differ.
> > I'd like to understand that, but does somebody have an good theory?
> 
> Under my patch, anyone who wanted to do a clean up and detected
> someone else was doing one would wait for the concurrent one to end.
> (This is more consistent with the existing behavior, I just made it so
> they don't do any damage while they wait.)
> 
> Under your patch, if a backend wants to do a clean up and detects
> someone else is already doing one, it would just skip the clean up and
> proceed on with whatever it was doing.  This allows one process
> (hopefully a vacuum, but maybe a user backend) to get pinned down
> indefinitely, as other processes keep putting stuff onto the end of
> the pending_list with no throttle.
> 
> Since the freespace recycling only takes place once the list is
> completely cleaned, allowing some processes to add to the end while
> one poor process is trying to clean can lead to less effective
> recycling.
> 
> That is my theory, anyway.

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Teodor,
since you committed the patch believed to have created it, you own this open
item.  If that responsibility lies elsewhere, please let us know whose
responsibility it is to fix this.  Since new open items may be discovered at
any time and I want to plan to have them all fixed well in advance of the ship
date, I will appreciate your efforts toward speedy resolution.  Please
present, within 72 hours, a plan to fix the defect within seven days of this
message.  Thanks.



Re: GIN data corruption bug(s) in 9.6devel

From
Teodor Sigaev
Date:
> The above-described topic is currently a PostgreSQL 9.6 open item.  Teodor,
> since you committed the patch believed to have created it, you own this open
> item.  If that responsibility lies elsewhere, please let us know whose
> responsibility it is to fix this.  Since new open items may be discovered at
> any time and I want to plan to have them all fixed well in advance of the ship
> date, I will appreciate your efforts toward speedy resolution.  Please
> present, within 72 hours, a plan to fix the defect within seven days of this
> message.  Thanks.

I'm waiting of Tomas testing. Suppose, it isn't possible now? so, will do 
myself, after that I will publish results.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:
On 04/04/2016 02:06 PM, Teodor Sigaev wrote:
>> The above-described topic is currently a PostgreSQL 9.6 open item.
>> Teodor,
>> since you committed the patch believed to have created it, you own
>> this open
>> item.  If that responsibility lies elsewhere, please let us know whose
>> responsibility it is to fix this.  Since new open items may be
>> discovered at
>> any time and I want to plan to have them all fixed well in advance of
>> the ship
>> date, I will appreciate your efforts toward speedy resolution.  Please
>> present, within 72 hours, a plan to fix the defect within seven days
>> of this
>> message.  Thanks.
>
> I'm waiting of Tomas testing. Suppose, it isn't possible now? so, will
> do myself, after that I will publish results.

Ah, damn. This completely slipped from my TODO list. I'll rerun the 
tests either today or tomorrow, and report the results here.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Tomas Vondra
Date:
Hi,

On 04/04/2016 02:25 PM, Tomas Vondra wrote:
> On 04/04/2016 02:06 PM, Teodor Sigaev wrote:
>>> The above-described topic is currently a PostgreSQL 9.6 open item.
>>> Teodor,
>>> since you committed the patch believed to have created it, you own
>>> this open
>>> item.  If that responsibility lies elsewhere, please let us know whose
>>> responsibility it is to fix this.  Since new open items may be
>>> discovered at
>>> any time and I want to plan to have them all fixed well in advance of
>>> the ship
>>> date, I will appreciate your efforts toward speedy resolution.  Please
>>> present, within 72 hours, a plan to fix the defect within seven days
>>> of this
>>> message.  Thanks.
>>
>> I'm waiting of Tomas testing. Suppose, it isn't possible now? so, will
>> do myself, after that I will publish results.
>
> Ah, damn. This completely slipped from my TODO list. I'll rerun the
> tests either today or tomorrow, and report the results here.

So, I've done some testing on the patch, and in short it seems fine to 
me. It fixes the bug (no data corruption observed), and the performance 
seems fine too.

I've tested the v2, v3 and v3.1 of the patch, to see if there are any 
differences. The v2 no longer applies, so I tested it on ee943004. The 
following table shows the total duration of the data load, and also 
sizes of the two GIN indexes.
           duration (sec)          subject          body
--------------------------------------------------------------- v2          1290                 23 MB           684 MB
v3          1360                 24 MB           487 MB  v3.1        1360                 24 MB           488 MB
 

So, looks good to me.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Teodor Sigaev
Date:
> I've tested the v2, v3 and v3.1 of the patch, to see if there are any
> differences. The v2 no longer applies, so I tested it on ee943004. The following
> table shows the total duration of the data load, and also sizes of the two GIN
> indexes.
>
>             duration (sec)          subject          body
>    ---------------------------------------------------------------
>    v2          1290                 23 MB           684 MB
>    v3          1360                 24 MB           487 MB
>    v3.1        1360                 24 MB           488 MB
Thank you very much.

Hmm. v3 is a just rebased version of v2, v3.1 hasn't unlock/lock cycle during 
cleanup, just to become similar to Jeff's pending lock patch. In theory, v2 and 
v3 should be very close, v3.1 should be close to pending_lock.

I'm inclining to push v3.1 as one of two winners by size/performance and, unlike 
to pending lock patch, it doesn't change an internal logic of lock machinery.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: GIN data corruption bug(s) in 9.6devel

From
Jeff Janes
Date:
On Wed, Apr 6, 2016 at 9:52 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

> I'm inclining to push v3.1 as one of two winners by size/performance and,
> unlike to pending lock patch, it doesn't change an internal logic of lock
> machinery.

This restricts the memory used by ordinary backends when doing the
cleanup to be work_mem. Shouldn't we let them use
maintenance_work_mem? Only one backend can be doing this clean up of a
given index at any given time, so we don't need to worry about many
concurrent allocations of maintenance_work_mem.  This seems very
similar in spirit to index creation, where a backend is allowed to use
maintenance_work_mem.

Also, do we plan on backpatching this?  While there are no known
instances of anyone hitting this bug in any released code, still it is
a bug.  There are plenty of reports of corrupted indexes which never
get properly diagnosed, and for all we know some of them might be due
to this.  On the other hand, it is a substantial behavior change,
which is scary to backpatch.  (But all other proposed solutions also
have behavior changes of one kind or another.)

To summarize the bugs in the already-released code:

If a vacuum's dead list includes a tuple which is still in the pending
list, and the vacuum skips the pending list cleanup because it detects
someone else is already working on it, then the vacuum could remove
that tuple from the table, even though there is still a reference to
it in the index.

A page could be deleted from the pending list, put on the free space
map, then reused.  When the linked list of pending pages is followed
onto this reused page, it will fail to recognize is as being deleted
and recycled, and bad things will happen.  This is quite likely to
happen in 9.6 because of the new aggressive recycling of pages.  But
I've seen no argument as to why it is impossible to hit this bug in
released code, it seems like it would be possible to hit it but just
on a very rare basis.

To summarize the behavior change:

In the released code, an inserting backend that violates the pending
list limit will try to clean the list, even if it is already being
cleaned.  It won't accomplish anything useful, but will go through the
motions until eventually it runs into a page the lead cleaner has
deleted, at which point it realizes there is another cleaner and it
stops.  This acts as a natural throttle to how fast insertions can
take place into an over-sized pending list.

The proposed change removes that throttle, so that inserters will
immediately see there is already a cleaner and just go back about
their business.  Due to that, unthrottled backends could add to the
pending list faster than the cleaner can clean it, leading to
unbounded growth in the pending list and could cause a user backend to
becoming apparently unresponsive to the user, indefinitely.  That is
scary to backpatch.

Cheers,

Jeff



Re: GIN data corruption bug(s) in 9.6devel

From
Alvaro Herrera
Date:
Jeff Janes wrote:

> The proposed change removes that throttle, so that inserters will
> immediately see there is already a cleaner and just go back about
> their business.  Due to that, unthrottled backends could add to the
> pending list faster than the cleaner can clean it, leading to
> unbounded growth in the pending list and could cause a user backend to
> becoming apparently unresponsive to the user, indefinitely.  That is
> scary to backpatch.

Can we install a protection against that?  For instance, the cleaner
only cleans up the entries that were present when it started, and if
other items are inserted later into the pending list, they are left for
a future cleaner.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: GIN data corruption bug(s) in 9.6devel

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> To summarize the behavior change:

> In the released code, an inserting backend that violates the pending
> list limit will try to clean the list, even if it is already being
> cleaned.  It won't accomplish anything useful, but will go through the
> motions until eventually it runs into a page the lead cleaner has
> deleted, at which point it realizes there is another cleaner and it
> stops.  This acts as a natural throttle to how fast insertions can
> take place into an over-sized pending list.

Right.

> The proposed change removes that throttle, so that inserters will
> immediately see there is already a cleaner and just go back about
> their business.  Due to that, unthrottled backends could add to the
> pending list faster than the cleaner can clean it, leading to
> unbounded growth in the pending list and could cause a user backend to
> becoming apparently unresponsive to the user, indefinitely.  That is
> scary to backpatch.

It's scary to put into HEAD, either.  What if we simply don't take
that specific behavioral change?  It doesn't seem like this is an
essential part of fixing the bug as you described it.  (Though I've
not read the patch, so maybe I'm just missing the connection.)
        regards, tom lane



Re: GIN data corruption bug(s) in 9.6devel

From
Jeff Janes
Date:
On Thu, Apr 7, 2016 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> To summarize the behavior change:
>
>> In the released code, an inserting backend that violates the pending
>> list limit will try to clean the list, even if it is already being
>> cleaned.  It won't accomplish anything useful, but will go through the
>> motions until eventually it runs into a page the lead cleaner has
>> deleted, at which point it realizes there is another cleaner and it
>> stops.  This acts as a natural throttle to how fast insertions can
>> take place into an over-sized pending list.
>
> Right.
>
>> The proposed change removes that throttle, so that inserters will
>> immediately see there is already a cleaner and just go back about
>> their business.  Due to that, unthrottled backends could add to the
>> pending list faster than the cleaner can clean it, leading to
>> unbounded growth in the pending list and could cause a user backend to
>> becoming apparently unresponsive to the user, indefinitely.  That is
>> scary to backpatch.
>
> It's scary to put into HEAD, either.  What if we simply don't take
> that specific behavioral change?  It doesn't seem like this is an
> essential part of fixing the bug as you described it.  (Though I've
> not read the patch, so maybe I'm just missing the connection.)

There are only 3 fundamental options I see, the cleaner can wait,
"help", or move on.

"Helping" is what it does now and is dangerous.

Moving on gives the above-discussed unthrottling problem.

Waiting has two problems.  The act of waiting will cause autovacuums
to be canceled, unless ugly hacks are deployed to prevent that.   If
we deploy those ugly hacks, then we have the problem that a user
backend will end up waiting on an autovacuum to finish the cleaning,
and the autovacuum is taking its sweet time due to
autovacuum_vacuum_cost_delay.  (The "helping" model avoids this
problem because the user backend can just catch up with and pass the
io-throttled autovac process)

For completeness sake, a fourth option would to move on, but only
after inserting the tuple directly into the main index structure
(rather then the pending list) like would be done with fastupdate off,
once the pending list is already oversized.  This is my favorite, but
there is no chance of it going into 9.6, much less being backpatched.


Alvaro's recommendation, to let the cleaner off the hook once it
passes the page which was the tail page at the time it started, would
prevent any process from getting pinned down indefinitely, but would
not prevent the size of the list from increasing without bound.  I
think that would probably be good enough, because the current
throttling behavior is purely accidentally and doesn't *guarantee* a
limit on the size of the pending list.

Cheers,

Jeff



Re: GIN data corruption bug(s) in 9.6devel

From
Noah Misch
Date:
On Thu, Apr 07, 2016 at 05:53:54PM -0700, Jeff Janes wrote:
> On Thu, Apr 7, 2016 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Jeff Janes <jeff.janes@gmail.com> writes:
> >> To summarize the behavior change:
> >
> >> In the released code, an inserting backend that violates the pending
> >> list limit will try to clean the list, even if it is already being
> >> cleaned.  It won't accomplish anything useful, but will go through the
> >> motions until eventually it runs into a page the lead cleaner has
> >> deleted, at which point it realizes there is another cleaner and it
> >> stops.  This acts as a natural throttle to how fast insertions can
> >> take place into an over-sized pending list.
> >
> > Right.
> >
> >> The proposed change removes that throttle, so that inserters will
> >> immediately see there is already a cleaner and just go back about
> >> their business.  Due to that, unthrottled backends could add to the
> >> pending list faster than the cleaner can clean it, leading to
> >> unbounded growth in the pending list and could cause a user backend to
> >> becoming apparently unresponsive to the user, indefinitely.  That is
> >> scary to backpatch.
> >
> > It's scary to put into HEAD, either.  What if we simply don't take
> > that specific behavioral change?  It doesn't seem like this is an
> > essential part of fixing the bug as you described it.  (Though I've
> > not read the patch, so maybe I'm just missing the connection.)
> 
> There are only 3 fundamental options I see, the cleaner can wait,
> "help", or move on.
> 
> "Helping" is what it does now and is dangerous.
> 
> Moving on gives the above-discussed unthrottling problem.
> 
> Waiting has two problems.  The act of waiting will cause autovacuums
> to be canceled, unless ugly hacks are deployed to prevent that.   If
> we deploy those ugly hacks, then we have the problem that a user
> backend will end up waiting on an autovacuum to finish the cleaning,
> and the autovacuum is taking its sweet time due to
> autovacuum_vacuum_cost_delay.

Teodor, this thread has been quiet for four days, and the deadline to fix this
open item expired 23 hours ago.  Do you have a new plan for fixing it?



Re: GIN data corruption bug(s) in 9.6devel

From
Teodor Sigaev
Date:
> This restricts the memory used by ordinary backends when doing the
> cleanup to be work_mem. Shouldn't we let them use
> maintenance_work_mem? Only one backend can be doing this clean up of a
> given index at any given time, so we don't need to worry about many
> concurrent allocations of maintenance_work_mem.  This seems very
> similar in spirit to index creation, where a backend is allowed to use
> maintenance_work_mem.
Because it could be a several indexes in one pg instance. And each cleaner could 
eat maintenance_work_mem.


>
> Also, do we plan on backpatching this?  While there are no known
Yes


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: GIN data corruption bug(s) in 9.6devel

From
Teodor Sigaev
Date:
> There are only 3 fundamental options I see, the cleaner can wait,
> "help", or move on.
>
> "Helping" is what it does now and is dangerous.
>
> Moving on gives the above-discussed unthrottling problem.
>
> Waiting has two problems.  The act of waiting will cause autovacuums
> to be canceled, unless ugly hacks are deployed to prevent that.   If
> we deploy those ugly hacks, then we have the problem that a user
> backend will end up waiting on an autovacuum to finish the cleaning,
> and the autovacuum is taking its sweet time due to
> autovacuum_vacuum_cost_delay.  (The "helping" model avoids this
> problem because the user backend can just catch up with and pass the
> io-throttled autovac process)
With pending cleanup patch backend will try to get lock on metapage with
ConditionalLockPage. Will it interrupt autovacum worker?


>
> For completeness sake, a fourth option would to move on, but only
> after inserting the tuple directly into the main index structure
> (rather then the pending list) like would be done with fastupdate off,
> once the pending list is already oversized.  This is my favorite, but
> there is no chance of it going into 9.6, much less being backpatched.
Agree, will reimplement for 9.7
>
>
> Alvaro's recommendation, to let the cleaner off the hook once it
> passes the page which was the tail page at the time it started, would
> prevent any process from getting pinned down indefinitely, but would
> not prevent the size of the list from increasing without bound.  I
> think that would probably be good enough, because the current
> throttling behavior is purely accidentally and doesn't *guarantee* a
> limit on the size of the pending list.
Added, see attached patch (based on v3.1)
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: GIN data corruption bug(s) in 9.6devel

From
Teodor Sigaev
Date:
>> Alvaro's recommendation, to let the cleaner off the hook once it
>> passes the page which was the tail page at the time it started, would
>> prevent any process from getting pinned down indefinitely, but would
> Added, see attached patch (based on v3.1)
If there is no objections I will aplly it at mondeay and backpatch all supported 
versions.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: GIN data corruption bug(s) in 9.6devel

From
Jeff Janes
Date:
On Tue, Apr 12, 2016 at 9:53 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>
> With pending cleanup patch backend will try to get lock on metapage with
> ConditionalLockPage. Will it interrupt autovacum worker?


Correct, ConditionalLockPage should not interrupt the autovacuum worker.

>>
>> Alvaro's recommendation, to let the cleaner off the hook once it
>> passes the page which was the tail page at the time it started, would
>> prevent any process from getting pinned down indefinitely, but would
>> not prevent the size of the list from increasing without bound.  I
>> think that would probably be good enough, because the current
>> throttling behavior is purely accidentally and doesn't *guarantee* a
>> limit on the size of the pending list.
>
> Added, see attached patch (based on v3.1)

With this applied, I am getting a couple errors I have not seen before
after extensive crash recovery testing:

ERROR:  attempted to delete invisible tuple

ERROR:  unexpected chunk number 1 (expected 2) for toast value
100338365 in pg_toast_16425

I've restarted the test harness with intentional crashes turned off,
to see if the problems are related to crash recovery or are more
generic than that.

I've never seen these particular problems before, so don't have much
insight into what might be going on or how to debug it.

Cheers,

Jeff



Re: GIN data corruption bug(s) in 9.6devel

From
Teodor Sigaev
Date:
>> Added, see attached patch (based on v3.1)
>
> With this applied, I am getting a couple errors I have not seen before
> after extensive crash recovery testing:
> ERROR:  attempted to delete invisible tuple
> ERROR:  unexpected chunk number 1 (expected 2) for toast value
> 100338365 in pg_toast_16425
Huh, seems, it's not related to GIN at all... Indexes don't play with toast 
machinery. The single place where this error can occur is a heap_delete() - 
deleting already deleted tuple.

>
> I've restarted the test harness with intentional crashes turned off,
> to see if the problems are related to crash recovery or are more
> generic than that.
>
> I've never seen these particular problems before, so don't have much
> insight into what might be going on or how to debug it.
Check my reasoning: In version 4 I added a remebering of tail of pending list 
into blknoFinish variable. And when we read page which was a tail on cleanup 
start then we sets cleanupFinish variable and after cleaning that page we will 
stop further cleanup. Any insert caused during cleanup will be placed after 
blknoFinish (corner case: in that page), so, vacuum should not miss tuples 
marked as deleted.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: GIN data corruption bug(s) in 9.6devel

From
Noah Misch
Date:
On Mon, Apr 18, 2016 at 05:48:17PM +0300, Teodor Sigaev wrote:
> >>Added, see attached patch (based on v3.1)
> >
> >With this applied, I am getting a couple errors I have not seen before
> >after extensive crash recovery testing:
> >ERROR:  attempted to delete invisible tuple
> >ERROR:  unexpected chunk number 1 (expected 2) for toast value
> >100338365 in pg_toast_16425
> Huh, seems, it's not related to GIN at all... Indexes don't play with toast
> machinery. The single place where this error can occur is a heap_delete() -
> deleting already deleted tuple.

Like you, I would not expect gin_alone_cleanup-4.patch to cause such an error.
I get the impression Jeff has a test case that he had run in many iterations
against the unpatched baseline.  I also get the impression that a similar or
smaller number of its iterations against gin_alone_cleanup-4.patch triggered
these two errors (once apiece, or multiple times?).  Jeff, is that right?  If
so, until we determine the cause, we should assume the cause arrived in
gin_alone_cleanup-4.patch.  An error in pointer arithmetic or locking might
corrupt an unrelated buffer, leading to this symptom.

> >I've restarted the test harness with intentional crashes turned off,
> >to see if the problems are related to crash recovery or are more
> >generic than that.
> >
> >I've never seen these particular problems before, so don't have much
> >insight into what might be going on or how to debug it.

Could you describe the test case in sufficient detail for Teodor to reproduce
your results?

> Check my reasoning: In version 4 I added a remebering of tail of pending
> list into blknoFinish variable. And when we read page which was a tail on
> cleanup start then we sets cleanupFinish variable and after cleaning that
> page we will stop further cleanup. Any insert caused during cleanup will be
> placed after blknoFinish (corner case: in that page), so, vacuum should not
> miss tuples marked as deleted.

Would any hacker volunteer to review Teodor's reasoning here?

Thanks,
nm



Re: GIN data corruption bug(s) in 9.6devel

From
Peter Geoghegan
Date:
On Thu, Nov 5, 2015 at 2:44 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit
> e95680832854cf300e64c) that free pages were recycled aggressively
> enough that it actually becomes likely to be hit.

In other words: The bug could be in 9.5, but that hasn't been
conclusively demonstrated. Fair?

I'm not an expert on GIN at all; I know far more about B-Tree. But,
commit e95680832854cf300e64c seems a bit odd to me. I don't see any
argument for why it's okay that the recycling of pages can happen
immediately for the pending list, rather than requiring it to happen
at some time later with a safe interlock (some like B-Tree's use of
RecentGlobalXmin). The GIN README discusses a few such issues, but it
wasn't updated by the commit I mentioned, which I would have expected.

OTOH, after all of 10 minutes I can't see what's special about
ginvacuumcleanup() that makes its long established
RecordFreeIndexPage() call fundamentally safer, which if true would be
a surprisingly obvious defect to go unfixed for all these years. This
is what you yourself said about it, I think. I need to look at it
again with fresh eyes, but offhand having no safe interlock for the
well established RecordFreeIndexPage() call for GIN seems like an
implausibly obvious bug.

-- 
Peter Geoghegan



Re: GIN data corruption bug(s) in 9.6devel

From
Jeff Janes
Date:
On Mon, Apr 18, 2016 at 7:48 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>> Added, see attached patch (based on v3.1)
>>
>>
>> With this applied, I am getting a couple errors I have not seen before
>> after extensive crash recovery testing:
>> ERROR:  attempted to delete invisible tuple
>> ERROR:  unexpected chunk number 1 (expected 2) for toast value
>> 100338365 in pg_toast_16425
>
> Huh, seems, it's not related to GIN at all... Indexes don't play with toast
> machinery. The single place where this error can occur is a heap_delete() -
> deleting already deleted tuple.

Those are two independent errors.  The delete invisible tuple error
doesn't occur on toast tables.

The actual statement triggering the error is an update statement.
Since it is showing up in the delete path, I assume it must be an
update where the new tuple goes to a different page.  But, if the
soon-to-be-old tuple is not visible, why is the update trying to
update it in the first place?  It seems like the different parts of
the code disagree on what is visible.

update foo set count=count+1,text_array=$1 where text_array @> $2

I agree it might not have anything to do with gin indexes, but I
didn't see it in testing anything else.  It might be a wrap-around
problem which for some reason only the gin test is efficient at
evoking. What I've done now is apply your v4 patch directly to
e95680832854cf300e64c1 and I am trying to see if it also has the
problem.  If that is clean, then it is probably an independently
introduced bug which is just getting exercised by the gin index stress
test.  If that is the case I'll try to git bisect forward, but that
could take weeks given the runtimes involved.  If that is dirty, then
maybe the FSM vacuuming patch introduced/uncovered more than one bug,
and should be reverted.


>> I've restarted the test harness with intentional crashes turned off,
>> to see if the problems are related to crash recovery or are more
>> generic than that.

I do not see the problem when there is no crash-recovery cycling involved.

I also do not see the problem when compiled under --enable-cassert,
but that could just be because compiling that way makes it too slow to
get in sufficient testing to hit the bug; before I gave up.

>>
>> I've never seen these particular problems before, so don't have much
>> insight into what might be going on or how to debug it.
>
> Check my reasoning: In version 4 I added a remebering of tail of pending
> list into blknoFinish variable. And when we read page which was a tail on
> cleanup start then we sets cleanupFinish variable and after cleaning that
> page we will stop further cleanup. Any insert caused during cleanup will be
> placed after blknoFinish (corner case: in that page), so, vacuum should not
> miss tuples marked as deleted.

Yes, I agree with the correctness of v4.  But I do wonder if we should
use that early stopping for vacuum and gin_clean_pending_list, rather
than just using it for user backends.  While I think correctness
allows it to stop early, since these routines are explicitly about
cleaning things up it seems like they should volunteer to clean the
whole thing.

Cheers,

Jeff



Re: GIN data corruption bug(s) in 9.6devel

From
Robert Haas
Date:
On Fri, Apr 22, 2016 at 2:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Check my reasoning: In version 4 I added a remebering of tail of pending
>> list into blknoFinish variable. And when we read page which was a tail on
>> cleanup start then we sets cleanupFinish variable and after cleaning that
>> page we will stop further cleanup. Any insert caused during cleanup will be
>> placed after blknoFinish (corner case: in that page), so, vacuum should not
>> miss tuples marked as deleted.
>
> Yes, I agree with the correctness of v4.  But I do wonder if we should
> use that early stopping for vacuum and gin_clean_pending_list, rather
> than just using it for user backends.  While I think correctness
> allows it to stop early, since these routines are explicitly about
> cleaning things up it seems like they should volunteer to clean the
> whole thing.

Not to hold anyone's feet to the fire, but we are running out of time
before beta1, and this seems to be one of the more serious outstanding
issues.  I'd rather not release beta1 with known data corruption bugs,
but that means either fixing whatever is broken here or reverting the
commit that caused or revealed the problem pretty soon.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: GIN data corruption bug(s) in 9.6devel

From
Jeff Janes
Date:
On Thu, Apr 21, 2016 at 11:00 PM, Noah Misch <noah@leadboat.com> wrote:
> On Mon, Apr 18, 2016 at 05:48:17PM +0300, Teodor Sigaev wrote:
>> >>Added, see attached patch (based on v3.1)
>> >
>> >With this applied, I am getting a couple errors I have not seen before
>> >after extensive crash recovery testing:
>> >ERROR:  attempted to delete invisible tuple
>> >ERROR:  unexpected chunk number 1 (expected 2) for toast value
>> >100338365 in pg_toast_16425
>> Huh, seems, it's not related to GIN at all... Indexes don't play with toast
>> machinery. The single place where this error can occur is a heap_delete() -
>> deleting already deleted tuple.
>
> Like you, I would not expect gin_alone_cleanup-4.patch to cause such an error.
> I get the impression Jeff has a test case that he had run in many iterations
> against the unpatched baseline.  I also get the impression that a similar or
> smaller number of its iterations against gin_alone_cleanup-4.patch triggered
> these two errors (once apiece, or multiple times?).  Jeff, is that right?

Because the unpatched baseline suffers from the bug which was the
original topic of this thread, I haven't been able to test against the
original baseline.  It would fail from that other bug before it ran
long enough to hit these ones.  Both errors start within a few minutes
of each other, but do not appear to be related other than that.  Once
they start happening, they occur repeatedly.


> Could you describe the test case in sufficient detail for Teodor to reproduce
> your results?

I spawn a swarm of processes to update a counter in a randomly chosen
row, selected via the gin index.  They do this as fast as possible
until the server intentionally crashes.  When it recovers, they
compare notes and see if the results are consistent.

But in this case, the problem is not inconsistent results, but rather
errors during the updating stage.

The patch introduces a mechanism to crash the server, a mechanism to
fast-forward the XID, and some additional logging that I sometimes
find useful (which I haven't been using in this case, but don't want
to rip it out)

The perl script implements the core of the test harness.

The shell script sets up the server (using hard coded paths for the
data and the binaries, so will need to be changed), and then calls the
perl script in a loop.

Running on an 8 core system, I've never seen it have a problem in less
than 9 hours of run time.

This produces copious amounts of logging to stdout.  This is how I
look through the log files to find this particular problem:

sh do.sh >& do.err &

tail -f do.err | fgrep ERROR

The more I think about it, the more I think gin is just an innocent
bystander, for which I just happen to have a particularly demanding
test.  I think something about snapshots and wrap-around may be
broken.

Cheers,

Jeff

Attachment

Re: GIN data corruption bug(s) in 9.6devel

From
Noah Misch
Date:
On Fri, Apr 22, 2016 at 02:03:01PM -0700, Jeff Janes wrote:
> On Thu, Apr 21, 2016 at 11:00 PM, Noah Misch <noah@leadboat.com> wrote:
> > Could you describe the test case in sufficient detail for Teodor to reproduce
> > your results?
> 
> [detailed description and attachments]

Thanks.

> The more I think about it, the more I think gin is just an innocent
> bystander, for which I just happen to have a particularly demanding
> test.  I think something about snapshots and wrap-around may be
> broken.

Based on your review, I think the next step on $subject is for Teodor to
either commit gin_alone_cleanup-4.patch or to tell us what he'd like to see or
do before committing that patch.

While the failures during your crash recovery testing may not even be 9.6
regressions, they suggest the presence of serious bugs.  At a minimum, the
community should confirm they are not 9.6 regressions before releasing 9.6.  I
have added an open item to that effect.



Re: GIN data corruption bug(s) in 9.6devel

From
Teodor Sigaev
Date:
>> Check my reasoning: In version 4 I added a remebering of tail of pending
>> list into blknoFinish variable. And when we read page which was a tail on
>> cleanup start then we sets cleanupFinish variable and after cleaning that
>> page we will stop further cleanup. Any insert caused during cleanup will be
>> placed after blknoFinish (corner case: in that page), so, vacuum should not
>> miss tuples marked as deleted.
>
> Yes, I agree with the correctness of v4.  But I do wonder if we should
> use that early stopping for vacuum and gin_clean_pending_list, rather
Interesting, I've missed this possible option

> than just using it for user backends.  While I think correctness
> allows it to stop early, since these routines are explicitly about
> cleaning things up it seems like they should volunteer to clean the
> whole thing.

I believe that autovacuum should not require guaranteed full clean up, only
vacuum and gin_clean_pending_list() should do that. In all other cases it should
stop early to prevent possible infinite cleanup. Patch attached.


--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: GIN data corruption bug(s) in 9.6devel

From
Noah Misch
Date:
On Tue, Apr 26, 2016 at 08:22:03PM +0300, Teodor Sigaev wrote:
> >>Check my reasoning: In version 4 I added a remebering of tail of pending
> >>list into blknoFinish variable. And when we read page which was a tail on
> >>cleanup start then we sets cleanupFinish variable and after cleaning that
> >>page we will stop further cleanup. Any insert caused during cleanup will be
> >>placed after blknoFinish (corner case: in that page), so, vacuum should not
> >>miss tuples marked as deleted.
> >
> >Yes, I agree with the correctness of v4.  But I do wonder if we should
> >use that early stopping for vacuum and gin_clean_pending_list, rather
> Interesting, I've missed this possible option
> 
> >than just using it for user backends.  While I think correctness
> >allows it to stop early, since these routines are explicitly about
> >cleaning things up it seems like they should volunteer to clean the
> >whole thing.
> 
> I believe that autovacuum should not require guaranteed full clean up, only
> vacuum and gin_clean_pending_list() should do that. In all other cases it
> should stop early to prevent possible infinite cleanup. Patch attached.

Teodor, this item remains open after twenty-three days of you owning it.  At
this pace, 9.6beta1 will have an unreliable GIN implementation.  What things
must happen so you can close this item, and when will you do those things?



Re: GIN data corruption bug(s) in 9.6devel

From
Teodor Sigaev
Date:
> The more I think about it, the more I think gin is just an innocent
> bystander, for which I just happen to have a particularly demanding
> test.  I think something about snapshots and wrap-around may be
> broken.

After 10 hours of running I've got
1587  XX000 2016-04-28 05:57:09.964 MSK:ERROR:  unexpected chunk number 0 
(expected 1) for toast value 10192986 in pg_toast_16424
1587  XX000 2016-04-28 05:57:09.964 MSK:CONTEXT:  automatic analyze of table 
"teodor.public.foo"
1587  00000 2016-04-28 05:57:09.974 MSK:LOG:  JJ transaction ID wrap limit is 
2147484514, limited by database with OID 16384


I've pushed v5 of gin-alone-cleanup patch, many thanks to all participants.
Will work on backpatch

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/