Thread: GIN data corruption bug(s) in 9.6devel
Hi, while repeating some full-text benchmarks on master, I've discovered that there's a data corruption bug somewhere. What happens is that while loading data into a table with GIN indexes (using multiple parallel connections), I sometimes get this: TRAP: FailedAssertion("!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))", File: "ginfast.c", Line: 537) LOG: server process (PID 22982) was terminated by signal 6: Aborted DETAIL: Failed process was running: autovacuum: ANALYZE messages The details of the assert are always exactly the same - it's always autovacuum and it trips on exactly the same check. And the backtrace always looks like this (full backtrace attached): #0 0x00007f133b635045 in raise () from /lib64/libc.so.6 #1 0x00007f133b6364ea in abort () from /lib64/libc.so.6 #2 0x00000000007dc007 in ExceptionalCondition (conditionName=conditionName@entry=0x81a088 "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=errorType@entry=0x81998b "FailedAssertion", fileName=fileName@entry=0x83480a "ginfast.c", lineNumber=lineNumber@entry=537) at assert.c:54 #3 0x00000000004894aa in shiftList (stats=0x0, fill_fsm=1 '\001', newHead=26357, metabuffer=130744, index=0x7f133c0f7518) at ginfast.c:537 #4 ginInsertCleanup (ginstate=ginstate@entry=0x7ffd98ac9160, vac_delay=vac_delay@entry=1 '\001', fill_fsm=fill_fsm@entry=1 '\001', stats=stats@entry=0x0) at ginfast.c:908 #5 0x00000000004874f7 in ginvacuumcleanup (fcinfo=<optimized out>) at ginvacuum.c:662 ... It's not perfectly deterministic - sometimes I had repeat the whole load multiple times (up to 5x, and each load takes ~30minutes). I'm pretty sure this is not external issue, because I've reproduced it on a different machine (different CPU / kernel / libraries / compiler). It's however interesting that on the other machine I've also observed a different kind of lockups, where the sessions get stuck on semop() in gininsert (again, full backtrace attached): #0 0x0000003f3d4eaf97 in semop () from /lib64/libc.so.6 #1 0x000000000067a41f in PGSemaphoreLock (sema=0x7f93290405d8) at pg_sema.c:387 #2 0x00000000006df613 in LWLockAcquire (lock=0x7f92a4dce900, mode=LW_EXCLUSIVE) at lwlock.c:1049 #3 0x00000000004878c6 in ginHeapTupleFastInsert (ginstate=0x7ffd969c88f0, collector=0x7ffd969caeb0) at ginfast.c:250 #4 0x000000000047423a in gininsert (fcinfo=<value optimized out>) at gininsert.c:531 ... I'm not sure whether this is a different manifestation of the same issue or another bug. The systems are not exactly the same - one has a single socket (i5) while the other one has 2 (Xeon), the compilers and kernels are different and so on. I've also seen cases when the load seemingly completed OK, but trying to dump the table to disk using COPY resulted in ERROR: compressed data is corrupted which I find rather strange as there was no previous error, and also COPY should only dump table data (while the asserts were in GIN code handling index pages, unless I'm mistaken). Seems like a case of insufficient locking where two backends scribble on the same page somehow, and then autovacuum hits the assert. Ot maybe not, not sure. I've been unable to reproduce the issue on REL9_5_STABLE (despite running the load ~20x on each machine), so that seems safe, and the issue was introduced by some of the newer commits. I've already spent too much CPU time on this, so perhaps someone with better knowledge of the GIN code can take care of this. To reproduce it you may use the same code I did - it's available here: https://bitbucket.org/tvondra/archie it's a PoC of database with pgsql mailing lists with fulltext. It's a bit messy, but it's rather simple 1) clone the repo $ git clone https://bitbucket.org/tvondra/archie.git 2) create a directory for downloading the mbox files $ mkdir archie-data 3) download the mbox files (~4.5GB of data) using the download script (make sure archie/bin is on PATH) $ cd archie-data $ export PATH=../archie/bin:$PATH $ ../archie/download 4) use "run" scipt (attached) to run the load n-times on a given commit $ run.sh master 10 NOTICE: The run script is the messiest one, you'll have to edit it to fix paths etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Nov 5, 2015 at 2:18 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Hi, > > while repeating some full-text benchmarks on master, I've discovered > that there's a data corruption bug somewhere. What happens is that while > loading data into a table with GIN indexes (using multiple parallel > connections), I sometimes get this: > > TRAP: FailedAssertion("!(((PageHeader) (page))->pd_special >= > (__builtin_offsetof (PageHeaderData, pd_linp)))", File: "ginfast.c", > Line: 537) > LOG: server process (PID 22982) was terminated by signal 6: Aborted > DETAIL: Failed process was running: autovacuum: ANALYZE messages > > The details of the assert are always exactly the same - it's always > autovacuum and it trips on exactly the same check. And the backtrace > always looks like this (full backtrace attached): > > #0 0x00007f133b635045 in raise () from /lib64/libc.so.6 > #1 0x00007f133b6364ea in abort () from /lib64/libc.so.6 > #2 0x00000000007dc007 in ExceptionalCondition > (conditionName=conditionName@entry=0x81a088 "!(((PageHeader) > (page))->pd_special >= (__builtin_offsetof (PageHeaderData, pd_linp)))", > errorType=errorType@entry=0x81998b "FailedAssertion", > fileName=fileName@entry=0x83480a "ginfast.c", > lineNumber=lineNumber@entry=537) at assert.c:54 > #3 0x00000000004894aa in shiftList (stats=0x0, fill_fsm=1 '\001', > newHead=26357, metabuffer=130744, index=0x7f133c0f7518) at ginfast.c:537 > #4 ginInsertCleanup (ginstate=ginstate@entry=0x7ffd98ac9160, > vac_delay=vac_delay@entry=1 '\001', fill_fsm=fill_fsm@entry=1 '\001', > stats=stats@entry=0x0) at ginfast.c:908 > #5 0x00000000004874f7 in ginvacuumcleanup (fcinfo=<optimized out>) at > ginvacuum.c:662 > ... This looks like it is probably the same bug discussed here: http://www.postgresql.org/message-id/CAMkU=1xALfLhUUohFP5v33RzedLVb5aknNUjcEuM9KNBKrB6-Q@mail.gmail.com And here: http://www.postgresql.org/message-id/56041B26.2040902@sigaev.ru The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit e95680832854cf300e64c) that free pages were recycled aggressively enough that it actually becomes likely to be hit. There are some proposed patches in those threads, but discussion on them seems to have stalled out. Can you try one and see if it fixes the problems you are seeing? Cheers, Jeff
On 11/05/2015 11:44 PM, Jeff Janes wrote:> > This looks like it is probably the same bug discussed here: > > http://www.postgresql.org/message-id/CAMkU=1xALfLhUUohFP5v33RzedLVb5aknNUjcEuM9KNBKrB6-Q@mail.gmail.com > > And here: > > http://www.postgresql.org/message-id/56041B26.2040902@sigaev.ru > > The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit > e95680832854cf300e64c) that free pages were recycled aggressively > enough that it actually becomes likely to be hit. I have only quickly skimmed the discussions, but my impression was that it's mostly about removing stuff that shouldn't be removed and such. But maybe there are race conditions that cause data corruption. I don't really want to dive too deeply into this, I've already spent too much time trying to reproduce it. > > There are some proposed patches in those threads, but discussion on > them seems to have stalled out. Can you try one and see if it fixes > the problems you are seeing? I can do that - I see there are three patches in the two threads: 1) gin_pending_lwlock.patch (Jeff Janes) 2) gin_pending_pagelock.patch (Jeff Janes) 3) gin_alone_cleanup-2.patch (TeodorSigaev) Should I test all of them? Or is (1) obsoleted by (2) for example? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > On 11/05/2015 11:44 PM, Jeff Janes wrote: >> >> >> This looks like it is probably the same bug discussed here: >> >> >> http://www.postgresql.org/message-id/CAMkU=1xALfLhUUohFP5v33RzedLVb5aknNUjcEuM9KNBKrB6-Q@mail.gmail.com >> >> And here: >> >> http://www.postgresql.org/message-id/56041B26.2040902@sigaev.ru >> >> The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit >> e95680832854cf300e64c) that free pages were recycled aggressively >> enough that it actually becomes likely to be hit. > > > I have only quickly skimmed the discussions, but my impression was that it's > mostly about removing stuff that shouldn't be removed and such. But maybe > there are race conditions that cause data corruption. I don't really want to > dive too deeply into this, I've already spent too much time trying to > reproduce it. > >> >> There are some proposed patches in those threads, but discussion on >> them seems to have stalled out. Can you try one and see if it fixes >> the problems you are seeing? > > > I can do that - I see there are three patches in the two threads: > > 1) gin_pending_lwlock.patch (Jeff Janes) > 2) gin_pending_pagelock.patch (Jeff Janes) > 3) gin_alone_cleanup-2.patch (Teodor Sigaev) > > Should I test all of them? Or is (1) obsoleted by (2) for example? 1 is obsolete. Either 2 or 3 should fix the bug, provided this is the bug you are seeing. They have different performance side effects, but as far as fixing the bug they should be equivalent. Cheers, Jeff
Hi, On 11/06/2015 01:05 AM, Jeff Janes wrote: > On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: ... >> >> I can do that - I see there are three patches in the two threads: >> >> 1) gin_pending_lwlock.patch (Jeff Janes) >> 2) gin_pending_pagelock.patch (Jeff Janes) >> 3) gin_alone_cleanup-2.patch (Teodor Sigaev) >> >> Should I test all of them? Or is (1) obsoleted by (2) for example? > > 1 is obsolete. Either 2 or 3 should fix the bug, provided this is the > bug you are seeing. They have different performance side effects, but > as far as fixing the bug they should be equivalent. OK, I'll do testing with those two patches then, and I'll also note the performance difference (the data load was very stable). Of course, it's just one particular workload. I'll post an update after the weekend. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 11/06/2015 02:09 AM, Tomas Vondra wrote: > Hi, > > On 11/06/2015 01:05 AM, Jeff Janes wrote: >> On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: > ... >>> >>> I can do that - I see there are three patches in the two threads: >>> >>> 1) gin_pending_lwlock.patch (Jeff Janes) >>> 2) gin_pending_pagelock.patch (Jeff Janes) >>> 3) gin_alone_cleanup-2.patch (Teodor Sigaev) >>> >>> Should I test all of them? Or is (1) obsoleted by (2) for example? >> >> 1 is obsolete. Either 2 or 3 should fix the bug, provided this is the >> bug you are seeing. They have different performance side effects, but >> as far as fixing the bug they should be equivalent. > > OK, I'll do testing with those two patches then, and I'll also note the > performance difference (the data load was very stable). Of course, it's > just one particular workload. > > I'll post an update after the weekend. I've finally managed to test the two patches. Sorry for the delay. I've repeated the workload on 9.5, 9.6 and 9.6 with (1) or (2), looking for lockups or data corruption. I've also measured duration of the script, to see what's the impact on performance. The configuration (shared_buffers, work_mem ...) was exactly the same in all cases. 9.5 : runtime ~1380 seconds 9.6 : runtime ~1380 seconds (but lockups and data corruption) 9.6+(1) : runtime ~1380 seconds 9.6+(2) : runtime ~1290 seconds So both patches seem to do the trick, but (2) is faster. Not sure if this is expected. (BTW all the results are without asserts enabled). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Hi, > > On 11/06/2015 02:09 AM, Tomas Vondra wrote: >> >> Hi, >> >> On 11/06/2015 01:05 AM, Jeff Janes wrote: >>> >>> On Thu, Nov 5, 2015 at 3:50 PM, Tomas Vondra >>> <tomas.vondra@2ndquadrant.com> wrote: >> >> ... >>>> >>>> >>>> I can do that - I see there are three patches in the two threads: >>>> >>>> 1) gin_pending_lwlock.patch (Jeff Janes) >>>> 2) gin_pending_pagelock.patch (Jeff Janes) >>>> 3) gin_alone_cleanup-2.patch (Teodor Sigaev) >>>> >>>> Should I test all of them? Or is (1) obsoleted by (2) for example? >>> >>> >>> 1 is obsolete. Either 2 or 3 should fix the bug, provided this is the >>> bug you are seeing. They have different performance side effects, but >>> as far as fixing the bug they should be equivalent. >> >> >> OK, I'll do testing with those two patches then, and I'll also note the >> performance difference (the data load was very stable). Of course, it's >> just one particular workload. >> >> I'll post an update after the weekend. > > > I've finally managed to test the two patches. Sorry for the delay. > > I've repeated the workload on 9.5, 9.6 and 9.6 with (1) or (2), looking for > lockups or data corruption. I've also measured duration of the script, to > see what's the impact on performance. The configuration (shared_buffers, > work_mem ...) was exactly the same in all cases. > > 9.5 : runtime ~1380 seconds > 9.6 : runtime ~1380 seconds (but lockups and data corruption) > 9.6+(1) : runtime ~1380 seconds > 9.6+(2) : runtime ~1290 seconds > > So both patches seem to do the trick, but (2) is faster. Not sure if this is > expected. (BTW all the results are without asserts enabled). Do you know what the size of the pending list was at the end of each test? I think last one may be faster because it left a large mess behind that someone needs to clean up later. Also, do you have the final size of the indexes in each case? Cheers, Jeff
On 12/21/2015 07:41 PM, Jeff Janes wrote: > On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: ... >> So both patches seem to do the trick, but (2) is faster. Not sure >> if this is expected. (BTW all the results are without asserts >> enabled). > > Do you know what the size of the pending list was at the end of each > test? > > I think last one may be faster because it left a large mess behind > that someone needs to clean up later. No. How do I measure it? > > Also, do you have the final size of the indexes in each case? No, I haven't realized the patches do affect that, so I haven't measured it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Dec 21, 2015 at 11:51 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > On 12/21/2015 07:41 PM, Jeff Janes wrote: >> >> On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: > > > ... > >>> So both patches seem to do the trick, but (2) is faster. Not sure >>> if this is expected. (BTW all the results are without asserts >>> enabled). >> >> >> Do you know what the size of the pending list was at the end of each >> test? >> >> I think last one may be faster because it left a large mess behind >> that someone needs to clean up later. > > > No. How do I measure it? pageinspect's gin_metapage_info, or pgstattuple's pgstatginindex > >> >> Also, do you have the final size of the indexes in each case? > > > No, I haven't realized the patches do affect that, so I haven't measured it. There shouldn't be a difference between the two approaches (although I guess there could be if one left a larger pending list than the other, as pending lists is very space inefficient), but since you included 9.5 in your test I thought it would be interesting to see how either patched version under 9.6 compared to 9.5. Cheers, Jeff
Hi, On 12/23/2015 09:33 PM, Jeff Janes wrote: > On Mon, Dec 21, 2015 at 11:51 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> >> >> On 12/21/2015 07:41 PM, Jeff Janes wrote: >>> >>> On Sat, Dec 19, 2015 at 3:19 PM, Tomas Vondra >>> <tomas.vondra@2ndquadrant.com> wrote: >> >> >> ... >> >>>> So both patches seem to do the trick, but (2) is faster. Not sure >>>> if this is expected. (BTW all the results are without asserts >>>> enabled). >>> >>> >>> Do you know what the size of the pending list was at the end of each >>> test? >>> >>> I think last one may be faster because it left a large mess behind >>> that someone needs to clean up later. >> >> >> No. How do I measure it? > > pageinspect's gin_metapage_info, or pgstattuple's pgstatginindex Hmmm, so this turns out not very useful, because at the end the data I get from gin_metapage_info is almost exactly the same for both patches (more details below). >> >>> >>> Also, do you have the final size of the indexes in each case? >> >> >> No, I haven't realized the patches do affect that, so I haven't measured it. > > There shouldn't be a difference between the two approaches (although I > guess there could be if one left a larger pending list than the other, > as pending lists is very space inefficient), but since you included > 9.5 in your test I thought it would be interesting to see how either > patched version under 9.6 compared to 9.5. Well, turns out there's a quite significant difference, actually. The index sizes I get (quite stable after multiple runs): 9.5 : 2428 MB 9.6 + alone cleanup : 730 MB 9.6 + pending lock : 488 MB So that's quite a significant difference, I guess. The load duration for each version look like this: 9.5 : 1415 seconds 9.6 + alone cleanup : 1310 seconds 9.6 + pending lock : 1380 seconds I'd say I'm happy with sacrificing ~5% of time in exchange for ~35% reduction of index size. The size of the index on 9.5 after VACUUM FULL (so pretty much the smallest index possible) is 440MB, which suggests the "pending lock" patch does a quite good job. The gin_metapage_info at the end of one of the runs (pretty much all the runs look exactly the same) looks like this: pending lock alone cleanup 9.5 -------------------------------------------------------- pending_head 2 2 310460 pending_tail 338 345 310806 tail_free_size 812 812 812 n_pending_pages 330 339 347 n_pending_tuples 1003 1037 1059 n_total_pages 2 2 2 n_entry_pages 1 1 1 n_data_pages 0 0 0 n_entries 0 0 0 version 2 2 2 So almost no difference, except for the pending_* attributes, and even in that case the values are only different for 9.5 branch. Not sure what conclusion to draw from this - maybe it's necessary to collect the function input while the load is running (but that'd be tricky to process, I guess). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 01/05/2016 10:38 AM, Tomas Vondra wrote: > Hi, > ... >> >> There shouldn't be a difference between the two approaches (although I >> guess there could be if one left a larger pending list than the other, >> as pending lists is very space inefficient), but since you included >> 9.5 in your test I thought it would be interesting to see how either >> patched version under 9.6 compared to 9.5. > > Well, turns out there's a quite significant difference, actually. The > index sizes I get (quite stable after multiple runs): > > 9.5 : 2428 MB > 9.6 + alone cleanup : 730 MB > 9.6 + pending lock : 488 MB > > So that's quite a significant difference, I guess. The load duration for > each version look like this: > > 9.5 : 1415 seconds > 9.6 + alone cleanup : 1310 seconds > 9.6 + pending lock : 1380 seconds > > I'd say I'm happy with sacrificing ~5% of time in exchange for ~35% > reduction of index size. > > The size of the index on 9.5 after VACUUM FULL (so pretty much the > smallest index possible) is 440MB, which suggests the "pending lock" > patch does a quite good job. > > The gin_metapage_info at the end of one of the runs (pretty much all the > runs look exactly the same) looks like this: > > pending lock alone cleanup 9.5 > -------------------------------------------------------- > pending_head 2 2 310460 > pending_tail 338 345 310806 > tail_free_size 812 812 812 > n_pending_pages 330 339 347 > n_pending_tuples 1003 1037 1059 > n_total_pages 2 2 2 > n_entry_pages 1 1 1 > n_data_pages 0 0 0 > n_entries 0 0 0 > version 2 2 2 > > So almost no difference, except for the pending_* attributes, and even > in that case the values are only different for 9.5 branch. Not sure what > conclusion to draw from this - maybe it's necessary to collect the > function input while the load is running (but that'd be tricky to > process, I guess). Are we going to anything about this? While the bug is present in 9.5 (and possibly other versions), fixing it before 9.6 gets out seems important because reproducing it there is rather trivial (while I've been unable to reproduce it on 9.5). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 24, 2016 at 9:17 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> Well, turns out there's a quite significant difference, actually. The >> index sizes I get (quite stable after multiple runs): >> >> 9.5 : 2428 MB >> 9.6 + alone cleanup : 730 MB >> 9.6 + pending lock : 488 MB >> >> So that's quite a significant difference, I guess. The load duration for >> each version look like this: >> >> 9.5 : 1415 seconds >> 9.6 + alone cleanup : 1310 seconds >> 9.6 + pending lock : 1380 seconds >> >> I'd say I'm happy with sacrificing ~5% of time in exchange for ~35% >> reduction of index size. >> >> The size of the index on 9.5 after VACUUM FULL (so pretty much the >> smallest index possible) is 440MB, which suggests the "pending lock" >> patch does a quite good job. >> >> The gin_metapage_info at the end of one of the runs (pretty much all the >> runs look exactly the same) looks like this: >> >> pending lock alone cleanup 9.5 >> -------------------------------------------------------- >> pending_head 2 2 310460 >> pending_tail 338 345 310806 >> tail_free_size 812 812 812 >> n_pending_pages 330 339 347 >> n_pending_tuples 1003 1037 1059 >> n_total_pages 2 2 2 >> n_entry_pages 1 1 1 >> n_data_pages 0 0 0 >> n_entries 0 0 0 >> version 2 2 2 >> >> So almost no difference, except for the pending_* attributes, and even >> in that case the values are only different for 9.5 branch. Not sure what >> conclusion to draw from this - maybe it's necessary to collect the >> function input while the load is running (but that'd be tricky to >> process, I guess). > > Are we going to anything about this? While the bug is present in 9.5 (and > possibly other versions), fixing it before 9.6 gets out seems important > because reproducing it there is rather trivial (while I've been unable to > reproduce it on 9.5). I'm not going to promise to commit anything here, because GIN is not usually my area, but could you provide a link to the email that contains the patch you think should be committed? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02/24/2016 06:56 AM, Robert Haas wrote: > On Wed, Feb 24, 2016 at 9:17 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: ... >> Are we going to anything about this? While the bug is present in 9.5 (and >> possibly other versions), fixing it before 9.6 gets out seems important >> because reproducing it there is rather trivial (while I've been unable to >> reproduce it on 9.5). > > I'm not going to promise to commit anything here, because GIN is not > usually my area, but could you provide a link to the email that > contains the patch you think should be committed? Sure. There are actually two candidate patches in two separate threads, I'm nots sure which one is better. Based on the testing both seem to fix the issue and the "pending lock" patch produces much smaller indexes (at least in my benchmark): [1] http://www.postgresql.org/message-id/56041B26.2040902@sigaev.ru [2] http://www.postgresql.org/message-id/CAMkU=1w7Uu1GZ8N0bxMykRLgTh-uPH+GPHfhMNeryZPCv7fqdA@mail.gmail.com regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thank you for remembering this problem, at least for me. >> Well, turns out there's a quite significant difference, actually. The >> index sizes I get (quite stable after multiple runs): >> >> 9.5 : 2428 MB >> 9.6 + alone cleanup : 730 MB >> 9.6 + pending lock : 488 MB Interesting, I don't see why alone_cleanup and pending_lock are so differ. I'd like to understand that, but does somebody have an good theory? The single point in pending_lock patch is an suspicious exception in ProcSleep, this exception may cause problem in future. >> >> So that's quite a significant difference, I guess. The load duration for >> each version look like this: >> >> 9.5 : 1415 seconds >> 9.6 + alone cleanup : 1310 seconds >> 9.6 + pending lock : 1380 seconds >> >> I'd say I'm happy with sacrificing ~5% of time in exchange for ~35% >> reduction of index size. I think, alone_cleanup patch is faster because regular insert could break its cleanup process if autovacuum waits to begin work on cleanup. So, insert process could returns earlier from pending cleanup process. In attachment just rebased v2 alone_cleanup patch. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
>> Well, turns out there's a quite significant difference, actually. The >> index sizes I get (quite stable after multiple runs): >> >> 9.5 : 2428 MB >> 9.6 + alone cleanup : 730 MB >> 9.6 + pending lock : 488 MB In attach modified alone_cleanup patch which doesn't break cleanup process as it does pending_lock patch but attached patch doesn't touch a lock management. Tomas. if you can, pls, repeat test with this patch. If not, I will try to do it, but later. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
Hi, On 02/25/2016 05:32 PM, Teodor Sigaev wrote: >>> Well, turns out there's a quite significant difference, actually. The >>> index sizes I get (quite stable after multiple runs): >>> >>> 9.5 : 2428 MB >>> 9.6 + alone cleanup : 730 MB >>> 9.6 + pending lock : 488 MB > > In attach modified alone_cleanup patch which doesn't break cleanup > process as it does pending_lock patch but attached patch doesn't touch a > lock management. > > Tomas. if you can, pls, repeat test with this patch. If not, I will try > to do it, but later. I'll do that once the system I used for that gets available - right now it's running other benchmarks. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 24, 2016 at 8:51 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Thank you for remembering this problem, at least for me. > >>> Well, turns out there's a quite significant difference, actually. The >>> index sizes I get (quite stable after multiple runs): >>> >>> 9.5 : 2428 MB >>> 9.6 + alone cleanup : 730 MB >>> 9.6 + pending lock : 488 MB > > Interesting, I don't see why alone_cleanup and pending_lock are so differ. > I'd like to understand that, but does somebody have an good theory? Under my patch, anyone who wanted to do a clean up and detected someone else was doing one would wait for the concurrent one to end. (This is more consistent with the existing behavior, I just made it so they don't do any damage while they wait.) Under your patch, if a backend wants to do a clean up and detects someone else is already doing one, it would just skip the clean up and proceed on with whatever it was doing. This allows one process (hopefully a vacuum, but maybe a user backend) to get pinned down indefinitely, as other processes keep putting stuff onto the end of the pending_list with no throttle. Since the freespace recycling only takes place once the list is completely cleaned, allowing some processes to add to the end while one poor process is trying to clean can lead to less effective recycling. That is my theory, anyway. Cheers, Jeff
On Thu, Feb 25, 2016 at 11:19:20AM -0800, Jeff Janes wrote: > On Wed, Feb 24, 2016 at 8:51 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > > Thank you for remembering this problem, at least for me. > > > >>> Well, turns out there's a quite significant difference, actually. The > >>> index sizes I get (quite stable after multiple runs): > >>> > >>> 9.5 : 2428 MB > >>> 9.6 + alone cleanup : 730 MB > >>> 9.6 + pending lock : 488 MB > > > > Interesting, I don't see why alone_cleanup and pending_lock are so differ. > > I'd like to understand that, but does somebody have an good theory? > > Under my patch, anyone who wanted to do a clean up and detected > someone else was doing one would wait for the concurrent one to end. > (This is more consistent with the existing behavior, I just made it so > they don't do any damage while they wait.) > > Under your patch, if a backend wants to do a clean up and detects > someone else is already doing one, it would just skip the clean up and > proceed on with whatever it was doing. This allows one process > (hopefully a vacuum, but maybe a user backend) to get pinned down > indefinitely, as other processes keep putting stuff onto the end of > the pending_list with no throttle. > > Since the freespace recycling only takes place once the list is > completely cleaned, allowing some processes to add to the end while > one poor process is trying to clean can lead to less effective > recycling. > > That is my theory, anyway. [This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Teodor, since you committed the patch believed to have created it, you own this open item. If that responsibility lies elsewhere, please let us know whose responsibility it is to fix this. Since new open items may be discovered at any time and I want to plan to have them all fixed well in advance of the ship date, I will appreciate your efforts toward speedy resolution. Please present, within 72 hours, a plan to fix the defect within seven days of this message. Thanks.
> The above-described topic is currently a PostgreSQL 9.6 open item. Teodor, > since you committed the patch believed to have created it, you own this open > item. If that responsibility lies elsewhere, please let us know whose > responsibility it is to fix this. Since new open items may be discovered at > any time and I want to plan to have them all fixed well in advance of the ship > date, I will appreciate your efforts toward speedy resolution. Please > present, within 72 hours, a plan to fix the defect within seven days of this > message. Thanks. I'm waiting of Tomas testing. Suppose, it isn't possible now? so, will do myself, after that I will publish results. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On 04/04/2016 02:06 PM, Teodor Sigaev wrote: >> The above-described topic is currently a PostgreSQL 9.6 open item. >> Teodor, >> since you committed the patch believed to have created it, you own >> this open >> item. If that responsibility lies elsewhere, please let us know whose >> responsibility it is to fix this. Since new open items may be >> discovered at >> any time and I want to plan to have them all fixed well in advance of >> the ship >> date, I will appreciate your efforts toward speedy resolution. Please >> present, within 72 hours, a plan to fix the defect within seven days >> of this >> message. Thanks. > > I'm waiting of Tomas testing. Suppose, it isn't possible now? so, will > do myself, after that I will publish results. Ah, damn. This completely slipped from my TODO list. I'll rerun the tests either today or tomorrow, and report the results here. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 04/04/2016 02:25 PM, Tomas Vondra wrote: > On 04/04/2016 02:06 PM, Teodor Sigaev wrote: >>> The above-described topic is currently a PostgreSQL 9.6 open item. >>> Teodor, >>> since you committed the patch believed to have created it, you own >>> this open >>> item. If that responsibility lies elsewhere, please let us know whose >>> responsibility it is to fix this. Since new open items may be >>> discovered at >>> any time and I want to plan to have them all fixed well in advance of >>> the ship >>> date, I will appreciate your efforts toward speedy resolution. Please >>> present, within 72 hours, a plan to fix the defect within seven days >>> of this >>> message. Thanks. >> >> I'm waiting of Tomas testing. Suppose, it isn't possible now? so, will >> do myself, after that I will publish results. > > Ah, damn. This completely slipped from my TODO list. I'll rerun the > tests either today or tomorrow, and report the results here. So, I've done some testing on the patch, and in short it seems fine to me. It fixes the bug (no data corruption observed), and the performance seems fine too. I've tested the v2, v3 and v3.1 of the patch, to see if there are any differences. The v2 no longer applies, so I tested it on ee943004. The following table shows the total duration of the data load, and also sizes of the two GIN indexes. duration (sec) subject body --------------------------------------------------------------- v2 1290 23 MB 684 MB v3 1360 24 MB 487 MB v3.1 1360 24 MB 488 MB So, looks good to me. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> I've tested the v2, v3 and v3.1 of the patch, to see if there are any > differences. The v2 no longer applies, so I tested it on ee943004. The following > table shows the total duration of the data load, and also sizes of the two GIN > indexes. > > duration (sec) subject body > --------------------------------------------------------------- > v2 1290 23 MB 684 MB > v3 1360 24 MB 487 MB > v3.1 1360 24 MB 488 MB Thank you very much. Hmm. v3 is a just rebased version of v2, v3.1 hasn't unlock/lock cycle during cleanup, just to become similar to Jeff's pending lock patch. In theory, v2 and v3 should be very close, v3.1 should be close to pending_lock. I'm inclining to push v3.1 as one of two winners by size/performance and, unlike to pending lock patch, it doesn't change an internal logic of lock machinery. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Wed, Apr 6, 2016 at 9:52 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > I'm inclining to push v3.1 as one of two winners by size/performance and, > unlike to pending lock patch, it doesn't change an internal logic of lock > machinery. This restricts the memory used by ordinary backends when doing the cleanup to be work_mem. Shouldn't we let them use maintenance_work_mem? Only one backend can be doing this clean up of a given index at any given time, so we don't need to worry about many concurrent allocations of maintenance_work_mem. This seems very similar in spirit to index creation, where a backend is allowed to use maintenance_work_mem. Also, do we plan on backpatching this? While there are no known instances of anyone hitting this bug in any released code, still it is a bug. There are plenty of reports of corrupted indexes which never get properly diagnosed, and for all we know some of them might be due to this. On the other hand, it is a substantial behavior change, which is scary to backpatch. (But all other proposed solutions also have behavior changes of one kind or another.) To summarize the bugs in the already-released code: If a vacuum's dead list includes a tuple which is still in the pending list, and the vacuum skips the pending list cleanup because it detects someone else is already working on it, then the vacuum could remove that tuple from the table, even though there is still a reference to it in the index. A page could be deleted from the pending list, put on the free space map, then reused. When the linked list of pending pages is followed onto this reused page, it will fail to recognize is as being deleted and recycled, and bad things will happen. This is quite likely to happen in 9.6 because of the new aggressive recycling of pages. But I've seen no argument as to why it is impossible to hit this bug in released code, it seems like it would be possible to hit it but just on a very rare basis. To summarize the behavior change: In the released code, an inserting backend that violates the pending list limit will try to clean the list, even if it is already being cleaned. It won't accomplish anything useful, but will go through the motions until eventually it runs into a page the lead cleaner has deleted, at which point it realizes there is another cleaner and it stops. This acts as a natural throttle to how fast insertions can take place into an over-sized pending list. The proposed change removes that throttle, so that inserters will immediately see there is already a cleaner and just go back about their business. Due to that, unthrottled backends could add to the pending list faster than the cleaner can clean it, leading to unbounded growth in the pending list and could cause a user backend to becoming apparently unresponsive to the user, indefinitely. That is scary to backpatch. Cheers, Jeff
Jeff Janes wrote: > The proposed change removes that throttle, so that inserters will > immediately see there is already a cleaner and just go back about > their business. Due to that, unthrottled backends could add to the > pending list faster than the cleaner can clean it, leading to > unbounded growth in the pending list and could cause a user backend to > becoming apparently unresponsive to the user, indefinitely. That is > scary to backpatch. Can we install a protection against that? For instance, the cleaner only cleans up the entries that were present when it started, and if other items are inserted later into the pending list, they are left for a future cleaner. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Jeff Janes <jeff.janes@gmail.com> writes: > To summarize the behavior change: > In the released code, an inserting backend that violates the pending > list limit will try to clean the list, even if it is already being > cleaned. It won't accomplish anything useful, but will go through the > motions until eventually it runs into a page the lead cleaner has > deleted, at which point it realizes there is another cleaner and it > stops. This acts as a natural throttle to how fast insertions can > take place into an over-sized pending list. Right. > The proposed change removes that throttle, so that inserters will > immediately see there is already a cleaner and just go back about > their business. Due to that, unthrottled backends could add to the > pending list faster than the cleaner can clean it, leading to > unbounded growth in the pending list and could cause a user backend to > becoming apparently unresponsive to the user, indefinitely. That is > scary to backpatch. It's scary to put into HEAD, either. What if we simply don't take that specific behavioral change? It doesn't seem like this is an essential part of fixing the bug as you described it. (Though I've not read the patch, so maybe I'm just missing the connection.) regards, tom lane
On Thu, Apr 7, 2016 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> To summarize the behavior change: > >> In the released code, an inserting backend that violates the pending >> list limit will try to clean the list, even if it is already being >> cleaned. It won't accomplish anything useful, but will go through the >> motions until eventually it runs into a page the lead cleaner has >> deleted, at which point it realizes there is another cleaner and it >> stops. This acts as a natural throttle to how fast insertions can >> take place into an over-sized pending list. > > Right. > >> The proposed change removes that throttle, so that inserters will >> immediately see there is already a cleaner and just go back about >> their business. Due to that, unthrottled backends could add to the >> pending list faster than the cleaner can clean it, leading to >> unbounded growth in the pending list and could cause a user backend to >> becoming apparently unresponsive to the user, indefinitely. That is >> scary to backpatch. > > It's scary to put into HEAD, either. What if we simply don't take > that specific behavioral change? It doesn't seem like this is an > essential part of fixing the bug as you described it. (Though I've > not read the patch, so maybe I'm just missing the connection.) There are only 3 fundamental options I see, the cleaner can wait, "help", or move on. "Helping" is what it does now and is dangerous. Moving on gives the above-discussed unthrottling problem. Waiting has two problems. The act of waiting will cause autovacuums to be canceled, unless ugly hacks are deployed to prevent that. If we deploy those ugly hacks, then we have the problem that a user backend will end up waiting on an autovacuum to finish the cleaning, and the autovacuum is taking its sweet time due to autovacuum_vacuum_cost_delay. (The "helping" model avoids this problem because the user backend can just catch up with and pass the io-throttled autovac process) For completeness sake, a fourth option would to move on, but only after inserting the tuple directly into the main index structure (rather then the pending list) like would be done with fastupdate off, once the pending list is already oversized. This is my favorite, but there is no chance of it going into 9.6, much less being backpatched. Alvaro's recommendation, to let the cleaner off the hook once it passes the page which was the tail page at the time it started, would prevent any process from getting pinned down indefinitely, but would not prevent the size of the list from increasing without bound. I think that would probably be good enough, because the current throttling behavior is purely accidentally and doesn't *guarantee* a limit on the size of the pending list. Cheers, Jeff
On Thu, Apr 07, 2016 at 05:53:54PM -0700, Jeff Janes wrote: > On Thu, Apr 7, 2016 at 4:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jeff Janes <jeff.janes@gmail.com> writes: > >> To summarize the behavior change: > > > >> In the released code, an inserting backend that violates the pending > >> list limit will try to clean the list, even if it is already being > >> cleaned. It won't accomplish anything useful, but will go through the > >> motions until eventually it runs into a page the lead cleaner has > >> deleted, at which point it realizes there is another cleaner and it > >> stops. This acts as a natural throttle to how fast insertions can > >> take place into an over-sized pending list. > > > > Right. > > > >> The proposed change removes that throttle, so that inserters will > >> immediately see there is already a cleaner and just go back about > >> their business. Due to that, unthrottled backends could add to the > >> pending list faster than the cleaner can clean it, leading to > >> unbounded growth in the pending list and could cause a user backend to > >> becoming apparently unresponsive to the user, indefinitely. That is > >> scary to backpatch. > > > > It's scary to put into HEAD, either. What if we simply don't take > > that specific behavioral change? It doesn't seem like this is an > > essential part of fixing the bug as you described it. (Though I've > > not read the patch, so maybe I'm just missing the connection.) > > There are only 3 fundamental options I see, the cleaner can wait, > "help", or move on. > > "Helping" is what it does now and is dangerous. > > Moving on gives the above-discussed unthrottling problem. > > Waiting has two problems. The act of waiting will cause autovacuums > to be canceled, unless ugly hacks are deployed to prevent that. If > we deploy those ugly hacks, then we have the problem that a user > backend will end up waiting on an autovacuum to finish the cleaning, > and the autovacuum is taking its sweet time due to > autovacuum_vacuum_cost_delay. Teodor, this thread has been quiet for four days, and the deadline to fix this open item expired 23 hours ago. Do you have a new plan for fixing it?
> This restricts the memory used by ordinary backends when doing the > cleanup to be work_mem. Shouldn't we let them use > maintenance_work_mem? Only one backend can be doing this clean up of a > given index at any given time, so we don't need to worry about many > concurrent allocations of maintenance_work_mem. This seems very > similar in spirit to index creation, where a backend is allowed to use > maintenance_work_mem. Because it could be a several indexes in one pg instance. And each cleaner could eat maintenance_work_mem. > > Also, do we plan on backpatching this? While there are no known Yes -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> There are only 3 fundamental options I see, the cleaner can wait, > "help", or move on. > > "Helping" is what it does now and is dangerous. > > Moving on gives the above-discussed unthrottling problem. > > Waiting has two problems. The act of waiting will cause autovacuums > to be canceled, unless ugly hacks are deployed to prevent that. If > we deploy those ugly hacks, then we have the problem that a user > backend will end up waiting on an autovacuum to finish the cleaning, > and the autovacuum is taking its sweet time due to > autovacuum_vacuum_cost_delay. (The "helping" model avoids this > problem because the user backend can just catch up with and pass the > io-throttled autovac process) With pending cleanup patch backend will try to get lock on metapage with ConditionalLockPage. Will it interrupt autovacum worker? > > For completeness sake, a fourth option would to move on, but only > after inserting the tuple directly into the main index structure > (rather then the pending list) like would be done with fastupdate off, > once the pending list is already oversized. This is my favorite, but > there is no chance of it going into 9.6, much less being backpatched. Agree, will reimplement for 9.7 > > > Alvaro's recommendation, to let the cleaner off the hook once it > passes the page which was the tail page at the time it started, would > prevent any process from getting pinned down indefinitely, but would > not prevent the size of the list from increasing without bound. I > think that would probably be good enough, because the current > throttling behavior is purely accidentally and doesn't *guarantee* a > limit on the size of the pending list. Added, see attached patch (based on v3.1) -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
>> Alvaro's recommendation, to let the cleaner off the hook once it >> passes the page which was the tail page at the time it started, would >> prevent any process from getting pinned down indefinitely, but would > Added, see attached patch (based on v3.1) If there is no objections I will aplly it at mondeay and backpatch all supported versions. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Tue, Apr 12, 2016 at 9:53 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > > With pending cleanup patch backend will try to get lock on metapage with > ConditionalLockPage. Will it interrupt autovacum worker? Correct, ConditionalLockPage should not interrupt the autovacuum worker. >> >> Alvaro's recommendation, to let the cleaner off the hook once it >> passes the page which was the tail page at the time it started, would >> prevent any process from getting pinned down indefinitely, but would >> not prevent the size of the list from increasing without bound. I >> think that would probably be good enough, because the current >> throttling behavior is purely accidentally and doesn't *guarantee* a >> limit on the size of the pending list. > > Added, see attached patch (based on v3.1) With this applied, I am getting a couple errors I have not seen before after extensive crash recovery testing: ERROR: attempted to delete invisible tuple ERROR: unexpected chunk number 1 (expected 2) for toast value 100338365 in pg_toast_16425 I've restarted the test harness with intentional crashes turned off, to see if the problems are related to crash recovery or are more generic than that. I've never seen these particular problems before, so don't have much insight into what might be going on or how to debug it. Cheers, Jeff
>> Added, see attached patch (based on v3.1) > > With this applied, I am getting a couple errors I have not seen before > after extensive crash recovery testing: > ERROR: attempted to delete invisible tuple > ERROR: unexpected chunk number 1 (expected 2) for toast value > 100338365 in pg_toast_16425 Huh, seems, it's not related to GIN at all... Indexes don't play with toast machinery. The single place where this error can occur is a heap_delete() - deleting already deleted tuple. > > I've restarted the test harness with intentional crashes turned off, > to see if the problems are related to crash recovery or are more > generic than that. > > I've never seen these particular problems before, so don't have much > insight into what might be going on or how to debug it. Check my reasoning: In version 4 I added a remebering of tail of pending list into blknoFinish variable. And when we read page which was a tail on cleanup start then we sets cleanupFinish variable and after cleaning that page we will stop further cleanup. Any insert caused during cleanup will be placed after blknoFinish (corner case: in that page), so, vacuum should not miss tuples marked as deleted. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Mon, Apr 18, 2016 at 05:48:17PM +0300, Teodor Sigaev wrote: > >>Added, see attached patch (based on v3.1) > > > >With this applied, I am getting a couple errors I have not seen before > >after extensive crash recovery testing: > >ERROR: attempted to delete invisible tuple > >ERROR: unexpected chunk number 1 (expected 2) for toast value > >100338365 in pg_toast_16425 > Huh, seems, it's not related to GIN at all... Indexes don't play with toast > machinery. The single place where this error can occur is a heap_delete() - > deleting already deleted tuple. Like you, I would not expect gin_alone_cleanup-4.patch to cause such an error. I get the impression Jeff has a test case that he had run in many iterations against the unpatched baseline. I also get the impression that a similar or smaller number of its iterations against gin_alone_cleanup-4.patch triggered these two errors (once apiece, or multiple times?). Jeff, is that right? If so, until we determine the cause, we should assume the cause arrived in gin_alone_cleanup-4.patch. An error in pointer arithmetic or locking might corrupt an unrelated buffer, leading to this symptom. > >I've restarted the test harness with intentional crashes turned off, > >to see if the problems are related to crash recovery or are more > >generic than that. > > > >I've never seen these particular problems before, so don't have much > >insight into what might be going on or how to debug it. Could you describe the test case in sufficient detail for Teodor to reproduce your results? > Check my reasoning: In version 4 I added a remebering of tail of pending > list into blknoFinish variable. And when we read page which was a tail on > cleanup start then we sets cleanupFinish variable and after cleaning that > page we will stop further cleanup. Any insert caused during cleanup will be > placed after blknoFinish (corner case: in that page), so, vacuum should not > miss tuples marked as deleted. Would any hacker volunteer to review Teodor's reasoning here? Thanks, nm
On Thu, Nov 5, 2015 at 2:44 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > The bug theoretically exists in 9.5, but it wasn't until 9.6 (commit > e95680832854cf300e64c) that free pages were recycled aggressively > enough that it actually becomes likely to be hit. In other words: The bug could be in 9.5, but that hasn't been conclusively demonstrated. Fair? I'm not an expert on GIN at all; I know far more about B-Tree. But, commit e95680832854cf300e64c seems a bit odd to me. I don't see any argument for why it's okay that the recycling of pages can happen immediately for the pending list, rather than requiring it to happen at some time later with a safe interlock (some like B-Tree's use of RecentGlobalXmin). The GIN README discusses a few such issues, but it wasn't updated by the commit I mentioned, which I would have expected. OTOH, after all of 10 minutes I can't see what's special about ginvacuumcleanup() that makes its long established RecordFreeIndexPage() call fundamentally safer, which if true would be a surprisingly obvious defect to go unfixed for all these years. This is what you yourself said about it, I think. I need to look at it again with fresh eyes, but offhand having no safe interlock for the well established RecordFreeIndexPage() call for GIN seems like an implausibly obvious bug. -- Peter Geoghegan
On Mon, Apr 18, 2016 at 7:48 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >>> Added, see attached patch (based on v3.1) >> >> >> With this applied, I am getting a couple errors I have not seen before >> after extensive crash recovery testing: >> ERROR: attempted to delete invisible tuple >> ERROR: unexpected chunk number 1 (expected 2) for toast value >> 100338365 in pg_toast_16425 > > Huh, seems, it's not related to GIN at all... Indexes don't play with toast > machinery. The single place where this error can occur is a heap_delete() - > deleting already deleted tuple. Those are two independent errors. The delete invisible tuple error doesn't occur on toast tables. The actual statement triggering the error is an update statement. Since it is showing up in the delete path, I assume it must be an update where the new tuple goes to a different page. But, if the soon-to-be-old tuple is not visible, why is the update trying to update it in the first place? It seems like the different parts of the code disagree on what is visible. update foo set count=count+1,text_array=$1 where text_array @> $2 I agree it might not have anything to do with gin indexes, but I didn't see it in testing anything else. It might be a wrap-around problem which for some reason only the gin test is efficient at evoking. What I've done now is apply your v4 patch directly to e95680832854cf300e64c1 and I am trying to see if it also has the problem. If that is clean, then it is probably an independently introduced bug which is just getting exercised by the gin index stress test. If that is the case I'll try to git bisect forward, but that could take weeks given the runtimes involved. If that is dirty, then maybe the FSM vacuuming patch introduced/uncovered more than one bug, and should be reverted. >> I've restarted the test harness with intentional crashes turned off, >> to see if the problems are related to crash recovery or are more >> generic than that. I do not see the problem when there is no crash-recovery cycling involved. I also do not see the problem when compiled under --enable-cassert, but that could just be because compiling that way makes it too slow to get in sufficient testing to hit the bug; before I gave up. >> >> I've never seen these particular problems before, so don't have much >> insight into what might be going on or how to debug it. > > Check my reasoning: In version 4 I added a remebering of tail of pending > list into blknoFinish variable. And when we read page which was a tail on > cleanup start then we sets cleanupFinish variable and after cleaning that > page we will stop further cleanup. Any insert caused during cleanup will be > placed after blknoFinish (corner case: in that page), so, vacuum should not > miss tuples marked as deleted. Yes, I agree with the correctness of v4. But I do wonder if we should use that early stopping for vacuum and gin_clean_pending_list, rather than just using it for user backends. While I think correctness allows it to stop early, since these routines are explicitly about cleaning things up it seems like they should volunteer to clean the whole thing. Cheers, Jeff
On Fri, Apr 22, 2016 at 2:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> Check my reasoning: In version 4 I added a remebering of tail of pending >> list into blknoFinish variable. And when we read page which was a tail on >> cleanup start then we sets cleanupFinish variable and after cleaning that >> page we will stop further cleanup. Any insert caused during cleanup will be >> placed after blknoFinish (corner case: in that page), so, vacuum should not >> miss tuples marked as deleted. > > Yes, I agree with the correctness of v4. But I do wonder if we should > use that early stopping for vacuum and gin_clean_pending_list, rather > than just using it for user backends. While I think correctness > allows it to stop early, since these routines are explicitly about > cleaning things up it seems like they should volunteer to clean the > whole thing. Not to hold anyone's feet to the fire, but we are running out of time before beta1, and this seems to be one of the more serious outstanding issues. I'd rather not release beta1 with known data corruption bugs, but that means either fixing whatever is broken here or reverting the commit that caused or revealed the problem pretty soon. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 21, 2016 at 11:00 PM, Noah Misch <noah@leadboat.com> wrote: > On Mon, Apr 18, 2016 at 05:48:17PM +0300, Teodor Sigaev wrote: >> >>Added, see attached patch (based on v3.1) >> > >> >With this applied, I am getting a couple errors I have not seen before >> >after extensive crash recovery testing: >> >ERROR: attempted to delete invisible tuple >> >ERROR: unexpected chunk number 1 (expected 2) for toast value >> >100338365 in pg_toast_16425 >> Huh, seems, it's not related to GIN at all... Indexes don't play with toast >> machinery. The single place where this error can occur is a heap_delete() - >> deleting already deleted tuple. > > Like you, I would not expect gin_alone_cleanup-4.patch to cause such an error. > I get the impression Jeff has a test case that he had run in many iterations > against the unpatched baseline. I also get the impression that a similar or > smaller number of its iterations against gin_alone_cleanup-4.patch triggered > these two errors (once apiece, or multiple times?). Jeff, is that right? Because the unpatched baseline suffers from the bug which was the original topic of this thread, I haven't been able to test against the original baseline. It would fail from that other bug before it ran long enough to hit these ones. Both errors start within a few minutes of each other, but do not appear to be related other than that. Once they start happening, they occur repeatedly. > Could you describe the test case in sufficient detail for Teodor to reproduce > your results? I spawn a swarm of processes to update a counter in a randomly chosen row, selected via the gin index. They do this as fast as possible until the server intentionally crashes. When it recovers, they compare notes and see if the results are consistent. But in this case, the problem is not inconsistent results, but rather errors during the updating stage. The patch introduces a mechanism to crash the server, a mechanism to fast-forward the XID, and some additional logging that I sometimes find useful (which I haven't been using in this case, but don't want to rip it out) The perl script implements the core of the test harness. The shell script sets up the server (using hard coded paths for the data and the binaries, so will need to be changed), and then calls the perl script in a loop. Running on an 8 core system, I've never seen it have a problem in less than 9 hours of run time. This produces copious amounts of logging to stdout. This is how I look through the log files to find this particular problem: sh do.sh >& do.err & tail -f do.err | fgrep ERROR The more I think about it, the more I think gin is just an innocent bystander, for which I just happen to have a particularly demanding test. I think something about snapshots and wrap-around may be broken. Cheers, Jeff
Attachment
On Fri, Apr 22, 2016 at 02:03:01PM -0700, Jeff Janes wrote: > On Thu, Apr 21, 2016 at 11:00 PM, Noah Misch <noah@leadboat.com> wrote: > > Could you describe the test case in sufficient detail for Teodor to reproduce > > your results? > > [detailed description and attachments] Thanks. > The more I think about it, the more I think gin is just an innocent > bystander, for which I just happen to have a particularly demanding > test. I think something about snapshots and wrap-around may be > broken. Based on your review, I think the next step on $subject is for Teodor to either commit gin_alone_cleanup-4.patch or to tell us what he'd like to see or do before committing that patch. While the failures during your crash recovery testing may not even be 9.6 regressions, they suggest the presence of serious bugs. At a minimum, the community should confirm they are not 9.6 regressions before releasing 9.6. I have added an open item to that effect.
>> Check my reasoning: In version 4 I added a remebering of tail of pending >> list into blknoFinish variable. And when we read page which was a tail on >> cleanup start then we sets cleanupFinish variable and after cleaning that >> page we will stop further cleanup. Any insert caused during cleanup will be >> placed after blknoFinish (corner case: in that page), so, vacuum should not >> miss tuples marked as deleted. > > Yes, I agree with the correctness of v4. But I do wonder if we should > use that early stopping for vacuum and gin_clean_pending_list, rather Interesting, I've missed this possible option > than just using it for user backends. While I think correctness > allows it to stop early, since these routines are explicitly about > cleaning things up it seems like they should volunteer to clean the > whole thing. I believe that autovacuum should not require guaranteed full clean up, only vacuum and gin_clean_pending_list() should do that. In all other cases it should stop early to prevent possible infinite cleanup. Patch attached. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
On Tue, Apr 26, 2016 at 08:22:03PM +0300, Teodor Sigaev wrote: > >>Check my reasoning: In version 4 I added a remebering of tail of pending > >>list into blknoFinish variable. And when we read page which was a tail on > >>cleanup start then we sets cleanupFinish variable and after cleaning that > >>page we will stop further cleanup. Any insert caused during cleanup will be > >>placed after blknoFinish (corner case: in that page), so, vacuum should not > >>miss tuples marked as deleted. > > > >Yes, I agree with the correctness of v4. But I do wonder if we should > >use that early stopping for vacuum and gin_clean_pending_list, rather > Interesting, I've missed this possible option > > >than just using it for user backends. While I think correctness > >allows it to stop early, since these routines are explicitly about > >cleaning things up it seems like they should volunteer to clean the > >whole thing. > > I believe that autovacuum should not require guaranteed full clean up, only > vacuum and gin_clean_pending_list() should do that. In all other cases it > should stop early to prevent possible infinite cleanup. Patch attached. Teodor, this item remains open after twenty-three days of you owning it. At this pace, 9.6beta1 will have an unreliable GIN implementation. What things must happen so you can close this item, and when will you do those things?
> The more I think about it, the more I think gin is just an innocent > bystander, for which I just happen to have a particularly demanding > test. I think something about snapshots and wrap-around may be > broken. After 10 hours of running I've got 1587 XX000 2016-04-28 05:57:09.964 MSK:ERROR: unexpected chunk number 0 (expected 1) for toast value 10192986 in pg_toast_16424 1587 XX000 2016-04-28 05:57:09.964 MSK:CONTEXT: automatic analyze of table "teodor.public.foo" 1587 00000 2016-04-28 05:57:09.974 MSK:LOG: JJ transaction ID wrap limit is 2147484514, limited by database with OID 16384 I've pushed v5 of gin-alone-cleanup patch, many thanks to all participants. Will work on backpatch -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/