Thread: issue with gininsert under very high load
A client of ours encountered this problem when upgrading (via slony) from 8.4 to 9.2, and then from 8.4 to to 9.3. The application is a telephony app that inserts call records at pretty hight volume in the busiest part of the day, which usually starts around 10.00 am to 11.00 am US eastern time. What happened was that the system load average shot up from, say, 6 to way over 100, and the database became largely unresponsive. This problem did not occur on 8.4. The platform is CentOS 6.4 / x86_64. On investigation I found that a number of processes were locked waiting for one wedged process to end its transaction, which never happened (this transaction should normally take milliseconds). oprofile revealed that postgres was spending 87% of its time in s_lock(), and strace on the wedged process revealed that it was in a tight loop constantly calling select(). It did not respond to a SIGTERM. I attached a debugger to the wedged process and got this backtrace: #0 0x000000386a2eaf37 in semop () from /lib64/libc.so.6 #1 0x00000000006021a7 in PGSemaphoreLock () #2 0x00000000006463a1in LWLockAcquire () #3 0x0000000000631997 in ReadBuffer_common () #4 0x00000000006322ee in ReadBufferExtended() #5 0x0000000000467475 in ginPrepareFindLeafPage () #6 0x000000000046663f in ginInsertItemPointers() #7 0x0000000000462adc in ginEntryInsert () #8 0x000000000046cad0 in ginInsertCleanup () #9 0x000000000046d7c6 in ginHeapTupleFastInsert () #10 0x0000000000462cfa in gininsert () #11 0x0000000000722b33 in FunctionCall6Coll() #12 0x000000000048bdff in index_insert () #13 0x0000000000587595 in ExecInsertIndexTuples () #140x00000000005948e1 in ExecModifyTable () #15 0x000000000057e718 in ExecProcNode () #16 0x000000000057d512 in standard_ExecutorRun() [...] Armed with this the client identified a single gin index (on an array of text) that could have been involved, and on removing the index the problem simply went away, and they have now been live on 9.3 for 36 hours without a mishap. The client is attempting to create a self-contained reproducible test, but understandably are not going to re-add the index to their production system. I'm not terribly familiar with the gin code, so I'm reporting this in the hope that other people who are more familiar with it than I am might know where to look for a potential race condition or other bug that might have caused this. cheers andrew
On 2014-02-12 14:39:37 -0500, Andrew Dunstan wrote: > On investigation I found that a number of processes were locked waiting for > one wedged process to end its transaction, which never happened (this > transaction should normally take milliseconds). oprofile revealed that > postgres was spending 87% of its time in s_lock(), and strace on the wedged > process revealed that it was in a tight loop constantly calling select(). It > did not respond to a SIGTERM. I attached a debugger to the wedged process > and got this backtrace: > > #0 0x000000386a2eaf37 in semop () from /lib64/libc.so.6 > #1 0x00000000006021a7 in PGSemaphoreLock () > #2 0x00000000006463a1 in LWLockAcquire () > #3 0x0000000000631997 in ReadBuffer_common () > #4 0x00000000006322ee in ReadBufferExtended () > #5 0x0000000000467475 in ginPrepareFindLeafPage () > #6 0x000000000046663f in ginInsertItemPointers () > #7 0x0000000000462adc in ginEntryInsert () > #8 0x000000000046cad0 in ginInsertCleanup () > #9 0x000000000046d7c6 in ginHeapTupleFastInsert () > #10 0x0000000000462cfa in gininsert () > #11 0x0000000000722b33 in FunctionCall6Coll () > #12 0x000000000048bdff in index_insert () > #13 0x0000000000587595 in ExecInsertIndexTuples () > #14 0x00000000005948e1 in ExecModifyTable () > #15 0x000000000057e718 in ExecProcNode () > #16 0x000000000057d512 in standard_ExecutorRun () > [...] > > > > Armed with this the client identified a single gin index (on an array of > text) that could have been involved, and on removing the index the problem > simply went away, and they have now been live on 9.3 for 36 hours without a > mishap. > > The client is attempting to create a self-contained reproducible test, but > understandably are not going to re-add the index to their production system. > > I'm not terribly familiar with the gin code, so I'm reporting this in the > hope that other people who are more familiar with it than I am might know > where to look for a potential race condition or other bug that might have > caused this. That's a deficiency of the gin fastupdate cache: a) it bases it's size on work_mem which usually makes it *far* too big b) it doesn't perform the cleanup in one go if it can get a suitable lock, but does independent locking for each entry. That usually leads to absolutely horrific performance under concurreny. The fix is usually to just turn FASTUPDATE off for all indexes. If it's done after the initial creation, the table should be vacuumed afterwards to flush it. There's previously been talk about changing the limits to something more reasonable but it got stalled in bikeshedding IIRC. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Andres Freund (andres@2ndquadrant.com) wrote: > There's previously been talk about changing the limits to something more > reasonable but it got stalled in bikeshedding IIRC. As I recall, there was argument that we didn't really need a new GUC for this (which was the proposal) but rather just need to pick a reasonable (small) value and hard-code it. Are there objections to doing so, or are there cases where that would be a serious problem? How do people feel about 4MB? Thanks, Stephen
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-02-12 14:39:37 -0500, Andrew Dunstan wrote: >> On investigation I found that a number of processes were locked waiting for >> one wedged process to end its transaction, which never happened (this >> transaction should normally take milliseconds). oprofile revealed that >> postgres was spending 87% of its time in s_lock(), and strace on the wedged >> process revealed that it was in a tight loop constantly calling select(). It >> did not respond to a SIGTERM. > That's a deficiency of the gin fastupdate cache: a) it bases it's size > on work_mem which usually makes it *far* too big b) it doesn't perform the > cleanup in one go if it can get a suitable lock, but does independent > locking for each entry. That usually leads to absolutely horrific > performance under concurreny. I'm not sure that what Andrew is describing can fairly be called a concurrent-performance problem. It sounds closer to a stuck lock. Are you sure you've diagnosed it correctly? regards, tom lane
On 02/12/2014 02:50 PM, Andres Freund wrote: > On 2014-02-12 14:39:37 -0500, Andrew Dunstan wrote: >> On investigation I found that a number of processes were locked waiting for >> one wedged process to end its transaction, which never happened (this >> transaction should normally take milliseconds). oprofile revealed that >> postgres was spending 87% of its time in s_lock(), and strace on the wedged >> process revealed that it was in a tight loop constantly calling select(). It >> did not respond to a SIGTERM. I attached a debugger to the wedged process >> and got this backtrace: >> >> #0 0x000000386a2eaf37 in semop () from /lib64/libc.so.6 >> #1 0x00000000006021a7 in PGSemaphoreLock () >> #2 0x00000000006463a1 in LWLockAcquire () >> #3 0x0000000000631997 in ReadBuffer_common () >> #4 0x00000000006322ee in ReadBufferExtended () >> #5 0x0000000000467475 in ginPrepareFindLeafPage () >> #6 0x000000000046663f in ginInsertItemPointers () >> #7 0x0000000000462adc in ginEntryInsert () >> #8 0x000000000046cad0 in ginInsertCleanup () >> #9 0x000000000046d7c6 in ginHeapTupleFastInsert () >> #10 0x0000000000462cfa in gininsert () >> #11 0x0000000000722b33 in FunctionCall6Coll () >> #12 0x000000000048bdff in index_insert () >> #13 0x0000000000587595 in ExecInsertIndexTuples () >> #14 0x00000000005948e1 in ExecModifyTable () >> #15 0x000000000057e718 in ExecProcNode () >> #16 0x000000000057d512 in standard_ExecutorRun () >> [...] >> >> >> >> Armed with this the client identified a single gin index (on an array of >> text) that could have been involved, and on removing the index the problem >> simply went away, and they have now been live on 9.3 for 36 hours without a >> mishap. >> >> The client is attempting to create a self-contained reproducible test, but >> understandably are not going to re-add the index to their production system. >> >> I'm not terribly familiar with the gin code, so I'm reporting this in the >> hope that other people who are more familiar with it than I am might know >> where to look for a potential race condition or other bug that might have >> caused this. > That's a deficiency of the gin fastupdate cache: a) it bases it's size > on work_mem which usually makes it *far* too big b) it doesn't perform the > cleanup in one go if it can get a suitable lock, but does independent > locking for each entry. That usually leads to absolutely horrific > performance under concurreny. > > The fix is usually to just turn FASTUPDATE off for all indexes. If it's > done after the initial creation, the table should be vacuumed afterwards > to flush it. > > There's previously been talk about changing the limits to something more > reasonable but it got stalled in bikeshedding IIRC. > So this doesn't work just when it might be most useful? cheers andrew
On February 12, 2014 9:33:38 PM CET, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Andres Freund <andres@2ndquadrant.com> writes: >> On 2014-02-12 14:39:37 -0500, Andrew Dunstan wrote: >>> On investigation I found that a number of processes were locked >waiting for >>> one wedged process to end its transaction, which never happened >(this >>> transaction should normally take milliseconds). oprofile revealed >that >>> postgres was spending 87% of its time in s_lock(), and strace on the >wedged >>> process revealed that it was in a tight loop constantly calling >select(). It >>> did not respond to a SIGTERM. > >> That's a deficiency of the gin fastupdate cache: a) it bases it's >size >> on work_mem which usually makes it *far* too big b) it doesn't >perform the >> cleanup in one go if it can get a suitable lock, but does independent >> locking for each entry. That usually leads to absolutely horrific >> performance under concurreny. > >I'm not sure that what Andrew is describing can fairly be called a >concurrent-performance problem. It sounds closer to a stuck lock. >Are you sure you've diagnosed it correctly? No. But I've several times seen similar backtraces where it wasn't actually stuck, just livelocked. I'm just on my mobileright now, but afair Andrew described a loop involving lots of semaphores and spinlock, that shouldn't be the caseif it were actually stuck. If there dozens of processes waiting on the same lock, cleaning up a large amount of items one by one, it's not surprisingif its dramatically slow. Andres -- Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 02/12/2014 10:50 PM, Andres Freund wrote: > On February 12, 2014 9:33:38 PM CET, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Andres Freund <andres@2ndquadrant.com> writes: >>> On 2014-02-12 14:39:37 -0500, Andrew Dunstan wrote: >>>> On investigation I found that a number of processes were locked >> waiting for >>>> one wedged process to end its transaction, which never happened >> (this >>>> transaction should normally take milliseconds). oprofile revealed >> that >>>> postgres was spending 87% of its time in s_lock(), and strace on the >> wedged >>>> process revealed that it was in a tight loop constantly calling >> select(). It >>>> did not respond to a SIGTERM. >> >>> That's a deficiency of the gin fastupdate cache: a) it bases it's >> size >>> on work_mem which usually makes it *far* too big b) it doesn't >> perform the >>> cleanup in one go if it can get a suitable lock, but does independent >>> locking for each entry. That usually leads to absolutely horrific >>> performance under concurreny. >> >> I'm not sure that what Andrew is describing can fairly be called a >> concurrent-performance problem. It sounds closer to a stuck lock. >> Are you sure you've diagnosed it correctly? > > No. But I've several times seen similar backtraces where it wasn't actually stuck, just livelocked. I'm just on my mobileright now, but afair Andrew described a loop involving lots of semaphores and spinlock, that shouldn't be the caseif it were actually stuck. > If there dozens of processes waiting on the same lock, cleaning up a large amount of items one by one, it's not surprisingif its dramatically slow. Perhaps we should use a lock to enforce that only one process tries to clean up the pending list at a time. - Heikki
On 02/12/2014 04:04 PM, Heikki Linnakangas wrote: > On 02/12/2014 10:50 PM, Andres Freund wrote: >> On February 12, 2014 9:33:38 PM CET, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Andres Freund <andres@2ndquadrant.com> writes: >>>> On 2014-02-12 14:39:37 -0500, Andrew Dunstan wrote: >>>>> On investigation I found that a number of processes were locked >>> waiting for >>>>> one wedged process to end its transaction, which never happened >>> (this >>>>> transaction should normally take milliseconds). oprofile revealed >>> that >>>>> postgres was spending 87% of its time in s_lock(), and strace on the >>> wedged >>>>> process revealed that it was in a tight loop constantly calling >>> select(). It >>>>> did not respond to a SIGTERM. >>> >>>> That's a deficiency of the gin fastupdate cache: a) it bases it's >>> size >>>> on work_mem which usually makes it *far* too big b) it doesn't >>> perform the >>>> cleanup in one go if it can get a suitable lock, but does independent >>>> locking for each entry. That usually leads to absolutely horrific >>>> performance under concurreny. >>> >>> I'm not sure that what Andrew is describing can fairly be called a >>> concurrent-performance problem. It sounds closer to a stuck lock. >>> Are you sure you've diagnosed it correctly? >> >> No. But I've several times seen similar backtraces where it wasn't >> actually stuck, just livelocked. I'm just on my mobile right now, but >> afair Andrew described a loop involving lots of semaphores and >> spinlock, that shouldn't be the case if it were actually stuck. >> If there dozens of processes waiting on the same lock, cleaning up a >> large amount of items one by one, it's not surprising if its >> dramatically slow. > > Perhaps we should use a lock to enforce that only one process tries to > clean up the pending list at a time. > Is that going to serialize all these inserts? cheers andrew
On 02/13/2014 05:40 PM, Andrew Dunstan wrote: > > On 02/12/2014 04:04 PM, Heikki Linnakangas wrote: >> On 02/12/2014 10:50 PM, Andres Freund wrote: >>> On February 12, 2014 9:33:38 PM CET, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Andres Freund <andres@2ndquadrant.com> writes: >>>>> On 2014-02-12 14:39:37 -0500, Andrew Dunstan wrote: >>>>>> On investigation I found that a number of processes were locked >>>> waiting for >>>>>> one wedged process to end its transaction, which never happened >>>> (this >>>>>> transaction should normally take milliseconds). oprofile revealed >>>> that >>>>>> postgres was spending 87% of its time in s_lock(), and strace on the >>>> wedged >>>>>> process revealed that it was in a tight loop constantly calling >>>> select(). It >>>>>> did not respond to a SIGTERM. >>>> >>>>> That's a deficiency of the gin fastupdate cache: a) it bases it's >>>> size >>>>> on work_mem which usually makes it *far* too big b) it doesn't >>>> perform the >>>>> cleanup in one go if it can get a suitable lock, but does independent >>>>> locking for each entry. That usually leads to absolutely horrific >>>>> performance under concurreny. >>>> >>>> I'm not sure that what Andrew is describing can fairly be called a >>>> concurrent-performance problem. It sounds closer to a stuck lock. >>>> Are you sure you've diagnosed it correctly? >>> >>> No. But I've several times seen similar backtraces where it wasn't >>> actually stuck, just livelocked. I'm just on my mobile right now, but >>> afair Andrew described a loop involving lots of semaphores and >>> spinlock, that shouldn't be the case if it were actually stuck. >>> If there dozens of processes waiting on the same lock, cleaning up a >>> large amount of items one by one, it's not surprising if its >>> dramatically slow. >> >> Perhaps we should use a lock to enforce that only one process tries to >> clean up the pending list at a time. > > Is that going to serialize all these inserts? It will serialize the cleanup process, which moves entries from the pending list to the tree proper. But that's better than the current situation. Currently, when two processes attempt it, they will both try to insert into the GIN tree, but one of them will notice that the other one already did the cleanup, and bail out. So only one process contributes to progress, while the others just waste their effort. The processes should try to get the lock, and just give up if it's already held rather than wait. If someone else is already doing the cleanup, there's no need for the current process to do it. - Heikki
On 02/13/2014 10:58 AM, Heikki Linnakangas wrote: >>> Perhaps we should use a lock to enforce that only one process tries to >>> clean up the pending list at a time. >> >> Is that going to serialize all these inserts? > > It will serialize the cleanup process, which moves entries from the > pending list to the tree proper. But that's better than the current > situation. Currently, when two processes attempt it, they will both > try to insert into the GIN tree, but one of them will notice that the > other one already did the cleanup, and bail out. So only one process > contributes to progress, while the others just waste their effort. > > The processes should try to get the lock, and just give up if it's > already held rather than wait. If someone else is already doing the > cleanup, there's no need for the current process to do it. > Sounds good. cheers andrew
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > Perhaps we should use a lock to enforce that only one process tries to > clean up the pending list at a time. Something like the attached? Can somebody who's seen this problem confirm this improves matters? (ginInsertCleanup's header comment also needs to be rewritten, but for testing purposes, this is fine.) regards, tom lane diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c index 4a65046..38cda14 100644 *** a/src/backend/access/gin/ginfast.c --- b/src/backend/access/gin/ginfast.c *************** *** 21,26 **** --- 21,27 ---- #include "access/gin_private.h" #include "commands/vacuum.h" #include "miscadmin.h" + #include "storage/lmgr.h" #include "utils/memutils.h" #include "utils/rel.h" *************** ginInsertCleanup(GinState *ginstate, *** 739,744 **** --- 740,755 ---- KeyArray datums; BlockNumber blkno; + /* + * We use a heavyweight lock on the metapage to ensure that only one + * backend at a time tries to clean up the pending list. While it does + * actually work for multiple backends to run this code concurrently, that + * turns out to be a bad idea because there's lots of locking conflicts. + * So if someone else is already running cleanup, we just do nothing. + */ + if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock)) + return; + metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO); LockBuffer(metabuffer, GIN_SHARE); metapage = BufferGetPage(metabuffer); *************** ginInsertCleanup(GinState *ginstate, *** 748,753 **** --- 759,765 ---- { /* Nothing to do */ UnlockReleaseBuffer(metabuffer); + UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock); return; } *************** ginInsertCleanup(GinState *ginstate, *** 925,930 **** --- 937,944 ---- ReleaseBuffer(metabuffer); + UnlockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock); + /* Clean up temporary space */ MemoryContextSwitchTo(oldCtx); MemoryContextDelete(opCtx);
On 2014-02-13 16:15:42 -0500, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: > > Perhaps we should use a lock to enforce that only one process tries to > > clean up the pending list at a time. > > Something like the attached? Can somebody who's seen this problem confirm > this improves matters? > > (ginInsertCleanup's header comment also needs to be rewritten, but for > testing purposes, this is fine.) Hm. Won't that possiby lead to the fast tuple list growing unboundedly? I think we would need to at least need to stop using the fast tuple mechanism during gininsert() if it's already too big and do plain inserts. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-02-13 16:15:42 -0500, Tom Lane wrote: >> Something like the attached? Can somebody who's seen this problem confirm >> this improves matters? > Hm. Won't that possiby lead to the fast tuple list growing unboundedly? > I think we would need to at least need to stop using the fast tuple > mechanism during gininsert() if it's already too big and do plain > inserts. No, because we've already got a process working on cleaning it out. In any case, this needs some testing to see if it's an improvement or not. regards, tom lane
On 2014-02-13 18:49:21 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-02-13 16:15:42 -0500, Tom Lane wrote: > >> Something like the attached? Can somebody who's seen this problem confirm > >> this improves matters? > > > Hm. Won't that possiby lead to the fast tuple list growing unboundedly? > > I think we would need to at least need to stop using the fast tuple > > mechanism during gininsert() if it's already too big and do plain > > inserts. > > No, because we've already got a process working on cleaning it out. Well, appending to the fast tuple list will normally be fully cached io, cleaning it up not so much. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 14/02/14 00:49, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> On 2014-02-13 16:15:42 -0500, Tom Lane wrote: >>> Something like the attached? Can somebody who's seen this problem confirm >>> this improves matters? >> Hm. Won't that possiby lead to the fast tuple list growing unboundedly? >> I think we would need to at least need to stop using the fast tuple >> mechanism during gininsert() if it's already too big and do plain >> inserts. > No, because we've already got a process working on cleaning it out. > > In any case, this needs some testing to see if it's an improvement > or not. Having some real-world experience with the fastupdate mechanism. Under concurrent load it behaves really bad. Random processes waiting for cleanup (or competing with cleanup) is going to see latency-spikes, because they magically hit that corner, thus reverting to plain inserts if it cannot add to the pending list, will not remove this problem, but will make it only hit the process actually doing the cleanup. The build in mechanism, that cleanup is i cost paid by the process who happened to fill the pendinglist, is really hard to deal with in production. More control is appreciated, perhaps even an explicit flush-mechanism.. I'd like to batch up inserts during one transaction only and flush on commit. -- Jesper - with fastupdate turned off due to above issues. -- Jesper
On 2014-02-14 08:06:40 +0100, Jesper Krogh wrote: > On 14/02/14 00:49, Tom Lane wrote: > >Andres Freund <andres@2ndquadrant.com> writes: > >>On 2014-02-13 16:15:42 -0500, Tom Lane wrote: > >>>Something like the attached? Can somebody who's seen this problem confirm > >>>this improves matters? > >>Hm. Won't that possiby lead to the fast tuple list growing unboundedly? > >>I think we would need to at least need to stop using the fast tuple > >>mechanism during gininsert() if it's already too big and do plain > >>inserts. > >No, because we've already got a process working on cleaning it out. > > > >In any case, this needs some testing to see if it's an improvement > >or not. > > Having some real-world experience with the fastupdate mechanism. Under > concurrent load > it behaves really bad. Random processes waiting for cleanup (or competing > with cleanup) is > going to see latency-spikes, because they magically hit that corner, thus > reverting to plain > inserts if it cannot add to the pending list, will not remove this problem, > but will > make it only hit the process actually doing the cleanup. Yea, this is only a part of fixing fastupdate. Limiting the size of the fastupdate list to something more reasonable is pretty important as well. Not competing around cleanup will make cleanup much faster though, so I am not that concerned about the latency spikes it causes once it's limited in size and done non-concurrently. > The build in mechanism, that cleanup is i cost paid by the process who > happened to fill the pendinglist, is really hard to deal with in > production. More control is appreciated, perhaps even an explicit > flush-mechanism.. I'd like to batch up inserts during one transaction > only and flush on commit. That doesn't seem likely to work with a reasonable amount of effort. The fastupdate list is shared across all processes, so one backend will always pay the price for several others. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On 2014-02-14 08:06:40 +0100, Jesper Krogh wrote: > > The build in mechanism, that cleanup is i cost paid by the process who > > happened to fill the pendinglist, is really hard to deal with in > > production. More control is appreciated, perhaps even an explicit > > flush-mechanism.. I'd like to batch up inserts during one transaction > > only and flush on commit. > > That doesn't seem likely to work with a reasonable amount of effort. The > fastupdate list is shared across all processes, so one backend will > always pay the price for several others. Unless some other process does it, such as autovacuum. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services