issue with gininsert under very high load - Mailing list pgsql-hackers

From Andrew Dunstan
Subject issue with gininsert under very high load
Date
Msg-id 52FBCDF9.4040904@dunslane.net
Whole thread Raw
Responses Re: issue with gininsert under very high load  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
A client of ours encountered this problem when upgrading (via slony) 
from 8.4 to 9.2, and then from 8.4 to to 9.3. The application is a 
telephony app that inserts call records at pretty hight volume in the 
busiest part of the day, which usually starts around 10.00 am to 11.00 
am US eastern time. What happened was that the system load average shot 
up from, say, 6 to way over 100, and the database became largely 
unresponsive. This problem did not occur on 8.4. The platform is CentOS 
6.4 / x86_64.

On investigation I found that a number of processes were locked waiting 
for one wedged process to end its transaction, which never happened 
(this transaction should normally take milliseconds). oprofile revealed 
that postgres was spending 87% of its time in s_lock(), and strace on 
the wedged process revealed that it was in a tight loop constantly 
calling select(). It did not respond to a SIGTERM. I attached a debugger 
to the wedged process and got this backtrace:
   #0  0x000000386a2eaf37 in semop () from /lib64/libc.so.6   #1  0x00000000006021a7 in PGSemaphoreLock ()   #2
0x00000000006463a1in LWLockAcquire ()   #3  0x0000000000631997 in ReadBuffer_common ()   #4  0x00000000006322ee in
ReadBufferExtended()   #5  0x0000000000467475 in ginPrepareFindLeafPage ()   #6  0x000000000046663f in
ginInsertItemPointers()   #7  0x0000000000462adc in ginEntryInsert ()   #8  0x000000000046cad0 in ginInsertCleanup ()
#9 0x000000000046d7c6 in ginHeapTupleFastInsert ()   #10 0x0000000000462cfa in gininsert ()   #11 0x0000000000722b33 in
FunctionCall6Coll()   #12 0x000000000048bdff in index_insert ()   #13 0x0000000000587595 in ExecInsertIndexTuples ()
#140x00000000005948e1 in ExecModifyTable ()   #15 0x000000000057e718 in ExecProcNode ()   #16 0x000000000057d512 in
standard_ExecutorRun()   [...]
 



Armed with this the client identified a single gin index (on an array of 
text) that could have been involved, and on removing the index the 
problem simply went away, and they have now been live on 9.3 for 36 
hours without a mishap.

The client is attempting to create a self-contained reproducible test, 
but understandably are not going to re-add the index to their production 
system.

I'm not terribly familiar with the gin code, so I'm reporting this in 
the hope that other people who are more familiar with it than I am might 
know where to look for a potential race condition or other bug that 
might have caused this.

cheers

andrew





pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: Recovery inconsistencies, standby much larger than primary
Next
From: Andres Freund
Date:
Subject: Re: issue with gininsert under very high load