A client of ours encountered this problem when upgrading (via slony)
from 8.4 to 9.2, and then from 8.4 to to 9.3. The application is a
telephony app that inserts call records at pretty hight volume in the
busiest part of the day, which usually starts around 10.00 am to 11.00
am US eastern time. What happened was that the system load average shot
up from, say, 6 to way over 100, and the database became largely
unresponsive. This problem did not occur on 8.4. The platform is CentOS
6.4 / x86_64.
On investigation I found that a number of processes were locked waiting
for one wedged process to end its transaction, which never happened
(this transaction should normally take milliseconds). oprofile revealed
that postgres was spending 87% of its time in s_lock(), and strace on
the wedged process revealed that it was in a tight loop constantly
calling select(). It did not respond to a SIGTERM. I attached a debugger
to the wedged process and got this backtrace:
#0 0x000000386a2eaf37 in semop () from /lib64/libc.so.6 #1 0x00000000006021a7 in PGSemaphoreLock () #2
0x00000000006463a1in LWLockAcquire () #3 0x0000000000631997 in ReadBuffer_common () #4 0x00000000006322ee in
ReadBufferExtended() #5 0x0000000000467475 in ginPrepareFindLeafPage () #6 0x000000000046663f in
ginInsertItemPointers() #7 0x0000000000462adc in ginEntryInsert () #8 0x000000000046cad0 in ginInsertCleanup ()
#9 0x000000000046d7c6 in ginHeapTupleFastInsert () #10 0x0000000000462cfa in gininsert () #11 0x0000000000722b33 in
FunctionCall6Coll() #12 0x000000000048bdff in index_insert () #13 0x0000000000587595 in ExecInsertIndexTuples ()
#140x00000000005948e1 in ExecModifyTable () #15 0x000000000057e718 in ExecProcNode () #16 0x000000000057d512 in
standard_ExecutorRun() [...]
Armed with this the client identified a single gin index (on an array of
text) that could have been involved, and on removing the index the
problem simply went away, and they have now been live on 9.3 for 36
hours without a mishap.
The client is attempting to create a self-contained reproducible test,
but understandably are not going to re-add the index to their production
system.
I'm not terribly familiar with the gin code, so I'm reporting this in
the hope that other people who are more familiar with it than I am might
know where to look for a potential race condition or other bug that
might have caused this.
cheers
andrew