On Tue, Sep 10, 2013 at 8:04 AM, David Whittaker <dave@iradix.com> wrote:
Hi All,
I've been seeing a strange issue with our Postgres install for about a year now, and I was hoping someone might be able to help point me at the cause. At what seem like fairly random intervals Postgres will become unresponsive to the 3 application nodes it services. These periods tend to last for 10 - 15 minutes before everything rights itself and the system goes back to normal.
During these periods the server will report a spike in the outbound bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in context switches / interrupts (normal peaks are around 2k/8k respectively, and during these periods they‘ve gone to 15k/22k), and a load average of 100+.
I'm curious about the spike it outbound network usage. If the database is hung and no longer responding to queries, what is it getting sent over the network? Can you snoop on that traffic?
CPU usage stays relatively low, but it’s all system time reported, user time goes to zero. It doesn‘t seem to be disk related since we’re running with a shared_buffers setting of 24G, which will fit just about our entire database into memory, and the IO transactions reported by the server, as well as the disk reads reported by Postgres stay consistently low.
There have been reports that using very large shared_buffers can cause a lot of contention issues in the kernel, for some kernels. The usual advice is not to set shared_buffers above 8GB. The operating system can use the rest of the memory to cache for you.
Also, using a connection pooler and lowering the number of connections to the real database has solved problems like this before.
We‘ve recently started tracking how long statements take to execute, and we’re seeing some really odd numbers. A simple delete by primary key, for example, from a table that contains about 280,000 rows, reportedly took 18h59m46.900s. An update by primary key in that same table was reported as 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those numbers don't seem reasonable at all.
How are your tracking those? Is it log_min_duration_statement or something else?