Intermittent hangs with 9.2 - Mailing list pgsql-performance

From David Whittaker
Subject Intermittent hangs with 9.2
Date
Msg-id CABXnLXQsX0Kp6qz=RHt+_-JYnx9yCuFs95ZkS1boHhW2aBHV=g@mail.gmail.com
Whole thread Raw
List pgsql-performance
Hi Jeff

On Tue, Sep 10, 2013 at 1:44 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Sep 10, 2013 at 8:04 AM, David Whittaker <dave@iradix.com> wrote:

Hi All,

I've been seeing a strange issue with our Postgres install for about a year now, and I was hoping someone might be able to help point me at the cause. At what seem like fairly random intervals Postgres will become unresponsive to the 3 application nodes it services. These periods tend to last for 10 - 15 minutes before everything rights itself and the system goes back to normal.

During these periods the server will report a spike in the outbound bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in context switches / interrupts (normal peaks are around 2k/8k respectively, and during these periods they‘ve gone to 15k/22k), and a load average of 100+.


I'm curious about the spike it outbound network usage.  If the database is hung and no longer responding to queries, what is it getting sent over the network?  Can you snoop on that traffic?


It seems curious to me as well.  I don't know if one, or a few of the pg connections are streaming out data and somehow blocking the others in the process, or the data could be unrelated to pg.  If you can suggest a tool I could use to monitor the data transfer continuously and get some type of summary of what happened after the issue reoccurs, I'd appreciate it.  Otherwise, I'll try to get in and catch some specifics the next time it happens.
 
 

CPU usage stays relatively low, but it’s all system time reported, user time goes to zero. It doesn‘t seem to be disk related since we’re running with a shared_buffers setting of 24G, which will fit just about our entire database into memory, and the IO transactions reported by the server, as well as the disk reads reported by Postgres stay consistently low.

There have been reports that using very large shared_buffers can cause a lot of contention issues in the kernel, for some kernels. The usual advice is not to set shared_buffers above 8GB.  The operating system can use the rest of the memory to cache for you.

Also, using a connection pooler and lowering the number of connections to the real database has solved problems like this before.

We're going to implement both of these changes tonight.  I was going to go with 12G for shared_buffers based on Andrew's suggestion, but maybe I'll go down to 8 if that seems to be the magic number.  We're also going to decrease the max connections from 500 to 100 and decrease the pooled connections per server.
 
 

We‘ve recently started tracking how long statements take to execute, and we’re seeing some really odd numbers. A simple delete by primary key, for example, from a table that contains about 280,000 rows, reportedly took 18h59m46.900s. An update by primary key in that same table was reported as 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those numbers don't seem reasonable at all.

How are your tracking those?  Is it log_min_duration_statement or something else?

We're using log_min_duration_statement = 1000, sending the log messages to syslog, then analyzing with pg_badger.
 

Cheers,

Jeff


pgsql-performance by date:

Previous
From: David Whittaker
Date:
Subject: Intermittent hangs with 9.2
Next
From: Torsten Förtsch
Date:
Subject: Re: Intermittent hangs with 9.2