Re: Intermittent hangs with 9.2 - Mailing list pgsql-performance
From | Andrew Dunstan |
---|---|
Subject | Re: Intermittent hangs with 9.2 |
Date | |
Msg-id | 522F3A26.20306@dunslane.net Whole thread Raw |
In response to | Intermittent hangs with 9.2 (David Whittaker <dave@iradix.com>) |
List | pgsql-performance |
On 09/10/2013 11:04 AM, David Whittaker wrote: > > Hi All, > > I've been seeing a strange issue with our Postgres install for about a > year now, and I was hoping someone might be able to help point me at > the cause. At what seem like fairly random intervals Postgres will > become unresponsive to the 3 application nodes it services. These > periods tend to last for 10 - 15 minutes before everything rights > itself and the system goes back to normal. > > During these periods the server will report a spike in the outbound > bandwidth (from about 1mbs to about 5mbs most recently), a huge spike > in context switches / interrupts (normal peaks are around 2k/8k > respectively, and during these periods they‘ve gone to 15k/22k), and a > load average of 100+. CPU usage stays relatively low, but it’s all > system time reported, user time goes to zero. It doesn‘t seem to be > disk related since we’re running with a shared_buffers setting of 24G, > which will fit just about our entire database into memory, and the IO > transactions reported by the server, as well as the disk reads > reported by Postgres stay consistently low. > > We‘ve recently started tracking how long statements take to execute, > and we’re seeing some really odd numbers. A simple delete by primary > key, for example, from a table that contains about 280,000 rows, > reportedly took 18h59m46.900s. An update by primary key in that same > table was reported as 7d 17h 58m 30.415s. That table is frequently > accessed, but obviously those numbers don't seem reasonable at all. > > Some other changes we've made to postgresql.conf: > > synchronous_commit = off > > maintenance_work_mem = 1GB > wal_level = hot_standby > wal_buffers = 16MB > > max_wal_senders = 10 > > wal_keep_segments = 5000 > > checkpoint_segments = 128 > > checkpoint_timeout = 30min > > checkpoint_completion_target = 0.9 > > max_connections = 500 > > The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB > of RAM, running Cent OS 6.3. > > So far we‘ve tried disabling Transparent Huge Pages after I found a > number of resources online that indicated similar interrupt/context > switch issues, but it hasn’t resolve the problem. I managed to catch > it happening once and run a perf which showed: > > | > + 41.40% 48154 postmaster 0x347ba9 f 0x347ba9 > + 9.55% 10956 postmaster 0x2dc820 f set_config_option > + 8.64% 9946 postmaster 0x5a3d4 f writeListPage > + 5.75% 6609 postmaster 0x5a2b0 f ginHeapTupleFastCollect > + 2.68% 3084 postmaster 0x192483 f build_implied_join_equality > + 2.61% 2990 postmaster 0x187a55 f build_paths_for_OR > + 1.86% 2131 postmaster 0x794aa f get_collation_oid > + 1.56% 1822 postmaster 0x5a67e f ginHeapTupleFastInsert > + 1.53% 1766 postmaster 0x1929bc f distribute_qual_to_rels > + 1.33% 1558 postmaster 0x249671 f cmp_numerics| > > I‘m not sure what 0x347ba9 represents, or why it’s an address rather > than a method name. > > That's about the sum of it. Any help would be greatly appreciated and > if you want any more information about our setup, please feel free to ask. > > I have seen cases like this with very high shared_buffers settings. 24Gb for shared_buffers is quite high, especially on a 48Gb box. What happens if you dial that back to, say, 12Gb? cheers andrew
pgsql-performance by date: