Re: Intermittent hangs with 9.2 - Mailing list pgsql-performance

From Julien Cigar
Subject Re: Intermittent hangs with 9.2
Date
Msg-id 20130911083642.GF75834@mordor.lan
Whole thread Raw
In response to Intermittent hangs with 9.2  (David Whittaker <dave@iradix.com>)
List pgsql-performance
On Tue, Sep 10, 2013 at 02:04:57PM -0400, David Whittaker wrote:
> Hi Andrew,
>
>
> On Tue, Sep 10, 2013 at 11:26 AM, Andrew Dunstan <andrew@dunslane.net>wrote:
>
> >
> > On 09/10/2013 11:04 AM, David Whittaker wrote:
> >
> >>
> >> Hi All,
> >>
> >> I've been seeing a strange issue with our Postgres install for about a
> >> year now, and I was hoping someone might be able to help point me at the
> >> cause. At what seem like fairly random intervals Postgres will become
> >> unresponsive to the 3 application nodes it services. These periods tend to
> >> last for 10 - 15 minutes before everything rights itself and the system
> >> goes back to normal.
> >>
> >> During these periods the server will report a spike in the outbound
> >> bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
> >> context switches / interrupts (normal peaks are around 2k/8k respectively,
> >> and during these periods they‘ve gone to 15k/22k), and a load average of
> >> 100+. CPU usage stays relatively low, but it’s all system time reported,
> >> user time goes to zero. It doesn‘t seem to be disk related since we’re
> >> running with a shared_buffers setting of 24G, which will fit just about our
> >> entire database into memory, and the IO transactions reported by the
> >> server, as well as the disk reads reported by Postgres stay consistently
> >> low.
> >>
> >> We‘ve recently started tracking how long statements take to execute, and
> >> we’re seeing some really odd numbers. A simple delete by primary key, for
> >> example, from a table that contains about 280,000 rows, reportedly took
> >> 18h59m46.900s. An update by primary key in that same table was reported as
> >> 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
> >> numbers don't seem reasonable at all.
> >>
> >> Some other changes we've made to postgresql.conf:
> >>
> >> synchronous_commit = off
> >>
> >> maintenance_work_mem = 1GB
> >> wal_level = hot_standby
> >> wal_buffers = 16MB
> >>
> >> max_wal_senders = 10
> >>
> >> wal_keep_segments = 5000
> >>
> >> checkpoint_segments = 128
> >>
> >> checkpoint_timeout = 30min
> >>
> >> checkpoint_completion_target = 0.9
> >>
> >> max_connections = 500
> >>
> >> The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
> >> RAM, running Cent OS 6.3.
> >>
> >> So far we‘ve tried disabling Transparent Huge Pages after I found a
> >> number of resources online that indicated similar interrupt/context switch
> >> issues, but it hasn’t resolve the problem. I managed to catch it happening
> >> once and run a perf which showed:
> >>
> >> |
> >> +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
> >> +   9.55%       10956  postmaster  0x2dc820         f set_config_option
> >> +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
> >> +   5.75%        6609  postmaster  0x5a2b0          f
> >> ginHeapTupleFastCollect
> >> +   2.68%        3084  postmaster  0x192483         f
> >> build_implied_join_equality
> >> +   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR
> >> +   1.86%        2131  postmaster  0x794aa          f get_collation_oid
> >> +   1.56%        1822  postmaster  0x5a67e          f
> >> ginHeapTupleFastInsert
> >> +   1.53%        1766  postmaster  0x1929bc         f
> >> distribute_qual_to_rels
> >> +   1.33%        1558  postmaster  0x249671         f cmp_numerics|
> >>
> >> I‘m not sure what 0x347ba9 represents, or why it’s an address rather than
> >> a method name.
> >>
> >> That's about the sum of it. Any help would be greatly appreciated and if
> >> you want any more information about our setup, please feel free to ask.
> >>
> >>
> >>
> > I have seen cases like this with very high shared_buffers settings.
> >
> > 24Gb for shared_buffers is quite high, especially on a 48Gb box. What
> > happens if you dial that back to, say, 12Gb?
> >
>
> I'd be willing to give it a try.  I'd really like to understand what's
> going on here though.  Can you elaborate on that?  Why would 24G of shared
> buffers be too high in this case?  The machine is devoted entirely to PG,
> so having PG use half of the available RAM to cache data doesn't feel
> unreasonable.

Some of the overhead of bgwriter and checkpoints is more or less linear
in the size of shared_buffers. If your shared_buffers is large a lot of
data could be dirty when a checkpoint starts, resulting in an I/O spike
... (although we've spread checkpoints in recent pg versions, so this
should be less a problem nowadays).
Another reason is that the OS cache is also being used for reads and
writes and with a large shared_buffers there is a risk of "doubly cached
data" (in the OS cache + in shared_buffers).
In an ideal world most frequently used blocks should be in
shared_buffers and less frequently used block in the OS cache ..

>
>
> >
> > cheers
> >
> > andrew
> >
> >

--
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.


pgsql-performance by date:

Previous
From: Torsten Förtsch
Date:
Subject: Re: Intermittent hangs with 9.2
Next
From: Maximilian Tyrtania
Date:
Subject: slow sort