Re: Intermittent hangs with 9.2 - Mailing list pgsql-performance

From Merlin Moncure
Subject Re: Intermittent hangs with 9.2
Date
Msg-id CAHyXU0w06DcbQnakx1bkryBN+w2J_fyxv+TWDHFcexqirkouSg@mail.gmail.com
Whole thread Raw
In response to Re: Intermittent hangs with 9.2  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-performance
On Wed, Sep 11, 2013 at 12:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-09-11 07:43:35 -0500, Merlin Moncure wrote:
>> > I've been seeing a strange issue with our Postgres install for about a year
>> > now, and I was hoping someone might be able to help point me at the cause.
>> > At what seem like fairly random intervals Postgres will become unresponsive
>> > to the 3 application nodes it services. These periods tend to last for 10 -
>> > 15 minutes before everything rights itself and the system goes back to
>> > normal.
>> >
>> > During these periods the server will report a spike in the outbound
>> > bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
>> > context switches / interrupts (normal peaks are around 2k/8k respectively,
>> > and during these periods they‘ve gone to 15k/22k), and a load average of
>> > 100+. CPU usage stays relatively low, but it’s all system time reported,
>> > user time goes to zero. It doesn‘t seem to be disk related since we’re
>> > running with a shared_buffers setting of 24G, which will fit just about our
>> > entire database into memory, and the IO transactions reported by the server,
>> > as well as the disk reads reported by Postgres stay consistently low.
>> >
>> > We‘ve recently started tracking how long statements take to execute, and
>> > we’re seeing some really odd numbers. A simple delete by primary key, for
>> > example, from a table that contains about 280,000 rows, reportedly took
>> > 18h59m46.900s. An update by primary key in that same table was reported as
>> > 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
>> > numbers don't seem reasonable at all.
>> >
>> > Some other changes we've made to postgresql.conf:
>> >
>> > synchronous_commit = off
>> >
>> > maintenance_work_mem = 1GB
>> > wal_level = hot_standby
>> > wal_buffers = 16MB
>> >
>> > max_wal_senders = 10
>> >
>> > wal_keep_segments = 5000
>> >
>> > checkpoint_segments = 128
>> >
>> > checkpoint_timeout = 30min
>> >
>> > checkpoint_completion_target = 0.9
>> >
>> > max_connections = 500
>> >
>> > The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
>> > RAM, running Cent OS 6.3.
>> >
>> > So far we‘ve tried disabling Transparent Huge Pages after I found a number
>> > of resources online that indicated similar interrupt/context switch issues,
>> > but it hasn’t resolve the problem. I managed to catch it happening once and
>> > run a perf which showed:
>> >
>> > +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
>> > +   9.55%       10956  postmaster  0x2dc820         f set_config_option
>> > +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
>> > +   5.75%        6609  postmaster  0x5a2b0          f
>> > ginHeapTupleFastCollect
>> > +   2.68%        3084  postmaster  0x192483         f
>> > build_implied_join_equality
>> > +   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR
>> > +   1.86%        2131  postmaster  0x794aa          f get_collation_oid
>> > +   1.56%        1822  postmaster  0x5a67e          f ginHeapTupleFastInsert
>> > +   1.53%        1766  postmaster  0x1929bc         f
>> > distribute_qual_to_rels
>> > +   1.33%        1558  postmaster  0x249671         f cmp_numerics
>> >
>> > I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
>> > method name.
>
> Try converting it to something more meaningful with "addr2line", that
> often has more sucess.
>
>> > That's about the sum of it. Any help would be greatly appreciated and if you
>> > want any more information about our setup, please feel free to ask.
>
>> Reducing shared buffers to around 2gb will probably make the problem go away
>
> That profile doesn't really look like one of the problem you are
> referring to would look like.

yup -- I think you're right.

merlin


pgsql-performance by date:

Previous
From: Andres Freund
Date:
Subject: Re: Intermittent hangs with 9.2
Next
From: Josh Berkus
Date:
Subject: Re: Performance bug in prepared statement binding in 9.2?