Thread: Intermittent hangs with 9.2

Intermittent hangs with 9.2

From
David Whittaker
Date:
Hi Andrew,


On Tue, Sep 10, 2013 at 11:26 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

On 09/10/2013 11:04 AM, David Whittaker wrote:

Hi All,

I've been seeing a strange issue with our Postgres install for about a year now, and I was hoping someone might be able to help point me at the cause. At what seem like fairly random intervals Postgres will become unresponsive to the 3 application nodes it services. These periods tend to last for 10 - 15 minutes before everything rights itself and the system goes back to normal.

During these periods the server will report a spike in the outbound bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in context switches / interrupts (normal peaks are around 2k/8k respectively, and during these periods they‘ve gone to 15k/22k), and a load average of 100+. CPU usage stays relatively low, but it’s all system time reported, user time goes to zero. It doesn‘t seem to be disk related since we’re running with a shared_buffers setting of 24G, which will fit just about our entire database into memory, and the IO transactions reported by the server, as well as the disk reads reported by Postgres stay consistently low.

We‘ve recently started tracking how long statements take to execute, and we’re seeing some really odd numbers. A simple delete by primary key, for example, from a table that contains about 280,000 rows, reportedly took 18h59m46.900s. An update by primary key in that same table was reported as 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those numbers don't seem reasonable at all.

Some other changes we've made to postgresql.conf:

synchronous_commit = off

maintenance_work_mem = 1GB
wal_level = hot_standby
wal_buffers = 16MB

max_wal_senders = 10

wal_keep_segments = 5000

checkpoint_segments = 128

checkpoint_timeout = 30min

checkpoint_completion_target = 0.9

max_connections = 500

The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of RAM, running Cent OS 6.3.

So far we‘ve tried disabling Transparent Huge Pages after I found a number of resources online that indicated similar interrupt/context switch issues, but it hasn’t resolve the problem. I managed to catch it happening once and run a perf which showed:

|
+  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
+   9.55%       10956  postmaster  0x2dc820         f set_config_option
+   8.64%        9946  postmaster  0x5a3d4          f writeListPage
+   5.75%        6609  postmaster  0x5a2b0          f ginHeapTupleFastCollect
+   2.68%        3084  postmaster  0x192483         f build_implied_join_equality
+   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR
+   1.86%        2131  postmaster  0x794aa          f get_collation_oid
+   1.56%        1822  postmaster  0x5a67e          f ginHeapTupleFastInsert
+   1.53%        1766  postmaster  0x1929bc         f distribute_qual_to_rels
+   1.33%        1558  postmaster  0x249671         f cmp_numerics|

I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a method name.

That's about the sum of it. Any help would be greatly appreciated and if you want any more information about our setup, please feel free to ask.



I have seen cases like this with very high shared_buffers settings.

24Gb for shared_buffers is quite high, especially on a 48Gb box. What happens if you dial that back to, say, 12Gb?

I'd be willing to give it a try.  I'd really like to understand what's going on here though.  Can you elaborate on that?  Why would 24G of shared buffers be too high in this case?  The machine is devoted entirely to PG, so having PG use half of the available RAM to cache data doesn't feel unreasonable. 
 

cheers

andrew



Re: Intermittent hangs with 9.2

From
Torsten Förtsch
Date:
On 10/09/13 20:04, David Whittaker wrote:
> On Tue, Sep 10, 2013 at 11:26 AM, Andrew Dunstan <andrew@dunslane.net
> <mailto:andrew@dunslane.net>> wrote:
>
>
>     On 09/10/2013 11:04 AM, David Whittaker wrote:
>
>
>         Hi All,
>
>         I've been seeing a strange issue with our Postgres install for
>         about a year now, and I was hoping someone might be able to help
>         point me at the cause. At what seem like fairly random intervals
>         Postgres will become unresponsive to the 3 application nodes it
>         services. These periods tend to last for 10 - 15 minutes before
>         everything rights itself and the system goes back to normal.
>
>         During these periods the server will report a spike in the
>         outbound bandwidth (from about 1mbs to about 5mbs most
>         recently), a huge spike in context switches / interrupts (normal
>         peaks are around 2k/8k respectively, and during these periods
>         they‘ve gone to 15k/22k), and a load average of 100+. CPU usage
>         stays relatively low, but it’s all system time reported, user
>         time goes to zero. It doesn‘t seem to be disk related since
>         we’re running with a shared_buffers setting of 24G, which will
>         fit just about our entire database into memory, and the IO
>         transactions reported by the server, as well as the disk reads
>         reported by Postgres stay consistently low.
>
>         We‘ve recently started tracking how long statements take to
>         execute, and we’re seeing some really odd numbers. A simple
>         delete by primary key, for example, from a table that contains
>         about 280,000 rows, reportedly took 18h59m46.900s. An update by
>         primary key in that same table was reported as 7d 17h 58m
>         30.415s. That table is frequently accessed, but obviously those
>         numbers don't seem reasonable at all.
>
>         Some other changes we've made to postgresql.conf:
>
>         synchronous_commit = off
>
>         maintenance_work_mem = 1GB
>         wal_level = hot_standby
>         wal_buffers = 16MB
>
>         max_wal_senders = 10
>
>         wal_keep_segments = 5000
>
>         checkpoint_segments = 128
>
>         checkpoint_timeout = 30min
>
>         checkpoint_completion_target = 0.9
>
>         max_connections = 500
>
>         The server is a Dell Poweredge R900 with 4 Xeon E7430
>         processors, 48GB of RAM, running Cent OS 6.3.
>
>         So far we‘ve tried disabling Transparent Huge Pages after I
>         found a number of resources online that indicated similar
>         interrupt/context switch issues, but it hasn’t resolve the
>         problem. I managed to catch it happening once and run a perf
>         which showed:
>
>         |
>         +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
>         +   9.55%       10956  postmaster  0x2dc820         f
>         set_config_option
>         +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
>         +   5.75%        6609  postmaster  0x5a2b0          f
>         ginHeapTupleFastCollect
>         +   2.68%        3084  postmaster  0x192483         f
>         build_implied_join_equality
>         +   2.61%        2990  postmaster  0x187a55         f
>         build_paths_for_OR
>         +   1.86%        2131  postmaster  0x794aa          f
>         get_collation_oid
>         +   1.56%        1822  postmaster  0x5a67e          f
>         ginHeapTupleFastInsert
>         +   1.53%        1766  postmaster  0x1929bc         f
>         distribute_qual_to_rels
>         +   1.33%        1558  postmaster  0x249671         f cmp_numerics|
>
>         I‘m not sure what 0x347ba9 represents, or why it’s an address
>         rather than a method name.
>
>         That's about the sum of it. Any help would be greatly
>         appreciated and if you want any more information about our
>         setup, please feel free to ask.
>
>
>
>     I have seen cases like this with very high shared_buffers settings.
>
>     24Gb for shared_buffers is quite high, especially on a 48Gb box.
>     What happens if you dial that back to, say, 12Gb?
>
>
> I'd be willing to give it a try.  I'd really like to understand what's
> going on here though.  Can you elaborate on that?  Why would 24G of
> shared buffers be too high in this case?  The machine is devoted
> entirely to PG, so having PG use half of the available RAM to cache data
> doesn't feel unreasonable.

Here is what I have recently learned.

The root cause is crash safety and checkpoints. This is certainly
something you want. When you write to the database these operations
first occur in the buffer cache and the particular buffer you write to
is marked dirty. The cache is organized in chunks of 8kb. Additionally
write operations are also committed to the WAL.

A checkpoint iterates over all dirty buffers writing them to the
database. After that all buffers are clean again.

Now, if you write to a clean buffer it gets entirely written to the WAL.
That means after a checkpoint since every buffer is clean every write
triggers an 8kb write to the WAL. (Already dirty buffers are written
only partially)

And the more shared buffers you have the more can be dirtied immediately
after a checkpoint, hence the spike.

To mitigate that lower shared_buffers to the 12GB Andrew mentioned or
even lower (8GB) but also adjust effective_cache_size. This should
reflect the free space you have when the database is NOT running. I
expect in your case that would be something between 40GB and 46GB.

Please correct me if I'm wrong!
Torsten


Re: Intermittent hangs with 9.2

From
Julien Cigar
Date:
On Tue, Sep 10, 2013 at 02:04:57PM -0400, David Whittaker wrote:
> Hi Andrew,
>
>
> On Tue, Sep 10, 2013 at 11:26 AM, Andrew Dunstan <andrew@dunslane.net>wrote:
>
> >
> > On 09/10/2013 11:04 AM, David Whittaker wrote:
> >
> >>
> >> Hi All,
> >>
> >> I've been seeing a strange issue with our Postgres install for about a
> >> year now, and I was hoping someone might be able to help point me at the
> >> cause. At what seem like fairly random intervals Postgres will become
> >> unresponsive to the 3 application nodes it services. These periods tend to
> >> last for 10 - 15 minutes before everything rights itself and the system
> >> goes back to normal.
> >>
> >> During these periods the server will report a spike in the outbound
> >> bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
> >> context switches / interrupts (normal peaks are around 2k/8k respectively,
> >> and during these periods they‘ve gone to 15k/22k), and a load average of
> >> 100+. CPU usage stays relatively low, but it’s all system time reported,
> >> user time goes to zero. It doesn‘t seem to be disk related since we’re
> >> running with a shared_buffers setting of 24G, which will fit just about our
> >> entire database into memory, and the IO transactions reported by the
> >> server, as well as the disk reads reported by Postgres stay consistently
> >> low.
> >>
> >> We‘ve recently started tracking how long statements take to execute, and
> >> we’re seeing some really odd numbers. A simple delete by primary key, for
> >> example, from a table that contains about 280,000 rows, reportedly took
> >> 18h59m46.900s. An update by primary key in that same table was reported as
> >> 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
> >> numbers don't seem reasonable at all.
> >>
> >> Some other changes we've made to postgresql.conf:
> >>
> >> synchronous_commit = off
> >>
> >> maintenance_work_mem = 1GB
> >> wal_level = hot_standby
> >> wal_buffers = 16MB
> >>
> >> max_wal_senders = 10
> >>
> >> wal_keep_segments = 5000
> >>
> >> checkpoint_segments = 128
> >>
> >> checkpoint_timeout = 30min
> >>
> >> checkpoint_completion_target = 0.9
> >>
> >> max_connections = 500
> >>
> >> The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
> >> RAM, running Cent OS 6.3.
> >>
> >> So far we‘ve tried disabling Transparent Huge Pages after I found a
> >> number of resources online that indicated similar interrupt/context switch
> >> issues, but it hasn’t resolve the problem. I managed to catch it happening
> >> once and run a perf which showed:
> >>
> >> |
> >> +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
> >> +   9.55%       10956  postmaster  0x2dc820         f set_config_option
> >> +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
> >> +   5.75%        6609  postmaster  0x5a2b0          f
> >> ginHeapTupleFastCollect
> >> +   2.68%        3084  postmaster  0x192483         f
> >> build_implied_join_equality
> >> +   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR
> >> +   1.86%        2131  postmaster  0x794aa          f get_collation_oid
> >> +   1.56%        1822  postmaster  0x5a67e          f
> >> ginHeapTupleFastInsert
> >> +   1.53%        1766  postmaster  0x1929bc         f
> >> distribute_qual_to_rels
> >> +   1.33%        1558  postmaster  0x249671         f cmp_numerics|
> >>
> >> I‘m not sure what 0x347ba9 represents, or why it’s an address rather than
> >> a method name.
> >>
> >> That's about the sum of it. Any help would be greatly appreciated and if
> >> you want any more information about our setup, please feel free to ask.
> >>
> >>
> >>
> > I have seen cases like this with very high shared_buffers settings.
> >
> > 24Gb for shared_buffers is quite high, especially on a 48Gb box. What
> > happens if you dial that back to, say, 12Gb?
> >
>
> I'd be willing to give it a try.  I'd really like to understand what's
> going on here though.  Can you elaborate on that?  Why would 24G of shared
> buffers be too high in this case?  The machine is devoted entirely to PG,
> so having PG use half of the available RAM to cache data doesn't feel
> unreasonable.

Some of the overhead of bgwriter and checkpoints is more or less linear
in the size of shared_buffers. If your shared_buffers is large a lot of
data could be dirty when a checkpoint starts, resulting in an I/O spike
... (although we've spread checkpoints in recent pg versions, so this
should be less a problem nowadays).
Another reason is that the OS cache is also being used for reads and
writes and with a large shared_buffers there is a risk of "doubly cached
data" (in the OS cache + in shared_buffers).
In an ideal world most frequently used blocks should be in
shared_buffers and less frequently used block in the OS cache ..

>
>
> >
> > cheers
> >
> > andrew
> >
> >

--
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.