Thread: Intermittent hangs with 9.2

Intermittent hangs with 9.2

From
David Whittaker
Date:

Hi All,

I've been seeing a strange issue with our Postgres install for about a year now, and I was hoping someone might be able to help point me at the cause. At what seem like fairly random intervals Postgres will become unresponsive to the 3 application nodes it services. These periods tend to last for 10 - 15 minutes before everything rights itself and the system goes back to normal.

During these periods the server will report a spike in the outbound bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in context switches / interrupts (normal peaks are around 2k/8k respectively, and during these periods they‘ve gone to 15k/22k), and a load average of 100+. CPU usage stays relatively low, but it’s all system time reported, user time goes to zero. It doesn‘t seem to be disk related since we’re running with a shared_buffers setting of 24G, which will fit just about our entire database into memory, and the IO transactions reported by the server, as well as the disk reads reported by Postgres stay consistently low.

We‘ve recently started tracking how long statements take to execute, and we’re seeing some really odd numbers. A simple delete by primary key, for example, from a table that contains about 280,000 rows, reportedly took 18h59m46.900s. An update by primary key in that same table was reported as 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those numbers don't seem reasonable at all.

Some other changes we've made to postgresql.conf:

synchronous_commit = off

maintenance_work_mem = 1GB
wal_level = hot_standby
wal_buffers = 16MB

max_wal_senders = 10

wal_keep_segments = 5000

checkpoint_segments = 128

checkpoint_timeout = 30min

checkpoint_completion_target = 0.9

max_connections = 500

The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of RAM, running Cent OS 6.3.

So far we‘ve tried disabling Transparent Huge Pages after I found a number of resources online that indicated similar interrupt/context switch issues, but it hasn’t resolve the problem. I managed to catch it happening once and run a perf which showed:


+  41.40%       48154  postmaster  0x347ba9         f 0x347ba9                                   
+   9.55%       10956  postmaster  0x2dc820         f set_config_option                          
+   8.64%        9946  postmaster  0x5a3d4          f writeListPage     
+   5.75%        6609  postmaster  0x5a2b0          f ginHeapTupleFastCollect                    
+   2.68%        3084  postmaster  0x192483         f build_implied_join_equality                
+   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR                         
+   1.86%        2131  postmaster  0x794aa          f get_collation_oid                          
+   1.56%        1822  postmaster  0x5a67e          f ginHeapTupleFastInsert                     
+   1.53%        1766  postmaster  0x1929bc         f distribute_qual_to_rels                    
+   1.33%        1558  postmaster  0x249671         f cmp_numerics

I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a method name.

That's about the sum of it. Any help would be greatly appreciated and if you want any more information about our setup, please feel free to ask.

Thanks,
Dave

Re: Intermittent hangs with 9.2

From
Andrew Dunstan
Date:
On 09/10/2013 11:04 AM, David Whittaker wrote:
>
> Hi All,
>
> I've been seeing a strange issue with our Postgres install for about a
> year now, and I was hoping someone might be able to help point me at
> the cause. At what seem like fairly random intervals Postgres will
> become unresponsive to the 3 application nodes it services. These
> periods tend to last for 10 - 15 minutes before everything rights
> itself and the system goes back to normal.
>
> During these periods the server will report a spike in the outbound
> bandwidth (from about 1mbs to about 5mbs most recently), a huge spike
> in context switches / interrupts (normal peaks are around 2k/8k
> respectively, and during these periods they‘ve gone to 15k/22k), and a
> load average of 100+. CPU usage stays relatively low, but it’s all
> system time reported, user time goes to zero. It doesn‘t seem to be
> disk related since we’re running with a shared_buffers setting of 24G,
> which will fit just about our entire database into memory, and the IO
> transactions reported by the server, as well as the disk reads
> reported by Postgres stay consistently low.
>
> We‘ve recently started tracking how long statements take to execute,
> and we’re seeing some really odd numbers. A simple delete by primary
> key, for example, from a table that contains about 280,000 rows,
> reportedly took 18h59m46.900s. An update by primary key in that same
> table was reported as 7d 17h 58m 30.415s. That table is frequently
> accessed, but obviously those numbers don't seem reasonable at all.
>
> Some other changes we've made to postgresql.conf:
>
> synchronous_commit = off
>
> maintenance_work_mem = 1GB
> wal_level = hot_standby
> wal_buffers = 16MB
>
> max_wal_senders = 10
>
> wal_keep_segments = 5000
>
> checkpoint_segments = 128
>
> checkpoint_timeout = 30min
>
> checkpoint_completion_target = 0.9
>
> max_connections = 500
>
> The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB
> of RAM, running Cent OS 6.3.
>
> So far we‘ve tried disabling Transparent Huge Pages after I found a
> number of resources online that indicated similar interrupt/context
> switch issues, but it hasn’t resolve the problem. I managed to catch
> it happening once and run a perf which showed:
>
> |
> +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
> +   9.55%       10956  postmaster  0x2dc820         f set_config_option
> +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
> +   5.75%        6609  postmaster  0x5a2b0          f ginHeapTupleFastCollect
> +   2.68%        3084  postmaster  0x192483         f build_implied_join_equality
> +   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR
> +   1.86%        2131  postmaster  0x794aa          f get_collation_oid
> +   1.56%        1822  postmaster  0x5a67e          f ginHeapTupleFastInsert
> +   1.53%        1766  postmaster  0x1929bc         f distribute_qual_to_rels
> +   1.33%        1558  postmaster  0x249671         f cmp_numerics|
>
> I‘m not sure what 0x347ba9 represents, or why it’s an address rather
> than a method name.
>
> That's about the sum of it. Any help would be greatly appreciated and
> if you want any more information about our setup, please feel free to ask.
>
>

I have seen cases like this with very high shared_buffers settings.

24Gb for shared_buffers is quite high, especially on a 48Gb box. What
happens if you dial that back to, say, 12Gb?

cheers

andrew



Re: Intermittent hangs with 9.2

From
"ktm@rice.edu"
Date:
On Tue, Sep 10, 2013 at 11:04:21AM -0400, David Whittaker wrote:
> Hi All,
>
> I've been seeing a strange issue with our Postgres install for about a year
> now, and I was hoping someone might be able to help point me at the cause.
> At what seem like fairly random intervals Postgres will become unresponsive
> to the 3 application nodes it services. These periods tend to last for 10 -
> 15 minutes before everything rights itself and the system goes back to
> normal.
>
> During these periods the server will report a spike in the outbound
> bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
> context switches / interrupts (normal peaks are around 2k/8k respectively,
> and during these periods they‘ve gone to 15k/22k), and a load average of
> 100+. CPU usage stays relatively low, but it’s all system time reported,
> user time goes to zero. It doesn‘t seem to be disk related since we’re
> running with a shared_buffers setting of 24G, which will fit just about our
> entire database into memory, and the IO transactions reported by the
> server, as well as the disk reads reported by Postgres stay consistently
> low.
>
> We‘ve recently started tracking how long statements take to execute, and
> we’re seeing some really odd numbers. A simple delete by primary key, for
> example, from a table that contains about 280,000 rows, reportedly took
> 18h59m46.900s. An update by primary key in that same table was reported as
> 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
> numbers don't seem reasonable at all.
>
> Some other changes we've made to postgresql.conf:
>
> synchronous_commit = off
>
> maintenance_work_mem = 1GB
> wal_level = hot_standby
> wal_buffers = 16MB
>
> max_wal_senders = 10
>
> wal_keep_segments = 5000
>
> checkpoint_segments = 128
>
> checkpoint_timeout = 30min
>
> checkpoint_completion_target = 0.9
>
> max_connections = 500
>
> The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
> RAM, running Cent OS 6.3.
>
> So far we‘ve tried disabling Transparent Huge Pages after I found a number
> of resources online that indicated similar interrupt/context switch issues,
> but it hasn’t resolve the problem. I managed to catch it happening once and
> run a perf which showed:
>
>
> +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
> +   9.55%       10956  postmaster  0x2dc820         f
> set_config_option
> +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
> +   5.75%        6609  postmaster  0x5a2b0          f
> ginHeapTupleFastCollect
> +   2.68%        3084  postmaster  0x192483         f
> build_implied_join_equality
> +   2.61%        2990  postmaster  0x187a55         f
> build_paths_for_OR
> +   1.86%        2131  postmaster  0x794aa          f
> get_collation_oid
> +   1.56%        1822  postmaster  0x5a67e          f
> ginHeapTupleFastInsert
> +   1.53%        1766  postmaster  0x1929bc         f
> distribute_qual_to_rels
> +   1.33%        1558  postmaster  0x249671         f cmp_numerics
>
> I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
> method name.
>
> That's about the sum of it. Any help would be greatly appreciated and if
> you want any more information about our setup, please feel free to ask.
>
> Thanks,
> Dave

Hi Dave,

A load average of 100+ means that you have that many processes waiting to
run yet you only have 16 cpus. You really need to consider using a connection
pooler like pgbouncer to keep your connection count in the 16-32 range.

Regards,
Ken


Re: Intermittent hangs with 9.2

From
Jeff Janes
Date:
On Tue, Sep 10, 2013 at 8:04 AM, David Whittaker <dave@iradix.com> wrote:

Hi All,

I've been seeing a strange issue with our Postgres install for about a year now, and I was hoping someone might be able to help point me at the cause. At what seem like fairly random intervals Postgres will become unresponsive to the 3 application nodes it services. These periods tend to last for 10 - 15 minutes before everything rights itself and the system goes back to normal.

During these periods the server will report a spike in the outbound bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in context switches / interrupts (normal peaks are around 2k/8k respectively, and during these periods they‘ve gone to 15k/22k), and a load average of 100+.


I'm curious about the spike it outbound network usage.  If the database is hung and no longer responding to queries, what is it getting sent over the network?  Can you snoop on that traffic?
 

CPU usage stays relatively low, but it’s all system time reported, user time goes to zero. It doesn‘t seem to be disk related since we’re running with a shared_buffers setting of 24G, which will fit just about our entire database into memory, and the IO transactions reported by the server, as well as the disk reads reported by Postgres stay consistently low.

There have been reports that using very large shared_buffers can cause a lot of contention issues in the kernel, for some kernels. The usual advice is not to set shared_buffers above 8GB.  The operating system can use the rest of the memory to cache for you.

Also, using a connection pooler and lowering the number of connections to the real database has solved problems like this before.
 

We‘ve recently started tracking how long statements take to execute, and we’re seeing some really odd numbers. A simple delete by primary key, for example, from a table that contains about 280,000 rows, reportedly took 18h59m46.900s. An update by primary key in that same table was reported as 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those numbers don't seem reasonable at all.

How are your tracking those?  Is it log_min_duration_statement or something else?

Cheers,

Jeff

Re: Intermittent hangs with 9.2

From
Merlin Moncure
Date:
On Tue, Sep 10, 2013 at 10:04 AM, David Whittaker <dave@iradix.com> wrote:
> Hi All,
>
> I've been seeing a strange issue with our Postgres install for about a year
> now, and I was hoping someone might be able to help point me at the cause.
> At what seem like fairly random intervals Postgres will become unresponsive
> to the 3 application nodes it services. These periods tend to last for 10 -
> 15 minutes before everything rights itself and the system goes back to
> normal.
>
> During these periods the server will report a spike in the outbound
> bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
> context switches / interrupts (normal peaks are around 2k/8k respectively,
> and during these periods they‘ve gone to 15k/22k), and a load average of
> 100+. CPU usage stays relatively low, but it’s all system time reported,
> user time goes to zero. It doesn‘t seem to be disk related since we’re
> running with a shared_buffers setting of 24G, which will fit just about our
> entire database into memory, and the IO transactions reported by the server,
> as well as the disk reads reported by Postgres stay consistently low.
>
> We‘ve recently started tracking how long statements take to execute, and
> we’re seeing some really odd numbers. A simple delete by primary key, for
> example, from a table that contains about 280,000 rows, reportedly took
> 18h59m46.900s. An update by primary key in that same table was reported as
> 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
> numbers don't seem reasonable at all.
>
> Some other changes we've made to postgresql.conf:
>
> synchronous_commit = off
>
> maintenance_work_mem = 1GB
> wal_level = hot_standby
> wal_buffers = 16MB
>
> max_wal_senders = 10
>
> wal_keep_segments = 5000
>
> checkpoint_segments = 128
>
> checkpoint_timeout = 30min
>
> checkpoint_completion_target = 0.9
>
> max_connections = 500
>
> The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
> RAM, running Cent OS 6.3.
>
> So far we‘ve tried disabling Transparent Huge Pages after I found a number
> of resources online that indicated similar interrupt/context switch issues,
> but it hasn’t resolve the problem. I managed to catch it happening once and
> run a perf which showed:
>
> +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
> +   9.55%       10956  postmaster  0x2dc820         f set_config_option
> +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
> +   5.75%        6609  postmaster  0x5a2b0          f
> ginHeapTupleFastCollect
> +   2.68%        3084  postmaster  0x192483         f
> build_implied_join_equality
> +   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR
> +   1.86%        2131  postmaster  0x794aa          f get_collation_oid
> +   1.56%        1822  postmaster  0x5a67e          f ginHeapTupleFastInsert
> +   1.53%        1766  postmaster  0x1929bc         f
> distribute_qual_to_rels
> +   1.33%        1558  postmaster  0x249671         f cmp_numerics
>
> I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
> method name.
>
> That's about the sum of it. Any help would be greatly appreciated and if you
> want any more information about our setup, please feel free to ask.


Reducing shared buffers to around 2gb will probably make the problem go away

*) What's your ratio reads to writes (approximately)?

*) How many connections when it happens.   Do connections pile on after that?

*) Are you willing to run custom patched postmaster to help
troubleshoot the problem?

merlin


Re: Intermittent hangs with 9.2

From
Andres Freund
Date:
On 2013-09-11 07:43:35 -0500, Merlin Moncure wrote:
> > I've been seeing a strange issue with our Postgres install for about a year
> > now, and I was hoping someone might be able to help point me at the cause.
> > At what seem like fairly random intervals Postgres will become unresponsive
> > to the 3 application nodes it services. These periods tend to last for 10 -
> > 15 minutes before everything rights itself and the system goes back to
> > normal.
> >
> > During these periods the server will report a spike in the outbound
> > bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
> > context switches / interrupts (normal peaks are around 2k/8k respectively,
> > and during these periods they‘ve gone to 15k/22k), and a load average of
> > 100+. CPU usage stays relatively low, but it’s all system time reported,
> > user time goes to zero. It doesn‘t seem to be disk related since we’re
> > running with a shared_buffers setting of 24G, which will fit just about our
> > entire database into memory, and the IO transactions reported by the server,
> > as well as the disk reads reported by Postgres stay consistently low.
> >
> > We‘ve recently started tracking how long statements take to execute, and
> > we’re seeing some really odd numbers. A simple delete by primary key, for
> > example, from a table that contains about 280,000 rows, reportedly took
> > 18h59m46.900s. An update by primary key in that same table was reported as
> > 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
> > numbers don't seem reasonable at all.
> >
> > Some other changes we've made to postgresql.conf:
> >
> > synchronous_commit = off
> >
> > maintenance_work_mem = 1GB
> > wal_level = hot_standby
> > wal_buffers = 16MB
> >
> > max_wal_senders = 10
> >
> > wal_keep_segments = 5000
> >
> > checkpoint_segments = 128
> >
> > checkpoint_timeout = 30min
> >
> > checkpoint_completion_target = 0.9
> >
> > max_connections = 500
> >
> > The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
> > RAM, running Cent OS 6.3.
> >
> > So far we‘ve tried disabling Transparent Huge Pages after I found a number
> > of resources online that indicated similar interrupt/context switch issues,
> > but it hasn’t resolve the problem. I managed to catch it happening once and
> > run a perf which showed:
> >
> > +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
> > +   9.55%       10956  postmaster  0x2dc820         f set_config_option
> > +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
> > +   5.75%        6609  postmaster  0x5a2b0          f
> > ginHeapTupleFastCollect
> > +   2.68%        3084  postmaster  0x192483         f
> > build_implied_join_equality
> > +   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR
> > +   1.86%        2131  postmaster  0x794aa          f get_collation_oid
> > +   1.56%        1822  postmaster  0x5a67e          f ginHeapTupleFastInsert
> > +   1.53%        1766  postmaster  0x1929bc         f
> > distribute_qual_to_rels
> > +   1.33%        1558  postmaster  0x249671         f cmp_numerics
> >
> > I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
> > method name.

Try converting it to something more meaningful with "addr2line", that
often has more sucess.

> > That's about the sum of it. Any help would be greatly appreciated and if you
> > want any more information about our setup, please feel free to ask.

> Reducing shared buffers to around 2gb will probably make the problem go away

That profile doesn't really look like one of the problem you are
referring to would look like.

Based on the profile I'd guess it's possible that you're seing problems
with GIN's "fastupdate" mechanism.
Try ALTER INDEX whatever SET (FASTUPDATE = OFF); VACUUM
whatever's_table for all gin indexes.

It's curious that set_config_option is so high in the profile... Any
chance you could recompile postgres with -fno-omit-frame-pointers in
CFLAGS? That would allow you to use perf -g. The performance price of
that usually is below 1% for postgres.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Intermittent hangs with 9.2

From
Merlin Moncure
Date:
On Wed, Sep 11, 2013 at 12:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-09-11 07:43:35 -0500, Merlin Moncure wrote:
>> > I've been seeing a strange issue with our Postgres install for about a year
>> > now, and I was hoping someone might be able to help point me at the cause.
>> > At what seem like fairly random intervals Postgres will become unresponsive
>> > to the 3 application nodes it services. These periods tend to last for 10 -
>> > 15 minutes before everything rights itself and the system goes back to
>> > normal.
>> >
>> > During these periods the server will report a spike in the outbound
>> > bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
>> > context switches / interrupts (normal peaks are around 2k/8k respectively,
>> > and during these periods they‘ve gone to 15k/22k), and a load average of
>> > 100+. CPU usage stays relatively low, but it’s all system time reported,
>> > user time goes to zero. It doesn‘t seem to be disk related since we’re
>> > running with a shared_buffers setting of 24G, which will fit just about our
>> > entire database into memory, and the IO transactions reported by the server,
>> > as well as the disk reads reported by Postgres stay consistently low.
>> >
>> > We‘ve recently started tracking how long statements take to execute, and
>> > we’re seeing some really odd numbers. A simple delete by primary key, for
>> > example, from a table that contains about 280,000 rows, reportedly took
>> > 18h59m46.900s. An update by primary key in that same table was reported as
>> > 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
>> > numbers don't seem reasonable at all.
>> >
>> > Some other changes we've made to postgresql.conf:
>> >
>> > synchronous_commit = off
>> >
>> > maintenance_work_mem = 1GB
>> > wal_level = hot_standby
>> > wal_buffers = 16MB
>> >
>> > max_wal_senders = 10
>> >
>> > wal_keep_segments = 5000
>> >
>> > checkpoint_segments = 128
>> >
>> > checkpoint_timeout = 30min
>> >
>> > checkpoint_completion_target = 0.9
>> >
>> > max_connections = 500
>> >
>> > The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
>> > RAM, running Cent OS 6.3.
>> >
>> > So far we‘ve tried disabling Transparent Huge Pages after I found a number
>> > of resources online that indicated similar interrupt/context switch issues,
>> > but it hasn’t resolve the problem. I managed to catch it happening once and
>> > run a perf which showed:
>> >
>> > +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
>> > +   9.55%       10956  postmaster  0x2dc820         f set_config_option
>> > +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
>> > +   5.75%        6609  postmaster  0x5a2b0          f
>> > ginHeapTupleFastCollect
>> > +   2.68%        3084  postmaster  0x192483         f
>> > build_implied_join_equality
>> > +   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR
>> > +   1.86%        2131  postmaster  0x794aa          f get_collation_oid
>> > +   1.56%        1822  postmaster  0x5a67e          f ginHeapTupleFastInsert
>> > +   1.53%        1766  postmaster  0x1929bc         f
>> > distribute_qual_to_rels
>> > +   1.33%        1558  postmaster  0x249671         f cmp_numerics
>> >
>> > I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
>> > method name.
>
> Try converting it to something more meaningful with "addr2line", that
> often has more sucess.
>
>> > That's about the sum of it. Any help would be greatly appreciated and if you
>> > want any more information about our setup, please feel free to ask.
>
>> Reducing shared buffers to around 2gb will probably make the problem go away
>
> That profile doesn't really look like one of the problem you are
> referring to would look like.

yup -- I think you're right.

merlin


Re: Intermittent hangs with 9.2

From
David Whittaker
Date:
Hi All,

We lowered shared_buffers to 8G and increased effective_cache_size accordingly.  So far, we haven't seen any issues since the adjustment.  The issues have come and gone in the past, so I'm not convinced it won't crop up again, but I think the best course is to wait a week or so and see how things work out before we make any other changes.

Thank you all for your help, and if the problem does reoccur, we'll look into the other options suggested, like using a patched postmaster and compiling for perf -g.

Thanks again, I really appreciate the feedback from everyone.

-Dave


On Wed, Sep 11, 2013 at 1:17 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-09-11 07:43:35 -0500, Merlin Moncure wrote:
> > I've been seeing a strange issue with our Postgres install for about a year
> > now, and I was hoping someone might be able to help point me at the cause.
> > At what seem like fairly random intervals Postgres will become unresponsive
> > to the 3 application nodes it services. These periods tend to last for 10 -
> > 15 minutes before everything rights itself and the system goes back to
> > normal.
> >
> > During these periods the server will report a spike in the outbound
> > bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
> > context switches / interrupts (normal peaks are around 2k/8k respectively,
> > and during these periods they‘ve gone to 15k/22k), and a load average of
> > 100+. CPU usage stays relatively low, but it’s all system time reported,
> > user time goes to zero. It doesn‘t seem to be disk related since we’re
> > running with a shared_buffers setting of 24G, which will fit just about our
> > entire database into memory, and the IO transactions reported by the server,
> > as well as the disk reads reported by Postgres stay consistently low.
> >
> > We‘ve recently started tracking how long statements take to execute, and
> > we’re seeing some really odd numbers. A simple delete by primary key, for
> > example, from a table that contains about 280,000 rows, reportedly took
> > 18h59m46.900s. An update by primary key in that same table was reported as
> > 7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
> > numbers don't seem reasonable at all.
> >
> > Some other changes we've made to postgresql.conf:
> >
> > synchronous_commit = off
> >
> > maintenance_work_mem = 1GB
> > wal_level = hot_standby
> > wal_buffers = 16MB
> >
> > max_wal_senders = 10
> >
> > wal_keep_segments = 5000
> >
> > checkpoint_segments = 128
> >
> > checkpoint_timeout = 30min
> >
> > checkpoint_completion_target = 0.9
> >
> > max_connections = 500
> >
> > The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
> > RAM, running Cent OS 6.3.
> >
> > So far we‘ve tried disabling Transparent Huge Pages after I found a number
> > of resources online that indicated similar interrupt/context switch issues,
> > but it hasn’t resolve the problem. I managed to catch it happening once and
> > run a perf which showed:
> >
> > +  41.40%       48154  postmaster  0x347ba9         f 0x347ba9
> > +   9.55%       10956  postmaster  0x2dc820         f set_config_option
> > +   8.64%        9946  postmaster  0x5a3d4          f writeListPage
> > +   5.75%        6609  postmaster  0x5a2b0          f
> > ginHeapTupleFastCollect
> > +   2.68%        3084  postmaster  0x192483         f
> > build_implied_join_equality
> > +   2.61%        2990  postmaster  0x187a55         f build_paths_for_OR
> > +   1.86%        2131  postmaster  0x794aa          f get_collation_oid
> > +   1.56%        1822  postmaster  0x5a67e          f ginHeapTupleFastInsert
> > +   1.53%        1766  postmaster  0x1929bc         f
> > distribute_qual_to_rels
> > +   1.33%        1558  postmaster  0x249671         f cmp_numerics
> >
> > I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
> > method name.

Try converting it to something more meaningful with "addr2line", that
often has more sucess.

> > That's about the sum of it. Any help would be greatly appreciated and if you
> > want any more information about our setup, please feel free to ask.

> Reducing shared buffers to around 2gb will probably make the problem go away

That profile doesn't really look like one of the problem you are
referring to would look like.

Based on the profile I'd guess it's possible that you're seing problems
with GIN's "fastupdate" mechanism.
Try ALTER INDEX whatever SET (FASTUPDATE = OFF); VACUUM
whatever's_table for all gin indexes.

It's curious that set_config_option is so high in the profile... Any
chance you could recompile postgres with -fno-omit-frame-pointers in
CFLAGS? That would allow you to use perf -g. The performance price of
that usually is below 1% for postgres.

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Intermittent hangs with 9.2

From
Merlin Moncure
Date:
On Thu, Sep 12, 2013 at 3:06 PM, David Whittaker <dave@iradix.com> wrote:
> Hi All,
>
> We lowered shared_buffers to 8G and increased effective_cache_size
> accordingly.  So far, we haven't seen any issues since the adjustment.  The
> issues have come and gone in the past, so I'm not convinced it won't crop up
> again, but I think the best course is to wait a week or so and see how
> things work out before we make any other changes.
>
> Thank you all for your help, and if the problem does reoccur, we'll look
> into the other options suggested, like using a patched postmaster and
> compiling for perf -g.
>
> Thanks again, I really appreciate the feedback from everyone.

Interesting -- please respond with a follow up if/when you feel
satisfied the problem has gone away.  Andres was right; I initially
mis-diagnosed the problem (there is another issue I'm chasing that has
a similar performance presentation but originates from a different
area of the code).

That said, if reducing shared_buffers made *your* problem go away as
well, then this more evidence that we have an underlying contention
mechanic that is somehow influenced by the setting.  Speaking frankly,
under certain workloads we seem to have contention issues in the
general area of the buffer system.  I'm thinking (guessing) that the
problems is usage_count is getting incremented faster than the buffers
are getting cleared out which is then causing the sweeper to spend
more and more time examining hotly contended buffers.  This may make
no sense in the context of your issue; I haven't looked at the code
yet.  Also, I've been unable to cause this to happen in simulated
testing.  But I'm suspicious (and dollars to doughnuts '0x347ba9' is
spinlock related).

Anyways, thanks for the report and (hopefully) the follow up.

merlin


Re: Intermittent hangs with 9.2

From
David Whittaker
Date:



On Fri, Sep 13, 2013 at 10:52 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Sep 12, 2013 at 3:06 PM, David Whittaker <dave@iradix.com> wrote:
> Hi All,
>
> We lowered shared_buffers to 8G and increased effective_cache_size
> accordingly.  So far, we haven't seen any issues since the adjustment.  The
> issues have come and gone in the past, so I'm not convinced it won't crop up
> again, but I think the best course is to wait a week or so and see how
> things work out before we make any other changes.
>
> Thank you all for your help, and if the problem does reoccur, we'll look
> into the other options suggested, like using a patched postmaster and
> compiling for perf -g.
>
> Thanks again, I really appreciate the feedback from everyone.

Interesting -- please respond with a follow up if/when you feel
satisfied the problem has gone away.  Andres was right; I initially
mis-diagnosed the problem (there is another issue I'm chasing that has
a similar performance presentation but originates from a different
area of the code).

That said, if reducing shared_buffers made *your* problem go away as
well, then this more evidence that we have an underlying contention
mechanic that is somehow influenced by the setting.  Speaking frankly,
under certain workloads we seem to have contention issues in the
general area of the buffer system.  I'm thinking (guessing) that the
problems is usage_count is getting incremented faster than the buffers
are getting cleared out which is then causing the sweeper to spend
more and more time examining hotly contended buffers.  This may make
no sense in the context of your issue; I haven't looked at the code
yet.  Also, I've been unable to cause this to happen in simulated
testing.  But I'm suspicious (and dollars to doughnuts '0x347ba9' is
spinlock related).

Anyways, thanks for the report and (hopefully) the follow up.

merlin

You guys have taken the time to help me through this, following up is the least I can do.  So far we're still looking good.

Re: Intermittent hangs with 9.2

From
David Whittaker
Date:
We haven't seen any issues since we decreased shared_buffers.  We also tuned some of the longer running / more frequently executed queries, so that may have had an effect as well, but my money would be on the shared_buffers change.  If the issue re-appears I'll try to get a perf again and post back, but if you don't hear from me again you can assume the problem is solved.

Thank you all again for the help.

-Dave

On Fri, Sep 13, 2013 at 11:05 AM, David Whittaker <dave@iradix.com> wrote:



On Fri, Sep 13, 2013 at 10:52 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Sep 12, 2013 at 3:06 PM, David Whittaker <dave@iradix.com> wrote:
> Hi All,
>
> We lowered shared_buffers to 8G and increased effective_cache_size
> accordingly.  So far, we haven't seen any issues since the adjustment.  The
> issues have come and gone in the past, so I'm not convinced it won't crop up
> again, but I think the best course is to wait a week or so and see how
> things work out before we make any other changes.
>
> Thank you all for your help, and if the problem does reoccur, we'll look
> into the other options suggested, like using a patched postmaster and
> compiling for perf -g.
>
> Thanks again, I really appreciate the feedback from everyone.

Interesting -- please respond with a follow up if/when you feel
satisfied the problem has gone away.  Andres was right; I initially
mis-diagnosed the problem (there is another issue I'm chasing that has
a similar performance presentation but originates from a different
area of the code).

That said, if reducing shared_buffers made *your* problem go away as
well, then this more evidence that we have an underlying contention
mechanic that is somehow influenced by the setting.  Speaking frankly,
under certain workloads we seem to have contention issues in the
general area of the buffer system.  I'm thinking (guessing) that the
problems is usage_count is getting incremented faster than the buffers
are getting cleared out which is then causing the sweeper to spend
more and more time examining hotly contended buffers.  This may make
no sense in the context of your issue; I haven't looked at the code
yet.  Also, I've been unable to cause this to happen in simulated
testing.  But I'm suspicious (and dollars to doughnuts '0x347ba9' is
spinlock related).

Anyways, thanks for the report and (hopefully) the follow up.

merlin

You guys have taken the time to help me through this, following up is the least I can do.  So far we're still looking good.