Thread: Poor performance after update from SLES11 SP1 to SP2

Poor performance after update from SLES11 SP1 to SP2

From
Mark Smith
Date:
Hardware: IBM X3650 M3 (2 x Xeon X5680 6C 3.33GHz), 96GB RAM. IBM X3524 with RAID 10 ext4 (noatime,nodiratime,data=writeback,barrier=0) volumes for pg_xlog / data / indexes.
 
Software: SLES 11 SP2 3.0.58-0.6.2-default x86_64, PostgreSQL 9.0.4.
max_connections = 1500
shared_buffers = 16GB
work_mem = 64MB
maintenance_work_mem = 256MB
wal_level = archive
synchronous_commit = off
wal_buffers = 16MB
checkpoint_segments = 32
checkpoint_completion_target = 0.9
effective_cache_size = 32GB
Workload: OLTP, typically with 500+ concurrent database connections, same Linux instance is also used as web server and application server. Far from ideal but has worked well for 15 months.

Problem: We have been running PostgreSQL 9.0.4 on SLES11 SP1, last kernel in use was 2.6.32-43-0.4, performance has always been great. Since updating from SLES11 SP1 to SP2 we now experience many database 'stalls' (e.g. normally 'instant' queries taking many seconds, any query will be slow, just connecting to the database will be slow). We have trialled PostgreSQL 9.2.3 under SLES11 SP2 with the exact same results. During these periods the machine is completely responsive but anything accessing the database is extremely slow.

I have tried increasing sched_migration_cost from 500000 to 5000000 and also tried setting sched_compat_yield to 1, neither of these appeared to make a difference. I don't have the parameter 'sched_autogroup_enabled'. Nothing jumps out from top/iostat/sar/pg_stat_activity however I am very far from expert in interpreting their output
 
We have work underway to reduce our number of connections as although it has always worked ok, perhaps it makes us particularly vulnerable to kernel/scheduler changes.
 
I would be very grateful for any suggestions as to the best way to diagnose the source of this problem and/or general recommendations?

Re: Poor performance after update from SLES11 SP1 to SP2

From
Vasilis Ventirozos
Date:
Hello Mark,
if i was you i would start with making basic benchmarks regarding IO performance (bonnie++ maybe?) and i would check my sysctl.conf. your parameters look ok to me.
when these delays occur have you noticed whats causing it ? An output of vmstat when the delays are happening would also help.


Vasilis Ventirozos

On Thu, Feb 21, 2013 at 11:59 AM, Mark Smith <smithmark662@gmail.com> wrote:
Hardware: IBM X3650 M3 (2 x Xeon X5680 6C 3.33GHz), 96GB RAM. IBM X3524 with RAID 10 ext4 (noatime,nodiratime,data=writeback,barrier=0) volumes for pg_xlog / data / indexes.
 
Software: SLES 11 SP2 3.0.58-0.6.2-default x86_64, PostgreSQL 9.0.4.
max_connections = 1500
shared_buffers = 16GB
work_mem = 64MB
maintenance_work_mem = 256MB
wal_level = archive
synchronous_commit = off
wal_buffers = 16MB
checkpoint_segments = 32
checkpoint_completion_target = 0.9
effective_cache_size = 32GB
Workload: OLTP, typically with 500+ concurrent database connections, same Linux instance is also used as web server and application server. Far from ideal but has worked well for 15 months.

Problem: We have been running PostgreSQL 9.0.4 on SLES11 SP1, last kernel in use was 2.6.32-43-0.4, performance has always been great. Since updating from SLES11 SP1 to SP2 we now experience many database 'stalls' (e.g. normally 'instant' queries taking many seconds, any query will be slow, just connecting to the database will be slow). We have trialled PostgreSQL 9.2.3 under SLES11 SP2 with the exact same results. During these periods the machine is completely responsive but anything accessing the database is extremely slow.

I have tried increasing sched_migration_cost from 500000 to 5000000 and also tried setting sched_compat_yield to 1, neither of these appeared to make a difference. I don't have the parameter 'sched_autogroup_enabled'. Nothing jumps out from top/iostat/sar/pg_stat_activity however I am very far from expert in interpreting their output
 
We have work underway to reduce our number of connections as although it has always worked ok, perhaps it makes us particularly vulnerable to kernel/scheduler changes.
 
I would be very grateful for any suggestions as to the best way to diagnose the source of this problem and/or general recommendations?

Re: Poor performance after update from SLES11 SP1 to SP2

From
Scott Marlowe
Date:
There's another thread in the list started by Josh Berkus about poor
performance with a similar kernel.  I'd look that thread up and see if
you and he have enough in common to work together on this.

On Thu, Feb 21, 2013 at 2:59 AM, Mark Smith <smithmark662@gmail.com> wrote:
> Hardware: IBM X3650 M3 (2 x Xeon X5680 6C 3.33GHz), 96GB RAM. IBM X3524 with
> RAID 10 ext4 (noatime,nodiratime,data=writeback,barrier=0) volumes for
> pg_xlog / data / indexes.
>
> Software: SLES 11 SP2 3.0.58-0.6.2-default x86_64, PostgreSQL 9.0.4.


Re: Poor performance after update from SLES11 SP1 to SP2

From
Sergey Konoplev
Date:
On Thu, Feb 21, 2013 at 1:59 AM, Mark Smith <smithmark662@gmail.com> wrote:
> Software: SLES 11 SP2 3.0.58-0.6.2-default x86_64, PostgreSQL 9.0.4.

[skipped]

> Problem: We have been running PostgreSQL 9.0.4 on SLES11 SP1, last kernel in
> use was 2.6.32-43-0.4, performance has always been great. Since updating
> from SLES11 SP1 to SP2 we now experience many database 'stalls' (e.g.
> normally 'instant' queries taking many seconds, any query will be slow, just
> connecting to the database will be slow).

It reminds me a transparent huge pages defragmentation issue that was
found in recent kernels.

Transparent huge pages defragmentation could lead to unpredictable
database stalls on some Linux kernels. The recommended settings for
this are below.

db1: ~ # echo always > /sys/kernel/mm/transparent_hugepage/enabled
db1: ~ # echo madvise > /sys/kernel/mm/transparent_hugepage/defrag

I am collecting recommendations for DB server configuration by the
link below. Try to look at it also if the above wont help.

http://code.google.com/p/pgcookbook/wiki/Database_Server_Configuration

> We have trialled PostgreSQL 9.2.3
> under SLES11 SP2 with the exact same results. During these periods the
> machine is completely responsive but anything accessing the database is
> extremely slow.
>
> I have tried increasing sched_migration_cost from 500000 to 5000000 and also
> tried setting sched_compat_yield to 1, neither of these appeared to make a
> difference. I don't have the parameter 'sched_autogroup_enabled'. Nothing
> jumps out from top/iostat/sar/pg_stat_activity however I am very far from
> expert in interpreting their output
>
> We have work underway to reduce our number of connections as although it has
> always worked ok, perhaps it makes us particularly vulnerable to
> kernel/scheduler changes.
>
> I would be very grateful for any suggestions as to the best way to diagnose
> the source of this problem and/or general recommendations?



--
Sergey Konoplev
Database and Software Architect
http://www.linkedin.com/in/grayhemp

Phones:
USA +1 415 867 9984
Russia, Moscow +7 901 903 0499
Russia, Krasnodar +7 988 888 1979

Skype: gray-hemp
Jabber: gray.ru@gmail.com


Re: Poor performance after update from SLES11 SP1 to SP2

From
Mark Smith
Date:
On 21 February 2013 16:23, Sergey Konoplev <gray.ru@gmail.com> wrote:
On Thu, Feb 21, 2013 at 1:59 AM, Mark Smith <smithmark662@gmail.com> wrote:
> Software: SLES 11 SP2 3.0.58-0.6.2-default x86_64, PostgreSQL 9.0.4.

[skipped]

> Problem: We have been running PostgreSQL 9.0.4 on SLES11 SP1, last kernel in
> use was 2.6.32-43-0.4, performance has always been great. Since updating
> from SLES11 SP1 to SP2 we now experience many database 'stalls' (e.g.
> normally 'instant' queries taking many seconds, any query will be slow, just
> connecting to the database will be slow).

It reminds me a transparent huge pages defragmentation issue that was
found in recent kernels.

Transparent huge pages defragmentation could lead to unpredictable
database stalls on some Linux kernels. The recommended settings for
this are below.

db1: ~ # echo always > /sys/kernel/mm/transparent_hugepage/enabled
db1: ~ # echo madvise > /sys/kernel/mm/transparent_hugepage/defrag

[skipped]
 
Sergey - your suggestion to look at transparent huge pages (THP) has resolved the issue for us, thank you so much. We had noticed abnormally high system CPU usage but didn't get much beyond that in our analysis. 
 
We disabled THP altogether and it was quite simply as if we had turned the 'poor performance' tap off. Since then we have had no slow queries / stalls at all and system CPU is consistently very low. We changed many things whilst trying to resolve this issue but the THP change was done in isolation and we can therefore be confident that in our environment, leaving THP enabled with the default parameters is a killer.
 
At a later point we will experiment with enabling THP with the recommended madvise defrag setting.
 
Thank you to all who responded.
 
Mark