Thread: High CPU usage / load average after upgrading to Ubuntu 12.04
Hello,
We upgraded from Ubuntu 11.04 to Ubuntu 12.04 and almost immediately obeserved increased CPU usage and significantly higher load average on our database server.
At the time we were on Postgres 9.0.5. We decided to upgrade to Postgres 9.2 to see if that resolves the issue, but unfortunately it did not.
Just for illustration purposes, below are a few links to cpu and load graphs pre and post upgrade.
https://s3.amazonaws.com/iqtell.ops/Load+Average+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Load+Average+Pre+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Server+CPU+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Server+CPU+Pre+Upgrade.png
We also tried tweaking kernel parameters as mentioned here - http://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com, but have not seen any improvement.
Any advice on how to trace what could be causing the change in CPU usage and load average is appreciated.
Our postgres version is:
PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
OS:
Linux ip-10-189-175-25 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Hardware (this an Amazon Ec2 High memory quadruple extra large instance):
8 core Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
68 GB RAM
RAID10 with 8 drives using xfs
Drives are EBS with provisioned IOPS, with 1000 iops each
Postgres Configuration:
archive_command = rsync -a %p slave:/var/lib/postgresql/replication_load/%f
archive_mode = on
checkpoint_completion_target = 0.9
checkpoint_segments = 64
checkpoint_timeout = 30min
default_text_search_config = pg_catalog.english
external_pid_file = /var/run/postgresql/9.2-main.pid
lc_messages = en_US.UTF-8
lc_monetary = en_US.UTF-8
lc_numeric = en_US.UTF-8
lc_time = en_US.UTF-8
listen_addresses = *
log_checkpoints=on
log_destination=stderr
log_line_prefix = %t [%p]: [%l-1]
log_min_duration_statement =500
max_connections=300
max_stack_depth=2MB
max_wal_senders=5
shared_buffers=4GB
synchronous_commit=off
unix_socket_directory=/var/run/postgresql
wal_keep_segments=128
wal_level=hot_standby
work_mem=8MB
Thanks,
Dan
Thanks for the reply. We are still using postgresql-9.0-801.jdbc4.jar. It seemed to us that this is more related to the OS than the JDBC, version as we had the issue before we upgraded to 9.2.
It might still be worth a try.
Just out of curiosity, has anyone else experienced performance issues (or even tried) with the 9.0 jdbc driver against 9.2 server?
Dan
From: Eric Haertel [mailto:eric.haertel@groupon.com]
Sent: Tuesday, February 12, 2013 12:52 PM
To: Dan Kogan
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04
I don't know if it helps, but I had after update from 8.4 to 9.1 extrem problems with my local test until I changed the JDBC driver to the propper version. I'm not shure if the load occured on the client or the server side as the local integration test run on my machine.
2013/2/12 Dan Kogan <dan@iqtell.com>
Hello,
We upgraded from Ubuntu 11.04 to Ubuntu 12.04 and almost immediately obeserved increased CPU usage and significantly higher load average on our database server.
At the time we were on Postgres 9.0.5. We decided to upgrade to Postgres 9.2 to see if that resolves the issue, but unfortunately it did not.
Just for illustration purposes, below are a few links to cpu and load graphs pre and post upgrade.
https://s3.amazonaws.com/iqtell.ops/Load+Average+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Load+Average+Pre+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Server+CPU+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Server+CPU+Pre+Upgrade.png
We also tried tweaking kernel parameters as mentioned here - http://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com, but have not seen any improvement.
Any advice on how to trace what could be causing the change in CPU usage and load average is appreciated.
Our postgres version is:
PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
OS:
Linux ip-10-189-175-25 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Hardware (this an Amazon Ec2 High memory quadruple extra large instance):
8 core Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
68 GB RAM
RAID10 with 8 drives using xfs
Drives are EBS with provisioned IOPS, with 1000 iops each
Postgres Configuration:
archive_command = rsync -a %p slave:/var/lib/postgresql/replication_load/%f
archive_mode = on
checkpoint_completion_target = 0.9
checkpoint_segments = 64
checkpoint_timeout = 30min
default_text_search_config = pg_catalog.english
external_pid_file = /var/run/postgresql/9.2-main.pid
lc_messages = en_US.UTF-8
lc_monetary = en_US.UTF-8
lc_numeric = en_US.UTF-8
lc_time = en_US.UTF-8
listen_addresses = *
log_checkpoints=on
log_destination=stderr
log_line_prefix = %t [%p]: [%l-1]
log_min_duration_statement =500
max_connections=300
max_stack_depth=2MB
max_wal_senders=5
shared_buffers=4GB
synchronous_commit=off
unix_socket_directory=/var/run/postgresql
wal_keep_segments=128
wal_level=hot_standby
work_mem=8MB
Thanks,
Dan
--
Eric Härtel
Senior Software Developer
Tel.: +49 (0) 30 240 20 40 35
Mobil: +49 (0) 174 43 38 614
Email: eric.haertel@groupon.com
Groupon GmbH & Co. Service KG | Oberwallstraße 6 | 10117 Berlin
persönlich haftende Gesellschafterin: Groupon Verwaltungs GmbH, HRB 131594 B
Geschäftsführer: Mark S. Hoyt | Bradley Downes | Daniel Köllner
Eingetragen beim Amtsgericht Charlottenburg Berlin, HRA 45265 B | USt.-ID Nr. DE 279 803 459
Attachment
Hi Will,
Yes, I think we’ve seen some discussions on that. Our servers our hosted on Amazon Ec2 and upgrading the kernel does not seem so straight forward.
We did a benchmark using pgbench on 3.5 vs 3.2 and saw an improvement. Unfortunately our production server would not boot off 3.5 so we had to revert back to 3.2.
At this point we are contemplating whether it’s better to go back to 11.04 or upgrade to 12.10 (which comes with kernel version 3.5).
Any thoughts on that would be appreciated.
Dan
From: Will Ferguson [mailto:WFerguson@northplains.com]
Sent: Tuesday, February 12, 2013 5:20 PM
To: Dan Kogan; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04
Hey Dan,
If I recall correctly there were some discussions on here related to performance issues with the 3.2 kernel. I'm away at the moment so can't dig them out but there have been much discussions lately about kernel performance in 3.2 which don't seem present in 3.4. I'll see if I can find them when I'm next at my desk.
Will
Sent from Samsung Mobile
-------- Original message --------
From: Dan Kogan <dan@iqtell.com>
Date:
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04
Thanks for the reply. We are still using postgresql-9.0-801.jdbc4.jar. It seemed to us that this is more related to the OS than the JDBC, version as we had the issue before we upgraded to 9.2.
It might still be worth a try.
Just out of curiosity, has anyone else experienced performance issues (or even tried) with the 9.0 jdbc driver against 9.2 server?
Dan
From: Eric Haertel [mailto:eric.haertel@groupon.com]
Sent: Tuesday, February 12, 2013 12:52 PM
To: Dan Kogan
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04
I don't know if it helps, but I had after update from 8.4 to 9.1 extrem problems with my local test until I changed the JDBC driver to the propper version. I'm not shure if the load occured on the client or the server side as the local integration test run on my machine.
2013/2/12 Dan Kogan <dan@iqtell.com>
Hello,
We upgraded from Ubuntu 11.04 to Ubuntu 12.04 and almost immediately obeserved increased CPU usage and significantly higher load average on our database server.
At the time we were on Postgres 9.0.5. We decided to upgrade to Postgres 9.2 to see if that resolves the issue, but unfortunately it did not.
Just for illustration purposes, below are a few links to cpu and load graphs pre and post upgrade.
https://s3.amazonaws.com/iqtell.ops/Load+Average+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Load+Average+Pre+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Server+CPU+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Server+CPU+Pre+Upgrade.png
We also tried tweaking kernel parameters as mentioned here - http://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com, but have not seen any improvement.
Any advice on how to trace what could be causing the change in CPU usage and load average is appreciated.
Our postgres version is:
PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
OS:
Linux ip-10-189-175-25 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Hardware (this an Amazon Ec2 High memory quadruple extra large instance):
8 core Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
68 GB RAM
RAID10 with 8 drives using xfs
Drives are EBS with provisioned IOPS, with 1000 iops each
Postgres Configuration:
archive_command = rsync -a %p slave:/var/lib/postgresql/replication_load/%f
archive_mode = on
checkpoint_completion_target = 0.9
checkpoint_segments = 64
checkpoint_timeout = 30min
default_text_search_config = pg_catalog.english
external_pid_file = /var/run/postgresql/9.2-main.pid
lc_messages = en_US.UTF-8
lc_monetary = en_US.UTF-8
lc_numeric = en_US.UTF-8
lc_time = en_US.UTF-8
listen_addresses = *
log_checkpoints=on
log_destination=stderr
log_line_prefix = %t [%p]: [%l-1]
log_min_duration_statement =500
max_connections=300
max_stack_depth=2MB
max_wal_senders=5
shared_buffers=4GB
synchronous_commit=off
unix_socket_directory=/var/run/postgresql
wal_keep_segments=128
wal_level=hot_standby
work_mem=8MB
Thanks,
Dan
--
Eric Härtel
Senior Software Developer
Tel.: +49 (0) 30 240 20 40 35
Mobil: +49 (0) 174 43 38 614
Email: eric.haertel@groupon.com
Groupon GmbH & Co. Service KG | Oberwallstraße 6 | 10117 Berlin
persönlich haftende Gesellschafterin: Groupon Verwaltungs GmbH, HRB 131594 B
Geschäftsführer: Mark S. Hoyt | Bradley Downes | Daniel Köllner
Eingetragen beim Amtsgericht Charlottenburg Berlin, HRA 45265 B | USt.-ID Nr. DE 279 803 459
Attachment
On 02/12/2013 05:28 PM, Dan Kogan wrote: > Hi Will, > > Yes, I think we've seen some discussions on that. Our servers our hosted on Amazon Ec2 and upgrading the kernel does notseem so straight forward. > We did a benchmark using pgbench on 3.5 vs 3.2 and saw an improvement. Unfortunately our production server would not bootoff 3.5 so we had to revert back to 3.2. > > At this point we are contemplating whether it's better to go back to 11.04 or upgrade to 12.10 (which comes with kernelversion 3.5). > Any thoughts on that would be appreciated. I have a machine running the same version of Ubuntu. I'll run some tests and tell you what I find. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 02/13/2013 11:24 AM, Josh Berkus wrote: > On 02/12/2013 05:28 PM, Dan Kogan wrote: >> Hi Will, >> >> Yes, I think we've seen some discussions on that. Our servers our hosted on Amazon Ec2 and upgrading the kernel doesnot seem so straight forward. >> We did a benchmark using pgbench on 3.5 vs 3.2 and saw an improvement. Unfortunately our production server would notboot off 3.5 so we had to revert back to 3.2. >> >> At this point we are contemplating whether it's better to go back to 11.04 or upgrade to 12.10 (which comes with kernelversion 3.5). >> Any thoughts on that would be appreciated. > > I have a machine running the same version of Ubuntu. I'll run some > tests and tell you what I find. So I'm running a pgbench. However, I don't really have anything to compare the stats I'm seeing. CPU usage and load average was high (load 7.9), but that was on -j 8 -c 32, with a TPS of 8500. What numbers are you seeing, exactly? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Just to be clear - I was describing the current situation in our production. We were running pgbench on different Ununtu versions today. I don’t have 12.04 setup at the moment, but I do have 12.10,which seems to be performing about the same as 12.04 in our tests with pgbench. Running pgbench with 8 jobs and 32 clients resulted in load average of about 15 and TPS was 51350. Question - how many cores does your server have? Ours has 8 cores. Thanks, Dan -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Josh Berkus Sent: Wednesday, February 13, 2013 7:26 PM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04 On 02/13/2013 11:24 AM, Josh Berkus wrote: > On 02/12/2013 05:28 PM, Dan Kogan wrote: >> Hi Will, >> >> Yes, I think we've seen some discussions on that. Our servers our hosted on Amazon Ec2 and upgrading the kernel doesnot seem so straight forward. >> We did a benchmark using pgbench on 3.5 vs 3.2 and saw an improvement. Unfortunately our production server would notboot off 3.5 so we had to revert back to 3.2. >> >> At this point we are contemplating whether it's better to go back to 11.04 or upgrade to 12.10 (which comes with kernelversion 3.5). >> Any thoughts on that would be appreciated. > > I have a machine running the same version of Ubuntu. I'll run some > tests and tell you what I find. So I'm running a pgbench. However, I don't really have anything to compare the stats I'm seeing. CPU usage and load averagewas high (load 7.9), but that was on -j 8 -c 32, with a TPS of 8500. What numbers are you seeing, exactly? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
On Tue, Feb 12, 2013 at 11:25 AM, Dan Kogan <dan@iqtell.com> wrote: > Hello, > > > > We upgraded from Ubuntu 11.04 to Ubuntu 12.04 and almost immediately > obeserved increased CPU usage and significantly higher load average on our > database server. > > At the time we were on Postgres 9.0.5. We decided to upgrade to Postgres > 9.2 to see if that resolves the issue, but unfortunately it did not. > > > > Just for illustration purposes, below are a few links to cpu and load graphs > pre and post upgrade. > > > > https://s3.amazonaws.com/iqtell.ops/Load+Average+Post+Upgrade.png > > https://s3.amazonaws.com/iqtell.ops/Load+Average+Pre+Upgrade.png > > > > https://s3.amazonaws.com/iqtell.ops/Server+CPU+Post+Upgrade.png > > https://s3.amazonaws.com/iqtell.ops/Server+CPU+Pre+Upgrade.png > > > > We also tried tweaking kernel parameters as mentioned here - > http://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com, but > have not seen any improvement. > > > > > > Any advice on how to trace what could be causing the change in CPU usage and > load average is appreciated. > > > > Our postgres version is: > > > > PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro > 4.6.3-1ubuntu5) 4.6.3, 64-bit > > > > OS: > > > > Linux ip-10-189-175-25 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03 > UTC 2013 x86_64 x86_64 x86_64 GNU/Linux > > > > Hardware (this an Amazon Ec2 High memory quadruple extra large instance): > > > > 8 core Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz > > 68 GB RAM > > RAID10 with 8 drives using xfs > > Drives are EBS with provisioned IOPS, with 1000 iops each > > > > Postgres Configuration: > > > > archive_command = rsync -a %p slave:/var/lib/postgresql/replication_load/%f > > archive_mode = on > > checkpoint_completion_target = 0.9 > > checkpoint_segments = 64 > > checkpoint_timeout = 30min > > default_text_search_config = pg_catalog.english > > external_pid_file = /var/run/postgresql/9.2-main.pid > > lc_messages = en_US.UTF-8 > > lc_monetary = en_US.UTF-8 > > lc_numeric = en_US.UTF-8 > > lc_time = en_US.UTF-8 > > listen_addresses = * > > log_checkpoints=on > > log_destination=stderr > > log_line_prefix = %t [%p]: [%l-1] > > log_min_duration_statement =500 > > max_connections=300 > > max_stack_depth=2MB > > max_wal_senders=5 > > shared_buffers=4GB > > synchronous_commit=off > > unix_socket_directory=/var/run/postgresql > > wal_keep_segments=128 > > wal_level=hot_standby > > work_mem=8MB does your application have a lot of concurrency? history has shown that postgres is highly sensitive to changes in the o/s scheduler (which changes a lot from release to release). also check this: zone reclaim (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html) merlin
Thanks for the info. Our application does have a lot of concurrency. We checked the zone reclaim parameter and it is turn off (that was the default,we did not have to change it). Dan -----Original Message----- From: Merlin Moncure [mailto:mmoncure@gmail.com] Sent: Thursday, February 14, 2013 9:08 AM To: Dan Kogan Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04 On Tue, Feb 12, 2013 at 11:25 AM, Dan Kogan <dan@iqtell.com> wrote: > Hello, > > > > We upgraded from Ubuntu 11.04 to Ubuntu 12.04 and almost immediately > obeserved increased CPU usage and significantly higher load average on > our database server. > > At the time we were on Postgres 9.0.5. We decided to upgrade to > Postgres > 9.2 to see if that resolves the issue, but unfortunately it did not. > > > > Just for illustration purposes, below are a few links to cpu and load > graphs pre and post upgrade. > > > > https://s3.amazonaws.com/iqtell.ops/Load+Average+Post+Upgrade.png > > https://s3.amazonaws.com/iqtell.ops/Load+Average+Pre+Upgrade.png > > > > https://s3.amazonaws.com/iqtell.ops/Server+CPU+Post+Upgrade.png > > https://s3.amazonaws.com/iqtell.ops/Server+CPU+Pre+Upgrade.png > > > > We also tried tweaking kernel parameters as mentioned here - > http://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com > , but have not seen any improvement. > > > > > > Any advice on how to trace what could be causing the change in CPU > usage and load average is appreciated. > > > > Our postgres version is: > > > > PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc > (Ubuntu/Linaro > 4.6.3-1ubuntu5) 4.6.3, 64-bit > > > > OS: > > > > Linux ip-10-189-175-25 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 > 15:48:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux > > > > Hardware (this an Amazon Ec2 High memory quadruple extra large instance): > > > > 8 core Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz > > 68 GB RAM > > RAID10 with 8 drives using xfs > > Drives are EBS with provisioned IOPS, with 1000 iops each > > > > Postgres Configuration: > > > > archive_command = rsync -a %p > slave:/var/lib/postgresql/replication_load/%f > > archive_mode = on > > checkpoint_completion_target = 0.9 > > checkpoint_segments = 64 > > checkpoint_timeout = 30min > > default_text_search_config = pg_catalog.english > > external_pid_file = /var/run/postgresql/9.2-main.pid > > lc_messages = en_US.UTF-8 > > lc_monetary = en_US.UTF-8 > > lc_numeric = en_US.UTF-8 > > lc_time = en_US.UTF-8 > > listen_addresses = * > > log_checkpoints=on > > log_destination=stderr > > log_line_prefix = %t [%p]: [%l-1] > > log_min_duration_statement =500 > > max_connections=300 > > max_stack_depth=2MB > > max_wal_senders=5 > > shared_buffers=4GB > > synchronous_commit=off > > unix_socket_directory=/var/run/postgresql > > wal_keep_segments=128 > > wal_level=hot_standby > > work_mem=8MB does your application have a lot of concurrency? history has shown that postgres is highly sensitive to changes in the o/sscheduler (which changes a lot from release to release). also check this: zone reclaim (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html) merlin
On 02/13/2013 05:30 PM, Dan Kogan wrote: > Just to be clear - I was describing the current situation in our production. > > We were running pgbench on different Ununtu versions today. I don’t have 12.04 setup at the moment, but I do have 12.10,which seems to be performing about the same as 12.04 in our tests with pgbench. > Running pgbench with 8 jobs and 32 clients resulted in load average of about 15 and TPS was 51350. What size database? > > Question - how many cores does your server have? Ours has 8 cores. 32 I suppose I could throw multiple pgbenches at it. I just dont' see the load numbers as unusual, but I don't have a similar pre-12.04 server to compare with. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
We used scale factor of 3600. Yeah, maybe other people see similar load average, we were not sure. However, we saw a clear difference right after the upgrade. We are trying to determine whether it makes sense for us to go to 11.04 or maybe there is something here we are missing. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Josh Berkus Sent: Thursday, February 14, 2013 1:38 PM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04 On 02/13/2013 05:30 PM, Dan Kogan wrote: > Just to be clear - I was describing the current situation in our production. > > We were running pgbench on different Ununtu versions today. I don’t have 12.04 setup at the moment, but I do have 12.10,which seems to be performing about the same as 12.04 in our tests with pgbench. > Running pgbench with 8 jobs and 32 clients resulted in load average of about 15 and TPS was 51350. What size database? > > Question - how many cores does your server have? Ours has 8 cores. 32 I suppose I could throw multiple pgbenches at it. I just dont' see the load numbers as unusual, but I don't have a similarpre-12.04 server to compare with. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
On 02/14/2013 12:41 PM, Dan Kogan wrote: > We used scale factor of 3600. > Yeah, maybe other people see similar load average, we were not sure. > However, we saw a clear difference right after the upgrade. > We are trying to determine whether it makes sense for us to go to 11.04 or maybe there is something here we are missing. Well, I'm seeing a higher system % on CPU than I expect (around 15% on each core), and a MUCH higher context-switch than I expect (up to 500K). Is that anything like you're seeing? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Yes, we are seeing higher system % on the CPU, not sure how to quantify in terms of % right now - will check into that tomorrow. We were not checking the context switch numbers during our benchmark, will check that tomorrow as well. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Josh Berkus Sent: Thursday, February 14, 2013 6:58 PM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04 On 02/14/2013 12:41 PM, Dan Kogan wrote: > We used scale factor of 3600. > Yeah, maybe other people see similar load average, we were not sure. > However, we saw a clear difference right after the upgrade. > We are trying to determine whether it makes sense for us to go to 11.04 or maybe there is something here we are missing. Well, I'm seeing a higher system % on CPU than I expect (around 15% on each core), and a MUCH higher context-switch thanI expect (up to 500K). Is that anything like you're seeing? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
If you run your benchmarks for more than a few minutes I highly recommend enabling sysstat service data collection, then you can look at it after the fact with sar. VERY useful stuff both for benchmarking and post mortem on live servers. On Thu, Feb 14, 2013 at 9:32 PM, Dan Kogan <dan@iqtell.com> wrote: > Yes, we are seeing higher system % on the CPU, not sure how to quantify in terms of % right now - will check into thattomorrow. > We were not checking the context switch numbers during our benchmark, will check that tomorrow as well. > > -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Josh Berkus > Sent: Thursday, February 14, 2013 6:58 PM > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04 > > On 02/14/2013 12:41 PM, Dan Kogan wrote: >> We used scale factor of 3600. >> Yeah, maybe other people see similar load average, we were not sure. >> However, we saw a clear difference right after the upgrade. >> We are trying to determine whether it makes sense for us to go to 11.04 or maybe there is something here we are missing. > > Well, I'm seeing a higher system % on CPU than I expect (around 15% on each core), and a MUCH higher context-switch thanI expect (up to 500K). > Is that anything like you're seeing? > > -- > Josh Berkus > PostgreSQL Experts Inc. > http://pgexperts.com > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- To understand recursion, one must first understand recursion.
On 02/14/2013 08:47 PM, Scott Marlowe wrote: > If you run your benchmarks for more than a few minutes I highly > recommend enabling sysstat service data collection, then you can look > at it after the fact with sar. VERY useful stuff both for > benchmarking and post mortem on live servers. Well, background sar, by default on Linux, only collects every 30min. For a benchmark run, you want to generate your own sar file, for example: sar -o hddrun2.sar -A 10 90 & which says "collect all stats every 10 seconds and write them to the file hddrun2.sar for 15 minutes" -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Feb 15, 2013 at 11:26 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 02/14/2013 08:47 PM, Scott Marlowe wrote: >> If you run your benchmarks for more than a few minutes I highly >> recommend enabling sysstat service data collection, then you can look >> at it after the fact with sar. VERY useful stuff both for >> benchmarking and post mortem on live servers. > > Well, background sar, by default on Linux, only collects every 30min. > For a benchmark run, you want to generate your own sar file, for example: On all my machines (debian and ubuntu) it collects every 5. > sar -o hddrun2.sar -A 10 90 & > > which says "collect all stats every 10 seconds and write them to the > file hddrun2.sar for 15 minutes" Not a bad idea. esp when benchmarking.
So, our drop in performance is now clearly due to pathological OS behavior during checkpoints. Still trying to pin down what's going on, but it's not system load; it's clearly related to the IO system. Anyone else see this? I'm getting it both on 3.2 and 3.4. We're using LSI Megaraid. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Scott, > So do you have generally slow IO, or is it fsync behavior etc? All tests except pgBench show this system as superfast. Bonnie++ and DD tests are good (200 to 300mb/s), and test_fsync shows 14K/second. Basically it has no issues until checkpoint kicks in, at which time the entire system basically halts for the duration of the checkpoint. For that matter, if I run a pgbench and halt it just before checkpoint kicks in, I get around 12000TPS, which is what I'd expect on this system. At this point, we've tried 3.2.0.26, 3.2.0.27, 3.4.0, and tried updating the RAID driver, and changing the IO scheduler. Nothing seems to affect the behavior. Testing using Ext4 (instead of XFS) next. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Feb 18, 2013 at 6:39 PM, Josh Berkus <josh@agliodbs.com> wrote: > Scott, > >> So do you have generally slow IO, or is it fsync behavior etc? > > All tests except pgBench show this system as superfast. Bonnie++ and DD > tests are good (200 to 300mb/s), and test_fsync shows 14K/second. > Basically it has no issues until checkpoint kicks in, at which time the > entire system basically halts for the duration of the checkpoint. > > For that matter, if I run a pgbench and halt it just before checkpoint > kicks in, I get around 12000TPS, which is what I'd expect on this system. > > At this point, we've tried 3.2.0.26, 3.2.0.27, 3.4.0, and tried updating > the RAID driver, and changing the IO scheduler. Nothing seems to affect > the behavior. Testing using Ext4 (instead of XFS) next. Did you try turning barriers on or off *manually* (explicitly)? With LSI and barriers *on* and ext4 I had less-optimal performance. With Linux MD or (some) 3Ware configurations I had no performance hit. -- Jon
> Did you try turning barriers on or off *manually* (explicitly)? With > LSI and barriers *on* and ext4 I had less-optimal performance. With > Linux MD or (some) 3Ware configurations I had no performance hit. They're off in fstab. /dev/sdd1 on /data type xfs (rw,noatime,nodiratime,nobarrier) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Feb 18, 2013 at 5:39 PM, Josh Berkus <josh@agliodbs.com> wrote: > Scott, > >> So do you have generally slow IO, or is it fsync behavior etc? > > All tests except pgBench show this system as superfast. Bonnie++ and DD > tests are good (200 to 300mb/s), and test_fsync shows 14K/second. > Basically it has no issues until checkpoint kicks in, at which time the > entire system basically halts for the duration of the checkpoint. I assume you've made attemtps at write levelling to reduce impacts of checkpoints etc.
On 19/02/13 13:39, Josh Berkus wrote: > Scott, > >> So do you have generally slow IO, or is it fsync behavior etc? > All tests except pgBench show this system as superfast. Bonnie++ and DD > tests are good (200 to 300mb/s), and test_fsync shows 14K/second. > Basically it has no issues until checkpoint kicks in, at which time the > entire system basically halts for the duration of the checkpoint. > > For that matter, if I run a pgbench and halt it just before checkpoint > kicks in, I get around 12000TPS, which is what I'd expect on this system. > > At this point, we've tried 3.2.0.26, 3.2.0.27, 3.4.0, and tried updating > the RAID driver, and changing the IO scheduler. Nothing seems to affect > the behavior. Testing using Ext4 (instead of XFS) next. > > Might be worth looking at your vm.dirty_ratio, vm.dirty_background_ratio and friends settings. We managed to choke up a system with 16x SSD by leaving them at their defaults... Cheers Mark
On 02/18/2013 08:28 PM, Mark Kirkwood wrote: > Might be worth looking at your vm.dirty_ratio, vm.dirty_background_ratio > and friends settings. We managed to choke up a system with 16x SSD by > leaving them at their defaults... Yeah? Any settings you'd recommend specifically? What did you use on the SSD system? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 20/02/13 06:51, Josh Berkus wrote: > On 02/18/2013 08:28 PM, Mark Kirkwood wrote: >> Might be worth looking at your vm.dirty_ratio, vm.dirty_background_ratio >> and friends settings. We managed to choke up a system with 16x SSD by >> leaving them at their defaults... > Yeah? Any settings you'd recommend specifically? What did you use on > the SSD system? > We set: vm.dirty_background_ratio = 0 vm.dirty_background_bytes = 1073741824 vm.dirty_ratio = 0 vm.dirty_bytes = 2147483648 i.e 1G for dirty_background and 2G for dirty. We didn't spend much time afterwards fiddling with the size much. I'm guessing the we could have made it bigger - however the SSD were happier to be constantly writing a few G than being handed (say) 50G of buffers to write at once . The system has 512G of ram and 32 cores (no hyperthreading). regards Mark
On 02/19/2013 09:51 AM, Josh Berkus wrote: > On 02/18/2013 08:28 PM, Mark Kirkwood wrote: >> Might be worth looking at your vm.dirty_ratio, vm.dirty_background_ratio >> and friends settings. We managed to choke up a system with 16x SSD by >> leaving them at their defaults... > > Yeah? Any settings you'd recommend specifically? What did you use on > the SSD system? > NM, I tested lowering dirty_background_ratio, and it didn't help, because checkpoints are kicking in before pdflush ever gets there. So the issue seems to be that if you have this combination of factors: 1. large RAM 2. many/fast CPUs 3. a database which fits in RAM but is larger than the RAID controller's WB cache 4. pg_xlog on the same volume as pgdata ... then you'll see checkpoint "stalls" and spread checkpoint will actually make them worse by making the stalls longer. Moving pg_xlog to a separate partition makes this better. Making bgwriter more aggressive helps a bit more on top of that. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 20/02/13 12:24, Josh Berkus wrote: > > NM, I tested lowering dirty_background_ratio, and it didn't help, > because checkpoints are kicking in before pdflush ever gets there. > > So the issue seems to be that if you have this combination of factors: > > 1. large RAM > 2. many/fast CPUs > 3. a database which fits in RAM but is larger than the RAID controller's > WB cache > 4. pg_xlog on the same volume as pgdata > > ... then you'll see checkpoint "stalls" and spread checkpoint will > actually make them worse by making the stalls longer. > > Moving pg_xlog to a separate partition makes this better. Making > bgwriter more aggressive helps a bit more on top of that. > We have pg_xlog on a pair of PCIe SSD. Also we running the deadline io scheduler. Regards Mark
On Tue, Feb 19, 2013 at 4:24 PM, Josh Berkus <josh@agliodbs.com> wrote: > ... then you'll see checkpoint "stalls" and spread checkpoint will > actually make them worse by making the stalls longer. Wait, if they're spread enough then there won't be a checkpoint, so to speak. Are you saying that spreading them out means that they still kind of pile up, even with say a completion target of 1.0 etc?
On 02/19/2013 07:15 PM, Scott Marlowe wrote: > On Tue, Feb 19, 2013 at 4:24 PM, Josh Berkus <josh@agliodbs.com> wrote: >> ... then you'll see checkpoint "stalls" and spread checkpoint will >> actually make them worse by making the stalls longer. > > Wait, if they're spread enough then there won't be a checkpoint, so to > speak. Are you saying that spreading them out means that they still > kind of pile up, even with say a completion target of 1.0 etc? I'm saying that spreading them makes things worse, because they get intermixed with the fsyncs for the WAL and causes commits to stall. I tried setting checkpoint_completion_target = 0.0 and throughput got about 10% better. I'm beginning to think that checkpoint_completion_target should be 0.0, by default. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
> Sounds to me like your IO system is stalling on fsyncs or something > like that. On machines with plenty of IO cranking up completion > target usuall smooths things out. It certainly seems like it does. However, I can't demonstrate the issue using any simpler tool than pgbench ... even running four test_fsyncs in parallel didn't show any issues, nor do standard FS testing tools. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 02/20/13 19:14, Josh Berkus wrote: >> Sounds to me like your IO system is stalling on fsyncs or something >> like that. On machines with plenty of IO cranking up completion >> target usuall smooths things out. > It certainly seems like it does. However, I can't demonstrate the issue > using any simpler tool than pgbench ... even running four test_fsyncs in > parallel didn't show any issues, nor do standard FS testing tools. > We were really starting to think that the system had an IO problem that we couldn't tickle with any synthetic tools. Then one of our other customers who upgraded to Ubuntu 12.04 LTS and is also experiencing issues came across the following LKML thread regarding pdflush on 3.0+ kernels: https://lkml.org/lkml/2012/10/9/210 So, I went and built a couple custom kernels with this patch removed: https://patchwork.kernel.org/patch/825212/ and the bad behavior stopped. Best performance was with a 3.5 kernel with the patch removed. -- Jeff Frost <jeff@pgexperts.com> CTO, PostgreSQL Experts, Inc. Phone: 1-888-PG-EXPRT x506 FAX: 415-762-5122 http://www.pgexperts.com/
On Fri, Feb 15, 2013 at 11:26 AM, Josh Berkus <josh@agliodbs.com> wrote:On all my machines (debian and ubuntu) it collects every 5.
> On 02/14/2013 08:47 PM, Scott Marlowe wrote:
>> If you run your benchmarks for more than a few minutes I highly
>> recommend enabling sysstat service data collection, then you can look
>> at it after the fact with sar. VERY useful stuff both for
>> benchmarking and post mortem on live servers.
>
> Well, background sar, by default on Linux, only collects every 30min.
> For a benchmark run, you want to generate your own sar file, for example:
All of mine were 10, but once I figured out to edit /etc/cron.d/sysstat they are now every 1 minute.
sar has some remarkably opaque documentation, but I'm glad I tracked that down.
Cheers,
Jeff
On Tue, Feb 26, 2013 at 2:30 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Feb 15, 2013 at 10:52 AM, Scott Marlowe <scott.marlowe@gmail.com> > wrote: >> >> On Fri, Feb 15, 2013 at 11:26 AM, Josh Berkus <josh@agliodbs.com> wrote: >> > On 02/14/2013 08:47 PM, Scott Marlowe wrote: >> >> If you run your benchmarks for more than a few minutes I highly >> >> recommend enabling sysstat service data collection, then you can look >> >> at it after the fact with sar. VERY useful stuff both for >> >> benchmarking and post mortem on live servers. >> > >> > Well, background sar, by default on Linux, only collects every 30min. >> > For a benchmark run, you want to generate your own sar file, for >> > example: >> >> On all my machines (debian and ubuntu) it collects every 5. > > > All of mine were 10, but once I figured out to edit /etc/cron.d/sysstat they > are now every 1 minute. oh yeah it's every 10 on the 5s. I too need to go to 1minute intervals. > sar has some remarkably opaque documentation, but I'm glad I tracked that > down. It's so incredibly useful. When a machine is acting up often getting it back online is more important than fixing it right then, and most of the system state stuff is lost on reboot / fix.
On Wed, Feb 20, 2013 at 3:44 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 02/19/2013 07:15 PM, Scott Marlowe wrote: >> On Tue, Feb 19, 2013 at 4:24 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> ... then you'll see checkpoint "stalls" and spread checkpoint will >>> actually make them worse by making the stalls longer. >> >> Wait, if they're spread enough then there won't be a checkpoint, so to >> speak. Are you saying that spreading them out means that they still >> kind of pile up, even with say a completion target of 1.0 etc? > > I'm saying that spreading them makes things worse, because they get > intermixed with the fsyncs for the WAL and causes commits to stall. I > tried setting checkpoint_completion_target = 0.0 and throughput got > about 10% better. Sounds to me like your IO system is stalling on fsyncs or something like that. On machines with plenty of IO cranking up completion target usuall smooths things out. I've got some new big servers coming in at work over the next few months so I'm gonna test and compare Ubuntu 10.04 and 12.04 and see if I can see this behaviour. We have a 12.04 machine in production but honestly it's not working very hard right now. But it's in production so I can't benchmark it without causing problems.
> From: Josh Berkus <josh@agliodbs.com> >To: Scott Marlowe <scott.marlowe@gmail.com> >Cc: pgsql-performance@postgresql.org >Sent: Thursday, 21 February 2013, 3:14 >Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04 > > >> Sounds to me like your IO system is stalling on fsyncs or something >> like that. On machines with plenty of IO cranking up completion >> target usuall smooths things out. > >It certainly seems like it does. However, I can't demonstrate the issue >using any simpler tool than pgbench ... even running four test_fsyncs in >parallel didn't show any issues, nor do standard FS testing tools. > I've missed a load of this thread and just scanned through what I can see, so apologies if I'm repeating anything. If the suspicion is the IO system and you've tuned everything you can think of; is there anything interesting in meminfo/iostat/vmstatbefore/during the stalls? If so can you cause anything similar via bonnie++ with the "-b" option?