Thread: Perf decreased although server is better

Perf decreased although server is better

From

Benjamin Toueg

Date:

02 November 2016, 16:27:09

Hi everyone,

I'm facing a peformance decrease after switching to a more performant VPS : http://serverfault.com/questions/812702/posgres-perf-decreased-although-server-is-better

My questions are:

What benchmark should I perform before switching to a new server?
What's your rule of thumb regarding my specific issue? What should be investigated first?

Best Regards,

Benjamin

Re: Perf decreased although server is better

From

Tomas Vondra

Date:

02 November 2016, 17:36:34

On 11/02/2016 02:26 PM, Benjamin Toueg wrote:
> Hi everyone,
>
> I'm facing a peformance decrease after switching to a more performant
> VPS :
> http://serverfault.com/questions/812702/posgres-perf-decreased-although-server-is-better
>

Well, changing so many things at once (CPU, RAM, storage, Ubuntu
version, probably kernel version, PostgreSQL version) is a bad idea,
exactly because it makes investigating regressions more complicated.

> My questions are:
>
>  1. What benchmark should I perform before switching to a new server?

Three types of benchmarks, in this order:

1) system-level benchmarks to test various resources (fio to test disks,
etc.)

2) general-purpose PostgreSQL benchmarks (regular pgbench, ...)

3) application-specific benchmarks, or at least pgbench with templates
that match your workload somewhat better than the default one

Start with (1), compare results between machines, if it's OK start with
(2) and so on.

>  2. What's your rule of thumb regarding my specific issue? What should
>     be investigated first?
>

There's a bottleneck somewhere. You need to identify which resource is
it and why, until then it's just wild guessing. Try measuring how long
the requests take at different points - at the app server, at the
database, etc. That will tell you whether it's a database issue, a
network issue etc. If the queries take longer on the database, use
something like perf to profile the system.

ISTM it's not a disk issue (at least the chart shows minimum usage). But
you're doing ~400tps, returning ~5M rows per second.

Also, if it turns out to be a database issue, more info about config and
data set would be useful.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Perf decreased although server is better

From

Kevin Grittner

Date:

02 November 2016, 17:38:50

On Wed, Nov 2, 2016 at 8:26 AM, Benjamin Toueg <btoueg@gmail.com> wrote:

> I'm facing a peformance decrease after switching to a more performant VPS :

In my world, the VPS that performs worse is not considered "more
performant", no matter what the sales materials say.

> What benchmark should I perform before switching to a new server?

Personally, I always like to run bonnie++ and the STREAM RAM test
before even installing the database software.

> What's your rule of thumb regarding my specific issue? What should be
> investigated first?

I would start by comparing the results of both of the
above-mentioned tests for the two environments.

Just one observation, based on the limited data -- a higher network
latency between the client and the database might explain what
you've presented.  I would check that, too.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perf decreased although server is better

From

Rick Otten

Date:

02 November 2016, 17:56:52

How did you migrate from one system to the other?

[ I recently moved a large time series table from 9.5.4 to 9.6.1 using dump and restore. Although it put the BRIN index on the time column back on, it was borked. Reindexing didn't help. I had to switch it to a regular btree index. I think the data wasn't inserted back into the database in time order. Therefore, because it was all over the place, the efficiency gains from the BRIN index were lost. It was probably because I restored it with "-j 8". -- It is possible something you didn't expect when moving introduce new inefficiencies. I also found running pg_repack after the restore helped performance (and storage size) on my new system too. ]

Do you have all of the kernel settings configured per best practices? Sometimes it is easy to forget them when you get totally focused on just moving the data.

(Things such as your hugepages settings)

With 9.6 you can enable parallel queries. Of course you wouldn't be comparing apples-to-apples then, but take advantage of that feature if you can.

On Wed, Nov 2, 2016 at 9:26 AM, Benjamin Toueg <btoueg@gmail.com> wrote:

Hi everyone,

I'm facing a peformance decrease after switching to a more performant VPS : http://serverfault.com/questions/812702/posgres-perf-decreased-although-server-is-better

My questions are:
What benchmark should I perform before switching to a new server?
What's your rule of thumb regarding my specific issue? What should be investigated first?

Best Regards,

Benjamin

Re: Perf decreased although server is better

From

Benjamin Toueg

Date:

03 November 2016, 17:51:48

Stream gives substantially better results with the new server (before/after)

I've run "bonnie++ -u postgres -d /tmp/ -s 4096M -r 1096" on both machines. I don't know how to read bonnie++ results (before/after) but it looks quite the same, sometimes better for the new, sometimes better for the old.

N.B. tests have been performed while on production on the new server whereas the old it's not solicited anymore.

I used pg_dump 9.6.1 to dump from Postgres 9.4.5 and restored with pg_restore 9.6.1 with "-j 4" option.

The database has many tables and indexes (50+). I haven't isolated which queries performs worse, it's just a general effect.

The loss in terms of response time seems now really consistent, look how the yellow suddenly grew by 100%:

I haven't tried `pg_repack` yet.

As a user, is it expected that restoring could break index efficiencies when moving from one version to another?

I feel really bad now because I'm not confident re-switching back to the old server will restore initial performance :(

2016-11-02 15:55 GMT+01:00 Rick Otten <rottenwindfish@gmail.com>:

How did you migrate from one system to the other?

[ I recently moved a large time series table from 9.5.4 to 9.6.1 using dump and restore. Although it put the BRIN index on the time column back on, it was borked. Reindexing didn't help. I had to switch it to a regular btree index. I think the data wasn't inserted back into the database in time order. Therefore, because it was all over the place, the efficiency gains from the BRIN index were lost. It was probably because I restored it with "-j 8". -- It is possible something you didn't expect when moving introduce new inefficiencies. I also found running pg_repack after the restore helped performance (and storage size) on my new system too. ]

Do you have all of the kernel settings configured per best practices? Sometimes it is easy to forget them when you get totally focused on just moving the data.
(Things such as your hugepages settings)

With 9.6 you can enable parallel queries. Of course you wouldn't be comparing apples-to-apples then, but take advantage of that feature if you can.

On Wed, Nov 2, 2016 at 9:26 AM, Benjamin Toueg <btoueg@gmail.com> wrote:
Hi everyone,

I'm facing a peformance decrease after switching to a more performant VPS : http://serverfault.com/questions/812702/posgres-perf-decreased-although-server-is-better

My questions are:
What benchmark should I perform before switching to a new server?
What's your rule of thumb regarding my specific issue? What should be investigated first?

Best Regards,

Benjamin

Attachment

image.png

Re: Perf decreased although server is better

From

Kevin Grittner

Date:

04 November 2016, 03:05:35

On Thu, Nov 3, 2016 at 9:51 AM, Benjamin Toueg <btoueg@gmail.com> wrote:
>
> Stream gives substantially better results with the new server (before/after)

Yep, the new server can access RAM at about twice the speed of the old.

> I've run "bonnie++ -u postgres -d /tmp/ -s 4096M -r 1096" on both
> machines. I don't know how to read bonnie++ results (before/after)
> but it looks quite the same, sometimes better for the new,
> sometimes better for the old.

On most metrics the new machine looks better, but there are a few
things that look potentially problematic with the new machine: the
new machine uses about 1.57x the CPU time of the old per block
written sequentially ((41 / 143557) / (16 / 87991)); so if the box
becomes CPU starved, you might notice writes getting slower than on
the new box.  Also, several of the latency numbers are worse -- in
some cases far worse.  If I'm understanding that properly, it
suggests that while total throughput from a number of connections
may be better on the new machine, a single connection may not run
the same query as quickly.  That probably makes the new machine
better for handling an OLTP workload from many concurrent clients,
but perhaps not as good at cranking out a single big report or
running dump/restore.

Yes, it is quite possible that the new machine could be faster at
some things and slower at others.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perf decreased although server is better

From

Benjamin Toueg

Date:

04 November 2016, 14:54:16

I've noticed a network latency increase. Ping between web server and database : 0.6 ms avg before, 5.3 ms avg after -- it wasn't that big 4 days ago :(

I've narrowed my investigation to one particular "Transaction" in terms of the NewRelic APM. It's basically the main HTTP request of my application.

Looks like the ping impacts psycopg2:connect (see http://imgur.com/a/LDH1c): 4 ms up to 16 ms on average.

That I can understand. However, I don't understand the performance decrease of the select queries on table1 (see https://i.stack.imgur.com/QaUqy.png): 80 ms up to 160 ms on average

Same goes for table 2 (see http://imgur.com/a/CnETs): 4 ms up to 20 ms on average

However, there is a commit in my request, and it performs better (see http://imgur.com/a/td8Dc): 12 ms down to 6 ms on average.

I don't see how this can be due to network latency!

I will provide a new bonnie++ benchmark when the requests per minute is at the lowest (remember I can only run benchmarks while the server is in use).

Rick, what did you mean by kernel configuration? The OS is a standard Ubuntu 16.04:

- Linux 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Do you think losing half the number of cores can explain my performance issue ? (AMD 8 cores down to Haswell 4 cores).

Best Regards,

Benjamin

PS : I've edited the SO post http://serverfault.com/questions/812702/posgres-perf-decreased-although-server-is-better

2016-11-04 1:05 GMT+01:00 Kevin Grittner <kgrittn@gmail.com>:

On Thu, Nov 3, 2016 at 9:51 AM, Benjamin Toueg <btoueg@gmail.com> wrote:
>
> Stream gives substantially better results with the new server (before/after)

Yep, the new server can access RAM at about twice the speed of the old.

> I've run "bonnie++ -u postgres -d /tmp/ -s 4096M -r 1096" on both
> machines. I don't know how to read bonnie++ results (before/after)
> but it looks quite the same, sometimes better for the new,
> sometimes better for the old.

On most metrics the new machine looks better, but there are a few
things that look potentially problematic with the new machine: the
new machine uses about 1.57x the CPU time of the old per block
written sequentially ((41 / 143557) / (16 / 87991)); so if the box
becomes CPU starved, you might notice writes getting slower than on
the new box. Also, several of the latency numbers are worse -- in
some cases far worse. If I'm understanding that properly, it
suggests that while total throughput from a number of connections
may be better on the new machine, a single connection may not run
the same query as quickly. That probably makes the new machine
better for handling an OLTP workload from many concurrent clients,
but perhaps not as good at cranking out a single big report or
running dump/restore.

Yes, it is quite possible that the new machine could be faster at
some things and slower at others.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perf decreased although server is better

From

Will Platnick

Date:

04 November 2016, 16:27:32

My guess would be that your server upgrade wasn't the upgrade you thought it was.

You network latency could definitely be the cause of most of this. The problem is you're not measuring this from the server side. It's not only going to impact connect time, but you're going to get your data a bit slower as well. I'm assuming your pgbouncer is installed on the postgres box. Do you have the pgbouncer addon installed? If you're looking to isolate network transfer, pgbouncer should give the statistics like avg query time to see if they line up with what you're seeing in your APM.

Also, you reduced your number of cores, which is a big problem because basically you just cut your max query capacity in half. Assuming a dedicated box, each CPU can only process up to one postgres query at a time. Previously, you could process up to 8 queries simultaneously, whereas now you can only do 4. Now, since most of your queries are probably in the ms, that can still be quite a bit of queries in a second time frame and you may never hit 4 going at the same exact time, but without seeing your pgbouncer config, this may actually be happening based on all the idle connections you're seeing.

If ALL your connections are coming from a single pgbouncer locally on the postgres box, then you can use your server resources better by setting max_connections to 6 (the number of cores + one or two more for you to connect locally via pgsql). Then, set your default_pool_size in pgbouncer to 4 (number of cores) and reserve_pool_size to 0, and restart. This will keep the number of sessions limited to what your box is actually capable of doing and will help you avoid loading postgres down with more than it's capable of doing, which can make a bad situation worse.

My recommendation would be to go back to the old server if it's available. If not, get a new one in the same data center as your web servers with at least 8 cores to put you back where you were.

On Fri, Nov 4, 2016 at 7:55 AM Benjamin Toueg <btoueg@gmail.com> wrote:

I've noticed a network latency increase. Ping between web server and database : 0.6 ms avg before, 5.3 ms avg after -- it wasn't that big 4 days ago :(

I've narrowed my investigation to one particular "Transaction" in terms of the NewRelic APM. It's basically the main HTTP request of my application.

Looks like the ping impacts psycopg2:connect (see http://imgur.com/a/LDH1c): 4 ms up to 16 ms on average.

That I can understand. However, I don't understand the performance decrease of the select queries on table1 (see https://i.stack.imgur.com/QaUqy.png): 80 ms up to 160 ms on average

Same goes for table 2 (see http://imgur.com/a/CnETs): 4 ms up to 20 ms on average

However, there is a commit in my request, and it performs better (see http://imgur.com/a/td8Dc): 12 ms down to 6 ms on average.

I don't see how this can be due to network latency!

I will provide a new bonnie++ benchmark when the requests per minute is at the lowest (remember I can only run benchmarks while the server is in use).

Rick, what did you mean by kernel configuration? The OS is a standard Ubuntu 16.04:

- Linux 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Do you think losing half the number of cores can explain my performance issue ? (AMD 8 cores down to Haswell 4 cores).

Best Regards,

Benjamin

PS : I've edited the SO post http://serverfault.com/questions/812702/posgres-perf-decreased-although-server-is-better

2016-11-04 1:05 GMT+01:00 Kevin Grittner <kgrittn@gmail.com>:
On Thu, Nov 3, 2016 at 9:51 AM, Benjamin Toueg <btoueg@gmail.com> wrote:
>
> Stream gives substantially better results with the new server (before/after)

Yep, the new server can access RAM at about twice the speed of the old.

> I've run "bonnie++ -u postgres -d /tmp/ -s 4096M -r 1096" on both
> machines. I don't know how to read bonnie++ results (before/after)
> but it looks quite the same, sometimes better for the new,
> sometimes better for the old.

On most metrics the new machine looks better, but there are a few
things that look potentially problematic with the new machine: the
new machine uses about 1.57x the CPU time of the old per block
written sequentially ((41 / 143557) / (16 / 87991)); so if the box
becomes CPU starved, you might notice writes getting slower than on
the new box. Also, several of the latency numbers are worse -- in
some cases far worse. If I'm understanding that properly, it
suggests that while total throughput from a number of connections
may be better on the new machine, a single connection may not run
the same query as quickly. That probably makes the new machine
better for handling an OLTP workload from many concurrent clients,
but perhaps not as good at cranking out a single big report or
running dump/restore.

Yes, it is quite possible that the new machine could be faster at
some things and slower at others.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perf decreased although server is better

From

Rick Otten

Date:

04 November 2016, 16:30:04

> Rick, what did you mean by kernel configuration? The OS is a standard Ubuntu 16.04:

> - Linux 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

> Do you think losing half the number of cores can explain my performance issue ? (AMD 8 cores down to Haswell 4 cores).

I was referring to some of the tunables discussed on this page:

https://www.postgresql.org/docs/9.6/static/kernel-resources.html

Specifically, in my environment I update /etc/security/limits.conf to include:

* hard nofile 65536
* soft nofile 65536

* hard stack 16384
* soft stack 16384

* hard memlock unlimited
* soft memlock unlimited

And then add this to /etc/pam.d/common-session so that they get picked up when I su to the postgres user:

session required pam_limits.so

I update sysctl.conf with huge pages:

vm.hugetlb_shm_group=5432
vm.nr_hugepages=4300

(The number of huge pages may be different for your environment.)

And create and add the postgres user to the huge pages group:

hugepages:x:5432:postgres

You may also want to look at some TCP tunables, and check your shared memory limits too.

I only mentioned this because sometimes when you move from one system to another, you can get so caught up in getting the database set up and data migration that you overlook the basic system settings...

Regarding the number of cores, most of the postgresql queries are going to be single threaded. The number of cores won't impact the performance of a single query except in certain circumstances:

1) You have parallel queries enabled and the table is doing some sort of expensive sequence scan

2) You have so many concurrent queries running the whole system is cpu starved.

3) There is some other resource contention between the cpus that causes _more_ cpus to actually run slower than fewer. (It happens - I had a server back in the 90's which had severe lock contention over /dev/tcp. Adding more cpus made it slower.)

4) The near-cache memory gets fragmented in a way that processors have to reach deeper in the caches to find what they need. (I'm not explaining that very well, but it is unlikely to be a problem in your case anyway.)

A quick and simple command to get a sense of how busy your cpus are is:

$ mpstat -P ALL 5

(let it run for a few of the 5 second intervals)

If they are all running pretty hot, then more cores might help. If just one is running hot, then more cores probably won't do anything.

On Fri, Nov 4, 2016 at 7:53 AM, Benjamin Toueg <btoueg@gmail.com> wrote:

I've noticed a network latency increase. Ping between web server and database : 0.6 ms avg before, 5.3 ms avg after -- it wasn't that big 4 days ago :(

I've narrowed my investigation to one particular "Transaction" in terms of the NewRelic APM. It's basically the main HTTP request of my application.

Looks like the ping impacts psycopg2:connect (see http://imgur.com/a/LDH1c): 4 ms up to 16 ms on average.

That I can understand. However, I don't understand the performance decrease of the select queries on table1 (see https://i.stack.imgur.com/QaUqy.png): 80 ms up to 160 ms on average

Same goes for table 2 (see http://imgur.com/a/CnETs): 4 ms up to 20 ms on average

However, there is a commit in my request, and it performs better (see http://imgur.com/a/td8Dc): 12 ms down to 6 ms on average.

I don't see how this can be due to network latency!

I will provide a new bonnie++ benchmark when the requests per minute is at the lowest (remember I can only run benchmarks while the server is in use).

Rick, what did you mean by kernel configuration? The OS is a standard Ubuntu 16.04:

- Linux 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Do you think losing half the number of cores can explain my performance issue ? (AMD 8 cores down to Haswell 4 cores).

Best Regards,

Benjamin

PS : I've edited the SO post http://serverfault.com/questions/812702/posgres-perf-decreased-although-server-is-better

2016-11-04 1:05 GMT+01:00 Kevin Grittner <kgrittn@gmail.com>:
On Thu, Nov 3, 2016 at 9:51 AM, Benjamin Toueg <btoueg@gmail.com> wrote:
>
> Stream gives substantially better results with the new server (before/after)

Yep, the new server can access RAM at about twice the speed of the old.

> I've run "bonnie++ -u postgres -d /tmp/ -s 4096M -r 1096" on both
> machines. I don't know how to read bonnie++ results (before/after)
> but it looks quite the same, sometimes better for the new,
> sometimes better for the old.

On most metrics the new machine looks better, but there are a few
things that look potentially problematic with the new machine: the
new machine uses about 1.57x the CPU time of the old per block
written sequentially ((41 / 143557) / (16 / 87991)); so if the box
becomes CPU starved, you might notice writes getting slower than on
the new box. Also, several of the latency numbers are worse -- in
some cases far worse. If I'm understanding that properly, it
suggests that while total throughput from a number of connections
may be better on the new machine, a single connection may not run
the same query as quickly. That probably makes the new machine
better for handling an OLTP workload from many concurrent clients,
but perhaps not as good at cranking out a single big report or
running dump/restore.

Yes, it is quite possible that the new machine could be faster at
some things and slower at others.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perf decreased although server is better

From

Kevin Grittner

Date:

04 November 2016, 17:05:50

On Fri, Nov 4, 2016 at 6:53 AM, Benjamin Toueg <btoueg@gmail.com> wrote:

> I don't see how this can be due to network latency!

I'm not suggesting it is due to network latency -- it is due to the
latency for storage requests.  That won't depend on network latency
unless you are going to a LAN for storage.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perf decreased although server is better

From

Benjamin Toueg

Date:

10 November 2016, 12:52:47

Hi,

I tried pg_repack on the new server with no luck, so I've decided to move back to the old server to discard:

performance decrease due to server raw characteristics
performance decrease due to network latencies

I've seen no improvement whatsoever. Could the issue be due to one of, or an interaction between :

Ubuntu 16.04.1 LTS (Linux 3.14.32-vps-grs-ipv6-64 x86_64 x86_64 x86_64 GNU/Linux)
Postgres 9.6.1/Postgis 2.3
pgbouncer 1.7.2

I guess to be sure I would need to find a way to restore my initial performances one way or another 😒

The postgresql.conf is the same as before, and very close to the defaults: http://pastebin.com/ZGYH38ft. I realised I had "track_functions on" but I turned it off and it didn't help.

The only thing that changed purposefully is the postgres and pgbouncer authentication mechanism (it used to be `trust` for both, now it's `md5` for both).

Any help appreciated,

Thanks

2016-11-04 15:05 GMT+01:00 Kevin Grittner <kgrittn@gmail.com>:

On Fri, Nov 4, 2016 at 6:53 AM, Benjamin Toueg <btoueg@gmail.com> wrote:

> I don't see how this can be due to network latency!

I'm not suggesting it is due to network latency -- it is due to the
latency for storage requests. That won't depend on network latency
unless you are going to a LAN for storage.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company