Thread: [PERFORM] Performance loss upgrading from 9.3 to 9.6

[PERFORM] Performance loss upgrading from 9.3 to 9.6

From

Adam Torres

Date:

06 November 2017, 16:18:00

Good morning all,

We have a problem with performance after upgrading from 9.3 to 9.6 where certain queries take 9 times longer to run. On our initial attempt to upgrade, we noticed the system as a whole was taking longer to run through normal daily processes. The query with the largest run time was picked to act as a measuring stick.

We are using the staging server to test the upgrade. It has two 1.3TB partitions, each of which holds a copy of the near 1TB database. The active staging 9.3 database is on one of the partitions while a copy was made onto the other. A second instance of 9.3 was set up to verify the copy was successful and some performace tests were done, then upgraded to 9.6 via pg_upgrade. The same performance tests were done and this is where the 9 time slower number comes from.

OS Ubuntu 14.04.4

PG9.3 is 9.3.19-1.pgdg14.04+1 from http://apt.postgresql.org/pub/repos/apt/

PG9.6 is 9.6.5-1.pgdg14.04+2 from same.

The server has 24 cores and 64GB RAM. Data drives are spinning platter in raid10 - not ssd.

upgrade steps:

* Ran rsync (excluding the xlog folder) from the active 9.3 partition to the unused partition while 9.3 was still running.

* Once initial rsync completed, shut down 9.3 and reran the rsync command without the exclude.

* Once second rsync completed, restarted 9.3 and left it alone.

* Copied the active 9.3 configuration files into a new /etc/postgresql/9.3/ folder so we could test before/after numbers. Changed the config to point to the appropriate data/log/etc folders/files/port.

* Started second 9.3 instance.

* Altered the few foreign servers to account for the second instance's port.

* Ran some arbitrary queries to check performance.

* Installed 9.6 via apt

* Created a 9.6 instance with its data directory on the same partition as the newely cloned 9.3 instance.

* Ran pg_upgrade with --link option (includes running --check and compiling/installing postgis)

* Copied the config from 9.3 and made minimal changes. Renamed log files, changed folders, removed checkpoint_segments is about it.

* Started the 9.6 instance and was able to do arbitrary queries.

* Ran the upgrade-suggested vacuumdb command on all databases to generate statistics

At that point, the database should be ready to use. Running the same set of arbitrary queries that were run on 9.3 yielded much worse performance, though.

The data is broken out by dealers containing customers owning vehicles. The arbitrary queries pull data from 8 tables (really 6 large[millions of records] and 2 smaller[hundreds] tables) to populate a new table via "select [fields] into [new table]". Twenty different dealers were used for testing meaning twenty of these queries. The system which runs these queries has 12 workers meaning up to 12 of these queries can be running concurrently. While the workers were offline, all twenty were queued up and then the workers activated. For each test, the order of the dealers was the same. That order was a mix of small/large dealers mixed - not exactly high,low,high; more like a few large then a few small. The run time for 9.3 was 21m9s and 9.6 was 3h18m25s.

Each test was done while the other database was idle - htop showed little to no activity before each test started.

perf reports (converted to flamegraph via https://github.com/brendangregg/FlameGraph) for the 9.6 test show about a 1/3 of the processor usage similar to that of graph for 9.3. The other 2/3 is still within the postgres process but starts with '[unknown]' and has 'connect', 'socket', and 'close' as the next call in the chain. I have not been able to figure out what postgres is doing to make these calls.

Changing the configuration based on pgtune (command line version 0.9.3-2) did not make much change. The online pgtune at http://pgtune.leopard.in.ua/ had just a couple differences in settings I have yet to test.

Main question is what the connect/socket/close calls in the perf output are and how to make them go away as they appear to be what is using up the added time. I'm hoping there is just a setting I've missed.

Query plan for a small dealer on 9.6 run without anything else running on the server

https://explain.depesz.com/s/z71u

Planning time: 8.218 ms

Execution time: 639319.525 ms

Same query as run on 9.3

https://explain.depesz.com/s/gjN3

Total runtime: 272897.150 ms

Thanks,

Adam Torres

Re: [PERFORM] Performance loss upgrading from 9.3 to 9.6

From

Justin Pryzby

Date:

06 November 2017, 17:21:42

On Mon, Nov 06, 2017 at 01:18:00PM +0000, Adam Torres wrote:
> Good morning all,
> 
> We have a problem with performance after upgrading from 9.3 to 9.6 where certain queries take 9 times longer to run.
Onour initial attempt to upgrade, we noticed the system as a whole was taking longer to run through normal daily
processes. The query with the largest run time was picked to act as a measuring stick.
 

> https://explain.depesz.com/s/z71u
> Planning time: 8.218 ms
> Execution time: 639319.525 ms
> 
> Same query as run on 9.3
> https://explain.depesz.com/s/gjN3
> Total runtime: 272897.150 ms

Actually it looks to me like both query plans are poor..

..because of this:
| Hash Join (cost=85,086.25..170,080.80 ROWS=40 width=115) (actual time=32.673..84.427 ROWS=13,390 loops=1)
|    Hash Cond: (av.customer_id = cc_1.id)

If there are a large number of distinct customer_ids (maybe with nearly equal
frequencies), it might help to
ALTER TABLE av ALTER customer_id SET STATISTICS 400
..same for cc_1.id.  And re-analyze those tables (are they large??).

see if statistics improve:
SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, attname, n_distinct,
array_length(most_common_vals,1)n_mcv,
 
FROM pg_stats WHERE attname~'customers_customer' AND tablename='id' GROUP BY 1,2,3,4,5 ORDER BY 1

Goal is to get at least an accurate value for n_distinct (but preferably also
storing the most frequent IDs).  I wouldn't bother re-running the query unless
you find that increasing stats target causes the plan to change.

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance loss upgrading from 9.3 to 9.6

From

Adam Torres

Date:

07 November 2017, 00:12:01

Justin,
Thanks for the reply.

I changed the statistics on av.customer_id as suggested and the number returned by pg_stats went from 202,333 to
904,097. There are 11.2 million distinct customer_ids on the 14.8 million vehicle records.  Rerunning the query showed
nosignificant change in time (624 seconds vs. 639 seconds) - plan is at https://explain.depesz.com/s/e2fo.
 

I went through the query looking for fields used in joins and conditions and applied the same steps to 7 other fields
over4 of the tables.  Most n_distinct values did not change much but two did change from 1.# million to -1<x<0 which
seemsbetter based on n_distinct's definition.  This improved the query a little; from 624 seconds down to 511 seconds.
Thatplan is at https://explain.depesz.com/s/te50.  This is the same query that ran in 272 seconds on 9.3 with the same
dataand previous statistics settings.
 

It has now been decided to try upgrading to 9.4 as that is the minimum to support Django 1.11 (which we are trying to
upgradea backend service to).  The hope is whatever feature we have not configured properly in 9.6 is not there in
9.4.


On 11/6/17, 9:21 AM, "Justin Pryzby" <pryzby@telsasoft.com> wrote:
   On Mon, Nov 06, 2017 at 01:18:00PM +0000, Adam Torres wrote:   > Good morning all,   >    > We have a problem with
performanceafter upgrading from 9.3 to 9.6 where certain queries take 9 times longer to run.  On our initial attempt to
upgrade,we noticed the system as a whole was taking longer to run through normal daily processes.  The query with the
largestrun time was picked to act as a measuring stick.      > https://explain.depesz.com/s/z71u   > Planning time:
8.218ms   > Execution time: 639319.525 ms   >    > Same query as run on 9.3   > https://explain.depesz.com/s/gjN3   >
Totalruntime: 272897.150 ms      Actually it looks to me like both query plans are poor..      ..because of this:   |
HashJoin (cost=85,086.25..170,080.80 ROWS=40 width=115) (actual time=32.673..84.427 ROWS=13,390 loops=1)   |    Hash
Cond:(av.customer_id = cc_1.id)      If there are a large number of distinct customer_ids (maybe with nearly equal
frequencies),it might help to   ALTER TABLE av ALTER customer_id SET STATISTICS 400   ..same for cc_1.id.  And
re-analyzethose tables (are they large??).      see if statistics improve:   SELECT (SELECT sum(x) FROM
unnest(most_common_freqs)x) frac_MCV, tablename, attname, n_distinct, array_length(most_common_vals,1) n_mcv,   FROM
pg_statsWHERE attname~'customers_customer' AND tablename='id' GROUP BY 1,2,3,4,5 ORDER BY 1      Goal is to get at
leastan accurate value for n_distinct (but preferably also   storing the most frequent IDs).  I wouldn't bother
re-runningthe query unless   you find that increasing stats target causes the plan to change.      Justin   
 


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance loss upgrading from 9.3 to 9.6

From

Adam Brusselback

Date:

07 November 2017, 05:59:30

> It has now been decided to try upgrading to 9.4 as that is the minimum to support Django 1.11 (which we are trying to
upgradea backend service to).  The hope is whatever feature we have not configured properly in 9.6 is not there in
9.4.
It's entirely possible whatever is causing your performance issue is
caused by the migration, rather than anything inherently different in
9.6.  The best test for that is setting another 9.3 server up,
restoring a backup, and testing there.  If that is very different than
what you are getting on 9.6 then it's something which changed in
Postgres, if not it's just bad stats.

I do think that it's probably better to fix your query rather than
choosing to upgrade to 9.4 rather than 9.6.  You have a crazy amount
of your query time spent in a single node.  That plan is not good.  If
that's the only query giving you trouble, work on optimizing it.

Just my $0.02

-Adam


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Performance loss upgrading from 9.3 to 9.6

From

Justin Pryzby

Date:

07 November 2017, 08:11:14

On 11/6/17, 9:21 AM, "Justin Pryzby" <pryzby@telsasoft.com> wrote:
>     see if statistics improve:
>     SELECT (SELECT sum(x) FROM unnest(most_common_freqs) x) frac_MCV, tablename, attname, n_distinct,
array_length(most_common_vals,1)n_mcv,

>     FROM pg_stats WHERE attname~'customers_customer' AND tablename='id' GROUP BY 1,2,3,4,5 ORDER BY 1

On Mon, Nov 06, 2017 at 09:12:01PM +0000, Adam Torres wrote:
> I changed the statistics on av.customer_id as suggested and the number
> returned by pg_stats went from 202,333 to 904,097.

Do you mean n_distinct ?  It' be useful to see that query on pg_stats.  Also I
don't know that we've seen \d output for the tables (or at least the joined
columns) or the full query ?

> There are 11.2 million distinct customer_ids on the 14.8 million vehicle
> records.

If there's so many distinct ids, updating stats won't help the rowcount
estimate (and could even hurt) - it can only store 10000 most-common-values.

Are there as many distinct values for cc.id ?

I would try to reproduce the rowcount problem with a minimal query:
explain analyze SELECT FROM av JOIN cc ON av.customer_id=cc.id; --WHERE cc.id<99;
Maybe the rows estimate is okay for some values and not for others, so maybe
you need to try various WHERE (with JOIN an additional tables if need be...but
without reimplementing the whole query).

I just noticed there are two conditions on dealer_id, one from table av and one
from table cc_1.  It seems likely those are co-related/non-independent
conditions..but postgres probably doesn't know that (unless you used PG96 FK
logic, or PG10 multi-variable stats). 

As a test, you could try dropping one of those conditions, or maybe a hacky
change like ROW(av.dealer_id, cc_1.dealer_id)=ROW('EC000079', 'EC000079'),
which postgres estimates as no more selective than a single equality test.  BTW
this is all from src/backend/utils/adt/selfuncs.c.

Justin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance