Thread: [PERFORM] Slow query after 9.3 to 9.6 migration
Hi there, fellow experts!
I need an advice with query that became slower after 9.3 to 9.6 migration.
First of all, I'm from the dev team.
Before migration, we (programmers) made some modifications on query bring it's average time from 8s to 2-3s.
As this query is the most executed on our system (it builds the user panel to work), every bit that we can squeeze from it will be nice.
Now, after server migration to 9.6 we're experiencing bad times with this query again.
Unfortunately, I don't have the old query plain (9.3 version) to show you, but in the actual version (9.6) I can see some buffers written that tells me that something is wrong.
Our server has 250GB of memory available, but the database team says that they can't do nothing to make this query better. I'm not sure, as some buffers are written on disk.
Any tip/help will be much appreciated (even from the query side).
Thank you!
The query plan: https://explain.depesz.com/s/5KMn
Note: I tried to add index on kilo_victor table already, but Postgresql still thinks that is better to do a seq scan.
Flávio Henrique
> > Hi there, fellow experts! > > > I need an advice with query that became slower after 9.3 to 9.6 > migration. > > > First of all, I'm from the dev team. > > > Before migration, we (programmers) made some modifications on query > bring it's average time from 8s to 2-3s. > > > As this query is the most executed on our system (it builds the user > panel to work), every bit that we can squeeze from it will be nice. > > > Now, after server migration to 9.6 we're experiencing bad times with > this query again. > > > Unfortunately, I don't have the old query plain (9.3 version) to show > you, but in the actual version (9.6) I can see some buffers written > that tells me that something is wrong. > > > Our server has 250GB of memory available, but the database team says > that they can't do nothing to make this query better. I'm not sure, > as some buffers are written on disk. > > > Any tip/help will be much appreciated (even from the query side). > > > Thank you! > > > The query plan: https://explain.depesz.com/s/5KMn > > > Note: I tried to add index on kilo_victor table already, but > Postgresql still thinks that is better to do a seq scan. > > I dont know about the data distribution in kilo_victor, but maybe a partial index ON kilo_victor (juliet_romeo) where not xray_seven ? Gerardo
The biggest impact on performance you can achieve is by using a materialized view. if it’s so heavily used as you said, even 2-3 seconds in a multiuser OLTP environment still unacceptable under my point of view. I don’t know if this is the case but if you have 1000 users connecting at 8 am all at the same time … it will freeze the app for a while ..
Ask your self: how old data can be? and take into account that you can refresh the materialized view as often as you want, even every 10 secs if you want.
Beides this, there there's still some room for improvement. Perhaps you have not created the right index to avoid seq scans. Have a look at indexes on expressions.
On systems side: ask them if they have not changed anything in effective_cache_size and shared_buffers parameters, I presume they haven’t change anything related to costs.
Regards.
Daniel Blanch.
El 28 dic 2016, a las 0:50, Flávio Henrique <yoshimit@gmail.com> escribió:Hi there, fellow experts!I need an advice with query that became slower after 9.3 to 9.6 migration.First of all, I'm from the dev team.Before migration, we (programmers) made some modifications on query bring it's average time from 8s to 2-3s.As this query is the most executed on our system (it builds the user panel to work), every bit that we can squeeze from it will be nice.Now, after server migration to 9.6 we're experiencing bad times with this query again.Unfortunately, I don't have the old query plain (9.3 version) to show you, but in the actual version (9.6) I can see some buffers written that tells me that something is wrong.Our server has 250GB of memory available, but the database team says that they can't do nothing to make this query better. I'm not sure, as some buffers are written on disk.Any tip/help will be much appreciated (even from the query side).Thank you!The query plan: https://explain.depesz.com/s/5KMnNote: I tried to add index on kilo_victor table already, but Postgresql still thinks that is better to do a seq scan.Flávio Henrique
On Tue, Dec 27, 2016 at 5:50 PM, Flávio Henrique <yoshimit@gmail.com> wrote: > I can see some buffers written that tells me > that something is wrong. Try running VACUUM FREEZE ANALYZE on all tables involved in the query (or just run it as a superuser on the whole database). Do *not* use the FULL option. Among other things, this will ensure that you have somewhat current statistics, and that all hint bits are set. (I remember my surprise the first time I converted a table to PostgreSQL, ran SELECT count(*) on it to make sure all rows made it, saw a very long run time with disk writes as the bottleneck. That's when I learned about hint bits.) You should also make sure that autovacuum is aggressive enough on the new cluster. Without that, any performance benefit from the above will slowly disappear. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 27, 2016 at 5:50 PM, Flávio Henrique <yoshimit@gmail.com> wrote: > Hi there, fellow experts! > > I need an advice with query that became slower after 9.3 to 9.6 migration. > > First of all, I'm from the dev team. > > Before migration, we (programmers) made some modifications on query bring > it's average time from 8s to 2-3s. > > As this query is the most executed on our system (it builds the user panel > to work), every bit that we can squeeze from it will be nice. > > Now, after server migration to 9.6 we're experiencing bad times with this > query again. > > Unfortunately, I don't have the old query plain (9.3 version) to show you, > but in the actual version (9.6) I can see some buffers written that tells me > that something is wrong. > > Our server has 250GB of memory available, but the database team says that > they can't do nothing to make this query better. I'm not sure, as some > buffers are written on disk. > > Any tip/help will be much appreciated (even from the query side). > > Thank you! > > The query plan: https://explain.depesz.com/s/5KMn > > Note: I tried to add index on kilo_victor table already, but Postgresql > still thinks that is better to do a seq scan. Hard to provide more without the query or the 'old' plan. Here are some things you can try: *) Set effective_io_concurrency high. You have some heap scanning going on and this can sometimes help (but it should be marginal). *) See if you can get any juice out of parallel query *) try playing with enable_nestloop and enable_seqscan. these are hail mary passes but worth a shot. Run the query back to back with same arguments in the same database session. Does performance improve? Big gains (if any) are likely due to indexing strategy. I do see some suspicious casting, for example: Join Filter: ((four_charlie.delta_tango)::integer = (six_quebec.golf_bravo)::integer) Are you casting in the query or joining through dissimilar data types? I suspect your database team might be incorrect. merlin
Hi all!
Sorry the delay (holidays).
Well, the most expensive sequencial scan was solved.
I asked the db team to drop the index and recreate it and guess what: now postgresql is using it and the time dropped.
(thank you, @Gerardo Herzig!)I think there's still room for improvement, but the problem is not so crucial right now.
I'll try to investigate every help mentioned here. Thank you all.
@Daniel Blanch
I'll make some tests with a materialized view. Thank you.
On systems side: ask them if they have not changed anything in effective_cache_size and shared_buffers parameters, I presume they haven’t change anything related to costs.
Replying your comment, I think they tunned the server:
effective_cache_size = 196GB
shared_buffers = 24GB (this shouldn't be higher?)
@Kevin Grittner
sorry, but I'm not sure when the autovacuum is aggressive enough, but here my settings related:
autovacuum |on
autovacuum_analyze_scale_factor |0.05
autovacuum_analyze_threshold |10
autovacuum_freeze_max_age |200000000
autovacuum_max_workers |3
autovacuum_multixact_freeze_max_age |400000000
autovacuum_naptime |15s
autovacuum_vacuum_cost_delay |10ms
autovacuum_vacuum_cost_limit |-1
autovacuum_vacuum_scale_factor |0.1
autovacuum_vacuum_threshold |10
autovacuum_work_mem |-1
@Merlin Moncure
Big gains (if any) are likely due to indexing strategy.
I do see some suspicious casting, for example:
Join Filter: ((four_charlie.delta_tango)::integer =
(six_quebec.golf_bravo)::integer)
Are you casting in the query or joining through dissimilar data types?
No casts in query. The joins are on same data types.
Thank you all for the answers. Happy 2017!
Flávio Henrique
--------------------------------------------------------"There are only 10 types of people in the world: Those who understand binary, and those who don't"
--------------------------------------------------------
On Thu, Jan 5, 2017 at 12:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Tue, Dec 27, 2016 at 5:50 PM, Flávio Henrique <yoshimit@gmail.com> wrote:
> Hi there, fellow experts!
>
> I need an advice with query that became slower after 9.3 to 9.6 migration.
>
> First of all, I'm from the dev team.
>
> Before migration, we (programmers) made some modifications on query bring
> it's average time from 8s to 2-3s.
>
> As this query is the most executed on our system (it builds the user panel
> to work), every bit that we can squeeze from it will be nice.
>
> Now, after server migration to 9.6 we're experiencing bad times with this
> query again.
>
> Unfortunately, I don't have the old query plain (9.3 version) to show you,
> but in the actual version (9.6) I can see some buffers written that tells me
> that something is wrong.
>
> Our server has 250GB of memory available, but the database team says that
> they can't do nothing to make this query better. I'm not sure, as some
> buffers are written on disk.
>
> Any tip/help will be much appreciated (even from the query side).
>
> Thank you!
>
> The query plan: https://explain.depesz.com/s/5KMn
>
> Note: I tried to add index on kilo_victor table already, but Postgresql
> still thinks that is better to do a seq scan.
Hard to provide more without the query or the 'old' plan. Here are
some things you can try:
*) Set effective_io_concurrency high. You have some heap scanning
going on and this can sometimes help (but it should be marginal).
*) See if you can get any juice out of parallel query
*) try playing with enable_nestloop and enable_seqscan. these are
hail mary passes but worth a shot.
Run the query back to back with same arguments in the same database
session. Does performance improve?
Big gains (if any) are likely due to indexing strategy.
I do see some suspicious casting, for example:
Join Filter: ((four_charlie.delta_tango)::integer =
(six_quebec.golf_bravo)::integer)
Are you casting in the query or joining through dissimilar data types?
I suspect your database team might be incorrect.
merlin
On Thu, Jan 5, 2017 at 10:51 AM, Flávio Henrique <yoshimit@gmail.com> wrote: > Replying your comment, I think they tunned the server: > effective_cache_size = 196GB > shared_buffers = 24GB (this shouldn't be higher?) Probably not, although it may be a good idea to try settings either side of that (say, 16GB and 32GB) and monitor performance compared to the current setting. > autovacuum_max_workers |3 If you ever see all workers busy at the same time for 30 minutes or more, you should probably consider raising that so that small, frequently updated tables are not neglected for too long. > autovacuum_vacuum_cost_limit |-1 That is going to default to vacuum_cost_limit, which is usually 200. If the server is actually big enough to merit "effective_cache_size = 196GB" then you should probably bump this setting to something like 2000. > autovacuum_work_mem |-1 That is going to default to maintenance_work_mem. On a big machine, you probably want that set to somewhere between 1GB and 2GB. Some other tuning to the cost parameters might be helpful, but there's not enough data on the thread to know what else to suggest. If you hit some other slow query, you might want to report it in the manner suggested here: https://wiki.postgresql.org/wiki/SlowQueryQuestions -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi,
If just recreating the index now it uses it, it might mean that the index was bloated, that is, it grew so big that it was cheaper a seq scan.
I’ve seen another case recently where postgres 9.6 wasn’t using the right index in a query, I was able to reproduce the issue crafting index bigger, much bigger than it should be.
Can you record index size as it is now? Keep this info, and If problem happens again check indexes size, and see if they have grow too much.
i.e. SELECT relname, relpages, reltuples FROM pg_class WHERE relname = ‘index_name'
This might help to see if this is the problem, that indexes are growing too much for some reason.
Regards.
P.S the other parameters don't seem to be the cause of the problem to me.
El 5 ene 2017, a las 17:51, Flávio Henrique <yoshimit@gmail.com> escribió:Hi all!Sorry the delay (holidays).Well, the most expensive sequencial scan was solved.I asked the db team to drop the index and recreate it and guess what: now postgresql is using it and the time dropped.(thank you, @Gerardo Herzig!)I think there's still room for improvement, but the problem is not so crucial right now.I'll try to investigate every help mentioned here. Thank you all.@Daniel BlanchI'll make some tests with a materialized view. Thank you.On systems side: ask them if they have not changed anything in effective_cache_size and shared_buffers parameters, I presume they haven’t change anything related to costs.Replying your comment, I think they tunned the server:effective_cache_size = 196GBshared_buffers = 24GB (this shouldn't be higher?)@Kevin Grittnersorry, but I'm not sure when the autovacuum is aggressive enough, but here my settings related:autovacuum |onautovacuum_analyze_scale_factor |0.05autovacuum_analyze_threshold |10autovacuum_freeze_max_age |200000000autovacuum_max_workers |3autovacuum_multixact_freeze_max_age |400000000autovacuum_naptime |15sautovacuum_vacuum_cost_delay |10msautovacuum_vacuum_cost_limit |-1autovacuum_vacuum_scale_factor |0.1autovacuum_vacuum_threshold |10autovacuum_work_mem |-1@Merlin MoncureBig gains (if any) are likely due to indexing strategy.
I do see some suspicious casting, for example:
Join Filter: ((four_charlie.delta_tango)::integer =
(six_quebec.golf_bravo)::integer)
Are you casting in the query or joining through dissimilar data types?No casts in query. The joins are on same data types.Thank you all for the answers. Happy 2017!Flávio Henrique--------------------------------------------------------
"There are only 10 types of people in the world: Those who understand binary, and those who don't"
--------------------------------------------------------On Thu, Jan 5, 2017 at 12:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:On Tue, Dec 27, 2016 at 5:50 PM, Flávio Henrique <yoshimit@gmail.com> wrote:
> Hi there, fellow experts!
>
> I need an advice with query that became slower after 9.3 to 9.6 migration.
>
> First of all, I'm from the dev team.
>
> Before migration, we (programmers) made some modifications on query bring
> it's average time from 8s to 2-3s.
>
> As this query is the most executed on our system (it builds the user panel
> to work), every bit that we can squeeze from it will be nice.
>
> Now, after server migration to 9.6 we're experiencing bad times with this
> query again.
>
> Unfortunately, I don't have the old query plain (9.3 version) to show you,
> but in the actual version (9.6) I can see some buffers written that tells me
> that something is wrong.
>
> Our server has 250GB of memory available, but the database team says that
> they can't do nothing to make this query better. I'm not sure, as some
> buffers are written on disk.
>
> Any tip/help will be much appreciated (even from the query side).
>
> Thank you!
>
> The query plan: https://explain.depesz.com/s/5KMn
>
> Note: I tried to add index on kilo_victor table already, but Postgresql
> still thinks that is better to do a seq scan.
Hard to provide more without the query or the 'old' plan. Here are
some things you can try:
*) Set effective_io_concurrency high. You have some heap scanning
going on and this can sometimes help (but it should be marginal).
*) See if you can get any juice out of parallel query
*) try playing with enable_nestloop and enable_seqscan. these are
hail mary passes but worth a shot.
Run the query back to back with same arguments in the same database
session. Does performance improve?
Big gains (if any) are likely due to indexing strategy.
I do see some suspicious casting, for example:
Join Filter: ((four_charlie.delta_tango)::integer =
(six_quebec.golf_bravo)::integer)
Are you casting in the query or joining through dissimilar data types?
I suspect your database team might be incorrect.
merlin
On Thu, Jan 5, 2017 at 10:51 AM, Flávio Henrique <yoshimit@gmail.com> wrote: > @Merlin Moncure >> >> Big gains (if any) are likely due to indexing strategy. >> I do see some suspicious casting, for example: >> Join Filter: ((four_charlie.delta_tango)::integer = >> (six_quebec.golf_bravo)::integer) >> Are you casting in the query or joining through dissimilar data types? > > No casts in query. The joins are on same data types. well, something is going on. create table t(i int); create table t2(i int); set enable_hashjoin to false; set enable_mergejoin to false; yields: postgres=# explain select * from t join t2 on t.i = t2.i; QUERY PLAN ────────────────────────────────────────────────────────────────── Nested Loop (cost=0.00..97614.88 rows=32512 width=8) Join Filter: (t.i = t2.i) -> Seq Scan on t (cost=0.00..35.50 rows=2550 width=4) -> Materialize (cost=0.00..48.25 rows=2550 width=4) -> Seq Scan on t2 (cost=0.00..35.50 rows=2550 width=4) please note the non-casted join filter. however, postgres=# explain select * from t join t2 on t.i::bigint = t2.i::bigint; QUERY PLAN ────────────────────────────────────────────────────────────────── Nested Loop (cost=0.00..130127.38 rows=32512 width=8) Join Filter: ((t.i)::bigint = (t2.i)::bigint) -> Seq Scan on t (cost=0.00..35.50 rows=2550 width=4) -> Materialize (cost=0.00..48.25 rows=2550 width=4) -> Seq Scan on t2 (cost=0.00..35.50 rows=2550 width=4) notice the casts in the join filter. Furthermore, please note the higher query cost due to the server accounting for the casting involved in the join. Any kind of non-equality based operation in a join or the predicate side of a where condition can get very expensive very quickly. (it remains difficult to see if there's any way to improve the join operation due to lack of visibility on the query string). merlin
Can you remove me from your mailing list?
Thanks.On Fri, Jan 6, 2017 at 6:14 AM, Filipe Oliveira <filipeo_11@hotmail.com> wrote: > Can you remove me from your mailing list? There is an unsubscribe action here: https://www.postgresql.org/community/lists/subscribe/ -- Michael
Thank you for the reply. I had been trying to find that option for awhile now.
On Fri, Jan 6, 2017 at 12:51 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Jan 6, 2017 at 6:14 AM, Filipe Oliveira <filipeo_11@hotmail.com> wrote:
> Can you remove me from your mailing list?
There is an unsubscribe action here:
https://www.postgresql.org/community/lists/subscribe/
--
Michael
On Thu, Jan 5, 2017 at 9:51 AM, Daniel Blanch Bataller <daniel.blanch.bataller@gmail.com> wrote: > If just recreating the index now it uses it, it might mean that the index > was bloated, that is, it grew so big that it was cheaper a seq scan. > > I’ve seen another case recently where postgres 9.6 wasn’t using the right > index in a query, I was able to reproduce the issue crafting index bigger, > much bigger than it should be. > > Can you record index size as it is now? Keep this info, and If problem > happens again check indexes size, and see if they have grow too much. > > i.e. SELECT relname, relpages, reltuples FROM pg_class WHERE relname = > ‘index_name' > > This might help to see if this is the problem, that indexes are growing too > much for some reason. Are these unique indexes or not? Did Flavio have a workload with many UPDATEs? I ask these questions because I think it's possible that this is explained by a regression in 9.5's handling of index bloat, described here: http://postgr.es/m/CAH2-Wz=SfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA@mail.gmail.com I'm trying to track down cases where this could be an issue, to get a better sense of the problem. -- Peter Geoghegan
On Tue, Dec 27, 2016 at 3:50 PM, Flávio Henrique <yoshimit@gmail.com> wrote: > Hi there, fellow experts! > > I need an advice with query that became slower after 9.3 to 9.6 migration. > > First of all, I'm from the dev team. > > Before migration, we (programmers) made some modifications on query bring > it's average time from 8s to 2-3s. > > As this query is the most executed on our system (it builds the user panel > to work), every bit that we can squeeze from it will be nice. > > Now, after server migration to 9.6 we're experiencing bad times with this > query again. > > Unfortunately, I don't have the old query plain (9.3 version) to show you, > but in the actual version (9.6) I can see some buffers written that tells me > that something is wrong. > > Our server has 250GB of memory available, but the database team says that > they can't do nothing to make this query better. I'm not sure, as some > buffers are written on disk. The second sorts etc start spilling to disk your performance is gonna tank. Try increasing work_mem to something moderate like 256M to 1G. Note that work_mem is per sort / action, so if you got 100 users running queries with 2 or 3 sorts at a time you can exhaust memory real fast. OTOH, a db with proper pooling on connections etc (aka 10 to 20 live connections at a time) cna easily handle 1G work_mem if it's got 256G RAM