Thread: slow sub-query problem
I'm having problems optimising a query that's very slow due to a sub-query. The query is this: SELECT structure_id, batch_id, property_id, property_data FROM chemcentral.structure_props WHERE structure_id IN (SELECT structure_id FROM chemcentral.structure_props WHERE property_id = 643413) AND property_id IN (1, 643413, 1106201); and it takes 18s to execute. It I replace the sub-query with the inlined 369 values so that the 4th line looks like this: WHERE structure_id IN (1122687,309004,306064 ...) it takes a few ms. The plans are: 1. sub-query "Nested Loop (cost=1132.97..1182.28 rows=43 width=644) (actual time=70.926..18937.669 rows=381 loops=1)" " -> HashAggregate (cost=1091.73..1091.75 rows=2 width=4) (actual time=2.829..3.212 rows=366 loops=1)" " Group Key: structure_props_1.structure_id" " -> Index Scan using idx_sp_property_id on structure_props structure_props_1 (cost=0.43..1090.77 rows=382 width=4) (actual time=0.033..2.380 rows=369 loops=1)" " Index Cond: (property_id = 643413)" " -> Bitmap Heap Scan on structure_props (cost=41.24..45.26 rows=1 width=644) (actual time=51.726..51.727 rows=1 loops=366)" " Recheck Cond: ((structure_id = structure_props_1.structure_id) AND (property_id = ANY ('{1,643413,1106201}'::integer[])))" " Heap Blocks: exact=381" " -> BitmapAnd (cost=41.24..41.24 rows=1 width=0) (actual time=51.714..51.714 rows=0 loops=366)" " -> Bitmap Index Scan on idx_sp_structure_id (cost=0.00..6.80 rows=317 width=0) (actual time=0.046..0.046rows=475 loops=366)" " Index Cond: (structure_id = structure_props_1.structure_id)" " -> Bitmap Index Scan on idx_sp_property_id (cost=0.00..33.90 rows=1146 width=0) (actual time=51.656..51.656 rows=811892 loops=366)" " Index Cond: (property_id = ANY ('{1,643413,1106201}'::integer[]))" "Planning time: 0.497 ms" "Execution time: 18937.868 ms" 2. inlined values "Bitmap Heap Scan on structure_props (cost=2600.48..2645.29 rows=10 width=644) (actual time=71.676..72.724 rows=381 loops=1)" " Recheck Cond: ((property_id = ANY ('{1,643413,1106201}'::integer[])) AND (structure_id = ANY ('{1122687,309004,306064,278852,234066,1122645,412925,280033,423990,568929,448302,278487,278955,40430,40430,467979,467508,288413,289746,306073,355352,265583,4779 (...)" " Heap Blocks: exact=381" " -> BitmapAnd (cost=2600.48..2600.48 rows=10 width=0) (actual time=71.608..71.608 rows=0 loops=1)" " -> Bitmap Index Scan on idx_sp_property_id (cost=0.00..33.90 rows=1146 width=0) (actual time=54.614..54.614 rows=811892 loops=1)" " Index Cond: (property_id = ANY ('{1,643413,1106201}'::integer[]))" " -> Bitmap Index Scan on idx_sp_structure_id (cost=0.00..2566.32 rows=117367 width=0) (actual time=14.487..14.487 rows=173867 loops=1)" " Index Cond: (structure_id = ANY ('{1122687,309004,306064,278852,234066,1122645,412925,280033,423990,568929,448302,278487,278955,40430,40430,467979,467508,288413,289746,306073,355352,265583,477941,326652,326602,233964,15338,397586,1122647,3088 (...)" "Planning time: 1.052 ms" "Execution time: 72.858 ms" Table is like this: CREATE TABLE chemcentral.structure_props ( id serial NOT NULL, source_id integer NOT NULL, structure_id integer NOT NULL, batch_id character varying(16), parent_idinteger, property_id integer NOT NULL, property_data jsonb, CONSTRAINT structure_props_pkey PRIMARY KEY (id) ) All relevant columns are indexed and using PostgreSQL 9.4. Any clues how to re-write it to avoid the slow sub-query. Many thanks Tim
Tim Dudgeon wrote > All relevant columns are indexed and using PostgreSQL 9.4. > Any clues how to re-write it to avoid the slow sub-query. Try using an actual join instead of a subquery. You will have to provide aliases and then setup the where clause appropriately. I am reading the query correctly in that the repeated reference to 643413 is redundant? The lack of a defined natural primary key makes blind reasoning difficult. David J. -- View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827275.html Sent from the PostgreSQL - sql mailing list archive at Nabble.com.
Please reply to the list...
In short...
A natural key prevents duplicate real data which a serially generated made up key does not.
David J.
On Monday, November 17, 2014, Tim Dudgeon <tdudgeon.ml@gmail.com> wrote:
On Monday, November 17, 2014, Tim Dudgeon <tdudgeon.ml@gmail.com> wrote:
On 17/11/2014 18:44, David G Johnston wrote:Tim Dudgeon wroteI'm trying to go in that direction but in the query is entirely within one table, so I need to join the table to itself? I've been trying this but not getting it to work yet.All relevant columns are indexed and using PostgreSQL 9.4.Try using an actual join instead of a subquery. You will have to provide
Any clues how to re-write it to avoid the slow sub-query.
aliases and then setup the where clause appropriately.In this example its sort of redundant, but in a real world case the query for structure_id and property_id are independent and may have nothing in common.
I am reading the query correctly in that the repeated reference to 643413 is
redundant?The lack of a defined natural primary key makes blind reasoning
difficult.
The id column is the primary key.
Tim
David J.
--
View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827275.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.
Tim Dudgeon <tdudgeon.ml@gmail.com> writes: > I'm having problems optimising a query that's very slow due to a sub-query. I think it might get better if you could fix this misestimate: > " -> Bitmap Index Scan on idx_sp_property_id > (cost=0.00..33.90 rows=1146 width=0) (actual time=51.656..51.656 rows=811892 loops=366)" > " Index Cond: (property_id = ANY ('{1,643413,1106201}'::integer[]))" 1146 estimated vs 811892 actual is pretty bad, and it doesn't seem like this is a very hard case to estimate. Are the stats for structure_props up to date? Maybe you need to increase the statistics target for the property_id column. Another component of the bad plan choice is this misestimate: > " -> HashAggregate (cost=1091.73..1091.75 rows=2 width=4) (actual time=2.829..3.212 rows=366 loops=1)" > " Group Key: structure_props_1.structure_id" but it might be harder to do anything about that one, since the result depends on the property_id being probed; without cross-column statistics it may be impossible to do much better. regards, tom lane
Tom, thanks. I did a vacuum of the table and unfortunately it didn't help. But a good spot. Tim On 17/11/2014 20:10, Tom Lane wrote: > Tim Dudgeon <tdudgeon.ml@gmail.com> writes: >> I'm having problems optimising a query that's very slow due to a sub-query. > I think it might get better if you could fix this misestimate: > >> " -> Bitmap Index Scan on idx_sp_property_id >> (cost=0.00..33.90 rows=1146 width=0) (actual time=51.656..51.656 rows=811892 loops=366)" >> " Index Cond: (property_id = ANY ('{1,643413,1106201}'::integer[]))" > 1146 estimated vs 811892 actual is pretty bad, and it doesn't seem like > this is a very hard case to estimate. Are the stats for structure_props > up to date? Maybe you need to increase the statistics target for the > property_id column. > > Another component of the bad plan choice is this misestimate: > >> " -> HashAggregate (cost=1091.73..1091.75 rows=2 width=4) (actual time=2.829..3.212 rows=366 loops=1)" >> " Group Key: structure_props_1.structure_id" > but it might be harder to do anything about that one, since the result > depends on the property_id being probed; without cross-column statistics > it may be impossible to do much better. > > regards, tom lane
Dave,
thanks for the suggestion. I was trying to work on that basis.
Eventually I got this that works quite well:
which has this plan.
It looks a little strange to me, but it works much better.
Tim
thanks for the suggestion. I was trying to work on that basis.
Eventually I got this that works quite well:
SELECT t1.id, t1.structure_id, t1.batch_id, t1.property_id, t1.property_data
FROM chemcentral.structure_props t1
WHERE t2.structure_id IN (SELECT structure_id FROM chemcentral.structure_props WHERE property_id = 643413)
AND t1.property_id IN (1, 643413, 1106201)
;
which has this plan.
"Hash Join (cost=4376.38..6539.42 rows=43 width=648) (actual time=467.265..795.887 rows=381 loops=1)"
" -> Nested Loop (cost=1092.16..1352.77 rows=507201 width=4) (actual time=0.807..84.228 rows=173867 loops=1)"
" -> HashAggregate (cost=1091.73..1091.75 rows=2 width=4) (actual time=0.779..0.897 rows=366 loops=1)"
" Group Key: structure_props.structure_id"
" -> Index Scan using idx_sp_property_id on structure_props (cost=0.43..1090.77 rows=382 width=4) (actual time=0.032..0.592 rows=369 loops=1)"
" Index Cond: (property_id = 643413)"
" -> Index Scan using idx_sp_structure_id on structure_props t2 (cost=0.43..127.34 rows=317 width=8) (actual time=0.010..0.172 rows=475 loops=366)"
" Index Cond: (structure_id = structure_props.structure_id)"
" -> Hash (cost=3269.89..3269.89 rows=1146 width=648) (actual time=464.458..464.458 rows=811892 loops=1)"
" Buckets: 1024 Batches: 32 (originally 1) Memory Usage: 4097kB"
" -> Index Scan using idx_sp_property_id on structure_props t1 (cost=0.44..3269.89 rows=1146 width=648) (actual time=0.033..231.895 rows=811892 loops=1)"
" Index Cond: (property_id = ANY ('{1,643413,1106201}'::integer[]))"
"Planning time: 0.885 ms"
It looks a little strange to me, but it works much better.
Tim
On 17/11/2014 19:19, David Johnston wrote:
Please reply to the list...In short...A natural key prevents duplicate real data which a serially generated made up key does not.David J.
On Monday, November 17, 2014, Tim Dudgeon <tdudgeon.ml@gmail.com> wrote:
On 17/11/2014 18:44, David G Johnston wrote:Tim Dudgeon wroteI'm trying to go in that direction but in the query is entirely within one table, so I need to join the table to itself? I've been trying this but not getting it to work yet.All relevant columns are indexed and using PostgreSQL 9.4.Try using an actual join instead of a subquery. You will have to provide
Any clues how to re-write it to avoid the slow sub-query.
aliases and then setup the where clause appropriately.In this example its sort of redundant, but in a real world case the query for structure_id and property_id are independent and may have nothing in common.
I am reading the query correctly in that the repeated reference to 643413 is
redundant?The lack of a defined natural primary key makes blind reasoning
difficult.
The id column is the primary key.
Tim
David J.
--
View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827275.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.
Tim Dudgeon wrote > SELECT t1.id, t1.structure_id, t1.batch_id, > t1.property_id, t1.property_data > FROM chemcentral.structure_props t1 > JOIN chemcentral.structure_props t2 ON t1.id = t2.id > WHERE t2.structure_id IN (SELECT structure_id FROM > chemcentral.structure_props WHERE property_id = 643413) > AND t1.property_id IN (1, 643413, 1106201) > ; What about: SELECT t1.id, t1.structure_id, t1.batch_id, t1.property_id, t1.property_data FROM chemcentral.structure_props t1 JOIN ( SELECT DISTINCT super.id FROM chemcentral.structure_props super WHERE super.structure_id IN ( SELECT sub.structure_id FROM chemcentral.structure_props sub WHERE sub.property_id = 643413 ) ) t2 ON (t1.id = t2.id) WHERE t1.property_id IN (1, 643413, 1106201) ; ? I do highly suggest using column table prefixes everywhere in this kind of query... Also, AND == INTERSECT so: SELECT ... FROM chemcentral.structure_props WHERE property_id IN (1,643413,1106201) INTERSECT DISTINCT SELECT ... FROM chemcentral.structure_props WHERE structure_id IN (SELECT ... WHERE property_id = 643413) You can even use CTE/WITH expressions and give these subqueries meaningful names. David J. -- View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827453.html Sent from the PostgreSQL - sql mailing list archive at Nabble.com.
Slightly off: I prefer "exists" to "join" if it's possible while on the list I almost never see any answer that uses "exists". Is my existsfixation is some kind of bad practice? Sandor Daku > On 19 Nov 2014, at 02:37, David G Johnston <david.g.johnston@gmail.com> wrote: > > Tim Dudgeon wrote >> SELECT t1.id, t1.structure_id, t1.batch_id, >> t1.property_id, t1.property_data >> FROM chemcentral.structure_props t1 >> JOIN chemcentral.structure_props t2 ON t1.id = t2.id >> WHERE t2.structure_id IN (SELECT structure_id FROM >> chemcentral.structure_props WHERE property_id = 643413) >> AND t1.property_id IN (1, 643413, 1106201) >> ; > > What about: > > SELECT t1.id, t1.structure_id, t1.batch_id, t1.property_id, t1.property_data > FROM chemcentral.structure_props t1 > JOIN ( > SELECT DISTINCT super.id FROM chemcentral.structure_props super > WHERE super.structure_id IN ( > SELECT sub.structure_id > FROM chemcentral.structure_props sub > WHERE sub.property_id = 643413 > ) > ) t2 ON (t1.id = t2.id) > WHERE t1.property_id IN (1, 643413, 1106201) > ; > > ? > > I do highly suggest using column table prefixes everywhere in this kind of > query... > > Also, AND == INTERSECT so: > > SELECT ... FROM chemcentral.structure_props WHERE property_id IN > (1,643413,1106201) > INTERSECT DISTINCT > SELECT ... FROM chemcentral.structure_props WHERE structure_id IN (SELECT > ... WHERE property_id = 643413) > > You can even use CTE/WITH expressions and give these subqueries meaningful > names. > > David J. > > > > > -- > View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827453.html > Sent from the PostgreSQL - sql mailing list archive at Nabble.com. > > > -- > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql
Slightly off:
I prefer "exists" to "join" if it's possible while on the list I almost never see any answer that uses "exists". Is my exists fixation is some kind of bad practice?
I tend not to use exists even where it is warranted...mostly because it just sounds wrong to me for some reason. Correlated subqueries are tougher for me to reason and understand quickly so if a join or IN version of a query will work I tend to go that way first.
The optimizer does a decent job of making exists and joins functionally equivalent. Exists and IN are not in the presence of NULL so I am often concerned about using IN instead of Exists but still end up doing it and only changing if the query performs badly.
The ON clause is more obvious than a equality condition in a where clause in a subquery. I tend to use "JOIN (subquery) alias ON" instead of exists too - partially to separate out the subsetting logic (WHERE) from the join logic (ON)
David J.
View this message in context: Re: slow sub-query problem
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.
I tried them all out. Original query: 17039ms Simple join: 889ms Join with SELECT: 1302ms (799ms without DISTINCT which I don't think is needed here) Using INTERSECT: 1454ms (1474 without DISTINCT) So with the current data the simple join and the Join with SELECT but no DISTINCT are the best. Thanks for your help with this. Tim On 19/11/2014 01:37, David G Johnston wrote: > Tim Dudgeon wrote >> SELECT t1.id, t1.structure_id, t1.batch_id, >> t1.property_id, t1.property_data >> FROM chemcentral.structure_props t1 >> JOIN chemcentral.structure_props t2 ON t1.id = t2.id >> WHERE t2.structure_id IN (SELECT structure_id FROM >> chemcentral.structure_props WHERE property_id = 643413) >> AND t1.property_id IN (1, 643413, 1106201) >> ; > What about: > > SELECT t1.id, t1.structure_id, t1.batch_id, t1.property_id, t1.property_data > FROM chemcentral.structure_props t1 > JOIN ( > SELECT DISTINCT super.id FROM chemcentral.structure_props super > WHERE super.structure_id IN ( > SELECT sub.structure_id > FROM chemcentral.structure_props sub > WHERE sub.property_id = 643413 > ) > ) t2 ON (t1.id = t2.id) > WHERE t1.property_id IN (1, 643413, 1106201) > ; > > ? > > I do highly suggest using column table prefixes everywhere in this kind of > query... > > Also, AND == INTERSECT so: > > SELECT ... FROM chemcentral.structure_props WHERE property_id IN > (1,643413,1106201) > INTERSECT DISTINCT > SELECT ... FROM chemcentral.structure_props WHERE structure_id IN (SELECT > ... WHERE property_id = 643413) > > You can even use CTE/WITH expressions and give these subqueries meaningful > names. > > David J. > > > > > -- > View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827453.html > Sent from the PostgreSQL - sql mailing list archive at Nabble.com. > >