Thread: slow sub-query problem

slow sub-query problem

From

Tim Dudgeon

Date:

17 November 2014, 18:34:44

I'm having problems optimising a query that's very slow due to a sub-query.
The query is this:

SELECT  structure_id, batch_id, property_id, property_data
FROM chemcentral.structure_props
WHERE structure_id IN (SELECT structure_id FROM 
chemcentral.structure_props WHERE property_id = 643413)
AND property_id IN (1, 643413, 1106201);

and it takes 18s to execute.
It I replace the sub-query with the inlined 369 values so that the 4th 
line looks like this:

WHERE structure_id IN (1122687,309004,306064 ...)

it takes a few ms.

The plans are:
1. sub-query

"Nested Loop  (cost=1132.97..1182.28 rows=43 width=644) (actual 
time=70.926..18937.669 rows=381 loops=1)"
"  ->  HashAggregate  (cost=1091.73..1091.75 rows=2 width=4) (actual 
time=2.829..3.212 rows=366 loops=1)"
"        Group Key: structure_props_1.structure_id"
"        ->  Index Scan using idx_sp_property_id on structure_props 
structure_props_1  (cost=0.43..1090.77 rows=382 width=4) (actual 
time=0.033..2.380 rows=369 loops=1)"
"              Index Cond: (property_id = 643413)"
"  ->  Bitmap Heap Scan on structure_props  (cost=41.24..45.26 rows=1 
width=644) (actual time=51.726..51.727 rows=1 loops=366)"
"        Recheck Cond: ((structure_id = structure_props_1.structure_id) 
AND (property_id = ANY ('{1,643413,1106201}'::integer[])))"
"        Heap Blocks: exact=381"
"        ->  BitmapAnd  (cost=41.24..41.24 rows=1 width=0) (actual 
time=51.714..51.714 rows=0 loops=366)"
"              ->  Bitmap Index Scan on idx_sp_structure_id  (cost=0.00..6.80 rows=317 width=0) (actual
time=0.046..0.046rows=475 
 
loops=366)"
"                    Index Cond: (structure_id = 
structure_props_1.structure_id)"
"              ->  Bitmap Index Scan on idx_sp_property_id  (cost=0.00..33.90 rows=1146 width=0) (actual
time=51.656..51.656
 
rows=811892 loops=366)"
"                    Index Cond: (property_id = ANY 
('{1,643413,1106201}'::integer[]))"
"Planning time: 0.497 ms"
"Execution time: 18937.868 ms"


2. inlined values

"Bitmap Heap Scan on structure_props  (cost=2600.48..2645.29 rows=10 
width=644) (actual time=71.676..72.724 rows=381 loops=1)"
"  Recheck Cond: ((property_id = ANY ('{1,643413,1106201}'::integer[])) 
AND (structure_id = ANY 

('{1122687,309004,306064,278852,234066,1122645,412925,280033,423990,568929,448302,278487,278955,40430,40430,467979,467508,288413,289746,306073,355352,265583,4779

(...)"
"  Heap Blocks: exact=381"
"  ->  BitmapAnd  (cost=2600.48..2600.48 rows=10 width=0) (actual 
time=71.608..71.608 rows=0 loops=1)"
"        ->  Bitmap Index Scan on idx_sp_property_id  (cost=0.00..33.90 
rows=1146 width=0) (actual time=54.614..54.614 rows=811892 loops=1)"
"              Index Cond: (property_id = ANY 
('{1,643413,1106201}'::integer[]))"
"        ->  Bitmap Index Scan on idx_sp_structure_id  (cost=0.00..2566.32 rows=117367 width=0) (actual
time=14.487..14.487
 
rows=173867 loops=1)"
"              Index Cond: (structure_id = ANY 

('{1122687,309004,306064,278852,234066,1122645,412925,280033,423990,568929,448302,278487,278955,40430,40430,467979,467508,288413,289746,306073,355352,265583,477941,326652,326602,233964,15338,397586,1122647,3088

(...)"
"Planning time: 1.052 ms"
"Execution time: 72.858 ms"

Table is like this:

CREATE TABLE chemcentral.structure_props
(  id serial NOT NULL,  source_id integer NOT NULL,  structure_id integer NOT NULL,  batch_id character varying(16),
parent_idinteger,  property_id integer NOT NULL,  property_data jsonb,  CONSTRAINT structure_props_pkey PRIMARY KEY
(id)
)


All relevant columns are indexed and using PostgreSQL 9.4.
Any clues how to re-write it to avoid the slow sub-query.

Many thanks
Tim

Re: slow sub-query problem

From

David G Johnston

Date:

17 November 2014, 18:44:51

Tim Dudgeon wrote
> All relevant columns are indexed and using PostgreSQL 9.4.
> Any clues how to re-write it to avoid the slow sub-query.

Try using an actual join instead of a subquery.  You will have to provide
aliases and then setup the where clause appropriately.

I am reading the query correctly in that the repeated reference to 643413 is
redundant?  The lack of a defined natural primary key makes blind reasoning
difficult.

David J.





--
View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827275.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

Re: slow sub-query problem

From

David Johnston

Date:

17 November 2014, 19:19:56

Please reply to the list...

In short...

tablea as t1 join tablea as t2 on t1.id = t2.id

A natural key prevents duplicate real data which a serially generated made up key does not.

David J.

On Monday, November 17, 2014, Tim Dudgeon <tdudgeon.ml@gmail.com> wrote:

On 17/11/2014 18:44, David G Johnston wrote:
Tim Dudgeon wrote
All relevant columns are indexed and using PostgreSQL 9.4.
Any clues how to re-write it to avoid the slow sub-query.
Try using an actual join instead of a subquery. You will have to provide
aliases and then setup the where clause appropriately.
I'm trying to go in that direction but in the query is entirely within one table, so I need to join the table to itself? I've been trying this but not getting it to work yet.

I am reading the query correctly in that the repeated reference to 643413 is
redundant?
In this example its sort of redundant, but in a real world case the query for structure_id and property_id are independent and may have nothing in common.

The lack of a defined natural primary key makes blind reasoning
difficult.

The id column is the primary key.

Tim

David J.

--
View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827275.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

Re: slow sub-query problem

From

Tom Lane

Date:

17 November 2014, 20:10:56

Tim Dudgeon <tdudgeon.ml@gmail.com> writes:
> I'm having problems optimising a query that's very slow due to a sub-query.

I think it might get better if you could fix this misestimate:

> "              ->  Bitmap Index Scan on idx_sp_property_id 
>   (cost=0.00..33.90 rows=1146 width=0) (actual time=51.656..51.656 rows=811892 loops=366)"
> "                    Index Cond: (property_id = ANY ('{1,643413,1106201}'::integer[]))"

1146 estimated vs 811892 actual is pretty bad, and it doesn't seem like
this is a very hard case to estimate.  Are the stats for structure_props
up to date?  Maybe you need to increase the statistics target for the
property_id column.

Another component of the bad plan choice is this misestimate:

> "  ->  HashAggregate  (cost=1091.73..1091.75 rows=2 width=4) (actual time=2.829..3.212 rows=366 loops=1)"
> "        Group Key: structure_props_1.structure_id"

but it might be harder to do anything about that one, since the result
depends on the property_id being probed; without cross-column statistics
it may be impossible to do much better.
        regards, tom lane

Re: slow sub-query problem

From

Tim Dudgeon

Date:

18 November 2014, 08:53:45

Tom,

thanks. I did a vacuum of the table and unfortunately it didn't help.
But a good spot.

Tim


On 17/11/2014 20:10, Tom Lane wrote:
> Tim Dudgeon <tdudgeon.ml@gmail.com> writes:
>> I'm having problems optimising a query that's very slow due to a sub-query.
> I think it might get better if you could fix this misestimate:
>
>> "              ->  Bitmap Index Scan on idx_sp_property_id
>>    (cost=0.00..33.90 rows=1146 width=0) (actual time=51.656..51.656 rows=811892 loops=366)"
>> "                    Index Cond: (property_id = ANY ('{1,643413,1106201}'::integer[]))"
> 1146 estimated vs 811892 actual is pretty bad, and it doesn't seem like
> this is a very hard case to estimate.  Are the stats for structure_props
> up to date?  Maybe you need to increase the statistics target for the
> property_id column.
>
> Another component of the bad plan choice is this misestimate:
>
>> "  ->  HashAggregate  (cost=1091.73..1091.75 rows=2 width=4) (actual time=2.829..3.212 rows=366 loops=1)"
>> "        Group Key: structure_props_1.structure_id"
> but it might be harder to do anything about that one, since the result
> depends on the property_id being probed; without cross-column statistics
> it may be impossible to do much better.
>
>             regards, tom lane

Re: slow sub-query problem

From

Tim Dudgeon

Date:

18 November 2014, 08:57:31

Dave,

thanks for the suggestion. I was trying to work on that basis.
Eventually I got this that works quite well:

SELECT t1.id, t1.structure_id, t1.batch_id, t1.property_id, t1.property_data

FROM chemcentral.structure_props t1

JOIN chemcentral.structure_props t2 ON t1.id = t2.id

WHERE t2.structure_id IN (SELECT structure_id FROM chemcentral.structure_props WHERE property_id = 643413)

AND t1.property_id IN (1, 643413, 1106201)

;

which has this plan.

"Hash Join (cost=4376.38..6539.42 rows=43 width=648) (actual time=467.265..795.887 rows=381 loops=1)"

" Hash Cond: (t2.id = t1.id)"

" -> Nested Loop (cost=1092.16..1352.77 rows=507201 width=4) (actual time=0.807..84.228 rows=173867 loops=1)"

" -> HashAggregate (cost=1091.73..1091.75 rows=2 width=4) (actual time=0.779..0.897 rows=366 loops=1)"

" Group Key: structure_props.structure_id"

" -> Index Scan using idx_sp_property_id on structure_props (cost=0.43..1090.77 rows=382 width=4) (actual time=0.032..0.592 rows=369 loops=1)"

" Index Cond: (property_id = 643413)"

" -> Index Scan using idx_sp_structure_id on structure_props t2 (cost=0.43..127.34 rows=317 width=8) (actual time=0.010..0.172 rows=475 loops=366)"

" Index Cond: (structure_id = structure_props.structure_id)"

" -> Hash (cost=3269.89..3269.89 rows=1146 width=648) (actual time=464.458..464.458 rows=811892 loops=1)"

" Buckets: 1024 Batches: 32 (originally 1) Memory Usage: 4097kB"

" -> Index Scan using idx_sp_property_id on structure_props t1 (cost=0.44..3269.89 rows=1146 width=648) (actual time=0.033..231.895 rows=811892 loops=1)"

" Index Cond: (property_id = ANY ('{1,643413,1106201}'::integer[]))"

"Planning time: 0.885 ms"

It looks a little strange to me, but it works much better.

Tim

On 17/11/2014 19:19, David Johnston wrote:

Please reply to the list...

In short...

tablea as t1 join tablea as t2 on t1.id = t2.id

A natural key prevents duplicate real data which a serially generated made up key does not.

David J.

On Monday, November 17, 2014, Tim Dudgeon <tdudgeon.ml@gmail.com> wrote:

On 17/11/2014 18:44, David G Johnston wrote:
Tim Dudgeon wrote
All relevant columns are indexed and using PostgreSQL 9.4.
Any clues how to re-write it to avoid the slow sub-query.
Try using an actual join instead of a subquery. You will have to provide
aliases and then setup the where clause appropriately.
I'm trying to go in that direction but in the query is entirely within one table, so I need to join the table to itself? I've been trying this but not getting it to work yet.

I am reading the query correctly in that the repeated reference to 643413 is
redundant?
In this example its sort of redundant, but in a real world case the query for structure_id and property_id are independent and may have nothing in common.

The lack of a defined natural primary key makes blind reasoning
difficult.

The id column is the primary key.

Tim

David J.

--
View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827275.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

Re: slow sub-query problem

From

David G Johnston

Date:

19 November 2014, 01:37:07

Tim Dudgeon wrote
> SELECT t1.id, t1.structure_id, t1.batch_id, 
> t1.property_id, t1.property_data
> FROM chemcentral.structure_props t1
> JOIN chemcentral.structure_props t2 ON t1.id = t2.id 
> WHERE t2.structure_id IN (SELECT structure_id FROM 
> chemcentral.structure_props WHERE property_id = 643413)
> AND t1.property_id IN (1, 643413, 1106201)
> ;

What about:

SELECT t1.id, t1.structure_id, t1.batch_id, t1.property_id, t1.property_data
FROM chemcentral.structure_props t1
JOIN (
SELECT DISTINCT super.id FROM chemcentral.structure_props super
WHERE super.structure_id IN (
SELECT sub.structure_id
FROM chemcentral.structure_props sub
WHERE sub.property_id = 643413
)
) t2 ON (t1.id = t2.id) 
WHERE t1.property_id IN (1, 643413, 1106201)
;

?

I do highly suggest using column table prefixes everywhere in this kind of
query...

Also, AND == INTERSECT so:

SELECT ... FROM chemcentral.structure_props WHERE property_id IN
(1,643413,1106201)
INTERSECT DISTINCT
SELECT ... FROM chemcentral.structure_props WHERE structure_id IN (SELECT
... WHERE property_id = 643413)

You can even use CTE/WITH expressions and give these subqueries meaningful
names.

David J.




--
View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827453.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

Re: slow sub-query problem

From

daku.sandor@gmail.com

Date:

19 November 2014, 17:48:23

Slightly off:

I prefer "exists" to "join" if it's possible while on the list I almost never see any answer that uses "exists". Is my
existsfixation is some kind of bad practice? 

Sandor Daku

> On 19 Nov 2014, at 02:37, David G Johnston <david.g.johnston@gmail.com> wrote:
>
> Tim Dudgeon wrote
>> SELECT t1.id, t1.structure_id, t1.batch_id,
>> t1.property_id, t1.property_data
>> FROM chemcentral.structure_props t1
>> JOIN chemcentral.structure_props t2 ON t1.id = t2.id
>> WHERE t2.structure_id IN (SELECT structure_id FROM
>> chemcentral.structure_props WHERE property_id = 643413)
>> AND t1.property_id IN (1, 643413, 1106201)
>> ;
>
> What about:
>
> SELECT t1.id, t1.structure_id, t1.batch_id, t1.property_id, t1.property_data
> FROM chemcentral.structure_props t1
> JOIN (
> SELECT DISTINCT super.id FROM chemcentral.structure_props super
> WHERE super.structure_id IN (
> SELECT sub.structure_id
> FROM chemcentral.structure_props sub
> WHERE sub.property_id = 643413
> )
> ) t2 ON (t1.id = t2.id)
> WHERE t1.property_id IN (1, 643413, 1106201)
> ;
>
> ?
>
> I do highly suggest using column table prefixes everywhere in this kind of
> query...
>
> Also, AND == INTERSECT so:
>
> SELECT ... FROM chemcentral.structure_props WHERE property_id IN
> (1,643413,1106201)
> INTERSECT DISTINCT
> SELECT ... FROM chemcentral.structure_props WHERE structure_id IN (SELECT
> ... WHERE property_id = 643413)
>
> You can even use CTE/WITH expressions and give these subqueries meaningful
> names.
>
> David J.
>
>
>
>
> --
> View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827453.html
> Sent from the PostgreSQL - sql mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql

Re: slow sub-query problem

From

David G Johnston

Date:

19 November 2014, 17:58:52

On Wed, Nov 19, 2014 at 10:48 AM, daku.sandor [via PostgreSQL] <[hidden email]> wrote:

Slightly off:

I prefer "exists" to "join" if it's possible while on the list I almost never see any answer that uses "exists". Is my exists fixation is some kind of bad practice?

I tend not to use exists even where it is warranted...mostly because it just sounds wrong to me for some reason. Correlated subqueries are tougher for me to reason and understand quickly so if a join or IN version of a query will work I tend to go that way first.

The optimizer does a decent job of making exists and joins functionally equivalent. Exists and IN are not in the presence of NULL so I am often concerned about using IN instead of Exists but still end up doing it and only changing if the query performs badly.

The ON clause is more obvious than a equality condition in a where clause in a subquery. I tend to use "JOIN (subquery) alias ON" instead of exists too - partially to separate out the subsetting logic (WHERE) from the join logic (ON)

David J.

View this message in context: Re: slow sub-query problem
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

Re: slow sub-query problem

From

Tim Dudgeon

Date:

20 November 2014, 14:12:10

I tried them all out.

Original query: 17039ms
Simple join: 889ms
Join with SELECT: 1302ms (799ms without DISTINCT which I don't think is 
needed here)
Using INTERSECT: 1454ms (1474 without DISTINCT)

So with the current data the simple join and the Join with SELECT but no 
DISTINCT are the best.

Thanks for your help with this.

Tim



On 19/11/2014 01:37, David G Johnston wrote:
> Tim Dudgeon wrote
>> SELECT t1.id, t1.structure_id, t1.batch_id,
>> t1.property_id, t1.property_data
>> FROM chemcentral.structure_props t1
>> JOIN chemcentral.structure_props t2 ON t1.id = t2.id
>> WHERE t2.structure_id IN (SELECT structure_id FROM
>> chemcentral.structure_props WHERE property_id = 643413)
>> AND t1.property_id IN (1, 643413, 1106201)
>> ;
> What about:
>
> SELECT t1.id, t1.structure_id, t1.batch_id, t1.property_id, t1.property_data
> FROM chemcentral.structure_props t1
> JOIN (
> SELECT DISTINCT super.id FROM chemcentral.structure_props super
> WHERE super.structure_id IN (
> SELECT sub.structure_id
> FROM chemcentral.structure_props sub
> WHERE sub.property_id = 643413
> )
> ) t2 ON (t1.id = t2.id)
> WHERE t1.property_id IN (1, 643413, 1106201)
> ;
>
> ?
>
> I do highly suggest using column table prefixes everywhere in this kind of
> query...
>
> Also, AND == INTERSECT so:
>
> SELECT ... FROM chemcentral.structure_props WHERE property_id IN
> (1,643413,1106201)
> INTERSECT DISTINCT
> SELECT ... FROM chemcentral.structure_props WHERE structure_id IN (SELECT
> ... WHERE property_id = 643413)
>
> You can even use CTE/WITH expressions and give these subqueries meaningful
> names.
>
> David J.
>
>
>
>
> --
> View this message in context: http://postgresql.nabble.com/slow-sub-query-problem-tp5827273p5827453.html
> Sent from the PostgreSQL - sql mailing list archive at Nabble.com.
>
>