Thread: alternate idioms for large "IN (...)" lists

alternate idioms for large "IN (...)" lists

From

russm

Date:

02 June 2002, 02:10:31

I'm working on an application (originally written for Oracle) that is being
brought to it's knees by one commonly-executed query - finding the most
common attributes for a given list of assets...


SELECT attribute_id, COUNT(asset_id) FROM attribute_mapWHERE asset_id IN ( <potentially several thousand asset_id
values>)GROUP BY attribute_id
 
HAVING COUNT(asset_id) < <cutoff value>ORDER BY count DESCLIMIT <limit>


where attribute_map is a many-to-many map of asset_id to attribute_id -

demo=# \d attribute_map Table "attribute_map"  Column     |  Type   | Modifiers
--------------+---------+-----------asset_id     | integer |attribute_id | integer |
Indexes: am_asset_id_idx,        am_attribute_id_idx
Unique keys: am_asset_id_attribute_id_un
Triggers: RI_ConstraintTrigger_26912,         RI_ConstraintTrigger_26906


From what i've read here, postgresql doesn't handle IN (...) queries
especially efficiently, and it looks to me like I'm being bitten by that. An
EXPLAIN ANALYZE on that query with just under 40k rows in attribute_map and
667 asset_id values in the IN (...) list returns

NOTICE:  QUERY PLAN:

Limit  (cost=64361.51..64361.51 rows=200 width=8) (actual
time=24431.51..24431.80 rows=200 loops=1) ->  Sort  (cost=64361.51..64361.51 rows=1647 width=8) (actual
time=24431.50..24431.61 rows=201 loops=1)       ->  Aggregate  (cost=0.00..64273.49 rows=1647 width=8) (actual
time=55.29..24324.75 rows=1747 loops=1)             ->  Group  (cost=0.00..64191.13 rows=16473 width=8) (actual
time=2.38..24198.18 rows=21308 loops=1)                   ->  Index Scan using am_attribute_id_idx on
attribute_map  (cost=0.00..64149.94 rows=16473 width=8) (actual
time=2.37..24034.97 rows=21308 loops=1)
Total runtime: 24433.20 msec


which I'm assuming means that the backend is doing a seperate index lookup
for each of the 667 asset_id values in the list.

Are there any standard idioms in postgres for getting around the poor
handling of IN (...) lists? Placing the list of asset_id values in a table
and then joining against that runs in about 1/15 the time (postgres makes a
hash of the asset_id values and then hash joins against the attribute_map)
but since my asset_id values come from outside the database that approach
would require creating and managing a whole lot of temporary tables, which
i'd rather avoid.

Much obliged for any suggestions

Russell

Re: alternate idioms for large "IN (...)" lists

From

"Christopher Kings-Lynne"

Date:

02 June 2002, 14:01:45

Yep, Postgresql's IN implementation is very slow.  You are far better off
rewriting your query to use EXISTS or an OUTER JOIN.

See:
http://www3.us.postgresql.org/users-lounge/docs/7.2/postgres/functions-subqu
ery.html

eg.

SELECT attribute_id, COUNT(asset_id) FROM attribute_mapWHERE EXISTS (SELECT asset_id FROM <potentially several thousand
asset_id
values> WHERE asset_id=...)GROUP BY attribute_id
HAVING COUNT(asset_id) < <cutoff value>ORDER BY count DESCLIMIT <limit>

or something like:

SELECT attribute_id, COUNT(asset_id) FROM attribute_map atm RIGHT JOIN assets_map aam ON atm.asset_id=asm.asset_id
GROUP BY....blah blah blah...

Play around with these alternative syntaxes and stay away from putting
thousands of rows in IN() and you will be able to make your query run orders
of magnitude faster.

Chris

----- Original Message -----
From: "russm" <russm@icorp.com.au>
To: <pgsql-sql@postgresql.org>
Sent: Saturday, June 01, 2002 11:10 PM
Subject: [SQL] alternate idioms for large "IN (...)" lists


> I'm working on an application (originally written for Oracle) that is
being
> brought to it's knees by one commonly-executed query - finding the most
> common attributes for a given list of assets...
>
>
> SELECT attribute_id, COUNT(asset_id)
>   FROM attribute_map
>  WHERE asset_id IN ( <potentially several thousand asset_id values> )
>  GROUP BY attribute_id
> HAVING COUNT(asset_id) < <cutoff value>
>  ORDER BY count DESC
>  LIMIT <limit>
>
>
> where attribute_map is a many-to-many map of asset_id to attribute_id -
>
> demo=# \d attribute_map
>   Table "attribute_map"
>    Column     |  Type   | Modifiers
> --------------+---------+-----------
>  asset_id     | integer |
>  attribute_id | integer |
> Indexes: am_asset_id_idx,
>          am_attribute_id_idx
> Unique keys: am_asset_id_attribute_id_un
> Triggers: RI_ConstraintTrigger_26912,
>           RI_ConstraintTrigger_26906
>
>
> >From what i've read here, postgresql doesn't handle IN (...) queries
> especially efficiently, and it looks to me like I'm being bitten by that.
An
> EXPLAIN ANALYZE on that query with just under 40k rows in attribute_map
and
> 667 asset_id values in the IN (...) list returns
>
> NOTICE:  QUERY PLAN:
>
> Limit  (cost=64361.51..64361.51 rows=200 width=8) (actual
> time=24431.51..24431.80 rows=200 loops=1)
>   ->  Sort  (cost=64361.51..64361.51 rows=1647 width=8) (actual
> time=24431.50..24431.61 rows=201 loops=1)
>         ->  Aggregate  (cost=0.00..64273.49 rows=1647 width=8) (actual
> time=55.29..24324.75 rows=1747 loops=1)
>               ->  Group  (cost=0.00..64191.13 rows=16473 width=8) (actual
> time=2.38..24198.18 rows=21308 loops=1)
>                     ->  Index Scan using am_attribute_id_idx on
> attribute_map  (cost=0.00..64149.94 rows=16473 width=8) (actual
> time=2.37..24034.97 rows=21308 loops=1)
> Total runtime: 24433.20 msec
>
>
> which I'm assuming means that the backend is doing a seperate index lookup
> for each of the 667 asset_id values in the list.
>
> Are there any standard idioms in postgres for getting around the poor
> handling of IN (...) lists? Placing the list of asset_id values in a table
> and then joining against that runs in about 1/15 the time (postgres makes
a
> hash of the asset_id values and then hash joins against the attribute_map)
> but since my asset_id values come from outside the database that approach
> would require creating and managing a whole lot of temporary tables, which
> i'd rather avoid.
>
> Much obliged for any suggestions
>
> Russell
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>

Re: alternate idioms for large "IN (...)" lists

From

russm

Date:

03 June 2002, 07:33:51

I've been playing with other ways of structuring the query to try and make
the planner use a hash join, and am currently looking at replacing

SELECT attribute_id, COUNT(asset_id)
FROM attribute_map
WHERE asset_id IN ( 564,3789, ... ,5240)

with

SELECT attribute_id, COUNT(asset_id)
FROM attribute_map m, ( SELECT 564 AS a UNION SELECT 3789 AS a UNION ... SELECT 5240 AS a
) a
WHERE m.asset_id = a.a

which produces a hash join between attribute_map and the list of asset_id
values, and runs in ~7% the time of the "IN (...)" version of the query.
I'll need to see how it scales to our full dataset, but this may be enough
of an improvement to get us by. The downside is that the queries I'm passing
to the DB are bloated by a factor of 5 - I suspect I can live with this.

Christopher suggested I rewrite the query in terms of EXISTS, but can't
figure out how to do that in a way that improves the performance of this
query - am I missing something here?


Just out of curiosity, why does postgresql's IN perform so poorly? Is it
because IN is implemented in terms of ANY? I would expect that the best
approach for an IN expression is a hash join, but that only works for = and
!=, and not any of the other operators that ANY supports.

cheers

Russell


On 3/6/02 4:01 AM, "Christopher Kings-Lynne" <chriskl@familyhealth.com.au>
wrote:

> Yep, Postgresql's IN implementation is very slow.  You are far better off
> rewriting your query to use EXISTS or an OUTER JOIN.
> 
> See:
> http://www3.us.postgresql.org/users-lounge/docs/7.2/postgres/functions-subqu
> ery.html
> 
> eg.
> 
> SELECT attribute_id, COUNT(asset_id)
> FROM attribute_map
> WHERE EXISTS (SELECT asset_id FROM <potentially several thousand asset_id
> values> WHERE asset_id=...)
> GROUP BY attribute_id
> HAVING COUNT(asset_id) < <cutoff value>
> ORDER BY count DESC
> LIMIT <limit>
> 
> or something like:
> 
> SELECT attribute_id, COUNT(asset_id)
> FROM attribute_map atm RIGHT JOIN assets_map aam
> ON atm.asset_id=asm.asset_id
> GROUP BY....blah blah blah...
> 
> Play around with these alternative syntaxes and stay away from putting
> thousands of rows in IN() and you will be able to make your query run orders
> of magnitude faster.
> 
> Chris
> 
> ----- Original Message -----
> From: "russm" <russm@icorp.com.au>
> To: <pgsql-sql@postgresql.org>
> Sent: Saturday, June 01, 2002 11:10 PM
> Subject: [SQL] alternate idioms for large "IN (...)" lists
> 
> 
>> I'm working on an application (originally written for Oracle) that is
> being
>> brought to it's knees by one commonly-executed query - finding the most
>> common attributes for a given list of assets...
>> 
>> 
>> SELECT attribute_id, COUNT(asset_id)
>>   FROM attribute_map
>>  WHERE asset_id IN ( <potentially several thousand asset_id values> )
>>  GROUP BY attribute_id
>> HAVING COUNT(asset_id) < <cutoff value>
>>  ORDER BY count DESC
>>  LIMIT <limit>
>> 
>> 
>> where attribute_map is a many-to-many map of asset_id to attribute_id -
>> 
>> demo=# \d attribute_map
>>   Table "attribute_map"
>>    Column     |  Type   | Modifiers
>> --------------+---------+-----------
>>  asset_id     | integer |
>>  attribute_id | integer |
>> Indexes: am_asset_id_idx,
>>          am_attribute_id_idx
>> Unique keys: am_asset_id_attribute_id_un
>> Triggers: RI_ConstraintTrigger_26912,
>>           RI_ConstraintTrigger_26906
>> 
>> 
>>> From what i've read here, postgresql doesn't handle IN (...) queries
>> especially efficiently, and it looks to me like I'm being bitten by that.
> An
>> EXPLAIN ANALYZE on that query with just under 40k rows in attribute_map
> and
>> 667 asset_id values in the IN (...) list returns
>> 
>> NOTICE:  QUERY PLAN:
>> 
>> Limit  (cost=64361.51..64361.51 rows=200 width=8) (actual
>> time=24431.51..24431.80 rows=200 loops=1)
>>   ->  Sort  (cost=64361.51..64361.51 rows=1647 width=8) (actual
>> time=24431.50..24431.61 rows=201 loops=1)
>>         ->  Aggregate  (cost=0.00..64273.49 rows=1647 width=8) (actual
>> time=55.29..24324.75 rows=1747 loops=1)
>>               ->  Group  (cost=0.00..64191.13 rows=16473 width=8) (actual
>> time=2.38..24198.18 rows=21308 loops=1)
>>                     ->  Index Scan using am_attribute_id_idx on
>> attribute_map  (cost=0.00..64149.94 rows=16473 width=8) (actual
>> time=2.37..24034.97 rows=21308 loops=1)
>> Total runtime: 24433.20 msec
>> 
>> 
>> which I'm assuming means that the backend is doing a seperate index lookup
>> for each of the 667 asset_id values in the list.
>> 
>> Are there any standard idioms in postgres for getting around the poor
>> handling of IN (...) lists? Placing the list of asset_id values in a table
>> and then joining against that runs in about 1/15 the time (postgres makes
> a
>> hash of the asset_id values and then hash joins against the attribute_map)
>> but since my asset_id values come from outside the database that approach
>> would require creating and managing a whole lot of temporary tables, which
>> i'd rather avoid.
>> 
>> Much obliged for any suggestions
>> 
>> Russell
>> 
>> 
>> 
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: if posting/reading through Usenet, please send an appropriate
>> subscribe-nomail command to majordomo@postgresql.org so that your
>> message can get through to the mailing list cleanly
>> 
> 
>

Re: alternate idioms for large "IN (...)" lists

From

Tom Lane

Date:

03 June 2002, 19:38:48

russm <russm@icorp.com.au> writes:
> Just out of curiosity, why does postgresql's IN perform so poorly?

Because no one's gotten around to improving it.

The current implementation is actually just fine for small numbers of
values in the IN-list.  Making it work nicely for large numbers of
values without pessimizing the small-list case would involve supporting
(at least) two implementations and teaching the planner how to choose
one.  This definitely should be done, but has not gotten to the top of
anyone's to-do queue.

Most of the complaints we've seen about IN performance have to do with
the IN-subselect case, which is yet a different animal.  It's likely
that that will get fixed before anyone worries about the large-number-
of-scalar-values case...
        regards, tom lane