Thread: alternate idioms for large "IN (...)" lists
I'm working on an application (originally written for Oracle) that is being brought to it's knees by one commonly-executed query - finding the most common attributes for a given list of assets... SELECT attribute_id, COUNT(asset_id) FROM attribute_mapWHERE asset_id IN ( <potentially several thousand asset_id values>)GROUP BY attribute_id HAVING COUNT(asset_id) < <cutoff value>ORDER BY count DESCLIMIT <limit> where attribute_map is a many-to-many map of asset_id to attribute_id - demo=# \d attribute_map Table "attribute_map" Column | Type | Modifiers --------------+---------+-----------asset_id | integer |attribute_id | integer | Indexes: am_asset_id_idx, am_attribute_id_idx Unique keys: am_asset_id_attribute_id_un Triggers: RI_ConstraintTrigger_26912, RI_ConstraintTrigger_26906 From what i've read here, postgresql doesn't handle IN (...) queries especially efficiently, and it looks to me like I'm being bitten by that. An EXPLAIN ANALYZE on that query with just under 40k rows in attribute_map and 667 asset_id values in the IN (...) list returns NOTICE: QUERY PLAN: Limit (cost=64361.51..64361.51 rows=200 width=8) (actual time=24431.51..24431.80 rows=200 loops=1) -> Sort (cost=64361.51..64361.51 rows=1647 width=8) (actual time=24431.50..24431.61 rows=201 loops=1) -> Aggregate (cost=0.00..64273.49 rows=1647 width=8) (actual time=55.29..24324.75 rows=1747 loops=1) -> Group (cost=0.00..64191.13 rows=16473 width=8) (actual time=2.38..24198.18 rows=21308 loops=1) -> Index Scan using am_attribute_id_idx on attribute_map (cost=0.00..64149.94 rows=16473 width=8) (actual time=2.37..24034.97 rows=21308 loops=1) Total runtime: 24433.20 msec which I'm assuming means that the backend is doing a seperate index lookup for each of the 667 asset_id values in the list. Are there any standard idioms in postgres for getting around the poor handling of IN (...) lists? Placing the list of asset_id values in a table and then joining against that runs in about 1/15 the time (postgres makes a hash of the asset_id values and then hash joins against the attribute_map) but since my asset_id values come from outside the database that approach would require creating and managing a whole lot of temporary tables, which i'd rather avoid. Much obliged for any suggestions Russell
Yep, Postgresql's IN implementation is very slow. You are far better off rewriting your query to use EXISTS or an OUTER JOIN. See: http://www3.us.postgresql.org/users-lounge/docs/7.2/postgres/functions-subqu ery.html eg. SELECT attribute_id, COUNT(asset_id) FROM attribute_mapWHERE EXISTS (SELECT asset_id FROM <potentially several thousand asset_id values> WHERE asset_id=...)GROUP BY attribute_id HAVING COUNT(asset_id) < <cutoff value>ORDER BY count DESCLIMIT <limit> or something like: SELECT attribute_id, COUNT(asset_id) FROM attribute_map atm RIGHT JOIN assets_map aam ON atm.asset_id=asm.asset_id GROUP BY....blah blah blah... Play around with these alternative syntaxes and stay away from putting thousands of rows in IN() and you will be able to make your query run orders of magnitude faster. Chris ----- Original Message ----- From: "russm" <russm@icorp.com.au> To: <pgsql-sql@postgresql.org> Sent: Saturday, June 01, 2002 11:10 PM Subject: [SQL] alternate idioms for large "IN (...)" lists > I'm working on an application (originally written for Oracle) that is being > brought to it's knees by one commonly-executed query - finding the most > common attributes for a given list of assets... > > > SELECT attribute_id, COUNT(asset_id) > FROM attribute_map > WHERE asset_id IN ( <potentially several thousand asset_id values> ) > GROUP BY attribute_id > HAVING COUNT(asset_id) < <cutoff value> > ORDER BY count DESC > LIMIT <limit> > > > where attribute_map is a many-to-many map of asset_id to attribute_id - > > demo=# \d attribute_map > Table "attribute_map" > Column | Type | Modifiers > --------------+---------+----------- > asset_id | integer | > attribute_id | integer | > Indexes: am_asset_id_idx, > am_attribute_id_idx > Unique keys: am_asset_id_attribute_id_un > Triggers: RI_ConstraintTrigger_26912, > RI_ConstraintTrigger_26906 > > > >From what i've read here, postgresql doesn't handle IN (...) queries > especially efficiently, and it looks to me like I'm being bitten by that. An > EXPLAIN ANALYZE on that query with just under 40k rows in attribute_map and > 667 asset_id values in the IN (...) list returns > > NOTICE: QUERY PLAN: > > Limit (cost=64361.51..64361.51 rows=200 width=8) (actual > time=24431.51..24431.80 rows=200 loops=1) > -> Sort (cost=64361.51..64361.51 rows=1647 width=8) (actual > time=24431.50..24431.61 rows=201 loops=1) > -> Aggregate (cost=0.00..64273.49 rows=1647 width=8) (actual > time=55.29..24324.75 rows=1747 loops=1) > -> Group (cost=0.00..64191.13 rows=16473 width=8) (actual > time=2.38..24198.18 rows=21308 loops=1) > -> Index Scan using am_attribute_id_idx on > attribute_map (cost=0.00..64149.94 rows=16473 width=8) (actual > time=2.37..24034.97 rows=21308 loops=1) > Total runtime: 24433.20 msec > > > which I'm assuming means that the backend is doing a seperate index lookup > for each of the 667 asset_id values in the list. > > Are there any standard idioms in postgres for getting around the poor > handling of IN (...) lists? Placing the list of asset_id values in a table > and then joining against that runs in about 1/15 the time (postgres makes a > hash of the asset_id values and then hash joins against the attribute_map) > but since my asset_id values come from outside the database that approach > would require creating and managing a whole lot of temporary tables, which > i'd rather avoid. > > Much obliged for any suggestions > > Russell > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
I've been playing with other ways of structuring the query to try and make the planner use a hash join, and am currently looking at replacing SELECT attribute_id, COUNT(asset_id) FROM attribute_map WHERE asset_id IN ( 564,3789, ... ,5240) with SELECT attribute_id, COUNT(asset_id) FROM attribute_map m, ( SELECT 564 AS a UNION SELECT 3789 AS a UNION ... SELECT 5240 AS a ) a WHERE m.asset_id = a.a which produces a hash join between attribute_map and the list of asset_id values, and runs in ~7% the time of the "IN (...)" version of the query. I'll need to see how it scales to our full dataset, but this may be enough of an improvement to get us by. The downside is that the queries I'm passing to the DB are bloated by a factor of 5 - I suspect I can live with this. Christopher suggested I rewrite the query in terms of EXISTS, but can't figure out how to do that in a way that improves the performance of this query - am I missing something here? Just out of curiosity, why does postgresql's IN perform so poorly? Is it because IN is implemented in terms of ANY? I would expect that the best approach for an IN expression is a hash join, but that only works for = and !=, and not any of the other operators that ANY supports. cheers Russell On 3/6/02 4:01 AM, "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> wrote: > Yep, Postgresql's IN implementation is very slow. You are far better off > rewriting your query to use EXISTS or an OUTER JOIN. > > See: > http://www3.us.postgresql.org/users-lounge/docs/7.2/postgres/functions-subqu > ery.html > > eg. > > SELECT attribute_id, COUNT(asset_id) > FROM attribute_map > WHERE EXISTS (SELECT asset_id FROM <potentially several thousand asset_id > values> WHERE asset_id=...) > GROUP BY attribute_id > HAVING COUNT(asset_id) < <cutoff value> > ORDER BY count DESC > LIMIT <limit> > > or something like: > > SELECT attribute_id, COUNT(asset_id) > FROM attribute_map atm RIGHT JOIN assets_map aam > ON atm.asset_id=asm.asset_id > GROUP BY....blah blah blah... > > Play around with these alternative syntaxes and stay away from putting > thousands of rows in IN() and you will be able to make your query run orders > of magnitude faster. > > Chris > > ----- Original Message ----- > From: "russm" <russm@icorp.com.au> > To: <pgsql-sql@postgresql.org> > Sent: Saturday, June 01, 2002 11:10 PM > Subject: [SQL] alternate idioms for large "IN (...)" lists > > >> I'm working on an application (originally written for Oracle) that is > being >> brought to it's knees by one commonly-executed query - finding the most >> common attributes for a given list of assets... >> >> >> SELECT attribute_id, COUNT(asset_id) >> FROM attribute_map >> WHERE asset_id IN ( <potentially several thousand asset_id values> ) >> GROUP BY attribute_id >> HAVING COUNT(asset_id) < <cutoff value> >> ORDER BY count DESC >> LIMIT <limit> >> >> >> where attribute_map is a many-to-many map of asset_id to attribute_id - >> >> demo=# \d attribute_map >> Table "attribute_map" >> Column | Type | Modifiers >> --------------+---------+----------- >> asset_id | integer | >> attribute_id | integer | >> Indexes: am_asset_id_idx, >> am_attribute_id_idx >> Unique keys: am_asset_id_attribute_id_un >> Triggers: RI_ConstraintTrigger_26912, >> RI_ConstraintTrigger_26906 >> >> >>> From what i've read here, postgresql doesn't handle IN (...) queries >> especially efficiently, and it looks to me like I'm being bitten by that. > An >> EXPLAIN ANALYZE on that query with just under 40k rows in attribute_map > and >> 667 asset_id values in the IN (...) list returns >> >> NOTICE: QUERY PLAN: >> >> Limit (cost=64361.51..64361.51 rows=200 width=8) (actual >> time=24431.51..24431.80 rows=200 loops=1) >> -> Sort (cost=64361.51..64361.51 rows=1647 width=8) (actual >> time=24431.50..24431.61 rows=201 loops=1) >> -> Aggregate (cost=0.00..64273.49 rows=1647 width=8) (actual >> time=55.29..24324.75 rows=1747 loops=1) >> -> Group (cost=0.00..64191.13 rows=16473 width=8) (actual >> time=2.38..24198.18 rows=21308 loops=1) >> -> Index Scan using am_attribute_id_idx on >> attribute_map (cost=0.00..64149.94 rows=16473 width=8) (actual >> time=2.37..24034.97 rows=21308 loops=1) >> Total runtime: 24433.20 msec >> >> >> which I'm assuming means that the backend is doing a seperate index lookup >> for each of the 667 asset_id values in the list. >> >> Are there any standard idioms in postgres for getting around the poor >> handling of IN (...) lists? Placing the list of asset_id values in a table >> and then joining against that runs in about 1/15 the time (postgres makes > a >> hash of the asset_id values and then hash joins against the attribute_map) >> but since my asset_id values come from outside the database that approach >> would require creating and managing a whole lot of temporary tables, which >> i'd rather avoid. >> >> Much obliged for any suggestions >> >> Russell >> >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 3: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly >> > >
russm <russm@icorp.com.au> writes: > Just out of curiosity, why does postgresql's IN perform so poorly? Because no one's gotten around to improving it. The current implementation is actually just fine for small numbers of values in the IN-list. Making it work nicely for large numbers of values without pessimizing the small-list case would involve supporting (at least) two implementations and teaching the planner how to choose one. This definitely should be done, but has not gotten to the top of anyone's to-do queue. Most of the complaints we've seen about IN performance have to do with the IN-subselect case, which is yet a different animal. It's likely that that will get fixed before anyone worries about the large-number- of-scalar-values case... regards, tom lane