Thread: Have I b0rked something? Slow comparisons on "where x in (...)"
Postgres version 8.0.9 on Solaris 2.8. I know it's old but... I have a table with a million rows. I need to select data from this table based on an indexed column; I need to select 600 possible values from the column, returning around 24,000 rows of data. In perl I have a hash which has 600 key values in it. So I did: "select stuff from table where index_key in (" . join(",",keys %hash) . ") AND non_index_row in ('xyz','abc','def') And in the perl while(fetch()) { do_stuff } This resulted in a query string which executed in 12 minutes. If I did an "explain" on the query string then I can see it was being expanded to 600 OR statements where (index_key = 1) OR (index_key = 2) OR ..... Now as an alternative option I did select stuff from table where non_index_row in ('xyz','abc','def') and in the perl while(fetch()) { next unless $hash{$_->{index_key}}; do_stuff } To me this should be slower since we're selecting more rows, throwing the data back to the perl and then discarding values I didn't want. Imagine my surprise when the result took 3 minutes to execute. Have I broken something, somewhere? Or are IN comparisons really that slow? For what it's worth, a simple explain select count(*) from table where index_key in (1,2,3,4,....) uses the index up until 156 values but then switches to sequential scan when there are 157 or more values in query. Any thoughts? I fear my poor tuning attempts may have caused other slow downs! -- rgds Stephen
Have you done a vacuum on the table recently? I would be curious to see how: select stuff from table where index_key = <key1> AND non_index_row in ('xyz','abc','def') UNION ALL select stuff from table where index_key = <key2> AND non_index_row in ('xyz','abc','def') ... UNION ALL select stuff from table where index_key = <key600> AND non_index_row in ('xyz','abc','def') performs by comparison. If, after a vacuum, it performs better than the IN list, then the IN list might benefit from a bit of analysis for better tuning chances. > -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Stephen Harris > Sent: Wednesday, May 02, 2007 11:32 AM > To: Postgres General > Subject: [GENERAL] Have I b0rked something? Slow comparisons on "where x > in (...)" > > Postgres version 8.0.9 on Solaris 2.8. I know it's old but... > > I have a table with a million rows. > > I need to select data from this table based on an indexed column; I need > to select 600 possible values from the column, returning around 24,000 > rows of data. > > In perl I have a hash which has 600 key values in it. > > So I did: > > "select stuff from table where index_key in (" . > join(",",keys %hash) . ") AND non_index_row in ('xyz','abc','def') > > And in the perl > while(fetch()) > { > do_stuff > } > > This resulted in a query string which executed in 12 minutes. If I > did an "explain" on the query string then I can see it was being expanded > to 600 OR statements > where (index_key = 1) OR (index_key = 2) OR ..... > > > Now as an alternative option I did > select stuff from table where non_index_row in ('xyz','abc','def') > and in the perl > while(fetch()) > { > next unless $hash{$_->{index_key}}; > do_stuff > } > > To me this should be slower since we're selecting more rows, throwing > the data back to the perl and then discarding values I didn't want. > > Imagine my surprise when the result took 3 minutes to execute. > > Have I broken something, somewhere? Or are IN comparisons really that > slow? > > For what it's worth, a simple > > explain select count(*) from table where index_key in (1,2,3,4,....) > > uses the index up until 156 values but then switches to sequential scan > when there are 157 or more values in query. > > Any thoughts? I fear my poor tuning attempts may have caused other > slow downs! > > -- > > rgds > Stephen > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
On Wed, May 02, 2007 at 12:45:08PM -0700, Dann Corbit wrote: > Have you done a vacuum on the table recently? We vacuum daily and cluster weekly after the nightly activities have been performed. > IN list, then the IN list might benefit from a bit of analysis for The IN list is just a set of integers (it's an integer index) generated from some selects on other tables earlier in our processing. I don't have any choice as to what is in the IN list :-) -- rgds Stephen
Stephen Harris <lists@spuddy.org> writes: > Postgres version 8.0.9 on Solaris 2.8. I know it's old but... > I have a table with a million rows. > I need to select data from this table based on an indexed column; I need > to select 600 possible values from the column, returning around 24,000 > rows of data. > In perl I have a hash which has 600 key values in it. > So I did: > "select stuff from table where index_key in (" . > join(",",keys %hash) . ") AND non_index_row in ('xyz','abc','def') > And in the perl > while(fetch()) > { > do_stuff > } > This resulted in a query string which executed in 12 minutes. If I > did an "explain" on the query string then I can see it was being expanded > to 600 OR statements > where (index_key = 1) OR (index_key = 2) OR ..... In what, a seq scan? That plan will require executing 600 integer comparisons at each of a million rows, with only some trivial fraction avoided because of early success. So it works out that your machine is able to do something over 800K such comparisons per second, which seems a bit slow for any modern machine ... but I note 8.0 didn't have any of the "virtual slot" optimizations added in later releases, and is doing a fresh heap_getattr() for each access to the variable. If it's having to grovel over a lot of variable-width fields to get to that field each time, I can see where the time might get eaten up. Where is the index_key column in the tuples, exactly? regards, tom lane
Stephen Harris wrote: > On Wed, May 02, 2007 at 12:45:08PM -0700, Dann Corbit wrote: >> Have you done a vacuum on the table recently? > > We vacuum daily and cluster weekly after the nightly activities have been > performed. > >> IN list, then the IN list might benefit from a bit of analysis for > > The IN list is just a set of integers (it's an integer index) generated > from some selects on other tables earlier in our processing. I don't > have any choice as to what is in the IN list :-) Try creating a temporary table, populating with the list and joining against it. That's probably your best bet for a long list of target values. -- Richard Huxton Archonet Ltd
On Wed, May 02, 2007 at 05:59:49PM -0400, Tom Lane wrote: > Stephen Harris <lists@spuddy.org> writes: > > "select stuff from table where index_key in (" . > > join(",",keys %hash) . ") AND non_index_row in ('xyz','abc','def') > In what, a seq scan? Yeah, if the number of comparisons exceeds 156 then it switched from index scan to sequential scan. > time, I can see where the time might get eaten up. Where is the > index_key column in the tuples, exactly? First column. -- rgds Stephen
> Try creating a temporary table, populating with the list and joining > against it. That's probably your best bet for a long list of target > values. Check : forum_bench=> CREATE TABLE test (value INTEGER NOT NULL); CREATE TABLE forum_bench=> INSERT INTO test SELECT * FROM generate_series( 1, 1000000 ); INSERT 0 1000000 forum_bench=> ANALYZE test; forum_bench=> EXPLAIN ANALYZE SELECT * FROM test; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Seq Scan on test (cost=0.00..14405.24 rows=999924 width=4) (actual time=0.030..349.699 rows=1000000 loops=1) Total runtime: 542.914 ms (2 lignes) OK : 542 ms to grab the data. IN() : EXPLAIN ANALYZE SELECT * FROM test WHERE value IN ( 1000 values from 0 to 999000 in steps of 1000 ): Seq Scan on test (cost=0.00..1264310.24 rows=1000 width=4) (actual time=17.649..17977.085 rows=999 loops=1) Filter: (value = ANY ('{0,1000..........99000}'::integer[])) Total runtime: 17978.061 ms Ouch. forum_bench=> EXPLAIN ANALYZE SELECT * FROM test WHERE value IN (VALUES (0),(1000),(2000),....................(998000),(999000)); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=19.50..18176.45 rows=200 width=4) (actual time=2.823..736.960 rows=999 loops=1) Hash Cond: (test.value = "*VALUES*".column1) -> Seq Scan on test (cost=0.00..14405.24 rows=999924 width=4) (actual time=0.032..335.680 rows=1000000 loops=1) -> Hash (cost=17.00..17.00 rows=200 width=4) (actual time=2.108..2.108 rows=1000 loops=1) -> HashAggregate (cost=15.00..17.00 rows=200 width=4) (actual time=1.165..1.542 rows=1000 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.004..0.478 rows=1000 loops=1) Total runtime: 737.362 ms Removing the 542 ms to read the table, we see checking if the values are in the hash is really rally fast. So, obvious truth : hash is faster than dumb compare. Much faster. Now, postgres should do this on its own, I think. PS : if the 1000 values are all the same (1000 times 1), IN() doesn't detect it, so the runtime does not change. Hash join doesn't care, so the runtime doesn't change either.
Followup to my previous test, with an index this time EXPLAIN ANALYZE SELECT * FROM test WHERE value IN ( 1000 integers ) Bitmap Heap Scan on test (cost=3519.09..7156.83 rows=1000 width=4) (actual time=5.843..8.897 rows=999 loops=1) Recheck Cond: (value = ANY ('{0,...,999000}'::integer[])) -> Bitmap Index Scan on testindex (cost=0.00..3518.84 rows=1000 width=0) (actual time=5.594..5.594 rows=999 loops=1) Index Cond: (value = ANY ('{0,...,999000}'::integer[])) Total runtime: 9.157 ms EXPLAIN ANALYZE SELECT * FROM test WHERE value IN (VALUES (0),(1000),.......(999000)) Nested Loop (cost=15.00..1461.74 rows=200 width=4) (actual time=1.191..26.127 rows=999 loops=1) -> HashAggregate (cost=15.00..17.00 rows=200 width=4) (actual time=1.169..1.673 rows=1000 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.007..0.517 rows=1000 loops=1) -> Index Scan using testindex on test (cost=0.00..7.21 rows=1 width=4) (actual time=0.023..0.023 rows=1 loops=1000) Index Cond: (test.value = "*VALUES*".column1) Total runtime: 26.411 ms Mixing the two would be a win : - hashing the values - making a bitmap from them - grabbing the pages and using the hash in "Recheck Cond" ie. something like that : -> HashAggregate (cost=15.00..17.00 rows=200 width=4) (actual time=1.169..1.673 rows=1000 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.007..0.517 rows=1000 loops=1) Bitmap Heap Scan on test (cost=3519.09..7156.83 rows=1000 width=4) (actual time=5.843..8.897 rows=999 loops=1) Recheck Cond: (value in hash) -> Bitmap Index Scan on testindex (cost=0.00..3518.84 rows=1000 width=0) (actual time=5.594..5.594 rows=999 loops=1) Index Cond: (value in hash)
Listmail wrote: > > Followup to my previous test, with an index this time > > EXPLAIN ANALYZE SELECT * FROM test WHERE value IN ( 1000 integers ) I'm not quite sure what you're trying to measure here, but I don't think it is what was suggested. IIRC the suggestion was to move the values from your IN (...) operator into a temp table and join against that. Try measuring something like this: EXPLAIN ANALYZE SELECT * FROM table JOIN test ON (table.column = test.value) vs. EXPLAIN ANALYZE SELECT * FROM table WHERE value IN ( 1000 integers ) -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
I used VALUES as a replacement for the temporary table since for this application, it is a lot more useful. The point is : SELECT * FROM table WHERE value IN ( 1000 integers ) : does 1000 comparisons for each row SELECT * FROM table WHERE value IN ( VALUES (1000 integerss) ) : builds a Hash with the 1000 values and uses it to test rows, which is a lot faster if you have many values to compare with. The first one is faster if the number of values in the IN() is small. The second one is faster if the number of values in the IN() is large. > EXPLAIN ANALYZE SELECT * FROM table JOIN test ON (table.column = > test.value) It wouldn't give the same result : both queries above remove duplicates, this one does not.