Thread: Have I b0rked something? Slow comparisons on "where x in (...)"

Have I b0rked something? Slow comparisons on "where x in (...)"

From
Stephen Harris
Date:
Postgres version 8.0.9 on Solaris 2.8.  I know it's old but...

I have a table with a million rows.

I need to select data from this table based on an indexed column; I need
to select 600 possible values from the column, returning around 24,000
rows of data.

In perl I have a hash which has 600 key values in it.

So I did:

  "select stuff from table where index_key in (" .
      join(",",keys %hash) . ") AND non_index_row in ('xyz','abc','def')

And in the perl
  while(fetch())
  {
    do_stuff
  }

This resulted in a query string which executed in 12 minutes.  If I
did an "explain" on the query string then I can see it was being expanded
to 600 OR statements
  where (index_key = 1) OR (index_key = 2) OR .....


Now as an alternative option I did
  select stuff from table where non_index_row in ('xyz','abc','def')
and in the perl
  while(fetch())
  {
    next unless $hash{$_->{index_key}};
    do_stuff
  }

To me this should be slower since we're selecting more rows, throwing
the data back to the perl and then discarding values I didn't want.

Imagine my surprise when the result took 3 minutes to execute.

Have I broken something, somewhere?  Or are IN comparisons really that
slow?

For what it's worth, a simple

  explain select count(*) from table where index_key in (1,2,3,4,....)

uses the index up until 156 values but then switches to sequential scan
when there are 157 or more values in query.

Any thoughts?  I fear my poor tuning attempts may have caused other
slow downs!

--

rgds
Stephen

Re: Have I b0rked something? Slow comparisons on "where x in (...)"

From
"Dann Corbit"
Date:
Have you done a vacuum on the table recently?

I would be curious to see how:

select stuff from table
where index_key = <key1> AND non_index_row in ('xyz','abc','def')
UNION ALL
select stuff from table
where index_key = <key2> AND non_index_row in ('xyz','abc','def')
...
UNION ALL
select stuff from table
where index_key = <key600> AND non_index_row in ('xyz','abc','def')

performs by comparison.  If, after a vacuum, it performs better than the
IN list, then the IN list might benefit from a bit of analysis for
better tuning chances.

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Stephen Harris
> Sent: Wednesday, May 02, 2007 11:32 AM
> To: Postgres General
> Subject: [GENERAL] Have I b0rked something? Slow comparisons on "where
x
> in (...)"
>
> Postgres version 8.0.9 on Solaris 2.8.  I know it's old but...
>
> I have a table with a million rows.
>
> I need to select data from this table based on an indexed column; I
need
> to select 600 possible values from the column, returning around 24,000
> rows of data.
>
> In perl I have a hash which has 600 key values in it.
>
> So I did:
>
>   "select stuff from table where index_key in (" .
>       join(",",keys %hash) . ") AND non_index_row in
('xyz','abc','def')
>
> And in the perl
>   while(fetch())
>   {
>     do_stuff
>   }
>
> This resulted in a query string which executed in 12 minutes.  If I
> did an "explain" on the query string then I can see it was being
expanded
> to 600 OR statements
>   where (index_key = 1) OR (index_key = 2) OR .....
>
>
> Now as an alternative option I did
>   select stuff from table where non_index_row in ('xyz','abc','def')
> and in the perl
>   while(fetch())
>   {
>     next unless $hash{$_->{index_key}};
>     do_stuff
>   }
>
> To me this should be slower since we're selecting more rows, throwing
> the data back to the perl and then discarding values I didn't want.
>
> Imagine my surprise when the result took 3 minutes to execute.
>
> Have I broken something, somewhere?  Or are IN comparisons really that
> slow?
>
> For what it's worth, a simple
>
>   explain select count(*) from table where index_key in (1,2,3,4,....)
>
> uses the index up until 156 values but then switches to sequential
scan
> when there are 157 or more values in query.
>
> Any thoughts?  I fear my poor tuning attempts may have caused other
> slow downs!
>
> --
>
> rgds
> Stephen
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that
your
>        message can get through to the mailing list cleanly

Re: Have I b0rked something? Slow comparisons on "where x in (...)"

From
Stephen Harris
Date:
On Wed, May 02, 2007 at 12:45:08PM -0700, Dann Corbit wrote:
> Have you done a vacuum on the table recently?

We vacuum daily and cluster weekly after the nightly activities have been
performed.

> IN list, then the IN list might benefit from a bit of analysis for

The IN list is just a set of integers (it's an integer index) generated
from some selects on other tables earlier in our processing.  I don't
have any choice as to what is in the IN list :-)

--

rgds
Stephen

Re: Have I b0rked something? Slow comparisons on "where x in (...)"

From
Tom Lane
Date:
Stephen Harris <lists@spuddy.org> writes:
> Postgres version 8.0.9 on Solaris 2.8.  I know it's old but...
> I have a table with a million rows.

> I need to select data from this table based on an indexed column; I need
> to select 600 possible values from the column, returning around 24,000
> rows of data.

> In perl I have a hash which has 600 key values in it.

> So I did:

>   "select stuff from table where index_key in (" .
>       join(",",keys %hash) . ") AND non_index_row in ('xyz','abc','def')

> And in the perl
>   while(fetch())
>   {
>     do_stuff
>   }

> This resulted in a query string which executed in 12 minutes.  If I
> did an "explain" on the query string then I can see it was being expanded
> to 600 OR statements
>   where (index_key = 1) OR (index_key = 2) OR .....

In what, a seq scan?

That plan will require executing 600 integer comparisons at each of a
million rows, with only some trivial fraction avoided because of early
success.  So it works out that your machine is able to do something over
800K such comparisons per second, which seems a bit slow for any modern
machine ... but I note 8.0 didn't have any of the "virtual slot"
optimizations added in later releases, and is doing a fresh
heap_getattr() for each access to the variable.  If it's having to
grovel over a lot of variable-width fields to get to that field each
time, I can see where the time might get eaten up.  Where is the
index_key column in the tuples, exactly?

            regards, tom lane

Re: Have I b0rked something? Slow comparisons on "where x in (...)"

From
Richard Huxton
Date:
Stephen Harris wrote:
> On Wed, May 02, 2007 at 12:45:08PM -0700, Dann Corbit wrote:
>> Have you done a vacuum on the table recently?
>
> We vacuum daily and cluster weekly after the nightly activities have been
> performed.
>
>> IN list, then the IN list might benefit from a bit of analysis for
>
> The IN list is just a set of integers (it's an integer index) generated
> from some selects on other tables earlier in our processing.  I don't
> have any choice as to what is in the IN list :-)

Try creating a temporary table, populating with the list and joining
against it. That's probably your best bet for a long list of target values.

--
   Richard Huxton
   Archonet Ltd

Re: Have I b0rked something? Slow comparisons on "where x in (...)"

From
Stephen Harris
Date:
On Wed, May 02, 2007 at 05:59:49PM -0400, Tom Lane wrote:
> Stephen Harris <lists@spuddy.org> writes:

> >   "select stuff from table where index_key in (" .
> >       join(",",keys %hash) . ") AND non_index_row in ('xyz','abc','def')

> In what, a seq scan?

Yeah, if the number of comparisons exceeds 156 then it switched from
index scan to sequential scan.

> time, I can see where the time might get eaten up.  Where is the
> index_key column in the tuples, exactly?

First column.

--

rgds
Stephen

Re: Have I b0rked something? Slow comparisons on "where x in (...)"

From
Listmail
Date:
> Try creating a temporary table, populating with the list and joining
> against it. That's probably your best bet for a long list of target
> values.

    Check :

forum_bench=> CREATE TABLE test (value INTEGER NOT NULL);
CREATE TABLE
forum_bench=> INSERT INTO test SELECT * FROM generate_series( 1, 1000000 );
INSERT 0 1000000
forum_bench=> ANALYZE test;
forum_bench=> EXPLAIN ANALYZE SELECT * FROM test;
                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------
  Seq Scan on test  (cost=0.00..14405.24 rows=999924 width=4) (actual
time=0.030..349.699 rows=1000000 loops=1)
  Total runtime: 542.914 ms
(2 lignes)


    OK : 542 ms to grab the data.
    IN() :

EXPLAIN ANALYZE SELECT * FROM test WHERE value IN ( 1000 values from 0 to
999000 in steps of 1000 ):
  Seq Scan on test  (cost=0.00..1264310.24 rows=1000 width=4) (actual
time=17.649..17977.085 rows=999 loops=1)
    Filter: (value = ANY ('{0,1000..........99000}'::integer[]))
  Total runtime: 17978.061 ms

    Ouch.

forum_bench=> EXPLAIN ANALYZE SELECT * FROM test WHERE value IN (VALUES
(0),(1000),(2000),....................(998000),(999000));
                                                            QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=19.50..18176.45 rows=200 width=4) (actual
time=2.823..736.960 rows=999 loops=1)
    Hash Cond: (test.value = "*VALUES*".column1)
    ->  Seq Scan on test  (cost=0.00..14405.24 rows=999924 width=4) (actual
time=0.032..335.680 rows=1000000 loops=1)
    ->  Hash  (cost=17.00..17.00 rows=200 width=4) (actual
time=2.108..2.108 rows=1000 loops=1)
          ->  HashAggregate  (cost=15.00..17.00 rows=200 width=4) (actual
time=1.165..1.542 rows=1000 loops=1)
                ->  Values Scan on "*VALUES*"  (cost=0.00..12.50 rows=1000
width=4) (actual time=0.004..0.478 rows=1000 loops=1)
  Total runtime: 737.362 ms

    Removing the 542 ms to read the table, we see checking if the values are
in the hash is really rally fast.

    So, obvious truth : hash is faster than dumb compare. Much faster.
    Now, postgres should do this on its own, I think.

    PS : if the 1000 values are all the same (1000 times 1), IN() doesn't
detect it, so the runtime does not change. Hash join doesn't care, so the
runtime doesn't change either.

Re: Have I b0rked something? Slow comparisons on "where x in (...)"

From
Listmail
Date:
    Followup to my previous test, with an index this time

EXPLAIN ANALYZE SELECT * FROM test WHERE value IN ( 1000 integers )

  Bitmap Heap Scan on test  (cost=3519.09..7156.83 rows=1000 width=4)
(actual time=5.843..8.897 rows=999 loops=1)
    Recheck Cond: (value = ANY ('{0,...,999000}'::integer[]))
    ->  Bitmap Index Scan on testindex  (cost=0.00..3518.84 rows=1000
width=0) (actual time=5.594..5.594 rows=999 loops=1)
          Index Cond: (value = ANY ('{0,...,999000}'::integer[]))
  Total runtime: 9.157 ms

EXPLAIN ANALYZE SELECT * FROM test WHERE value IN (VALUES
(0),(1000),.......(999000))

  Nested Loop  (cost=15.00..1461.74 rows=200 width=4) (actual
time=1.191..26.127 rows=999 loops=1)
    ->  HashAggregate  (cost=15.00..17.00 rows=200 width=4) (actual
time=1.169..1.673 rows=1000 loops=1)
          ->  Values Scan on "*VALUES*"  (cost=0.00..12.50 rows=1000
width=4) (actual time=0.007..0.517 rows=1000 loops=1)
    ->  Index Scan using testindex on test  (cost=0.00..7.21 rows=1
width=4) (actual time=0.023..0.023 rows=1 loops=1000)
          Index Cond: (test.value = "*VALUES*".column1)
  Total runtime: 26.411 ms

    Mixing the two would be a win :

    - hashing the values
    - making a bitmap from them
    - grabbing the pages and using the hash in "Recheck Cond"

    ie. something like that :

    ->  HashAggregate  (cost=15.00..17.00 rows=200 width=4) (actual
time=1.169..1.673 rows=1000 loops=1)
          ->  Values Scan on "*VALUES*"  (cost=0.00..12.50 rows=1000
width=4) (actual time=0.007..0.517 rows=1000 loops=1)
  Bitmap Heap Scan on test  (cost=3519.09..7156.83 rows=1000 width=4)
(actual time=5.843..8.897 rows=999 loops=1)
    Recheck Cond: (value in hash)
    ->  Bitmap Index Scan on testindex  (cost=0.00..3518.84 rows=1000
width=0) (actual time=5.594..5.594 rows=999 loops=1)
          Index Cond: (value in hash)



Re: Have I b0rked something? Slow comparisons on "where x in (...)"

From
Alban Hertroys
Date:
Listmail wrote:
>
>     Followup to my previous test, with an index this time
>
> EXPLAIN ANALYZE SELECT * FROM test WHERE value IN ( 1000 integers )

I'm not quite sure what you're trying to measure here, but I don't think
it is what was suggested.

IIRC the suggestion was to move the values from your IN (...) operator
into a temp table and join against that.

Try measuring something like this:

EXPLAIN ANALYZE SELECT * FROM table JOIN test ON (table.column = test.value)

vs.

EXPLAIN ANALYZE SELECT * FROM table WHERE value IN ( 1000 integers )

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

Re: Have I b0rked something? Slow comparisons on "where x in (...)"

From
Listmail
Date:

    I used VALUES as a replacement for the temporary table since for this
application, it is a lot more useful.

    The point is :

SELECT * FROM table WHERE value IN ( 1000 integers ) : does 1000
comparisons for each row
SELECT * FROM table WHERE value IN ( VALUES (1000 integerss) ) : builds a
Hash with the 1000 values and uses it to test rows, which is a lot faster
if you have many values to compare with.

    The first one is faster if the number of values in the IN() is small. The
second one is faster if the number of values in the IN() is large.

> EXPLAIN ANALYZE SELECT * FROM table JOIN test ON (table.column =
> test.value)

    It wouldn't give the same result : both queries above remove duplicates,
this one does not.