Thread: Optimising "in" queries

Optimising "in" queries

From
Stephen Davies
Date:
I have a PostgreSQL 8.2.4 table with some seven million rows.

The psql query:

select count(rdate),rdate from reading where sensor_id in
(1137,1138,1139) group by rdate order by rdate desc limit 1;

takes a few seconds but:

select count(rdate),rdate from reading where sensor_id in
(1137,1138,1139,1140) group by rdate order by rdate desc limit 1;

(anything with four or more values in the "in" list) takes several
minutes.

Is there any way to make the "larger" queries more efficient?

Both rdate and sensor_id are indexed and the database is vacuumed every
night.

The values in the "in" list are seldom as "neat" as in the above
examples. Actual values can range from 1 to about 2000. The number of
values ranges from 2 to about 10.

Explain outputs are:

benparts=# explain select count(rdate),rdate from reading where
sensor_id in (1137,1138,1139,1140) group by rdate order by rdate desc
limit 1;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..39890.96 rows=1 width=8)
   ->  GroupAggregate  (cost=0.00..7938300.21 rows=199 width=8)
         ->  Index Scan Backward using date on reading
(cost=0.00..7937884.59 rows=82625 width=8)
               Filter: (sensor_id = ANY
('{1137,1138,1139,1140}'::integer[]))
(4 rows)

benparts=# explain select count(rdate),rdate from reading where
sensor_id in (1137,1138,1139) group by rdate order by rdate desc limit
1;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Limit  (cost=48364.32..48364.32 rows=1 width=8)
   ->  Sort  (cost=48364.32..48364.49 rows=69 width=8)
         Sort Key: rdate
         ->  HashAggregate  (cost=48361.35..48362.21 rows=69 width=8)
               ->  Bitmap Heap Scan on reading  (cost=535.53..48218.10
rows=28650 width=8)
                     Recheck Cond: (sensor_id = ANY
('{1137,1138,1139}'::integer[]))
                     ->  Bitmap Index Scan on reading_sensor
(cost=0.00..528.37 rows=28650 width=0)
                           Index Cond: (sensor_id = ANY
('{1137,1138,1139}'::integer[]))
(8 rows)

TIA,
Stephen Davies
--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: Optimising "in" queries

From
"Scott Marlowe"
Date:
On 8/21/07, Stephen Davies <scldad@sdc.com.au> wrote:
> I have a PostgreSQL 8.2.4 table with some seven million rows.
>
> The psql query:
>
> select count(rdate),rdate from reading where sensor_id in
> (1137,1138,1139) group by rdate order by rdate desc limit 1;
>
> takes a few seconds but:
>
> select count(rdate),rdate from reading where sensor_id in
> (1137,1138,1139,1140) group by rdate order by rdate desc limit 1;
>
> (anything with four or more values in the "in" list) takes several
> minutes.

Can we see explain analyze output?  (i.e. not just plain explain)

Re: Optimising "in" queries

From
Russell Smith
Date:
Stephen Davies wrote:
> I have a PostgreSQL 8.2.4 table with some seven million rows.
>
> The psql query:
>
> select count(rdate),rdate from reading where sensor_id in
> (1137,1138,1139) group by rdate order by rdate desc limit 1;
>
> takes a few seconds but:
>
> select count(rdate),rdate from reading where sensor_id in
> (1137,1138,1139,1140) group by rdate order by rdate desc limit 1;
>
>
It would have been helpful to see the table definition here.  I can say
up front that array processing in postgres is SLOW.

> (anything with four or more values in the "in" list) takes several
> minutes.
>
> Is there any way to make the "larger" queries more efficient?
>
> Both rdate and sensor_id are indexed and the database is vacuumed every
> night.
>
> The values in the "in" list are seldom as "neat" as in the above
> examples. Actual values can range from 1 to about 2000. The number of
> values ranges from 2 to about 10.
>
> Explain outputs are:
>
> benparts=# explain select count(rdate),rdate from reading where
> sensor_id in (1137,1138,1139,1140) group by rdate order by rdate desc
> limit 1;
>                                             QUERY PLAN
> ---------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..39890.96 rows=1 width=8)
>    ->  GroupAggregate  (cost=0.00..7938300.21 rows=199 width=8)
>          ->  Index Scan Backward using date on reading
> (cost=0.00..7937884.59 rows=82625 width=8)
>                Filter: (sensor_id = ANY
> ('{1137,1138,1139,1140}'::integer[]))
> (4 rows)
>
I'm unsure of how you produced a plan like this without the benefit of
seeing the table definition.
> benparts=# explain select count(rdate),rdate from reading where
> sensor_id in (1137,1138,1139) group by rdate order by rdate desc limit
> 1;
>                                              QUERY PLAN
> -----------------------------------------------------------------------------------------------------
>  Limit  (cost=48364.32..48364.32 rows=1 width=8)
>    ->  Sort  (cost=48364.32..48364.49 rows=69 width=8)
>          Sort Key: rdate
>          ->  HashAggregate  (cost=48361.35..48362.21 rows=69 width=8)
>                ->  Bitmap Heap Scan on reading  (cost=535.53..48218.10
> rows=28650 width=8)
>                      Recheck Cond: (sensor_id = ANY
> ('{1137,1138,1139}'::integer[]))
>                      ->  Bitmap Index Scan on reading_sensor
> (cost=0.00..528.37 rows=28650 width=0)
>                            Index Cond: (sensor_id = ANY
> ('{1137,1138,1139}'::integer[]))
> (8 rows)
>
>
>
As mentioned already, you need explain analyze.

However I again will say that array processing is postgres is SLOW.  It
would strongly recommend redesigning your schema to use a table with
sensor_id's that correspond to the primary key in the reading table.

Rethinking the way you are going about this will probably be the most
effective solution, but we will need more information if you are not
comfortable doing that yourself.

Regards

Russell Smith


Re: Optimising "in" queries

From
Michael Glaesemann
Date:
On Aug 22, 2007, at 5:58 , Russell Smith wrote:

> Stephen Davies wrote:
>> select count(rdate),rdate from reading where sensor_id in
>> (1137,1138,1139,1140) group by rdate order by rdate desc limit 1;
>>
>>
> It would have been helpful to see the table definition here.  I can
> say up front that array processing in postgres is SLOW.

Um, what array processing are you seeing here? IN (a, b, b) is not an
array construct.

Michael Glaesemann
grzm seespotcode net



Re: Optimising "in" queries

From
"Kevin Grittner"
Date:
>>> On Tue, Aug 21, 2007 at  9:40 PM, in message
<200708221210.36770.scldad@sdc.com.au>, Stephen Davies <scldad@sdc.com.au>
wrote:
> Is there any way to make the "larger" queries more efficient?

People would be in a better position to answer that if you posted the table
structure and the results of EXPLAIN ANALYZE (rather than just EXPLAIN).

-Kevin



Re: Optimising "in" queries

From
Michael Glaesemann
Date:
[Please don't top post as it makes the discussion more difficult to
follow.]

On Aug 22, 2007, at 18:30 , Stephen Davies wrote:

> I have always thought of array processing as the thing that vector
> processors such as Cray and ETA do/did.

(I've always heard that referred to as vector processing.)

> While superficially equivalent, I have always believed that IN (a,b,c)
> executed faster than =a or =b or =c. Am I wrong for PostgreSQL?

Depending on the numbers of the IN list and other statistcs, I
believe PostgreSQL will rewrite z in IN (a, b, ...) into either (z =
a) OR (z = b) OR ... or somehow add it to the join list, so
performance will vary.

Michael Glaesemann
grzm seespotcode net



Re: Optimising "in" queries

From
Russell Smith
Date:
Michael Glaesemann wrote:
>
> On Aug 22, 2007, at 5:58 , Russell Smith wrote:
>
>> Stephen Davies wrote:
>>> select count(rdate),rdate from reading where sensor_id in
>>> (1137,1138,1139,1140) group by rdate order by rdate desc limit 1;
>>>
>>>
>> It would have been helpful to see the table definition here.  I can
>> say up front that array processing in postgres is SLOW.
>
> Um, what array processing are you seeing here? IN (a, b, b) is not an
> array construct.

Filter: (sensor_id = ANY ('{1137,1138,1139,1140}'::integer[]))

I've never seen this plan item except for when array's are involved.  I
could be wrong.  I'd like to know how this is generated when you don't
have an array.

Regards

Russell Smith

Re: Optimising "in" queries

From
Russell Smith
Date:
Russell Smith wrote:
>
> Filter: (sensor_id = ANY ('{1137,1138,1139,1140}'::integer[]))
>
> I've never seen this plan item except for when array's are involved.  I
> could be wrong.  I'd like to know how this is generated when you don't
> have an array.
>

I have just discovered that PG 8.2 will turn an IN clause into an array
search instead of an OR list.  Sorry all.

Regards

Russell Smith

Re: Optimising "in" queries

From
Stephen Davies
Date:
Interesting semantics. I have never seen  the IN syntax referred to as
"array processing" before.

I have always thought of array processing as the thing that vector
processors such as Cray and ETA do/did.

While superficially equivalent, I have always believed that IN (a,b,c)
executed faster than =a or =b or =c. Am I wrong for PostgreSQL?

Stephen

 On Wednesday 22 August 2007 22:55, Michael Glaesemann wrote:
> On Aug 22, 2007, at 5:58 , Russell Smith wrote:
> > Stephen Davies wrote:
> >> select count(rdate),rdate from reading where sensor_id in
> >> (1137,1138,1139,1140) group by rdate order by rdate desc limit 1;
> >
> > It would have been helpful to see the table definition here.  I can
> > say up front that array processing in postgres is SLOW.
>
> Um, what array processing are you seeing here? IN (a, b, b) is not an
> array construct.
>
> Michael Glaesemann
> grzm seespotcode net

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: Optimising "in" queries

From
Alvaro Herrera
Date:
Stephen Davies wrote:
> Interesting semantics. I have never seen  the IN syntax referred to as
> "array processing" before.
>
> I have always thought of array processing as the thing that vector
> processors such as Cray and ETA do/did.
>
> While superficially equivalent, I have always believed that IN (a,b,c)
> executed faster than =a or =b or =c. Am I wrong for PostgreSQL?

Older versions of Postgres translated IN (a, b, c) into an OR'ed list of
equalities.  Nowadays it is treated as an array; I think it's translated
to = ANY ({a,b,c}), as you can see in the message you posted at the
start of this thread.

I don't think you showed us the EXPLAIN ANALYZE results that Scott
requested.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Optimising "in" queries

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Stephen Davies wrote:
>> While superficially equivalent, I have always believed that IN (a,b,c)
>> executed faster than =a or =b or =c. Am I wrong for PostgreSQL?

> Older versions of Postgres translated IN (a, b, c) into an OR'ed list of
> equalities.  Nowadays it is treated as an array; I think it's translated
> to = ANY ({a,b,c}), as you can see in the message you posted at the
> start of this thread.

If you're dealing with tables large enough that the index search work is
the dominant cost, all these variants ought to be exactly the same.
However, for smaller tables the planning time and executor startup time
are interesting, and on those measures the = ANY(array) formulation
should win because there's less "stuff" for the system to look at.
With "x=a OR x=b OR x=c" the planner actually has to deduce three times
that an indexscan on x is possible; with "x = ANY(ARRAY[a,b,c])" it
does that only once.  That's why I changed IN to expand to an array
construct instead of an OR tree.  I have to confess not having tried to
measure the consequences carefully, though.  I suspect it's not all
that interesting at only three items ... it's lists of hundreds or
thousands of items where this becomes a big deal.

            regards, tom lane

Re: Optimising "in" queries

From
Stephen Davies
Date:
array processing???

There are no arrays. What made you think there might be?

The table definition is:


benparts=# \d reading
                                       Table "public.reading"
  Column   |            Type             |
Modifiers
-----------+-----------------------------+-----------------------------------------------------------
 id        | integer                     | not null default
nextval(('reading_seq'::text)::regclass)
 sensor_id | integer                     |
 rdate     | timestamp without time zone |
 rval      | numeric(7,3)                |
Indexes:
    "reading_pkey" PRIMARY KEY, btree (id)
    "unique_sensor_date" UNIQUE, btree (sensor_id, rdate)
    "date" btree (rdate)
    "reading_sensor" btree (sensor_id)
Foreign-key constraints:
    "$1" FOREIGN KEY (sensor_id) REFERENCES sensor(id)

Cheers,
Stephen

On Wednesday 22 August 2007 20:28, Russell Smith wrote:
> Stephen Davies wrote:
> > I have a PostgreSQL 8.2.4 table with some seven million rows.
> >
> > The psql query:
> >
> > select count(rdate),rdate from reading where sensor_id in
> > (1137,1138,1139) group by rdate order by rdate desc limit 1;
> >
> > takes a few seconds but:
> >
> > select count(rdate),rdate from reading where sensor_id in
> > (1137,1138,1139,1140) group by rdate order by rdate desc limit 1;
>
> It would have been helpful to see the table definition here.  I can
> say up front that array processing in postgres is SLOW.
>
> > (anything with four or more values in the "in" list) takes several
> > minutes.
> >
> > Is there any way to make the "larger" queries more efficient?
> >
> > Both rdate and sensor_id are indexed and the database is vacuumed
> > every night.
> >
> > The values in the "in" list are seldom as "neat" as in the above
> > examples. Actual values can range from 1 to about 2000. The number
> > of values ranges from 2 to about 10.
> >
> > Explain outputs are:
> >
> > benparts=# explain select count(rdate),rdate from reading where
> > sensor_id in (1137,1138,1139,1140) group by rdate order by rdate
> > desc limit 1;
> >                                             QUERY PLAN
> > -------------------------------------------------------------------
> >-------------------------------- Limit  (cost=0.00..39890.96 rows=1
> > width=8)
> >    ->  GroupAggregate  (cost=0.00..7938300.21 rows=199 width=8)
> >          ->  Index Scan Backward using date on reading
> > (cost=0.00..7937884.59 rows=82625 width=8)
> >                Filter: (sensor_id = ANY
> > ('{1137,1138,1139,1140}'::integer[]))
> > (4 rows)
>
> I'm unsure of how you produced a plan like this without the benefit
> of seeing the table definition.
>
> > benparts=# explain select count(rdate),rdate from reading where
> > sensor_id in (1137,1138,1139) group by rdate order by rdate desc
> > limit 1;
> >                                              QUERY PLAN
> > -------------------------------------------------------------------
> >---------------------------------- Limit  (cost=48364.32..48364.32
> > rows=1 width=8)
> >    ->  Sort  (cost=48364.32..48364.49 rows=69 width=8)
> >          Sort Key: rdate
> >          ->  HashAggregate  (cost=48361.35..48362.21 rows=69
> > width=8) ->  Bitmap Heap Scan on reading  (cost=535.53..48218.10
> > rows=28650 width=8)
> >                      Recheck Cond: (sensor_id = ANY
> > ('{1137,1138,1139}'::integer[]))
> >                      ->  Bitmap Index Scan on reading_sensor
> > (cost=0.00..528.37 rows=28650 width=0)
> >                            Index Cond: (sensor_id = ANY
> > ('{1137,1138,1139}'::integer[]))
> > (8 rows)
>
> As mentioned already, you need explain analyze.
>
> However I again will say that array processing is postgres is SLOW.
> It would strongly recommend redesigning your schema to use a table
> with sensor_id's that correspond to the primary key in the reading
> table.
>
> Rethinking the way you are going about this will probably be the most
> effective solution, but we will need more information if you are not
> comfortable doing that yourself.
>
> Regards
>
> Russell Smith

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: Optimising "in" queries

From
Stephen Davies
Date:
I thought that I had but I screwed up the addresses.
Here they are:

benparts=# explain select count(rdate),rdate from reading where
sensor_id in (1137,1138,1139,1140) group by rdate order by rdate desc
limit 1;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..39890.96 rows=1 width=8)
   ->  GroupAggregate  (cost=0.00..7938300.21 rows=199 width=8)
         ->  Index Scan Backward using date on reading
(cost=0.00..7937884.59 rows=82625 width=8)
               Filter: (sensor_id = ANY
('{1137,1138,1139,1140}'::integer[]))
(4 rows)

benparts=# explain select count(rdate),rdate from reading where
sensor_id in (1137,1138,1139) group by rdate order by rdate desc limit
1;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Limit  (cost=48364.32..48364.32 rows=1 width=8)
   ->  Sort  (cost=48364.32..48364.49 rows=69 width=8)
         Sort Key: rdate
         ->  HashAggregate  (cost=48361.35..48362.21 rows=69 width=8)
               ->  Bitmap Heap Scan on reading  (cost=535.53..48218.10
rows=28650 width=8)
                     Recheck Cond: (sensor_id = ANY
('{1137,1138,1139}'::integer[]))
                     ->  Bitmap Index Scan on reading_sensor
(cost=0.00..528.37 rows=28650 width=0)
                           Index Cond: (sensor_id = ANY
('{1137,1138,1139}'::integer[]))
(8 rows)


benparts=# explain analyze select count(rdate),rdate from reading where
sensor_id in (1137,1138,1139) group by rdate order by rdate desc limit
1;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=49260.20..49260.20 rows=1 width=8) (actual
time=3263.219..3263.221 rows=1 loops=1)
   ->  Sort  (cost=49260.20..49260.38 rows=73 width=8) (actual
time=3263.213..3263.213 rows=1 loops=1)
         Sort Key: rdate
         ->  HashAggregate  (cost=49257.03..49257.94 rows=73 width=8)
(actual time=3049.667..3093.345 rows=30445 loops=1)
               ->  Bitmap Heap Scan on reading  (cost=541.97..49109.62
rows=29481 width=8) (actual time=1727.021..2908.563 rows=91334 loops=1)
                     Recheck Cond: (sensor_id = ANY
('{1137,1138,1139}'::integer[]))
                     ->  Bitmap Index Scan on reading_sensor
(cost=0.00..534.60 rows=29481 width=0) (actual time=1714.980..1714.980
rows=91334 loops=1)
                           Index Cond: (sensor_id = ANY
('{1137,1138,1139}'::integer[]))
 Total runtime: 3264.121 ms
(9 rows)

benparts=# explain analyze select count(rdate),rdate from reading where
sensor_id in (1137,1138,1139,1140) group by rdate order by rdate desc
limit 1;
                                                                 QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..41959.54 rows=1 width=8) (actual time=1.284..1.285
rows=1 loops=1)
   ->  GroupAggregate  (cost=0.00..8182110.32 rows=195 width=8) (actual
time=1.281..1.281 rows=1 loops=1)
         ->  Index Scan Backward using date on reading
(cost=0.00..8181711.41 rows=79294 width=8) (actual time=1.254..1.261
rows=2 loops=1)
               Filter: (sensor_id = ANY
('{1137,1138,1139,1140}'::integer[]))
 Total runtime: 1.360 ms
(5 rows)

On Friday 24 August 2007 05:16, Alvaro Herrera wrote:
<snip>
> I don't think you showed us the EXPLAIN ANALYZE results that Scott
> requested.

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83