Thread: huge disparities in =/IN/BETWEEN performance

huge disparities in =/IN/BETWEEN performance

From

"George Pavlov"

Date:

08 February 2007, 19:56:53

all my SQL-writin' life i have been assuming that expressions like =,
IN, BETWEEN in the WHERE clause are, in the most general sense,
alternative ways of doing the same things. i am hitting some very very
bizarre results in PGSQL:

i have a (very involved) view, say v_foo, largely optimized to be
queried by a user_id column, so:
 select * from v_foo where user_id = 70728; Time: 580.620 ms

however an essentially synonymous IN construct takes minutes to complete
(the subquery inside the IN clause will return the 70728 ID and by
itself takes 40ms to complete):
 select * from v_foo where user_id in (select user_id from bar group by
1 having count(*) = 10 limit 1); Time: 244616.464 ms

a synonymous-looking BETWEEN also takes forever:
 select * from v_foo where user_id between 70728 and 70728; Time: 329332.722 ms

there is, admittedly, substantial complexity inside v_foo, with GROUP
BYs on user_id and various subqueries, but my basic thought is that
should not really matter...

i am on 8.1.3. i'd like to hope that the 8.2 optimizer improvements
might help with this but i haven't tested.

george

Re: huge disparities in =/IN/BETWEEN performance

From

Tom Lane

Date:

08 February 2007, 21:54:38

"George Pavlov" <gpavlov@mynewplace.com> writes:
> there is, admittedly, substantial complexity inside v_foo, with GROUP
> BYs on user_id and various subqueries, but my basic thought is that
> should not really matter...

You're unlikely to get any useful comment on this when you have not
shown any of those details, nor even an EXPLAIN.
        regards, tom lane

Re: huge disparities in =/IN/BETWEEN performance

From

"George Pavlov"

Date:

08 February 2007, 22:03:30

> > BYs on user_id and various subqueries, but my basic thought is that
> > should not really matter...
>
> You're unlikely to get any useful comment on this when you have not
> shown any of those details, nor even an EXPLAIN.

yes, i know. i guess i was partially just venting. sorry. the problem is
that the view is very complex and cleansing it for general consumprion
and paring it down to some degree of readability is a lot of work.

the basic question i have is fairly clear though: why saying "where x =
10" should be  different (in ANY cicumstance, not just mine) from saying
"where x between 10 and 10" or from "where x in (select ... /* some
query that returns 10 */)" ???

i am not really looking for help optimizing my view and query, more of a
general idea of should this happen, when might this happen, why is it a
good idea that this happens? is this a better statement of the issue?
thanks for listening!

george

Re: huge disparities in =/IN/BETWEEN performance

From

Alvaro Herrera

Date:

08 February 2007, 22:14:27

George Pavlov wrote:

> the basic question i have is fairly clear though: why saying "where x =
> 10" should be  different (in ANY cicumstance, not just mine) from saying
> "where x between 10 and 10" or from "where x in (select ... /* some
> query that returns 10 */)" ??? 

I think the principle here is that the system is not gonna waste cycles
on dumb queries.  Supposedly, morphing "foo BETWEEN 10 and 10" into
"foo=10" is not a trivial transformation, and it'd impose a planning
cost on all non-dumb BETWEEN queries.  That cost is best avoided: if you
optimize for dumb users, the smart users then want you buried because
you've lost performance doing useless work.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: huge disparities in =/IN/BETWEEN performance

From

Tom Lane

Date:

08 February 2007, 23:50:13

Alvaro Herrera <alvherre@commandprompt.com> writes:
> I think the principle here is that the system is not gonna waste cycles
> on dumb queries.  Supposedly, morphing "foo BETWEEN 10 and 10" into
> "foo=10" is not a trivial transformation, and it'd impose a planning
> cost on all non-dumb BETWEEN queries.

There's a datatype abstraction issue involved: what does it take to
prove that "x >= 10 AND x <= 10" is equivalent to "x = 10"?  This
requires a nontrivial amount of knowledge about the operators involved.
We could probably do it for operators appearing in a btree operator
class, but as Alvaro says, it'd be cycles wasted for non-dumb queries.

As for the IN case, I think we do simplify "x IN (one-expression)" to
"x = one-expression", but "x IN (sub-select)" is a whole 'nother matter,
especially when you're comparing it to a case where one-expression is
a constant and so the planner can get good statistics about how many
rows are likely to match.
        regards, tom lane

Re: huge disparities in =/IN/BETWEEN performance

From

Joe

Date:

09 February 2007, 00:12:01

Hi Tom,

On Thu, 2007-02-08 at 22:50 -0500, Tom Lane wrote:
> There's a datatype abstraction issue involved: what does it take to
> prove that "x >= 10 AND x <= 10" is equivalent to "x = 10"?  This
> requires a nontrivial amount of knowledge about the operators involved.
> We could probably do it for operators appearing in a btree operator
> class, but as Alvaro says, it'd be cycles wasted for non-dumb queries.

Are you saying the planner is datatype-agnostic and can't tell that x
is, say, as in the example above, an INTEGER and therefore cannot
transform one expression into another?  What about "x = 10 AND x < 5"?
Can't it reduce that to FALSE?

Joe

Re: huge disparities in =/IN/BETWEEN performance

From

Tom Lane

Date:

09 February 2007, 00:24:57

Joe <dev@freedomcircle.net> writes:
> Are you saying the planner is datatype-agnostic

Certainly, but your other concerns don't follow from that.  The issue at
hand here is whether it's worth expending cycles on every query to try
to detect a situation that only holds for a few.
        regards, tom lane

Re: huge disparities in =/IN/BETWEEN performance

From

Joe

Date:

09 February 2007, 01:21:01

Hi Tom,

On Thu, 2007-02-08 at 23:24 -0500, Tom Lane wrote:
> Certainly, but your other concerns don't follow from that.  The issue at
> hand here is whether it's worth expending cycles on every query to try
> to detect a situation that only holds for a few.

Those where George's concerns, but I was interested in knowing whether
the planner transforms a query in any way to avoid, let's say, useless
execution.  George didn't provide the inside of his view, but it's
possible that my earlier example could be rephrased as follows:

create view v_foo as select * from tab where x < 5;
select * from v_foo where x = 10;

Presumably the planner has to transform the SELECT into

select * from tab where x < 5 and x = 10;

or something analogous.  This is a simple example, but with complicated
views or even joins and aggregates, the "useless execution" may not be
that obvious to the "naked eye" but it would be to a boolean logic
analyzer.

As to whether these query instances represent few or are typical is
arguable, and will depend on the type of application, level of knowledge
among users, and what kind of interfaces are used to enter or generate
the queries.

Joe

Re: huge disparities in =/IN/BETWEEN performance

From

Tom Lane

Date:

09 February 2007, 10:37:51

Joe <dev@freedomcircle.net> writes:
> George didn't provide the inside of his view, but it's
> possible that my earlier example could be rephrased as follows:
> create view v_foo as select * from tab where x < 5;
> select * from v_foo where x = 10;

So try it:

regression=# create table tab (x int);
CREATE TABLE
regression=# create view v_foo as select * from tab where x < 5;
CREATE VIEW
regression=# explain select * from v_foo where x = 10;                    QUERY PLAN
----------------------------------------------------Seq Scan on tab  (cost=0.00..46.00 rows=4 width=4)  Filter: ((x <
5)AND (x = 10))
 
(2 rows)

regression=# set constraint_exclusion to 1;
SET
regression=# explain select * from v_foo where x = 10;               QUERY PLAN
------------------------------------------Result  (cost=0.00..0.01 rows=1 width=0)  One-Time Filter: false
(2 rows)

(This is with HEAD, but I think 8.2 can do it too.)
        regards, tom lane

Re: huge disparities in =/IN/BETWEEN performance

From

"George Pavlov"

Date:

09 February 2007, 15:22:08

Thanks all for the various useful thoughts. Let me backtrack a bit and
state my real underlying issue a bit with actual examples. Hope not to
bore you with the length of this. Looks to me like an optimizer issue
unless I am missing something.

So, suppose I have a query:

select *
from stuff
inner join  ( -- just getting the distinct   -- user-stuff associations   -- since there may be multiple;   --
ultimatelyI need to query by user  select stuff_id, user_id   from stuff_user  group by 1,2 -- GROUP BY outperforms
DISTINCT) su using (stuff_id) 
left join ( -- this obtains summary statistics   -- about each stuff item  select stuff_id, count(*)  from stuff_events
group by 1 ) se 
using (stuff_id)
where user_id = 41

This is a very pared down version of what I have. And yes this specific
query can be rewritten as a single GROUP BY, but in the real world I am
gathering the aggregate statistics from several tables, so I actually
have several sub-recordsets similar to the one called "se" above.
Rewriting ALL those as a single GROUP BY is not feasible. I know, all
this cries for a single summarized rollup table, but let's not go there
(for now).

So running the above is inefficient. This particular user_id has only
one associated stuff_id and does not even have much data for that in
stuff_events. The query runs in ~4600ms. Were I to query by stuff_id
instead, things look great (if I change the where clause to the stuff_id
it runs in 25ms). When I query based on stuff_id the optimizer uses an
index on stuff_events.stuff_id. However, when I query by user_id it does
a Seq Scan on stuff_events. I somehow wish I could tell the optimizer to
first figure out which stuff_ids are related to the user_id that is
being asked for and then look ONLY those up in the stuff_events table
using the index on stuff_id.

It would seem (and this is where we get back to my original question)
that one should be able to just say:

select *
from stuff
left join (select stuff_id, count(*)  from stuff_events  group by 1 ) se
using (stuff_id)
where stuff_id in  (select distinct stuff_id  from stuff_user  where user_id = 41 )

You'd think that the subquery in the IN would be (very quickly) resolved
to a list of stuff_ids and then stuff_events would be accessed via its
stuff_id index. Instead, the Seq Scan on stuff_events still happens and
the query actually is even slower than the original, running in ~5500ms.

So one (very ugly) way to optimize the first query is to add an extra
join to stuff_user INSIDE the "se" subquery:

select *
from stuff
inner join  (select stuff_id, user_id   from stuff_user  group by 1,2 ) su using (stuff_id)
left join (select stuff_id, user_id, count(*)  from stuff_events  inner join     ( -- same subquery as above     select
stuff_id,user_id      from stuff_user     group by 1,2      ) su2    using (stuff_id)  group by 1,2 ) se 
using (stuff_id)
where user_id = 41;

This does improve things a lot, bringing the execution time for this
particular user to 3ms (!), but it is quite ugly and not fast enough for
me for a user_id with lots of associated stuff_ids.

George

Re: huge disparities in =/IN/BETWEEN performance

From

Tom Lane

Date:

09 February 2007, 18:38:51

"George Pavlov" <gpavlov@mynewplace.com> writes:
> I somehow wish I could tell the optimizer to
> first figure out which stuff_ids are related to the user_id that is
> being asked for and then look ONLY those up in the stuff_events table
> using the index on stuff_id. 

This is not really an optimizer problem, or at least not just an
optimizer problem.  The type of plan I think you are wishing for is what
the source code calls a "nestloop with inner index scan", and that
terminology should tip you off that it's only considered when the inner
relation is just a simple indexscannable table.  GROUP BY subqueries
need not apply :-(.

I've been speculating recently about how this situation might be
improved, but I fear it will require nontrivial executor changes along
with planner changes.  The executor's present mechanism for passing
variable values from the outer plan to the inner is a hack that only
really works for indexscans.  I got it to work for inheritance cases
too, recently, but that's about as far as it can be pushed.  I think
it might be possible to get rid of it and use the more-recently-invented
subplan parameter mechanism, but I haven't worked out the details.
(And I know that the Greenplum crowd would like to get rid of subplan
parameters, so I'm not sure this idea will go over well anyway.)  The
planner changes needed will be pretty wide-ranging too, likely.

This might happen for 8.4 but I wouldn't promise it for 8.3.
        regards, tom lane