Thread: Different execution plans for semantically equivalent queries

Different execution plans for semantically equivalent queries

From
Mikkel Lauritsen
Date:
Hi all,

I have a execution planner related issue that I'd like to have some help
in understanding a bit deeper -

I have a table which basically contains fields for a value, a timestamp
and
a record type which is an integer. I would like to do a query which
retrieves
the newest record for each type, and the persistence framework that I'm
using
does something which is structurally like

SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE
    t2.type = t1.type AND t2.timestamp > t1.timestamp)

On all of the PostgreSQL versions I've tried (9.0.2 included) this
executes
in about 20-30 seconds, which isn't exactly stellar. I've tried the (I
think)
equivalent

SELECT * FROM table t1 WHERE NOT EXISTS (SELECT * FROM table t2 WHERE
    t2.type = t1.type AND t2.timestamp > t1.timestamp)

instead, and that executes in about 100 ms, so it's about 200 times
faster.

The two statements have completely different execution plans, so I
understand
why there is a difference in performance, but as I'm unable to modify the
SQL that the persistence framework generates I'd like to know if there's
anything that I can do in order to make the first query execute as fast as
the second one.

I'm more specifically thinking whether I'm missing out on a crucial
planner
configuration knob or something like that, which causes the planner to
treat the two cases differently.

Best regards & thanks for an excellent database engine,
  Mikkel Lauritsen


Re: Different execution plans for semantically equivalent queries

From
Tom Lane
Date:
Mikkel Lauritsen <renard@tala.dk> writes:
> I would like to do a query which retrieves the newest record for each
> type, and the persistence framework that I'm using does something
> which is structurally like

> SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE
>     t2.type = t1.type AND t2.timestamp > t1.timestamp)

I suspect that *any* database is going to have trouble optimizing that.
You'd be well advised to lobby the persistence framework's authors to
produce less brain-dead SQL.  The NOT EXISTS formulation seems to
express what's wanted much less indirectly.

            regards, tom lane

Re: Different execution plans for semantically equivalent queries

From
Mikkel Lauritsen
Date:
Hi Tom et al,

Many thanks for your prompt reply - you wrote:

>> SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE
>>     t2.type = t1.type AND t2.timestamp > t1.timestamp)
>
> I suspect that *any* database is going to have trouble optimizing that.

Okay, I expected that much.

Just out of curiosity I've been looking a bit at the optimizer code
in PostgreSQL, and it seems as if it would be at least theoretically
possible to add support for things like transforming the query at
hand into the NOT EXISTS form; a bit like how = NULL is converted
to IS NULL.

Would a change like that be accepted, or would you rather try to
indirectly educate people into writing better SQL?

> You'd be well advised to lobby the persistence framework's authors to
> produce less brain-dead SQL.  The NOT EXISTS formulation seems to
> express what's wanted much less indirectly.

Will do :-)

For now I guess I'll hack it by wrapping a proxy around the JDBC
driver and rewriting the SQL on the fly; I encounter other bugs in
the persistence layer that are probably best handled that way as
well.

Best regards & thanks,
  Mikkel

Re: Different execution plans for semantically equivalent queries

From
Marti Raudsepp
Date:
On Mon, Feb 7, 2011 at 00:03, Mikkel Lauritsen <renard@tala.dk> wrote:
>>> SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE
>>>     t2.type = t1.type AND t2.timestamp > t1.timestamp)
>>
>> I suspect that *any* database is going to have trouble optimizing that.

> Just out of curiosity I've been looking a bit at the optimizer code
> in PostgreSQL, and it seems as if it would be at least theoretically
> possible to add support for things like transforming the query at
> hand into the NOT EXISTS form; a bit like how = NULL is converted
> to IS NULL.
>
> Would a change like that be accepted, or would you rather try to
> indirectly educate people into writing better SQL?

There are some reasonable and generic optimizations that could be done
here. Being able to inline subqueries with aggregates into joins would
be a good thing e.g. transform your query into this:

SELECT t1.* FROM table t1 JOIN table t2 ON (t2.type = t1.type)
WHERE t2.timestamp > t1.timestamp
GROUP BY t1.* HAVING COUNT(t2.*)=0

However, this is probably still worse than a NOT EXISTS query.

I am less excited about turning "COUNT(x)=0" query to NOT EXISTS
because that's just a bad way to write a query.

Regards,
Marti