Thread: Different execution plans for semantically equivalent queries
Hi all, I have a execution planner related issue that I'd like to have some help in understanding a bit deeper - I have a table which basically contains fields for a value, a timestamp and a record type which is an integer. I would like to do a query which retrieves the newest record for each type, and the persistence framework that I'm using does something which is structurally like SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE t2.type = t1.type AND t2.timestamp > t1.timestamp) On all of the PostgreSQL versions I've tried (9.0.2 included) this executes in about 20-30 seconds, which isn't exactly stellar. I've tried the (I think) equivalent SELECT * FROM table t1 WHERE NOT EXISTS (SELECT * FROM table t2 WHERE t2.type = t1.type AND t2.timestamp > t1.timestamp) instead, and that executes in about 100 ms, so it's about 200 times faster. The two statements have completely different execution plans, so I understand why there is a difference in performance, but as I'm unable to modify the SQL that the persistence framework generates I'd like to know if there's anything that I can do in order to make the first query execute as fast as the second one. I'm more specifically thinking whether I'm missing out on a crucial planner configuration knob or something like that, which causes the planner to treat the two cases differently. Best regards & thanks for an excellent database engine, Mikkel Lauritsen
Mikkel Lauritsen <renard@tala.dk> writes: > I would like to do a query which retrieves the newest record for each > type, and the persistence framework that I'm using does something > which is structurally like > SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE > t2.type = t1.type AND t2.timestamp > t1.timestamp) I suspect that *any* database is going to have trouble optimizing that. You'd be well advised to lobby the persistence framework's authors to produce less brain-dead SQL. The NOT EXISTS formulation seems to express what's wanted much less indirectly. regards, tom lane
Hi Tom et al, Many thanks for your prompt reply - you wrote: >> SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE >> t2.type = t1.type AND t2.timestamp > t1.timestamp) > > I suspect that *any* database is going to have trouble optimizing that. Okay, I expected that much. Just out of curiosity I've been looking a bit at the optimizer code in PostgreSQL, and it seems as if it would be at least theoretically possible to add support for things like transforming the query at hand into the NOT EXISTS form; a bit like how = NULL is converted to IS NULL. Would a change like that be accepted, or would you rather try to indirectly educate people into writing better SQL? > You'd be well advised to lobby the persistence framework's authors to > produce less brain-dead SQL. The NOT EXISTS formulation seems to > express what's wanted much less indirectly. Will do :-) For now I guess I'll hack it by wrapping a proxy around the JDBC driver and rewriting the SQL on the fly; I encounter other bugs in the persistence layer that are probably best handled that way as well. Best regards & thanks, Mikkel
On Mon, Feb 7, 2011 at 00:03, Mikkel Lauritsen <renard@tala.dk> wrote: >>> SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE >>> t2.type = t1.type AND t2.timestamp > t1.timestamp) >> >> I suspect that *any* database is going to have trouble optimizing that. > Just out of curiosity I've been looking a bit at the optimizer code > in PostgreSQL, and it seems as if it would be at least theoretically > possible to add support for things like transforming the query at > hand into the NOT EXISTS form; a bit like how = NULL is converted > to IS NULL. > > Would a change like that be accepted, or would you rather try to > indirectly educate people into writing better SQL? There are some reasonable and generic optimizations that could be done here. Being able to inline subqueries with aggregates into joins would be a good thing e.g. transform your query into this: SELECT t1.* FROM table t1 JOIN table t2 ON (t2.type = t1.type) WHERE t2.timestamp > t1.timestamp GROUP BY t1.* HAVING COUNT(t2.*)=0 However, this is probably still worse than a NOT EXISTS query. I am less excited about turning "COUNT(x)=0" query to NOT EXISTS because that's just a bad way to write a query. Regards, Marti