Thread: Predicate migration on complex self joins
In some cases, we have SQL being submitted that has superfluous self-joins. An example would be select count(*) from foo1 a, foo1 b where a.c1 = b.c1 /* PK join */ and a.c2 = 5 and b.c2 = 10; We can recognise that <a> and <b> are the same table because they are joined on the PK. PK is never NULL, so a join b == a in set terms. We can use this to re-write the query as if all predicates on either of the two aliases were on the LHS only. e.g. rewrite query like this: select count(*) from foo1 a, foo1 b where a.c1 = b.c1 and a.c2 = 5 and a.c2 = 10; /* predicate migration */ Predicate migration is important because it either allows us to detect impossible logic, as above, or to use multi-column index access/ bitmap scans, or to allow join removal of the RHS as a superfluous join. (I believe that self-joins were not originally part of the analysis of potentially removable joins). You may well ask who would be stupid enough to write SQL like that. The answer is of course that it is automatically generated by an ORM. Implementing something along these lines is secondary to join removal, but it seems worth noting as non-high priority item for the TODO. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> wrote: > select count(*) > from foo1 a, foo1 b > where a.c1 = b.c1 /* PK join */ > You may well ask who would be stupid enough to write SQL like that. > The answer is of course that it is automatically generated by an > ORM. We had to do something like that to get acceptable performance from Sybase ASE. That code probably has not been changed since migrating to PostgreSQL, and since we have a strong portability mandate, it probably should be left alone, since the penalty for the extra join in PostgreSQL is small and the penalty for not having it in Sybase ASE is large. In short, it would be a welcome optimization here, although (in our case) a relatively minor one. -Kevin
Simon Riggs <simon@2ndQuadrant.com> writes: > In some cases, we have SQL being submitted that has superfluous > self-joins. An example would be > select count(*) > from foo1 a, foo1 b > where a.c1 = b.c1 /* PK join */ > and a.c2 = 5 > and b.c2 = 10; > You may well ask who would be stupid enough to write SQL like that. The > answer is of course that it is automatically generated by an ORM. Seems like the right answer is "fix the damn ORM". It's hard to believe this sort of case comes up often enough to justify the cycles that would be expended (on *every* join query) to try to recognize it. regards, tom lane
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Simon Riggs <simon@2ndQuadrant.com> wrote: >> select count(*) >> from foo1 a, foo1 b >> where a.c1 = b.c1 /* PK join */ > We had to do something like that to get acceptable performance from > Sybase ASE. Writing a join for a single-table query? Why, in heavens name? (Or have you mercifully blotted the details from your memory?) regards, tom lane
On Mon, 2009-07-13 at 13:33 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > In some cases, we have SQL being submitted that has superfluous > > self-joins. An example would be > > > select count(*) > > from foo1 a, foo1 b > > where a.c1 = b.c1 /* PK join */ > > and a.c2 = 5 > > and b.c2 = 10; > > > You may well ask who would be stupid enough to write SQL like that. The > > answer is of course that it is automatically generated by an ORM. > > Seems like the right answer is "fix the damn ORM". It's hard to believe > this sort of case comes up often enough to justify the cycles that would > be expended (on *every* join query) to try to recognize it. Yeh, damn ORMs seem to spring up faster than vines. Not just because of this but I wonder if we might benefit from an optimizer setting specifically aimed at the foolishnesses of automatically generated SQL. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, Jul 13, 2009 at 1:33 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> In some cases, we have SQL being submitted that has superfluous >> self-joins. An example would be > >> select count(*) >> from foo1 a, foo1 b >> where a.c1 = b.c1 /* PK join */ >> and a.c2 = 5 >> and b.c2 = 10; > >> You may well ask who would be stupid enough to write SQL like that. The >> answer is of course that it is automatically generated by an ORM. > > Seems like the right answer is "fix the damn ORM". It's hard to believe > this sort of case comes up often enough to justify the cycles that would > be expended (on *every* join query) to try to recognize it. I think it's more common than you might think. It's been requested on -performance within recent memory, and I've had cases where I needed to deal with it as well. You can't write: DELETE FROM table AS alias LEFT JOIN othertable ... so you end up writing: DELETE FROM table AS alias USING sametable LEFT JOIN othertable ... or sometimes: DELETE FROM table AS alias USING viewthatconstainstable ... ...Robert
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Writing a join for a single-table query? Why, in heavens name? > (Or have you mercifully blotted the details from your memory?) Actually, I had only the vaguest recollection of why, but I found an email where I was explaining the problem to Sybase. Basically, it boiled down to a corner case involving the intersection of named caches and index optimizations similar to what Heikki's currently developing. If we did a search such as: SELECT searchName FROM Party WHERE searchName LIKE 'PET%,RANDY%' where searchName was the first column of an index bound to a named cache, it would scan the range of the index where searchName >= 'PET' and searchName < 'PEU', determine which rows actually matched the whole pattern, and access the heap pages only for those rows which matched the pattern. (In this case, 298 rows.) As long as only columns from the index were returned, there were only 298 access to the heap. Now, if you added a column which was not in the index, it went to the heap before checking the full pattern, so it went to the heap 87,632 times for the above criteria, and returned the same 298 rows. Since the primary key columns were in all indexes (to allow use of READ UNCOMMITTED :-/ ), we could select those columns without driving it to the heap, so we used the first table reference just for selecting the rows, and joined back to the same table on primary key to get the values to return. We could not convince Sybase that they should fix that issue. -Kevin
On Jul 13, 2009, at 1:06 PM, Simon Riggs wrote: > Not just because of this but I wonder if we might benefit from an > optimizer setting specifically aimed at the foolishnesses of > automatically generated SQL. +1. And it's not just ORMs that do stupid things, I've seen crap like this come out of users too (not this exact case, but similar). Perhaps what we really want is an "optimization level" GUC so that users can tell the backend how much overhead they want the optimizer to spend on trying to work around stupidity... :) -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
On Mon, Jul 13, 2009 at 3:48 PM, decibel<decibel@decibel.org> wrote: > On Jul 13, 2009, at 1:06 PM, Simon Riggs wrote: >> >> Not just because of this but I wonder if we might benefit from an >> optimizer setting specifically aimed at the foolishnesses of >> automatically generated SQL. > > > +1. And it's not just ORMs that do stupid things, I've seen crap like this > come out of users too (not this exact case, but similar). > i've this come from users most of the time... the few systems i have seen that generate sql, try to avoid using complex queries and make simple ones and the JOIN them at the app level... > Perhaps what we really want is an "optimization level" GUC so that users can > tell the backend how much overhead they want the optimizer to spend on > trying to work around stupidity... :) i wonder what the levels have to be: - smart_sql - application_generated_sql - normal_user_sql - xtremely_stupid_sql - what_the_hell_sql ;) -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157
On Mon, Jul 13, 2009 at 07:06:40PM +0100, Simon Riggs wrote: > On Mon, 2009-07-13 at 13:33 -0400, Tom Lane wrote: > > It's hard to believe > > this sort of case comes up often enough to justify the cycles that would > > be expended (on *every* join query) to try to recognize it. > > Yeh, damn ORMs seem to spring up faster than vines. > > Not just because of this but I wonder if we might benefit from an > optimizer setting specifically aimed at the foolishnesses of > automatically generated SQL. The best suggestion I heard was to carry on optimizing until the plan looked cheap enough or all the options had been exhausted. In practical terms; I think this means doing the planning in two stages, try with all the "simple" optimizations and see if it results in less than "n" page accesses, if it's above this level then try again with all the optimizations turned on. -- Sam http://samason.me.uk/