Thread: Hypothetical suggestions for planner, indexing improvement
Folks, An area in which postgresql planner & indexing could be improved have occurred to me over the last week. I'd like to share this ideas with you in case it is worthy of the todo list. Please excuse me if this issue is already dealt with in CVS; I've been unable to keep up completely on HACKERS lately. Please also excuse me if this issue has been discussed and was tabled due to some theoretical limitation, such as x^n scaling problems THE IDEA: The planner should keep statistics on the correlation of foreign keys and apply them to the expected row counts for EXISTS clause limitations, and possibly for other query types as well. To illustrate: Database "calendar" has two tables, events and event_days. Event_days has FK on column event_id to parent table Events. There is at lease one record in event_days for each record in events, and the average parent-child relationship is 1 event -> 1.15 event_days records. This query: SELECT events.* FROM events WHERE EXISTS (SELECT event_id FROM event_days WHERE event_day BETWEEN '2003-04-08' AND '2003-05-18'); Currently, (in 7.2.4 and 7.3.1) the planner makes the assumption that the above EXISTS restriction will only filter events by 50% and makes other join and execution plans accordingly. In fact, it filters events by 96% and the ideal execution plan should be quite different. It would be really keen if planner statistics could be expanded to include correlation on foriegn keys in order to make more intelligent planner decisions on the above type of query possible. Thanks for your attention! -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > THE IDEA: The planner should keep statistics on the correlation of foreign > keys and apply them to the expected row counts for EXISTS clause limitations, > and possibly for other query types as well. It's a thought. Keeping complete cross-column correlation stats (for every combination of columns in the DB) is obviously out of the question. If you're gonna do it you need a heuristic to tell you which combinations of columns are worth keeping track of --- and foreign-key relationships seem like a reasonable guide to the interesting combinations. I'm not sure about the long-term usefulness of optimizing EXISTS per se. Seems to me that a lot of the present uses of EXISTS are workarounds for Postgres' historic mistreatment of IN ... which we've attacked more directly for 7.4. But cross-column correlations are certainly useful for estimating join sizes in general. regards, tom lane
Tom, > It's a thought. Keeping complete cross-column correlation stats (for > every combination of columns in the DB) is obviously out of the > question. If you're gonna do it you need a heuristic to tell you which > combinations of columns are worth keeping track of --- and foreign-key > relationships seem like a reasonable guide to the interesting > combinations. Yes. It would also make FKs something more than just an annoying (and slow) constraint in PostgreSQL. And it would be a performance feature that most other RDBMSs don't have ;-) > I'm not sure about the long-term usefulness of optimizing EXISTS per se. > Seems to me that a lot of the present uses of EXISTS are workarounds > for Postgres' historic mistreatment of IN ... which we've attacked more > directly for 7.4. But cross-column correlations are certainly useful > for estimating join sizes in general. EXISTS is more flexible than IN; how can you do a 3-column corellation on an IN clause? The reason that I mention EXISTS is because that's where the lack of cross-column corellation is most dramatic; the planner seems to estimate a flat 50% for EXISTS clauses regardless of the content. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > The reason that I mention EXISTS is because that's where the lack of > cross-column corellation is most dramatic; the planner seems to estimate a > flat 50% for EXISTS clauses regardless of the content. No "seems to" about that one: see src/backend/optimizer/path/clausesel.c else if (is_subplan(clause)) { /* * Just for the moment! FIX ME! - vadim 02/04/98 */ s1 = (Selectivity) 0.5; } Patches to improve this are welcome ;-). But I'm not at all sure how to write something that would extract a reliable selectivity estimate from a subplan. regards, tom lane
On Mon, 5 May 2003, Josh Berkus wrote: > Tom, > > > It's a thought. Keeping complete cross-column correlation stats (for > > every combination of columns in the DB) is obviously out of the > > question. If you're gonna do it you need a heuristic to tell you which > > combinations of columns are worth keeping track of --- and foreign-key > > relationships seem like a reasonable guide to the interesting > > combinations. > > Yes. It would also make FKs something more than just an annoying (and slow) > constraint in PostgreSQL. And it would be a performance feature that most > other RDBMSs don't have ;-) That statement seams really strange, If FKs are really only a safety lock to stop you from putting bad data in your database, It makes them a bit pointless if you want a nice fast database and you can trust your users! This does not make them useless and I still have them but from a purely performance point of view they don't help currently! It may be worth adding Partial Matching FKs so that a user can mark that they think might be a useful match to do. This would help the fact that in many data sets NULL can mean more than one different thing. (Don't Know, None, etc) plus using the index on IS NOT NULL queries would be very handy when you need to know about all the records that you need to find the information out for, or all the records with no relationship. Peter Childs > > > I'm not sure about the long-term usefulness of optimizing EXISTS per se. > > Seems to me that a lot of the present uses of EXISTS are workarounds > > for Postgres' historic mistreatment of IN ... which we've attacked more > > directly for 7.4. But cross-column correlations are certainly useful > > for estimating join sizes in general. > > EXISTS is more flexible than IN; how can you do a 3-column corellation on an > IN clause? > > The reason that I mention EXISTS is because that's where the lack of > cross-column corellation is most dramatic; the planner seems to estimate a > flat 50% for EXISTS clauses regardless of the content. > >
On Mon, 2003-05-05 at 23:25, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > THE IDEA: The planner should keep statistics on the correlation of foreign > > keys and apply them to the expected row counts for EXISTS clause limitations, > > and possibly for other query types as well. > > It's a thought. Keeping complete cross-column correlation stats (for > every combination of columns in the DB) is obviously out of the > question. If you're gonna do it you need a heuristic to tell you which > combinations of columns are worth keeping track of --- and foreign-key > relationships seem like a reasonable guide to the interesting > combinations. How about generalizing this into something useful for all queries? And to make this problem not just guess work on the part of the optimizer, how about having a separate "backslash command" so that the DBA can add specific/important/crucial predicates that he needs optimized. Thus, if there's a query with a large WHERE clause that has an optimized predicate inside it, the statistics would be used. > WHERE event_day BETWEEN '2003-04-08' AND '2003-05-18') What this sounds to me like, though, is that in order for it to work quickly, the postmaster would have to keep track in system tables each value of event_day and it's COUNT(*), thus more I/O overhead. -- +---------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | The purpose of the military isn't to pay your college tuition | | or give you a little extra income; it's to "kill people and | | break things". Surprisingly, not everyone understands that. | +---------------------------------------------------------------+
On Mon, May 05, 2003 at 09:33:33PM -0700, Josh Berkus wrote: > EXISTS is more flexible than IN; how can you do a 3-column corellation on an > IN clause? It would be nice to add support for multi-column IN.. WHERE (a, b, c) IN (SELECT a, b, c ...) BTW, does postgresql handle IN and EXISTS differently? Theoretically if the optimizer was good enough you could transform one to the other and not worry about it. Whenever possible, I try and use IN when the subselect will return a very small number of rows, since IN might be faster than EXISTS in that case, though it seems most optimizers tend to fall apart when they see ORs, and a lot of databases transform IN to a OR b OR c. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Tue, May 06, 2003 at 12:25:33AM -0400, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > THE IDEA: The planner should keep statistics on the correlation of foreign > > keys and apply them to the expected row counts for EXISTS clause limitations, > > and possibly for other query types as well. > > It's a thought. Keeping complete cross-column correlation stats (for > every combination of columns in the DB) is obviously out of the > question. If you're gonna do it you need a heuristic to tell you which > combinations of columns are worth keeping track of --- and foreign-key > relationships seem like a reasonable guide to the interesting > combinations. What if the optimizer kept on-going statistics on what columns were used in joins to what other columns? Over time this would allow analyze to determine on it's own what cross-column correlation stats should be kept. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > It would be nice to add support for multi-column IN.. > WHERE (a, b, c) IN (SELECT a, b, c ...) RTFM... > BTW, does postgresql handle IN and EXISTS differently? Yes. > Theoretically if the optimizer was good enough you could transform one > to the other and not worry about it. No. They have different responses to NULLs in the subselect result. regards, tom lane
> On Mon, May 05, 2003 at 09:33:33PM -0700, Josh Berkus wrote: > > EXISTS is more flexible than IN; how can you do a 3-column corellation on an > > IN clause? > > It would be nice to add support for multi-column IN.. > > WHERE (a, b, c) IN (SELECT a, b, c ...) Umm....we DO have that... Chris
On Tue, 6 May 2003, Tom Lane wrote: > > It would be nice to add support for multi-column IN.. > > WHERE (a, b, c) IN (SELECT a, b, c ...) > > RTFM... Maybe he did read the manual: 6.15.3. IN (subquery form) expression IN (subquery) The right-hand side of this form of IN is a parenthesized subquery, which must return exactly one column. -- /Dennis
On Tue, May 06, 2003 at 09:45:07AM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > It would be nice to add support for multi-column IN.. > > WHERE (a, b, c) IN (SELECT a, b, c ...) > > RTFM... As someone pointed out, the documentation says you can't. In this case the docs are wrong (I've added a note). > > BTW, does postgresql handle IN and EXISTS differently? > > Yes. > > > Theoretically if the optimizer was good enough you could transform one > > to the other and not worry about it. > > No. They have different responses to NULLs in the subselect result. They appear to operate the same... what's different? -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > On Tue, May 06, 2003 at 09:45:07AM -0400, Tom Lane wrote: >> RTFM... > As someone pointed out, the documentation says you can't. In this case > the docs are wrong (I've added a note). Perhaps you should have read to the end of the section. >>> BTW, does postgresql handle IN and EXISTS differently? >> >> Yes. > > They appear to operate the same... what's different? Supposing that tab1.col1 contains 1, NULL, 2, then for an outer table row where col2 = 42 WHERE outer.col2 IN (SELECT col1 FROM tab1) will yield NULL (not FALSE). But WHERE EXISTS(SELECT * FROM tab1 WHERE col1 = outer.col2) will yield FALSE (not NULL). The distinction doesn't matter at the top level of WHERE, but it matters a lot underneath a NOT ... regards, tom lane
On Wed, 7 May 2003, Tom Lane wrote: > > As someone pointed out, the documentation says you can't. In this case > > the docs are wrong (I've added a note). > > Perhaps you should have read to the end of the section. Oh, you are right (of course). It is documented later on in that section. I guess it's specified in the SQL spec, but how come a tuple is not an expression? If it had been an expression the first part of that section would still apply, which is why I just read the first part. -- /Dennis
> Supposing that tab1.col1 contains 1, NULL, 2, then for an outer > table row where col2 = 42 > > WHERE outer.col2 IN (SELECT col1 FROM tab1) > > will yield NULL (not FALSE). But > > WHERE EXISTS(SELECT * FROM tab1 WHERE col1 = outer.col2) > > will yield FALSE (not NULL). > > The distinction doesn't matter at the top level of WHERE, but it > matters a lot underneath a NOT ... OK, but even if a true transform can't be done, couldn't they share the same set of code to fetch the data for the subquery? Going back to my original post, I tend to use IN only in cases where I think the subquery will return a small result-set, and use EXISTS elsewhere. Presumably, the subquery for an IN will only be run once, while EXISTS will be run as an inner-loop (I'm guessing here, I could be wrong). It might be useful if the subquery was executed based on how many rows it would/might return. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
=?ISO-8859-1?Q?Dennis_Bj=F6rklund?= <db@zigo.dhs.org> writes: > I guess it's specified in the SQL spec, but how come a tuple is not an > expression? If it had been an expression the first part of that section > would still apply, which is why I just read the first part. Well, if you want to think of "(42, 'foo')" as being an expression then I suppose it could be considered to be all one syntax. Personally I think that'd be more confusing rather than less so. Another alternative is to put both syntax summaries at the top of the subsection, and combine the two textual descriptions into one. This is the wrong place for this discussion, though. If you want to have a go at rewriting the section, why not put up a sketch on pgsql-docs and see if people like it? regards, tom lane