Thread: Hypothetical suggestions for planner, indexing improvement
Folks, An area in which postgresql planner & indexing could be improved have occurred to me over the last week. I'd like to share this ideas with you in case it is worthy of the todo list. Please excuse me if this issue is already dealt with in CVS; I've been unable to keep up completely on HACKERS lately. Please also excuse me if this issue has been discussed and was tabled due to some theoretical limitation, such as x^n scaling problems THE IDEA: The planner should keep statistics on the correlation of foreign keys and apply them to the expected row counts for EXISTS clause limitations, and possibly for other query types as well. To illustrate: Database "calendar" has two tables, events and event_days. Event_days has FK on column event_id to parent table Events. There is at lease one record in event_days for each record in events, and the average parent-child relationship is 1 event -> 1.15 event_days records. This query: SELECT events.* FROM events WHERE EXISTS (SELECT event_id FROM event_days WHERE event_day BETWEEN '2003-04-08' AND '2003-05-18'); Currently, (in 7.2.4 and 7.3.1) the planner makes the assumption that the above EXISTS restriction will only filter events by 50% and makes other join and execution plans accordingly. In fact, it filters events by 96% and the ideal execution plan should be quite different. It would be really keen if planner statistics could be expanded to include correlation on foriegn keys in order to make more intelligent planner decisions on the above type of query possible. Thanks for your attention! -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > THE IDEA: The planner should keep statistics on the correlation of foreign > keys and apply them to the expected row counts for EXISTS clause limitations, > and possibly for other query types as well. It's a thought. Keeping complete cross-column correlation stats (for every combination of columns in the DB) is obviously out of the question. If you're gonna do it you need a heuristic to tell you which combinations of columns are worth keeping track of --- and foreign-key relationships seem like a reasonable guide to the interesting combinations. I'm not sure about the long-term usefulness of optimizing EXISTS per se. Seems to me that a lot of the present uses of EXISTS are workarounds for Postgres' historic mistreatment of IN ... which we've attacked more directly for 7.4. But cross-column correlations are certainly useful for estimating join sizes in general. regards, tom lane
Tom, > It's a thought. Keeping complete cross-column correlation stats (for > every combination of columns in the DB) is obviously out of the > question. If you're gonna do it you need a heuristic to tell you which > combinations of columns are worth keeping track of --- and foreign-key > relationships seem like a reasonable guide to the interesting > combinations. Yes. It would also make FKs something more than just an annoying (and slow) constraint in PostgreSQL. And it would be a performance feature that most other RDBMSs don't have ;-) > I'm not sure about the long-term usefulness of optimizing EXISTS per se. > Seems to me that a lot of the present uses of EXISTS are workarounds > for Postgres' historic mistreatment of IN ... which we've attacked more > directly for 7.4. But cross-column correlations are certainly useful > for estimating join sizes in general. EXISTS is more flexible than IN; how can you do a 3-column corellation on an IN clause? The reason that I mention EXISTS is because that's where the lack of cross-column corellation is most dramatic; the planner seems to estimate a flat 50% for EXISTS clauses regardless of the content. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > The reason that I mention EXISTS is because that's where the lack of > cross-column corellation is most dramatic; the planner seems to estimate a > flat 50% for EXISTS clauses regardless of the content. No "seems to" about that one: see src/backend/optimizer/path/clausesel.c else if (is_subplan(clause)) { /* * Just for the moment! FIX ME! - vadim 02/04/98 */ s1 = (Selectivity) 0.5; } Patches to improve this are welcome ;-). But I'm not at all sure how to write something that would extract a reliable selectivity estimate from a subplan. regards, tom lane
"Jim C. Nasby" <jim@nasby.net> writes: > It would be nice to add support for multi-column IN.. > WHERE (a, b, c) IN (SELECT a, b, c ...) RTFM... > BTW, does postgresql handle IN and EXISTS differently? Yes. > Theoretically if the optimizer was good enough you could transform one > to the other and not worry about it. No. They have different responses to NULLs in the subselect result. regards, tom lane
Re: Hypothetical suggestions for planner, indexing improvement
From
"Zeugswetter Andreas SB SD"
Date:
> > The reason that I mention EXISTS is because that's where the lack of > > cross-column corellation is most dramatic; the planner seems to estimate a > > flat 50% for EXISTS clauses regardless of the content. > > No "seems to" about that one: see > src/backend/optimizer/path/clausesel.c > > else if (is_subplan(clause)) > { > /* > * Just for the moment! FIX ME! - vadim 02/04/98 > */ > s1 = (Selectivity) 0.5; > } I think the main issue cannot be the correlation in this case. In this FK PK case an events row exists for each row in the subselect, so the correlation must be above 1, how much above is irrelevant for the selectivity estimate (it is only relevant in the below 1 cases). The selectivity in this case can be estimated by estimating the number of rows returned from the subselect where the (in the example missing) PK-FK join condition is removed (here: select distinct event_id from event_day where event_day BETWEEN '2003-04-08' AND '2003-05-18'). (selectivity = min (1, e. rows of subselect / e. rows of main select)) The "event_day BETWEEN '2003-04-08' AND '2003-05-18'" is what really reduces the result set here, and that is not used. Andreas PS: in the example the subselect join clause event_id=events.event_id is missing
Andreas, > PS: in the example the subselect join clause event_id=events.event_id is > missing Sorry. I was cutting-and-pasting to reduce the complexity of the query. The full query is actually posted on -PERFORMANCE, as is a workaround to the "exists selectivity problem". -- Josh Berkus Aglio Database Solutions San Francisco
On Tue, 2003-05-06 at 00:45, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > The reason that I mention EXISTS is because that's where the lack of > > cross-column corellation is most dramatic; the planner seems to estimate a > > flat 50% for EXISTS clauses regardless of the content. > > No "seems to" about that one: see src/backend/optimizer/path/clausesel.c > > else if (is_subplan(clause)) > { > /* > * Just for the moment! FIX ME! - vadim 02/04/98 > */ > s1 = (Selectivity) 0.5; > } > > Patches to improve this are welcome ;-). But I'm not at all sure how to > write something that would extract a reliable selectivity estimate from > a subplan. > given that we have so few GUC variables... would there be any merit in adding one that would allow folks to change this assumption? Robert Treat
On Tue, 6 May 2003, Tom Lane wrote: > > It would be nice to add support for multi-column IN.. > > WHERE (a, b, c) IN (SELECT a, b, c ...) > > RTFM... Maybe he did read the manual: 6.15.3. IN (subquery form) expression IN (subquery) The right-hand side of this form of IN is a parenthesized subquery, which must return exactly one column. -- /Dennis
On Tue, May 06, 2003 at 09:45:07AM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > It would be nice to add support for multi-column IN.. > > WHERE (a, b, c) IN (SELECT a, b, c ...) > > RTFM... As someone pointed out, the documentation says you can't. In this case the docs are wrong (I've added a note). > > BTW, does postgresql handle IN and EXISTS differently? > > Yes. > > > Theoretically if the optimizer was good enough you could transform one > > to the other and not worry about it. > > No. They have different responses to NULLs in the subselect result. They appear to operate the same... what's different? -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > On Tue, May 06, 2003 at 09:45:07AM -0400, Tom Lane wrote: >> RTFM... > As someone pointed out, the documentation says you can't. In this case > the docs are wrong (I've added a note). Perhaps you should have read to the end of the section. >>> BTW, does postgresql handle IN and EXISTS differently? >> >> Yes. > > They appear to operate the same... what's different? Supposing that tab1.col1 contains 1, NULL, 2, then for an outer table row where col2 = 42 WHERE outer.col2 IN (SELECT col1 FROM tab1) will yield NULL (not FALSE). But WHERE EXISTS(SELECT * FROM tab1 WHERE col1 = outer.col2) will yield FALSE (not NULL). The distinction doesn't matter at the top level of WHERE, but it matters a lot underneath a NOT ... regards, tom lane
On Wed, 7 May 2003, Tom Lane wrote: > > As someone pointed out, the documentation says you can't. In this case > > the docs are wrong (I've added a note). > > Perhaps you should have read to the end of the section. Oh, you are right (of course). It is documented later on in that section. I guess it's specified in the SQL spec, but how come a tuple is not an expression? If it had been an expression the first part of that section would still apply, which is why I just read the first part. -- /Dennis
> Supposing that tab1.col1 contains 1, NULL, 2, then for an outer > table row where col2 = 42 > > WHERE outer.col2 IN (SELECT col1 FROM tab1) > > will yield NULL (not FALSE). But > > WHERE EXISTS(SELECT * FROM tab1 WHERE col1 = outer.col2) > > will yield FALSE (not NULL). > > The distinction doesn't matter at the top level of WHERE, but it > matters a lot underneath a NOT ... OK, but even if a true transform can't be done, couldn't they share the same set of code to fetch the data for the subquery? Going back to my original post, I tend to use IN only in cases where I think the subquery will return a small result-set, and use EXISTS elsewhere. Presumably, the subquery for an IN will only be run once, while EXISTS will be run as an inner-loop (I'm guessing here, I could be wrong). It might be useful if the subquery was executed based on how many rows it would/might return. -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
=?ISO-8859-1?Q?Dennis_Bj=F6rklund?= <db@zigo.dhs.org> writes: > I guess it's specified in the SQL spec, but how come a tuple is not an > expression? If it had been an expression the first part of that section > would still apply, which is why I just read the first part. Well, if you want to think of "(42, 'foo')" as being an expression then I suppose it could be considered to be all one syntax. Personally I think that'd be more confusing rather than less so. Another alternative is to put both syntax summaries at the top of the subsection, and combine the two textual descriptions into one. This is the wrong place for this discussion, though. If you want to have a go at rewriting the section, why not put up a sketch on pgsql-docs and see if people like it? regards, tom lane