Thread: Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)
Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)
From
Tom Lane
Date:
I wrote: > Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes: >> SELECT COUNT(*) FROM >> data_main AS dm, >> postcodes AS p >> WHERE dm.range BETWEEN p.range_from AND p.range_till > Planner error ... because it doesn't have any good way to estimate the > number of matching rows, it thinks that way is a bit more expensive than > data_main as the outside, but in reality it seems a good deal cheaper: BTW, it would get the right answer if it had recognized the WHERE clause as a range restriction --- it still doesn't know exactly what fraction of rows will match, but its default estimate is a great deal tighter for "WHERE x > something AND x < somethingelse" than it is for two unrelated inequality constraints. Enough tighter that it would have gone for the correct plan. The problem is that it doesn't recognize the WHERE as a range constraint on dm.range. I thought for a moment that this might be a recently-introduced bug, but actually the code is operating as designed: clauselist_selectivity says * See if it looks like a restriction clause with a pseudoconstant * on one side. (Anything more complicated than that might not * behave in the simple way we are expecting.) "Pseudoconstant" in this context means "a constant, parameter symbol, or non-volatile functions of these" ... so comparisons against values from another table don't qualify. It seems like we're missing a bet though. Can anyone suggest a more general rule? Do we need for example to consider whether the relation membership is the same in two clauses that might be opposite sides of a range restriction? It seems like a.x > b.y AND a.x < b.z probably can be treated as a range restriction on a.x for this purpose, but I'm much less sure that the same is true of a.x > b.y AND a.x < c.z Thoughts? regards, tom lane
Re: Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)
From
"Jim C. Nasby"
Date:
On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote: > Can anyone suggest a more general rule? Do we need for example to > consider whether the relation membership is the same in two clauses > that might be opposite sides of a range restriction? It seems like > > a.x > b.y AND a.x < b.z In a case like this, you could actually look at the data in b and see what the average range size is. If you wanted to get really fancy, the optimizer could decide how best to access a based on each row of b. > probably can be treated as a range restriction on a.x for this purpose, > but I'm much less sure that the same is true of > > a.x > b.y AND a.x < c.z Well, this could end up being much trickier, since who knows how b and c are related. Though thinking about it, although I threw out the row-by-row analysis idea to be glib, that would actually work in this case; you could take a look at what b and c look like each time 'through the loop'. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Re: Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)
From
Tom Lane
Date:
"Jim C. Nasby" <decibel@decibel.org> writes: > On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote: >> Can anyone suggest a more general rule? Do we need for example to >> consider whether the relation membership is the same in two clauses >> that might be opposite sides of a range restriction? It seems like >> >> a.x > b.y AND a.x < b.z > In a case like this, you could actually look at the data in b and see > what the average range size is. Not with the current statistics --- you'd need some kind of cross-column statistics involving both y and z. (That is, I doubt it would be helpful to estimate the average range width by taking the difference of independently-calculated mean values of y and z ...) But yeah, in principle it would be possible to make a non-default estimate. regards, tom lane
Re: Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)
From
Tom Lane
Date:
John A Meinel <john@arbash-meinel.com> writes: > Actually, I think he was saying do a nested loop, and for each item in > the nested loop, re-evaluate if an index or a sequential scan is more > efficient. > I don't think postgres re-plans once it has started, though you could > test this in a plpgsql function. It doesn't, and in any case that's a microscopic view of the issue. The entire shape of the plan might change depending on what we think the selectivity is --- much more than could be handled by switching scan types at the bottom level. Also, I anticipate that bitmap-driven index scans will change things considerably here. The range of usefulness of pure seqscans will drop drastically... regards, tom lane
On Wed, 2005-04-06 at 18:09 -0400, Tom Lane wrote: > I wrote: > > Arjen van der Meijden <acmmailing@vulcanus.its.tudelft.nl> writes: > >> SELECT COUNT(*) FROM > >> data_main AS dm, > >> postcodes AS p > >> WHERE dm.range BETWEEN p.range_from AND p.range_till > > > Planner error ... because it doesn't have any good way to estimate the > > number of matching rows, it thinks that way is a bit more expensive than > > data_main as the outside, but in reality it seems a good deal cheaper: > > BTW, it would get the right answer if it had recognized the WHERE clause > as a range restriction --- it still doesn't know exactly what fraction > of rows will match, but its default estimate is a great deal tighter for > "WHERE x > something AND x < somethingelse" than it is for two unrelated > inequality constraints. Enough tighter that it would have gone for the > correct plan. > > The problem is that it doesn't recognize the WHERE as a range constraint > on dm.range. > Can anyone suggest a more general rule? Do we need for example to > consider whether the relation membership is the same in two clauses > that might be opposite sides of a range restriction? It seems like > > a.x > b.y AND a.x < b.z Not sure we need a more general rule. There's only three ways to view this pair of clauses: i) its a range constraint i.e. BETWEEN ii) its the complement of that i.e. NOT BETWEEN iii) its a mistake, but we're not allowed to take that path Arjen's query and your generalisation of it above is a common type of query - using a lookup of a reference data table with begin/end effective dates. It would be very useful if this was supported. > probably can be treated as a range restriction on a.x for this purpose, > but I'm much less sure that the same is true of > > a.x > b.y AND a.x < c.z I can't think of a query that would use such a construct, and might even conclude that it was very poorly normalised model. I would suggest that this is much less common in practical use. Best Regards, Simon Riggs
Re: Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
From
Bruno Wolff III
Date:
On Wed, Apr 06, 2005 at 18:09:37 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Can anyone suggest a more general rule? Do we need for example to > consider whether the relation membership is the same in two clauses > that might be opposite sides of a range restriction? It seems like > > a.x > b.y AND a.x < b.z > > probably can be treated as a range restriction on a.x for this purpose, > but I'm much less sure that the same is true of > > a.x > b.y AND a.x < c.z > > Thoughts? I think it makes sense to guess that a smaller fraction of the rows will be returned when a column value is bounded above and below than if it is only bounded on one side, even if the bounds aren't fixed. You can certainly be wrong. The difference between this and the normal case is that column statistics aren't normally going to be that useful. If date/time ranges are the common use for this construct, it might be better to create date and/or time range types that use rtree or gist indexes.
Re: Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
From
Tom Lane
Date:
Bruno Wolff III <bruno@wolff.to> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Can anyone suggest a more general rule? > I think it makes sense to guess that a smaller fraction of the rows will > be returned when a column value is bounded above and below than if it > is only bounded on one side, even if the bounds aren't fixed. You can > certainly be wrong. Yeah, the whole thing is only a heuristic anyway. I've been coming around to the view that relation membership shouldn't matter, because of cases like WHERE a.x > b.y AND a.x < 42 which surely should be taken as a range constraint. regards, tom lane
Re: Recognizing range constraints (was Re: [PERFORM] Plan for relatively simple query seems to be very inefficient)
From
"Jim C. Nasby"
Date:
On Wed, Apr 06, 2005 at 06:35:10PM -0400, Tom Lane wrote: > "Jim C. Nasby" <decibel@decibel.org> writes: > > On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote: > >> Can anyone suggest a more general rule? Do we need for example to > >> consider whether the relation membership is the same in two clauses > >> that might be opposite sides of a range restriction? It seems like > >> > >> a.x > b.y AND a.x < b.z > > > In a case like this, you could actually look at the data in b and see > > what the average range size is. > > Not with the current statistics --- you'd need some kind of cross-column > statistics involving both y and z. (That is, I doubt it would be > helpful to estimate the average range width by taking the difference of > independently-calculated mean values of y and z ...) But yeah, in > principle it would be possible to make a non-default estimate. Actually, it might be possible to take a SWAG at it using the histogram and correlation stats. You know... since getting universally useful cross-platform stats seems to be pretty pie-in-the-sky, would it be possible to generate more complex stats on the fly from a sampling of a table? If you're looking at a fairly sizeable table ISTM it would be worth sampling the rows on 10 or 20 random pages to see what you get. In this case, you'd want to know the average difference between two fields. Other queries might want something different. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Tom Lane wrote: > "Jim C. Nasby" <decibel@decibel.org> writes: > >>On Wed, Apr 06, 2005 at 06:09:37PM -0400, Tom Lane wrote: >> >>>Can anyone suggest a more general rule? Do we need for example to >>>consider whether the relation membership is the same in two clauses >>>that might be opposite sides of a range restriction? It seems like >>> >>>a.x > b.y AND a.x < b.z > > >>In a case like this, you could actually look at the data in b and see >>what the average range size is. > > > Not with the current statistics --- you'd need some kind of cross-column > statistics involving both y and z. (That is, I doubt it would be > helpful to estimate the average range width by taking the difference of > independently-calculated mean values of y and z ...) But yeah, in > principle it would be possible to make a non-default estimate. > > regards, tom lane Actually, I think he was saying do a nested loop, and for each item in the nested loop, re-evaluate if an index or a sequential scan is more efficient. I don't think postgres re-plans once it has started, though you could test this in a plpgsql function. John =:->