Thread: Cross-table statistics idea
Since I don't recall any ideas ever having been thrown out on how to do this... ISTM that we could gain additional insight on how many rows would likely result from a join be comparing the "shape" of the histogram for the joining columns. For example, if the histogram arrays were exactly identical, we're essentially stuck looking at the ratio of reltuples between the two tables. (AFAIK that's the only estimate we make today) If one histogram ended at a value smaller than the start of the other histogram, we would estimate that no rows would result from an equal join. Am I right about how our estimates work right now? Where can I look in the code? Has anyone looked down this path in the past? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > ISTM that we could gain additional insight on how many rows would likely > result from a join be comparing the "shape" of the histogram for the > joining columns. eqjoinsel already does this for the case of comparing the MCV lists. If you're serious about using the histograms: well, maybe, but it seems to require far stronger assumptions about the behavior of the datatype than we use now. > Am I right about how our estimates work right now? Where can I look in > the code? Has anyone looked down this path in the past? src/backend/utils/adt/selfuncs.c regards, tom lane
On Tue, 2006-09-26 at 21:27 -0500, Jim C. Nasby wrote: > Since I don't recall any ideas ever having been thrown out on how to do > this... > > ISTM that we could gain additional insight on how many rows would likely > result from a join One thing we can do is to use cross-column relationships to improve the estimation of Ndistinct. If we have a table Order_line (PK orderId, lineNum) If we look at lineNum and see it has on average 10 values we can then use this information to compute that Ndistinct should be -0.1, i.e. the number of values is proportional to the number of rows with a factor of 10. Right now if there are more than 10 lineNums per orderId on average then we never decide that orderId is a scalable statistic. I propose adding a final step to ANALYZE that applies a cross-column rule after all columns have been analysed. If all except one column of a PK have very low Ndistinct we can use that to calculate a minimum number of Ndistinct for the column with a high number of values. If that minimum number is less than the Ndistinct estimate in isolation, then we overlay the new value. This is desirable because the estimation of Ndistinct is very sensitive to the number of matching rows in the sample, so Ndistinct estimates are usually very poor for large Ndistinct. The estimates for low Ndistinct are much better, so we can use them with a lower standard error to correct the in-isolation estimate of other columns. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Wed, Sep 27, 2006 at 12:30:43PM +0100, Simon Riggs wrote: > On Tue, 2006-09-26 at 21:27 -0500, Jim C. Nasby wrote: > > Since I don't recall any ideas ever having been thrown out on how to do > > this... > > > > ISTM that we could gain additional insight on how many rows would likely > > result from a join > > One thing we can do is to use cross-column relationships to improve the > estimation of Ndistinct. > > If we have a table > Order_line (PK orderId, lineNum) > > If we look at lineNum and see it has on average 10 values we can then > use this information to compute that Ndistinct should be -0.1, i.e. the > number of values is proportional to the number of rows with a factor of > 10. > > Right now if there are more than 10 lineNums per orderId on average then > we never decide that orderId is a scalable statistic. > > I propose adding a final step to ANALYZE that applies a cross-column > rule after all columns have been analysed. If all except one column of a > PK have very low Ndistinct we can use that to calculate a minimum number > of Ndistinct for the column with a high number of values. If that > minimum number is less than the Ndistinct estimate in isolation, then we > overlay the new value. > > This is desirable because the estimation of Ndistinct is very sensitive > to the number of matching rows in the sample, so Ndistinct estimates are > usually very poor for large Ndistinct. The estimates for low Ndistinct > are much better, so we can use them with a lower standard error to > correct the in-isolation estimate of other columns. But wouldn't overlaying the value screw us if we wanted to look up something based on the unique field? (ie: if there was a line_id in order_line and we wanted to look something up based on line_id). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)