Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics - Mailing list pgsql-hackers

From Nathan Boley
Subject Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics
Date
Msg-id 6fa3b6e20806091051j67347c3aj6658920e95855ef8@mail.gmail.com
Whole thread Raw
In response to Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics  (Zeugswetter Andreas OSB sIT <Andreas.Zeugswetter@s-itsolutions.at>)
List pgsql-hackers
> Your argument seems to consider only columns having a normal
> distribution.

My example was based upon normally distributed data because people
usually know what they are and they are reasonably common.

> How badly does it fall apart for non-normal distributions?

This should work to the extent that there is a correlation between the
number of distinct values in a histogram bin and the 'size' of the
bin. I know that this isn't a very good answer, but it's a hard
problem in the general case.

However, for any non-uniform probability distribution there exists a
measure under which there is a non-zero correlation between these. I
wrote up a hand-wavie semi-formal argument at the end, but it's not
tough to see intuitively.

Just think about the graph of the cdf. The histogram boundaries are
horizontal lines that are equally spaced from top to bottom. If we
rescale the x-axis st there is a fixed distance between each distinct
value, and define the distance as ( ndistinct in a given
interval)/(total ndistinct) then the only place where this doesn't
hold is when the CDF is f(x) = x, aka the dist is uniform. And even
then we just get a 0 coefficient, which is exactly what we are always
assuming now.

Obviously we run into problems when
a) we have a poor estimate for ndistinct - but then we have worse problems
b) our length measure doesn't correspond well with ndistinct in an interval

expanding on b)...

your mention of Zipfian distributions is particularly good example of
where this could be poor. Right now ( someone correct me if I'm wrong
) words are sorted alphabetically. However, if we wanted this
estimator to do a good job, we would sort them by their length or,
better yet, frequency in the english language ( which is highly
correlated with length ).

> (For instance, Zipfian distributions seem to be pretty common in database work, from what I've seen.)

This should work well for any power law distribution. If people are
curious, I can rerun my previous example using a power law
distribution instead of normal.

However, the easy way of thinking about all of this is that we're
building a linear model between ndistinct and histogram bucket width.
It's intuitive to expect there to be a correlation between the two,
and so the model should have some predictive power.

-Nathan


<somewhat formal - probably will be difficult to read without some
basic probability theory>
To see this, first note that we can alternatively define the uniform
distribution on [a,b] as the distribution whose CDF is a straight line
that passes through both (a,0) and (b,1) ( just integrate the PDF ).
So any non-uniform distribution will have a CDF with slope that is
both below and above 1/(b-a) at some set of points, implying the
existence of an interval [i1, i2] st ( CDF(i2) - CDF(i1) ) < ( i2 - i1
)/(b-a). Then, under the constraints of the probability measure, there
exists a second disjoint interval st ( CDF(i2') - CDF(i1') ) > ( i2' -
i1' )/(b-a). In other words,

Next, assume that the number of potential distinct values in our
interval scales linearly with the length of the interval. Although it
seems as if this assumption could be somewhat burdensome, there always
exists a measure under which this is true for sets with a defined
order relation.  ( As remarked earlier by Tom, we are already making
this assumption ). To see this, consider defining the length(i1, i2)
as ( the number of distinct value in  [i1, i2] )/( total num distinct
values ), where the number of distinct values is the set of values { v
| v >= i1 and v <= i2 }.

Next, note that the joint distribution of identically distributed,
independent random variables is multinomial with cell probabilities
given by the value of the pdf at each distinct point.  Next, I'll
state without proof that for an IID RV  the expected number of
distinct values is maximized for a uniform distribution ( this is
pretty easy to see: think about the binomial case. Do you want your
cell probabilities to be ( 1.0, 0 ) or ( 0.5, 0.5 ) )

Finally, note that the number of expected distinct values decreases
faster than linearly in the length of the interval. This is pretty
clear when we consider the sparse case. As the number of potential
entries ( in this case, the interval length) approaches infinity, the
probability of a new entry being distinct approaches 1. This means
that,  in this limit, every new entry ends up being distinct, aka the
number of distinct values scales linearly in the number of new
entries. As the interval shrinks, new entries have some probability of
being repeats. As the interval shrinks to 0, there is a zero
probability of new entries being unique. Since,

a) there doesn't exists a linear relationship that contains the two
boundary points
b) the multinomial distribution of the PDF is continuous
c) the relationship is clearly decreasing

we can surmise that it is sub-linear.

Therefore, we have two intervals that have sub and super linear slopes
that cancel one another. However, the change in ndistinct for the
super-linear slope scales super-linearly, while the change in
ndistinct for the sub-linear scales sub-linearly. So there is a net
correlation.

</somewhat formal>


pgsql-hackers by date:

Previous
From: "Pavel Stehule"
Date:
Subject: Re: proposal: new contrib module - session variables
Next
From: Simon Riggs
Date:
Subject: Re: pg_dump restore time and Foreign Keys