Thread: proposal : cross-column stats

proposal : cross-column stats

From
Tomas Vondra
Date:
Hi everyone,

one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
columns are not independent, this usually result in poor estimates (and
then in suboptimal plans).

I was thinking about this issue before, but that session was the last
impulse that pushed me to try to hack up a proof of concept. So here it
is ;-)

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

------------------------ Short introduction ------------------------

Say we have a table with two INT columns - col_a and col_b, and we want
to estimate number of rows for a condition involving both columns:

   WHERE (col_a BETWEEN m AND n) AND (col_b BETWEEN p AND q)

When the columns are independent, doing the estimate is just a matter of
multiplication. When the columns are dependent, the estimate may be way off.

Lets assume there are histograms with 5 bins for both columns. What I
propose is basically building a 2D histogram. It kind of resembles
contingency table.

So we do have a table like this

col_b \ col_a  || 20% | 20% | 20% | 20% | 20% |
===============||==============================
          20%  ||  4% |  4% |  4% |  4% |  4% |
          20%  ||  4% |  4% |  4% |  4% |  4% |
          20%  ||  4% |  4% |  4% |  4% |  4% |
          20%  ||  4% |  4% |  4% |  4% |  4% |
          20%  ||  4% |  4% |  4% |  4% |  4% |
===============||==============================

where each column / row represents a bin in the original histograms, and
each cell represents an expected number of rows in it (for really
independent columns, it's 4%).

For dependent columns the actual values may be actually very different,
of course - e.g. for strongly dependent columns it might be like this

col_b \ col_a  || 20% | 20% | 20% | 20% | 20% |
===============||==============================
          20%  || 20% |  0% |  0% |  0% |  0% |
          20%  ||  0% | 20% |  0% |  0% |  0% |
          20%  ||  0% |  0% | 20% |  0% |  0% |
          20%  ||  0% |  0% |  0% | 20% |  0% |
          20%  ||  0% |  0% |  9% |  0% | 20% |
===============||==============================

To estimate the number of rows matching the condition, you'd sum
estimates for cells matching the condition. I.e. when the condition on
col_a matches the lowest 20% (the first histogram bin) and the condition
on col_b matches the lowest 20% of values, this corresponds to the first
cell (20% of rows).

Current optimizer estimates this to be 4% as it believes the columns are
independent.

I'm not sure whether I've explained it well enough, but the essence of
the proposal is to build N-dimensional histograms (where N is the number
of columns covered by the statistics) just as we are building histograms
today.

----------------------- Proof of concept ---------------------------

I've hacked a nasty proof of concept in PL/pgSQL (see the attachment).
It creates two tables - test_table and cross_stats. The former contains
the data (in two columns), the latter is used to store cross-column
statistics of test_table.

Then there are several PL/pgSQL functions - two of them are really
important:

collect_stats(p_bins_a INT, p_bins_b INT)
  - this one is used to collect the stats, the parameters represent
    number of bins for the columns (sides of the contingency table)
  - collect_stats(10,10) will build contingency table with 100 cells

get_estimate(p_from_a INT, p_to_a INT, p_from_b INT, p_to_b INT)
  - computes estimate for the condition listed above (ranges on both
    columns)

So to run the PoC, just do something like this:

1) fill the table with some data

   INSERT INTO test_table SELECT round(random()*1000),
   round(random()*1000) FROM generate_series(1,100000);

2) collect the cross-column statistics

   SELECT collect_stats(10,10);

3) see current estimated and actual number of rows

   EXPLAIN ANALYZE SELECT * FROM test_table
                           WHERE (col_a BETWEEN 30 AND 129)
                             AND (col_b BETWEEN 234 AND 484);

4) see the estimate based on contingency table

   SELECT get_estimate(30, 129, 234, 484);

Just two very simple tests for now - col_a/col_b contain the range from
the query, then there are actual number of matching rows and a current
estimate, And finally the new estimate based on contingency table with
various number of bins.

A) independent columns (proof that it produces just as good estimates
   as the current code)

   col_a   |   col_b   | actual | expected | 10x10 | 20x20 |
  [50,70]  |  [50,70]  |    41  |     40   |   41  |    47 |
  [50,250] |  [50,250] |  4023  |   4024   | 4436  |  3944 |
  [50,250] | [750,950] |  4023  |   3955   | 4509  |  3933 |

B) strongly positively dependent columns (actually col_a = col_b)

   col_a   |   col_b   | actual | expect | 10x10 | 20x20 | 100x100 |
  [50,70]  |  [50,70]  |   2143 |     57 |   391 |   729 |    1866 |
  [50,250] |  [50,250] |  20181 |   4111 | 15401 | 19983 |   19991 |
  [50,250] | [750,950] |      0 |   3977 |     0 |     0 |       0 |

  Which proves that it actually significantly improves the estimates.

Again, I've hacked that in about 1 hour, so it's really slow and I think
some of the estimates may be further improved. And the precision
obviously depends on the number of cells.

----------------------- Additional notes ---------------------------

1) The whole thing may be easily extended to more than two columns,
   just build N-dimensional cubes.

2) I really don't think we should collect stats for all combinations of
   columns of a table - I do like the Oracle-like approach where a DBA
   has to enable cross-column stats using an ALTER TABLE (for a
   particular list of columns).

   The only exception might be columns from a multi-column index. It
   might be quite efficient I guess?

3) There are independence tests for contingency tables (e.g. Pearson's
   Chi-squared test), so that it's easy to find out whether the columns
   are independent. In that case we can just throw away these stats and
   use the simple estimation.

   http://mathworld.wolfram.com/Chi-SquaredTest.html

4) Or we could store just those cells where expected and observed values
   differ significantly (may help if most of the values are indendent,
   but there's a small glitch somewhere).

5) It does not make sense to collect stats for two list of columns that
   share several columns - just collect cross stats for union of the
   column lists and then 'dripp up' as needed. An extreme case might be
   collecting stats for all columns in a table. But there are practical
   issues with this approach (huge tables, small values within cells).

6) The precision increases with the number of cells, but the number of
   cells grows exponentionally. So it's necessary to choose a
   reasonable number of bins for the columns (might be part of the
   ALTER COMMAND, should not be firmly bound to the statistics target).

   At a cell level, the simple 'multiplication' estimates are actually
   used (but only when the cell is divided by the range).

7) All this is done under the assumption that all the columns are in a
   single table - when doing research in the archives I've noticed
   suggestions to use this to optimize joins. Maybe it can be done but
   I really don't know how to do that.

8) I'm not sure where to store this and in what form - generally the
   table does not need to be significantly more complex than cross_stats
   table from the PoC.

9) I've noticed demands to collect data about columns used frequently
   in a single WHERE condition. Not sure how to do that and how to
   analyze the collected data. I think collecting data about expected
   and observed number of columns should be just fine - but it's kinda
   independent from this proposal.

regards
Tomas

Attachment

Re: proposal : cross-column stats

From
Martijn van Oosterhout
Date:
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
> Hi everyone,
>
> one of the ssesion I've attended on PgDay last week was Heikki's session
> about statistics in PostgreSQL. One of the issues he mentioned (and one
> I regularly run into) is the absence of cross-column stats. When the
> columns are not independent, this usually result in poor estimates (and
> then in suboptimal plans).

Very cool that you're working on this.

> Lets talk about one special case - I'll explain how the proposed
> solution works, and then I'll explain how to make it more general, what
> improvements are possible, what issues are there. Anyway this is by no
> means a perfect or complete solution - it's just a starting point.

It looks like you handled most of the issues. Just a few points:

- This is obviously applicable to more than just integers, probably anything with a b-tree operator class. What you've
codedseems rely on calculations on the values. Have you thought about how it could work for, for example, strings? 

The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.
Not that I suggest you fix this, but it's food for though. Though
strictly speaking this is a different kind of correlation than what
you're looking at.

> 2) I really don't think we should collect stats for all combinations of
>    columns of a table - I do like the Oracle-like approach where a DBA
>    has to enable cross-column stats using an ALTER TABLE (for a
>    particular list of columns).
>
>    The only exception might be columns from a multi-column index. It
>    might be quite efficient I guess?

In the past it has been suggested to only do it for multi-column
indexes, but I find these days I find in some situations I prefer to
make individual indexes and let the bitmap scan code combine them. So
perhaps it would be best to let it be configured by the DBA.

> 3) There are independence tests for contingency tables (e.g. Pearson's
>    Chi-squared test), so that it's easy to find out whether the columns
>    are independent. In that case we can just throw away these stats and
>    use the simple estimation.
>
>    http://mathworld.wolfram.com/Chi-SquaredTest.html

I think this would be good to include, if possible.

Actually, I wonder if the existing stats collection code could be
altered to attempt to calculate the correlation between columns as part
of its other work.

> 4) Or we could store just those cells where expected and observed values
>    differ significantly (may help if most of the values are indendent,
>    but there's a small glitch somewhere).

Comrpessing that grid would be useful, given that for many dimensions
most of the grid will be not interesting. In fact, storing the 20
largest values may be enough. Worth an experiment.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: proposal : cross-column stats

From
Heikki Linnakangas
Date:
On 12.12.2010 15:17, Martijn van Oosterhout wrote:
> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
> Very cool that you're working on this.

+1

>> Lets talk about one special case - I'll explain how the proposed
>> solution works, and then I'll explain how to make it more general, what
>> improvements are possible, what issues are there. Anyway this is by no
>> means a perfect or complete solution - it's just a starting point.
>
> It looks like you handled most of the issues. Just a few points:
>
> - This is obviously applicable to more than just integers, probably
>    anything with a b-tree operator class. What you've coded seems rely
>    on calculations on the values. Have you thought about how it could
>    work for, for example, strings?
>
> The classic failure case has always been: postcodes and city names.
> Strongly correlated, but in a way that the computer can't easily see.

Yeah, and that's actually analogous to the example I used in my 
presentation.

The way I think of that problem is that once you know the postcode, 
knowing the city name doesn't add any information. The postcode implies 
the city name. So the selectivity for "postcode = ? AND city = ?" should 
be the selectivity of "postcode = ?" alone. The measurement we need is 
"implicativeness": How strongly does column A imply a certain value for 
column B. Perhaps that could be measured by counting the number of 
distinct values of column B for each value of column A, or something 
like that. I don't know what the statisticians call that property, or if 
there's some existing theory on how to measure that from a sample.

That's assuming the combination has any matches. It's possible that the 
user chooses a postcode and city combination that doesn't exist, but 
that's no different from a user doing "city = 'fsdfsdfsd'" on a single 
column, returning no matches. We should assume that the combination 
makes sense.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Hi!

Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a):
>> Lets talk about one special case - I'll explain how the proposed
>> solution works, and then I'll explain how to make it more general, what
>> improvements are possible, what issues are there. Anyway this is by no
>> means a perfect or complete solution - it's just a starting point.
> 
> It looks like you handled most of the issues. Just a few points:
> 
> - This is obviously applicable to more than just integers, probably
>   anything with a b-tree operator class. What you've coded seems rely
>   on calculations on the values. Have you thought about how it could
>   work for, for example, strings?

Yes, I know, I just forgot to address this in my previous e-mail. The
contingency tables have a really nice feature - they are based on
splitting the sets into groups (~ bins of the histograms for each
column). And this can be done if you can sort the values, you really
don't need any calculations. So it should work with strings.

And another thing I somehow forgot is handling the case when there is no
histogram, just MCV. That's mostly the same - each of the values might
be a separate group, or the values might be grouped to form less groups,
etc.

> The classic failure case has always been: postcodes and city names.
> Strongly correlated, but in a way that the computer can't easily see.
> Not that I suggest you fix this, but it's food for though. Though
> strictly speaking this is a different kind of correlation than what
> you're looking at.

Hmmm, I see. I think the proposal does not fix this particular case,
although it might improve the situation a little bit (limit the error
between expected and observed number of rows).

The problem is that once we get to a cell-level of the contingency
table, there is no additional (more detailed) information. So we're
stuck with the multiplication estimate, or something like that.

I was thinking about it actually, and I think we could collect some more
info - a correlation coefficient for each bin, or something like that.
But that was not part of my proposal, and I'm not sure how to do that.

>> 2) I really don't think we should collect stats for all combinations of
>>    columns of a table - I do like the Oracle-like approach where a DBA
>>    has to enable cross-column stats using an ALTER TABLE (for a
>>    particular list of columns).
>>
>>    The only exception might be columns from a multi-column index. It
>>    might be quite efficient I guess?
> 
> In the past it has been suggested to only do it for multi-column
> indexes, but I find these days I find in some situations I prefer to
> make individual indexes and let the bitmap scan code combine them. So
> perhaps it would be best to let it be configured by the DBA.

Yes, I prefer individual indexes too.

The idea behind collecting cross-column stats for multi-column indexes
was that maybe we could 'append' this to the current functionality
(building the index or something like that) so that it does not
introduce significant performance problems.

>> 3) There are independence tests for contingency tables (e.g. Pearson's
>>    Chi-squared test), so that it's easy to find out whether the columns
>>    are independent. In that case we can just throw away these stats and
>>    use the simple estimation.
>>
>>    http://mathworld.wolfram.com/Chi-SquaredTest.html
> 
> I think this would be good to include, if possible. 
> 
> Actually, I wonder if the existing stats collection code could be
> altered to attempt to calculate the correlation between columns as part
> of its other work.

I guess that would be rather expensive - to compute correlation you need
two passes, and you need to do that for each pair or columns. So I'd be
surprised if it is possible (and effective).

Another thing is that you can compute correlation only for numeric
columns, so it's not possible to do that for city/ZIP code mentioned
above. More precisely - it's possible to do that (if you map strings to
numbers somehow), but I doubt you'll get useful results as the
assignment is rather random.

Well, you could ask the governments to assign the ZIP codes to cities in
strictly alphabecital order, but I guess they'll say no.

>> 4) Or we could store just those cells where expected and observed values
>>    differ significantly (may help if most of the values are indendent,
>>    but there's a small glitch somewhere).
> 
> Comrpessing that grid would be useful, given that for many dimensions
> most of the grid will be not interesting. In fact, storing the 20
> largest values may be enough. Worth an experiment.

Not exactly just the largest values - rather values that are
significantly different from the expected values. Generally there are
two interesting cases

expected << observed - The optimizer may choose index scan, although                      the seq scan would be
better.

expected >> observed - The optimizer may choose seq scan, although                      the index scan would be
better.

regards
Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
>> The classic failure case has always been: postcodes and city names.
>> Strongly correlated, but in a way that the computer can't easily see.
> 
> Yeah, and that's actually analogous to the example I used in my
> presentation.
> 
> The way I think of that problem is that once you know the postcode,
> knowing the city name doesn't add any information. The postcode implies
> the city name. So the selectivity for "postcode = ? AND city = ?" should
> be the selectivity of "postcode = ?" alone. The measurement we need is
> "implicativeness": How strongly does column A imply a certain value for
> column B. Perhaps that could be measured by counting the number of
> distinct values of column B for each value of column A, or something
> like that. I don't know what the statisticians call that property, or if
> there's some existing theory on how to measure that from a sample.

Yes, those issues are a righteous punishment for breaking BCNF rules ;-)

I'm not sure it's solvable using the contingency tables, as it requires
knowledge about dependencies between individual values (working with
cells is not enough, although it might improve the estimates).

Well, maybe we could collect these stats (number of cities for a given
ZIP code and number of ZIP codes for a given city). Collecting a good
stats about this is a bit tricky, but possible. What about collecting
this for the MCVs from both columns?

Tomas


Re: proposal : cross-column stats

From
Florian Pflug
Date:
On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
> The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any
information.The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the
selectivityof "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a
certainvalue for column B. Perhaps that could be measured by counting the number of distinct values of column B for
eachvalue of column A, or something like that. I don't know what the statisticians call that property, or if there's
someexisting theory on how to measure that from a sample. 

The statistical term for this is "conditional probability", written P(A|B), meaning the probability of A under the
assumptionor knowledge of B. The basic tool for working with conditional probabilities is bayes' theorem which states
that

P(A|B) = P(A and B) / P(B).

Currently, we assume that P(A|B) = P(A), meaning the probability (or selectivity as we call it) of an event (like a=3)
doesnot change under additional assumptions like b=4. Bayes' theorem thus becomes 

P(A) = P(A and B) / P(B)    <=>
P(A and B) = P(A)*P(B)

which is how we currently compute the selectivity of a clause such as "WHERE a=3 AND b=4".

I believe that measuring this by counting the number of distinct values of column B for each A is basically the right
idea.Maybe we could count the number of distinct values of "b" for every one of the most common values of "a", and
comparethat to the overall number of distinct values of "b"... 

A (very) quick search on scholar.google.com for "estimate conditional probability" didn't turn up anything useful, but
it'shard to believe that there isn't at least some literature on the subject. 

best regards,
Florian Pflug

Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 12.12.2010 17:33, Florian Pflug napsal(a):
> On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
>> The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any
information.The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the
selectivityof "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a
certainvalue for column B. Perhaps that could be measured by counting the number of distinct values of column B for
eachvalue of column A, or something like that. I don't know what the statisticians call that property, or if there's
someexisting theory on how to measure that from a sample.
 
> 
> The statistical term for this is "conditional probability", written P(A|B), meaning the probability of A under the
assumptionor knowledge of B. The basic tool for working with conditional probabilities is bayes' theorem which states
that
> 
> P(A|B) = P(A and B) / P(B).
> 
> Currently, we assume that P(A|B) = P(A), meaning the probability (or selectivity as we call it) of an event (like
a=3)does not change under additional assumptions like b=4. Bayes' theorem thus becomes
 
> 
> P(A) = P(A and B) / P(B)    <=>
> P(A and B) = P(A)*P(B)
> 
> which is how we currently compute the selectivity of a clause such as "WHERE a=3 AND b=4".
> 
> I believe that measuring this by counting the number of distinct values of column B for each A is basically the right
idea.Maybe we could count the number of distinct values of "b" for every one of the most common values of "a", and
comparethat to the overall number of distinct values of "b"...
 

Good point!

Well, I was thinking about this too - generally this means creating a
contingency table with the MCV as bins. Then you can compute these
interesting probabilities P(A and B). (OK, now I definitely look like
some contingency table weirdo, who tries to solve everything with a
contingency table. OMG!)

The question is - what are we going to do when the values in the query
are not in the MCV list? Is there some heuristics to estimate the
probability from MCV, or something like that? Could we use some
"average" probability or what?

Tomas


Re: proposal : cross-column stats

From
Robert Haas
Date:
On Sun, Dec 12, 2010 at 9:43 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> The way I think of that problem is that once you know the postcode, knowing
> the city name doesn't add any information. The postcode implies the city
> name. So the selectivity for "postcode = ? AND city = ?" should be the
> selectivity of "postcode = ?" alone. The measurement we need is
> "implicativeness": How strongly does column A imply a certain value for
> column B. Perhaps that could be measured by counting the number of distinct
> values of column B for each value of column A, or something like that. I
> don't know what the statisticians call that property, or if there's some
> existing theory on how to measure that from a sample.

This is a good idea, but I guess the question is what you do next.  If
you know that the "applicability" is 100%, you can disregard the
restriction clause on the implied column.  And if it has no
implicatory power, then you just do what we do now.  But what if it
has some intermediate degree of implicability?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 13.12.2010 01:05, Robert Haas napsal(a):
> This is a good idea, but I guess the question is what you do next.  If
> you know that the "applicability" is 100%, you can disregard the
> restriction clause on the implied column.  And if it has no
> implicatory power, then you just do what we do now.  But what if it
> has some intermediate degree of implicability?

Well, I think you've missed the e-mail from Florian Pflug - he actually
pointed out that the 'implicativeness' Heikki mentioned is called
conditional probability. And conditional probability can be used to
express the "AND" probability we are looking for (selectiveness).

For two columns, this is actually pretty straighforward - as Florian
wrote, the equation is
  P(A and B) = P(A|B) * P(B) = P(B|A) * P(A)

where P(B) may be estimated from the current histogram, and P(A|B) may
be estimated from the contingency (see the previous mails). And "P(A and
B)" is actually the value we're looking for.

Anyway there really is no "intermediate" degree of aplicability, it just
gives you the right estimate.

And AFAIR this is easily extensible to more than two columns, as
 P(A and B and C) = P(A and (B and C)) = P(A|(B and C)) * P(B and C)

so it's basically a recursion.

Well, I hope my statements are really correct - it's been a few years
since I gained my degree in statistics ;-)

regards
Tomas


Re: proposal : cross-column stats

From
Robert Haas
Date:
On Sun, Dec 12, 2010 at 8:46 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Dne 13.12.2010 01:05, Robert Haas napsal(a):
>> This is a good idea, but I guess the question is what you do next.  If
>> you know that the "applicability" is 100%, you can disregard the
>> restriction clause on the implied column.  And if it has no
>> implicatory power, then you just do what we do now.  But what if it
>> has some intermediate degree of implicability?
>
> Well, I think you've missed the e-mail from Florian Pflug - he actually
> pointed out that the 'implicativeness' Heikki mentioned is called
> conditional probability. And conditional probability can be used to
> express the "AND" probability we are looking for (selectiveness).
>
> For two columns, this is actually pretty straighforward - as Florian
> wrote, the equation is
>
>   P(A and B) = P(A|B) * P(B) = P(B|A) * P(A)

Well, the question is what data you are actually storing.  It's
appealing to store a measure of the extent to which a constraint on
column X constrains column Y, because you'd only need to store
O(ncolumns^2) values, which would be reasonably compact and would
potentially handle the zip code problem - a classic "hard case" rather
neatly.  But that wouldn't be sufficient to use the above equation,
because there A and B need to be things like "column X has value x",
and it's not going to be practical to store a complete set of MCVs for
column X for each possible value that could appear in column Y.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
> P(A|B) = P(A and B) / P(B).

Well, until this point we've discussed failure cases involving 'AND'
conditions. What about 'OR' conditions? I think the current optimizer
computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
found in backend/optimizer/path/clausesel.c:630).

Sometimes that may return nearly 2x the actual selectivity, but in
general it's a reasonable estimate. Are there any severe failure cases
that produce much worse estimates?

regards
Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 13.12.2010 03:00, Robert Haas napsal(a):
> Well, the question is what data you are actually storing.  It's
> appealing to store a measure of the extent to which a constraint on
> column X constrains column Y, because you'd only need to store
> O(ncolumns^2) values, which would be reasonably compact and would
> potentially handle the zip code problem - a classic "hard case" rather
> neatly.  But that wouldn't be sufficient to use the above equation,
> because there A and B need to be things like "column X has value x",
> and it's not going to be practical to store a complete set of MCVs for
> column X for each possible value that could appear in column Y.

O(ncolumns^2) values? You mean collecting such stats for each possible
pair of columns? Well, I meant something different.

The proposed solution is based on contingency tables, built for selected
groups of columns (not for each possible group). And the contingency
table gives you the ability to estimate the probabilities needed to
compute the selectivity. Or am I missing something?

regards
Tomas


Re: proposal : cross-column stats

From
Robert Haas
Date:
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Dne 13.12.2010 03:00, Robert Haas napsal(a):
>> Well, the question is what data you are actually storing.  It's
>> appealing to store a measure of the extent to which a constraint on
>> column X constrains column Y, because you'd only need to store
>> O(ncolumns^2) values, which would be reasonably compact and would
>> potentially handle the zip code problem - a classic "hard case" rather
>> neatly.  But that wouldn't be sufficient to use the above equation,
>> because there A and B need to be things like "column X has value x",
>> and it's not going to be practical to store a complete set of MCVs for
>> column X for each possible value that could appear in column Y.
>
> O(ncolumns^2) values? You mean collecting such stats for each possible
> pair of columns? Well, I meant something different.
>
> The proposed solution is based on contingency tables, built for selected
> groups of columns (not for each possible group). And the contingency
> table gives you the ability to estimate the probabilities needed to
> compute the selectivity. Or am I missing something?

Well, I'm not real familiar with contingency tables, but it seems like
you could end up needing to store a huge amount of data to get any
benefit out of it, in some cases.  For example, in the United States,
there are over 40,000 postal codes, and some even larger number of
city names, and doesn't the number of entries go as O(m*n)?  Now maybe
this is useful enough anyway that we should Just Do It, but it'd be a
lot cooler if we could find a way to give the planner a meaningful
clue out of some more compact representation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: proposal : cross-column stats

From
Nathan Boley
Date:
>> The proposed solution is based on contingency tables, built for selected
>> groups of columns (not for each possible group). And the contingency
>> table gives you the ability to estimate the probabilities needed to
>> compute the selectivity. Or am I missing something?
>
> Well, I'm not real familiar with contingency tables, but it seems like
> you could end up needing to store a huge amount of data to get any
> benefit out of it, in some cases.  For example, in the United States,
> there are over 40,000 postal codes, and some even larger number of
> city names, and doesn't the number of entries go as O(m*n)?

Not to mention that the model parameters will be impossible to estimate well.

( I've only scanned the thread, so sorry if Im repeating something
that's already been said )

My intuition is that storing the covariance structure will be
unworkable both technically and statistically for all but the simplest
of cases. That being said, I dont think the problem is a lack of ways
to parameterize the covariance structure ( there are several papers on
mutli-dim histogram estimators, at least one of whichtalks about
databases explicitly, not to mention approaches like CART[1] ) , but a
complete lack of infrastructure to do anything with the estimates. So
keep working on it ;-) ( but try to make the framework general enough
to allow better estimators ).

I wonder if a good first step might be to focus on the AND and OR
operators since they seem like the most common cases and union and
intersection commute ( although it's important to remember that the
estimate variances do NOT )  That is, what about estimating the
selectivity of the condition

WHERE X=A and Y=B

by

f(A,B) = x_1*(selectivity(X = A) + selectivity( Y = B )) +
x_2*selectivity(X = A)*selectivity( Y = B ) + x_3

where x_{1,2,3} are parameters to be estimated from the data.

Another quick note: I think that storing the full contingency table is
wasteful since the marginals are already stored in the single column
statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
looking at this a couple years back ).

Best,
Nathan

[1] http://en.wikipedia.org/wiki/Classification_and_regression_tree
[2] http://en.wikipedia.org/wiki/Copula_%28statistics%29


Re: proposal : cross-column stats

From
tv@fuzzy.cz
Date:
> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Dne 13.12.2010 03:00, Robert Haas napsal(a):
>>> Well, the question is what data you are actually storing.  It's
>>> appealing to store a measure of the extent to which a constraint on
>>> column X constrains column Y, because you'd only need to store
>>> O(ncolumns^2) values, which would be reasonably compact and would
>>> potentially handle the zip code problem - a classic "hard case" rather
>>> neatly.  But that wouldn't be sufficient to use the above equation,
>>> because there A and B need to be things like "column X has value x",
>>> and it's not going to be practical to store a complete set of MCVs for
>>> column X for each possible value that could appear in column Y.
>>
>> O(ncolumns^2) values? You mean collecting such stats for each possible
>> pair of columns? Well, I meant something different.
>>
>> The proposed solution is based on contingency tables, built for selected
>> groups of columns (not for each possible group). And the contingency
>> table gives you the ability to estimate the probabilities needed to
>> compute the selectivity. Or am I missing something?
>
> Well, I'm not real familiar with contingency tables, but it seems like
> you could end up needing to store a huge amount of data to get any
> benefit out of it, in some cases.  For example, in the United States,
> there are over 40,000 postal codes, and some even larger number of
> city names, and doesn't the number of entries go as O(m*n)?  Now maybe
> this is useful enough anyway that we should Just Do It, but it'd be a
> lot cooler if we could find a way to give the planner a meaningful
> clue out of some more compact representation.

Yes, storing a complete contingency table is not feasible in such cases.
My original proposal actually did not address this particular issue
(cities and ZIP codes) as it was based on simplified contingency tables
(with bins corresponding to current histograms, not to individual values).
So the number of values to store would grow much slower.

On the other hand, this generalization really makes it unusable in some
cases, and the issue we're discussing here (cities and ZIP codes) is one
of them. I think in such cases we could build a contingency table for MCV
and then use it to estimate those conditional probabilities we need, but I
expect it to be very tricky.

Thanks for the comments.

Tomas



Re: proposal : cross-column stats

From
Yeb Havinga
Date:
On 2010-12-13 03:28, Robert Haas wrote:
> Well, I'm not real familiar with contingency tables, but it seems like
> you could end up needing to store a huge amount of data to get any
> benefit out of it, in some cases.  For example, in the United States,
> there are over 40,000 postal codes, and some even larger number of
> city names, and doesn't the number of entries go as O(m*n)?  Now maybe
> this is useful enough anyway that we should Just Do It, but it'd be a
> lot cooler if we could find a way to give the planner a meaningful
> clue out of some more compact representation.
A sparse matrix that holds only 'implicative' (P(A|B) <> P(A*B)?) 
combinations? Also, some information might be deduced from others. For 
Heikki's city/region example, for each city it would be known that it is 
100% in one region. In that case it suffices to store only that 
information, since 0% in all other regions ca be deduced. I wouldn't be 
surprized if storing implicatures like this would reduce the size to O(n).

regards,
Yeb Havinga



Re: proposal : cross-column stats

From
tv@fuzzy.cz
Date:
> On 2010-12-13 03:28, Robert Haas wrote:
>> Well, I'm not real familiar with contingency tables, but it seems like
>> you could end up needing to store a huge amount of data to get any
>> benefit out of it, in some cases.  For example, in the United States,
>> there are over 40,000 postal codes, and some even larger number of
>> city names, and doesn't the number of entries go as O(m*n)?  Now maybe
>> this is useful enough anyway that we should Just Do It, but it'd be a
>> lot cooler if we could find a way to give the planner a meaningful
>> clue out of some more compact representation.
> A sparse matrix that holds only 'implicative' (P(A|B) <> P(A*B)?)
> combinations? Also, some information might be deduced from others. For
> Heikki's city/region example, for each city it would be known that it is
> 100% in one region. In that case it suffices to store only that
> information, since 0% in all other regions ca be deduced. I wouldn't be
> surprized if storing implicatures like this would reduce the size to O(n).

OK, but I'll leave this for the future. My plan is to build a small PoC,
just to see whether the contingency-table + probability-estimates approach
works for the failure case mentioned by Heikki. I'l like to do this till
the end of this week, if possible.

I'll read the articles/mentioned by Nathan Boley (thanks for those links,
if you have more of them just let me know).

Once we have a solution that solves (or significantly improves) these
failure cases, we can do further plans (how to do that ascually in the
code etc.).

BTW thanks for all the comments!

regards
Tomas



Re: proposal : cross-column stats

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> Well, until this point we've discussed failure cases involving 'AND'
> conditions. What about 'OR' conditions? I think the current optimizer
> computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
> found in backend/optimizer/path/clausesel.c:630).

If you can solve the AND case, the OR case falls out of that.  Just
replace s1*s2 with a more accurate AND calculation.
        regards, tom lane


Re: proposal : cross-column stats

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> The proposed solution is based on contingency tables, built for selected
>> groups of columns (not for each possible group). And the contingency
>> table gives you the ability to estimate the probabilities needed to
>> compute the selectivity. Or am I missing something?

> Well, I'm not real familiar with contingency tables, but it seems like
> you could end up needing to store a huge amount of data to get any
> benefit out of it, in some cases.

The reason that this wasn't done years ago is precisely that nobody's
figured out how to do it with a tolerable amount of stats data and a
tolerable amount of processing time (both at ANALYZE time and during
query planning).  It's not hard to see what we'd ideally like to do;
it's getting from there to something useful in production that's hard.
        regards, tom lane


Re: proposal : cross-column stats

From
Joshua Tolley
Date:
On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
> Another quick note: I think that storing the full contingency table is
> wasteful since the marginals are already stored in the single column
> statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
> looking at this a couple years back ).

Josh Tolley still looks at it occasionally, though time hasn't permitted any
sort of significant work for quite some time. The multicolstat branch on my
git.postgresql.org repository will create an empirical copula each
multi-column index, and stick it in pg_statistic. It doesn't yet do anything
useful with that information, nor am I convinced it's remotely bug-free. In a
brief PGCon discussion with Tom a while back, it was suggested a good place
for the planner to use these stats would be clausesel.c, which is responsible
for handling code such as "...WHERE foo > 4 AND foo > 5".

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 13.12.2010 16:34, Tom Lane napsal(a):
> Tomas Vondra <tv@fuzzy.cz> writes:
>> Well, until this point we've discussed failure cases involving 'AND'
>> conditions. What about 'OR' conditions? I think the current optimizer
>> computes the selectivity as 's1+s2 - s1*s2' (at least that's what I
>> found in backend/optimizer/path/clausesel.c:630).
> 
> If you can solve the AND case, the OR case falls out of that.  Just
> replace s1*s2 with a more accurate AND calculation.

Oh yeah, now I see - it's just the usual equation
  P(A or B) = P(A) + P(B) - P(A and B)

and we're estimating "P(A and B)" as P(A)*P(B).

regards
Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 13.12.2010 16:38, Tom Lane napsal(a):
> The reason that this wasn't done years ago is precisely that nobody's
> figured out how to do it with a tolerable amount of stats data and a
> tolerable amount of processing time (both at ANALYZE time and during
> query planning).  It's not hard to see what we'd ideally like to do;
> it's getting from there to something useful in production that's hard.

OK, I fully realize that. My plan is to simply
 (a) find out what statistics do we need to collect and how to use it (b) implement a really stupid inefficient
solution(c) optimize in iterations, i.e. making it faster, consuming less     space etc.
 

It will take time, it won't be perfect for a long time, but every
journey starts somewhere.

regards
Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 13.12.2010 18:59, Joshua Tolley napsal(a):
> On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote:
>> Another quick note: I think that storing the full contingency table is
>> wasteful since the marginals are already stored in the single column
>> statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was
>> looking at this a couple years back ).
> 
> Josh Tolley still looks at it occasionally, though time hasn't permitted any
> sort of significant work for quite some time. The multicolstat branch on my
> git.postgresql.org repository will create an empirical copula each
> multi-column index, and stick it in pg_statistic. It doesn't yet do anything
> useful with that information, nor am I convinced it's remotely bug-free. In a
> brief PGCon discussion with Tom a while back, it was suggested a good place
> for the planner to use these stats would be clausesel.c, which is responsible
> for handling code such as "...WHERE foo > 4 AND foo > 5".

Well, that's good news ;-)

I've done a bit of research today, and I've found some quite interesting
papers on this topic (probably, I did not have time to read them, in
most cases I've read just the title and abstract).

[1] Selectivity estimators for multidimensional range queries over real   attributes
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.914
   This seems like a very good starting point. AFAIK it precisely   describes what data need to be collected, how to do
themath etc.
 

[2] Selectivity Estimation Without the Attribute Value Independence   Assumption
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.8126
   This obviously deals with the independence problem. Haven't   investigated it further, but it seems worth to read.

[3] On Analyzing the Cost of Queries with Multi-Attribute Restrictions   and Sort Operations (A Cost Function for
UniformlyPartitioned   UB-Trees)   http://mistral.in.tum.de/results/publications/MB00_ideas.pdf
 
   Describes something called UB-Tree, and shows how it may be used to   do estimates. Might be interesting as an
alternativeto the   traditional histograms.
 
   There are more details about UB-Trees at
   http://mistral.in.tum.de/results/publications/

[4] http://www.dbnet.ece.ntua.gr/~nikos/edith/qopt_bibl/
   A rather nice collection of papers related to estimation (including   some of the papers listed above).

Hm, I planned to finally read the "Understanding MySQL Internals" over
the Xmas ... that obviously won't happen.

regards
Tomas


Re: proposal : cross-column stats

From
Josh Berkus
Date:
Tomas,

>   (a) find out what statistics do we need to collect and how to use it
>   (b) implement a really stupid inefficient solution
>   (c) optimize in iterations, i.e. making it faster, consuming less
>       space etc.

I'll suggest again how to decide *which* columns to cross: whichever
columns are combined in composite indexes.  In version 2, allow the DBA
to specify combinations.

In the unlikely event that correlation could be reduced to a single
float number, it would be conceivable for each column to have an array
of correlation stats for every other column where correlation was
non-random; on most tables (i.e. ones with less than 100 columns) we're
not talking about that much storage space.

The main cost would be the time spent collecting that info ...

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 13.12.2010 22:50, Josh Berkus napsal(a):
> Tomas,
> 
>>   (a) find out what statistics do we need to collect and how to use it
>>   (b) implement a really stupid inefficient solution
>>   (c) optimize in iterations, i.e. making it faster, consuming less
>>       space etc.
> 
> I'll suggest again how to decide *which* columns to cross: whichever
> columns are combined in composite indexes.  In version 2, allow the DBA
> to specify combinations.
> 
> In the unlikely event that correlation could be reduced to a single
> float number, it would be conceivable for each column to have an array
> of correlation stats for every other column where correlation was
> non-random; on most tables (i.e. ones with less than 100 columns) we're
> not talking about that much storage space.
> 
> The main cost would be the time spent collecting that info ...

I think this is a bit early to discuss this, given the fact that we
don't have a working solution yet. But OK, let's discuss these options
anyway

1) collecting the automatically for composite indexes
  I don't think this is wise idea. The first versions definitely won't  be very efficient, and collecting the data for
eachcomposite  index means everyone will be hit by this inefficiency, even if he  actually does not need that (e.g. the
columnsare independent so the  current estimates are quite accurate or he's not using those columns  very often in the
sameWHERE clause).
 
  Another reason against this is that many DBAs don't actually use  composed indexes - they simply create indexes on
eachcolumn and let  the bitmap index scan to work it out. And this would not work for  this case.
 
  And actually it's not very complicated to allow the DBA to do this,  this can be a quite simple PL/pgSQL procedure.

2) collecting correlation for each pair of columns
  Again, you're effectively forcing everyone to pay the price even  though he may not need the feature. Maybe we'll get
thereone day,  but it's not a good idea to do that from the beginning.
 
  And the correlation itself has a very limited use in real life, as  it's not possible to compute it for character
columnsand is not  very useful in case of some numerical columns (e.g. ZIP codes).
 

regards
Tomas


Re: proposal : cross-column stats

From
Pavel Stehule
Date:
2010/12/13 Josh Berkus <josh@agliodbs.com>:
> Tomas,
>
>>   (a) find out what statistics do we need to collect and how to use it
>>   (b) implement a really stupid inefficient solution
>>   (c) optimize in iterations, i.e. making it faster, consuming less
>>       space etc.
>
> I'll suggest again how to decide *which* columns to cross: whichever
> columns are combined in composite indexes.  In version 2, allow the DBA
> to specify combinations.

It's really good idea? Composite index can be created when single
columns are too less unique - (name, surname). DBA specification can
be cheeper.  We can set a options for relation? So it can be used.

Pavel

>
> In the unlikely event that correlation could be reduced to a single
> float number, it would be conceivable for each column to have an array
> of correlation stats for every other column where correlation was
> non-random; on most tables (i.e. ones with less than 100 columns) we're
> not talking about that much storage space.
>
> The main cost would be the time spent collecting that info ...
>
> --
>                                  -- Josh Berkus
>                                     PostgreSQL Experts Inc.
>                                     http://www.pgexperts.com
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: proposal : cross-column stats

From
Robert Haas
Date:
On Mon, Dec 13, 2010 at 5:45 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> I'll suggest again how to decide *which* columns to cross: whichever
>> columns are combined in composite indexes.  In version 2, allow the DBA
>> to specify combinations.
> I think this is a bit early to discuss this, given the fact that we
> don't have a working solution yet.

Agreed.  Not to mention that adding syntax is pretty easy.  The hard
part is figuring out what data you want to collect and what you want
to do with it.  If we can figure that out, the syntax will be simple
by comparison.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
> On 12.12.2010 15:17, Martijn van Oosterhout wrote:
>> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
>> Very cool that you're working on this.
>
> +1
>
>>> Lets talk about one special case - I'll explain how the proposed
>>> solution works, and then I'll explain how to make it more general, what
>>> improvements are possible, what issues are there. Anyway this is by no
>>> means a perfect or complete solution - it's just a starting point.
>>
>> It looks like you handled most of the issues. Just a few points:
>>
>> - This is obviously applicable to more than just integers, probably
>>    anything with a b-tree operator class. What you've coded seems rely
>>    on calculations on the values. Have you thought about how it could
>>    work for, for example, strings?
>>
>> The classic failure case has always been: postcodes and city names.
>> Strongly correlated, but in a way that the computer can't easily see.
>
> Yeah, and that's actually analogous to the example I used in my
> presentation.
>
> The way I think of that problem is that once you know the postcode,
> knowing the city name doesn't add any information. The postcode implies
> the city name. So the selectivity for "postcode = ? AND city = ?" should
> be the selectivity of "postcode = ?" alone. The measurement we need is
> "implicativeness": How strongly does column A imply a certain value for
> column B. Perhaps that could be measured by counting the number of
> distinct values of column B for each value of column A, or something
> like that. I don't know what the statisticians call that property, or if
> there's some existing theory on how to measure that from a sample.
>
> That's assuming the combination has any matches. It's possible that the
> user chooses a postcode and city combination that doesn't exist, but
> that's no different from a user doing "city = 'fsdfsdfsd'" on a single
> column, returning no matches. We should assume that the combination
> makes sense.

Well, I've read several papers this week, in an attempt to find out what
possibilities are there when implementing cross-column statistics. One
of the papers seems like a reasonable solution to this particular
problem - discrete data + equality queries.

The article is "A Bayesian Approach to Estimating The Selectivity of
Conjuctive Predicates" (written by Heimel, Markl and Murthy). It's
freely available as a PDF

http:://subs.emis.de/LNI/Proceedings/Proceedings144/52.pdf

I do have a PDF with my own notes in it (as annotations), let me know if
you're interested ...

The basic principle is that instead of 'attribute value independence'
(which is the problem with correlated columns) they assume 'uniform
correlation' which is a much weaker assumption.

In the end, all they need to compute an estimate is number of distinct
values for each of the columns (we already have that in pg_stats) and a
number of distinct values for the group of columns in a query. They
really don't need any multidimensional histogram or something like that.

The funny thing is I've implemented most of the PoC before reading the
article - the only difference was the way to combine the estimates (

------------------------ pros and cons --------------------------

So let's see the pros:

* it's reasonably simple, the overhead should be minimal I guess (we're
  already estimating distinct values for the columns, so I guess we can
  do that for multiple columns with reasonable impact)

* there are no 'advanced data structures' as multi-dimensional
   histograms that are expensive to build and maintain

* it consistently produces better estimates than the current estimator
  based on attribute value independence assumption, and it's reasonably
  accurate (I've been playing with it for some time, so we'll need
  more tests of course)

* it's easily extensible to more columns

* I guess we could improve the estimated by our own heuristicts, to
  catch the special case when one column is perfectly implied by
  another one (e.g. when state is implied by ZIP code) - we can catch
  that by 'distinct for combination = distinct for column' and use just
  the estimate for one of the column

but there are some cons too:

* this really is not designed to work with real-valued data, it's a
  very nice solution for discrete data (namely 'labels' as for example
  city/state names, ZIP codes etc.)

* you need to know the list of columns in advance (there's nothing like
  'let's build' the stats for columns (a,b,c) and then query just (a,b))
  as you need to know the count of distinct values for the queried
  columns (well, maybe it's possible - I guess we could 'integrate'
  over all possible values of the omited column)

* it's built for 'equality' qeuries, although maybe we could modify it
  for inequalities too (but it won't be as elegant), but I guess the
  equality queries are much more common in case of discrete data (OK,
  you can run something like

    WHERE (zip_code BETWEEN '49389' AND '54980') AND state = '48'

  but I guess that's not very frequent. And I think we could still use
  the solution described in the paper.

------------------------ proof of concept --------------------------

I've prepared another 'proof of concept' example - see the attachment.
It works with data from

  http://www.census.gov/geo/www/tiger/zip1999.zip

which is a list of US ZIP codes from 1999, along with info on GPS,
state, county etc. Basically a perfect example of the fail case
described above. It's in a DBF format, so use http:://pgdbf.sf.net (or
something like that) to extract the data.

Then use the script to create a table zip_codes_usa (with appropriate
data types), table 'cross_stats' to store the stats (number of distinct
values) and plpgsql functions collect_stats, get_estimate and
get_frequency.

Run the 'collect_stats' with two columns as parameters, i.e.

  SELECT collect_stats('zip_code', 'state');

and then you can run the 'get_estimate' with values for the columns,
i.e. something like this

  SELECT get_estimate('07034', '34');

which prints some debug notices and returns three integers - estimate
for the colums separately, and then the combined result. So what really
matters is the third number ...

Some basic tests, compared with the current estimator:

a) queries involving 'zip_code' - I've put there the heuristics
   described above, so it actually returns the best estimate (unless
   the combination with state does not make sense, of course)

     ZIP  | state | actual | current estimate | get_estimate
   =========================================================
    07034 | 34    | 32     | 1                | 32
    07034 | 32    | 0      | 1                | 32

   There are 32 rows with this combination as I've copied the data
   multiple times to make the table bigger.

   The 'better performance' of the current estimator in the second case
   is actually just a lucky coincidence, a side-effect of the
   independence assumption. And as Heikki already pointed out, we should
   assume the combination makes sense, so this actually is not an
   advantage of the current estimator.

b) queries involving 'post_name' and 'state' - this can't use the
   heuristics as in (a), as there are 18.940 different post names,
   59 states and 30113 combinations. On the other side, post name
   implies state.

    post name | state | actual | current estimate | get_estimate
   =========================================================
    ESSEX     | 09    | 32     | 1                | 75
    ELMSFORD  | 36    | 32     | 4                | 188

   Hm, in this case the estimate is not as good as in the first case,
   but I guess it's still a bit better than the current estimate. In
   the first case it's about 2.5x overestimated (compared to 32x
   underestimate of the current estimate), and about 5x overestimated
   in the second case (compared to 8x underestimate of the current one).

   Again, with the invalid combination (e.g. ESSEX/36) the current
   estimator would perform better.

regards
Tomas



Attachment

Re: proposal : cross-column stats

From
Robert Haas
Date:
On Fri, Dec 17, 2010 at 12:58 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> In the end, all they need to compute an estimate is number of distinct
> values for each of the columns (we already have that in pg_stats) and a
> number of distinct values for the group of columns in a query. They
> really don't need any multidimensional histogram or something like that.

I haven't read the paper yet (sorry) but just off the top of my head,
one possible problem here is that our n_distinct estimates aren't
always very accurate, especially for large tables.  As we've discussed
before, making them accurate requires sampling a significant
percentage of the table, whereas all of our other statistics can be
computed reasonably accurately by sampling a fixed amount of an
arbitrarily large table.  So it's possible that relying more heavily
on n_distinct could turn out worse overall even if the algorithm is
better.  Not sure if that's an issue here, just throwing it out
there...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Hi,

I've read about 10 papers about estimation this week, and I have gained
some insight. I'm not going to describe each of the papers here, I plan
to set up a wiki page about cross column statistics
 http://wiki.postgresql.org/wiki/Cross_Columns_Stats

and I'll put the list of papers and some basic info there. There was a
lot of discussion in this thread, and it's difficult to follow it, so
I'll put there some details about the proposal etc.

So I'll just briefly describe the interesting things I've learned from
those papers.

---------------------- instances of the problem ----------------------

Generally, there are four quite different cases of the problem.

First, the columns may be either real-valued or discrete. And by
'discrete' I mean the value is rather a label than a value. It does not
make sense to add or subtract them, etc. So for example city names or
zip codes are discrete values (although zip codes are numbers etc).

Second, the queries are consist of equality or inequality (range)
conditions.

So actually there are four instances of the problem:
                 | equality conditions | inequality conditions
|================================================================discrete values |          A          |           D
      | numeric values  |          C          |           B
|----------------------------------------------------------------

I think those four problems should be treated separately, although some
of the techniques may be common.

A) discrete values and inequality conditions
  One of the papers (A Bayesian Approach to Estimating The Selectivity  of Conjuctive Predicates) describes a quite
interestingsolution to  this problem - I've already posted a description on how to apply it  to the ZIP code / city
nameproblem - see this
 
  http://archives.postgresql.org/pgsql-hackers/2010-12/msg01576.php

B) numeric values and inequality conditions
  Most of the papers dealing with this problem are based on  discretization and multi-dimensional histograms to
approximatethe  joint distribution. So I believe my initial proposal was not a  complete nonsense.
 
  Once we have a working histogram-based solution, we can work on  precision and efficiency (how much space is needed
tostore it, how  long does it take to compute an estimate, etc.). There are two  ways to do that.
 
  First, most of the papers offer an enhanced type of histogram  (although the histogram proposed in the paper is
alwaysevaluated as  the most efficient one, which is a bit suspicious). Anyway there are  advanced quite-promising ways
tobuild multi-dimensional histograms.
 
  Second, the paper "Selectivity estimation using probabilistic  models") describes a solution based on bayesian
networks.That means  really really promising (actually it addresses join selectivity  estimation too).
 
  So yeas, I'm quite confident we can solve this issue and improve the  efficiency - either by an advanced histogram or
usingbayesian  networks.
 

C) numeric values and equality conditions
  OK, I'm not sure how to solve this case. But the queries against  numeric queries are range queries in most cases I
guess,so maybe  that's not that big deal.
 

D) discrete values and inequality conditions
  Basically, this can be handled just like numeric values after  discretization, i.e. using a histogram. But this is
usually

E) a combination of discrete / numeric columns
  I'm not sure how to deal with this. Obviously it's possible to  build multi-dimensional histogram, and estimate as
manyqueries as  possible.
 

-----------------------------------------------------------------------

The papers describe some interesting alternative approaches, e.g. based
on SVD (singular value decomposition) or random sampling (or kernel
estimators, which an enhanced version of sampling).

But there are various issues associated with those solutions. SVD is
applicable to 2D only, so we'd be stuck with 2 columns, and random
sampling sounds a bit suspicious to me).

regards
Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 17.12.2010 19:58, Robert Haas napsal(a):
> I haven't read the paper yet (sorry) but just off the top of my head,
> one possible problem here is that our n_distinct estimates aren't
> always very accurate, especially for large tables.  As we've discussed
> before, making them accurate requires sampling a significant
> percentage of the table, whereas all of our other statistics can be
> computed reasonably accurately by sampling a fixed amount of an
> arbitrarily large table.  So it's possible that relying more heavily
> on n_distinct could turn out worse overall even if the algorithm is
> better.  Not sure if that's an issue here, just throwing it out
> there...

Yes, you're right - the paper really is based on (estimates of) number
of distinct values for each of the columns as well as for the group of
columns.

AFAIK it will work with reasonably precise estimates, but the point is
you need an estimate of distinct values of the whole group of columns.
So when you want to get an estimate for queries on columns (a,b), you
need the number of distinct value combinations of these two columns.

And I think we're not collecting this right now, so this solution
requires scanning the table (or some part of it).

I know this is a weak point of the whole solution, but the truth is
every cross-column stats solution will have to do something like this. I
don't think we'll find a solution with 0 performance impact, without the
need to scan sufficient part of a table.

That's why I want to make this optional so that the users will use it
only when really needed.

Anyway one possible solution might be to allow the user to set these
values manually (as in case when ndistinct estimates are not precise).

regards
Tomas


Re: proposal : cross-column stats

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> AFAIK it will work with reasonably precise estimates, but the point is
> you need an estimate of distinct values of the whole group of columns.
> So when you want to get an estimate for queries on columns (a,b), you
> need the number of distinct value combinations of these two columns.

That seems likely to be even more unreliable than our existing
single-column estimates :-(
        regards, tom lane


Re: proposal : cross-column stats

From
Heikki Linnakangas
Date:
On 17.12.2010 23:13, Tomas Vondra wrote:
> Dne 17.12.2010 19:58, Robert Haas napsal(a):
>> I haven't read the paper yet (sorry) but just off the top of my head,
>> one possible problem here is that our n_distinct estimates aren't
>> always very accurate, especially for large tables.  As we've discussed
>> before, making them accurate requires sampling a significant
>> percentage of the table, whereas all of our other statistics can be
>> computed reasonably accurately by sampling a fixed amount of an
>> arbitrarily large table.  So it's possible that relying more heavily
>> on n_distinct could turn out worse overall even if the algorithm is
>> better.  Not sure if that's an issue here, just throwing it out
>> there...
>
> Yes, you're right - the paper really is based on (estimates of) number
> of distinct values for each of the columns as well as for the group of
> columns.
>
> AFAIK it will work with reasonably precise estimates, but the point is
> you need an estimate of distinct values of the whole group of columns.
> So when you want to get an estimate for queries on columns (a,b), you
> need the number of distinct value combinations of these two columns.
>
> And I think we're not collecting this right now, so this solution
> requires scanning the table (or some part of it).

Any idea how sensitive it is to the accuracy of that estimate on 
distinct value combinations? If we get that off by a factor of ten or a 
hundred, what kind of an effect does it have on the final cost estimates?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 17.12.2010 22:24, Tom Lane napsal(a):
> That seems likely to be even more unreliable than our existing
> single-column estimates :-(
> 
>             regards, tom lane

Well, yes :-(

I guess this is a place where we could use a multi-column index, if it
contains all the interesting columns. We could scan the index, not the
whole table.

That could save us tremendous amount of I/O and should be quite precise
(unless it's severely bloated).

Another thing is those 'discrete' columns are usually quite stable, i.e.
there's usually a limited list of values and it does not change very
often. Think about ZIP codes, states, etc. And the combinations are
quite stable too (counties do not move to other states, etc.).

So I think scanning a reasonable part of a table should be enough in
these cases.

regards
Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
> On 17.12.2010 23:13, Tomas Vondra wrote:
>> Dne 17.12.2010 19:58, Robert Haas napsal(a):
>>> I haven't read the paper yet (sorry) but just off the top of my head,
>>> one possible problem here is that our n_distinct estimates aren't
>>> always very accurate, especially for large tables.  As we've discussed
>>> before, making them accurate requires sampling a significant
>>> percentage of the table, whereas all of our other statistics can be
>>> computed reasonably accurately by sampling a fixed amount of an
>>> arbitrarily large table.  So it's possible that relying more heavily
>>> on n_distinct could turn out worse overall even if the algorithm is
>>> better.  Not sure if that's an issue here, just throwing it out
>>> there...
>>
>> Yes, you're right - the paper really is based on (estimates of) number
>> of distinct values for each of the columns as well as for the group of
>> columns.
>>
>> AFAIK it will work with reasonably precise estimates, but the point is
>> you need an estimate of distinct values of the whole group of columns.
>> So when you want to get an estimate for queries on columns (a,b), you
>> need the number of distinct value combinations of these two columns.
>>
>> And I think we're not collecting this right now, so this solution
>> requires scanning the table (or some part of it).
> 
> Any idea how sensitive it is to the accuracy of that estimate on
> distinct value combinations? If we get that off by a factor of ten or a
> hundred, what kind of an effect does it have on the final cost estimates?

Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is
     (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))

where A and B are columns, dist(X) is number of distinct values in
column X and sel(X) is selectivity of column X.

dependency on dist(A), dist(B) and dist(A,C)
--------------------------------------------

So it seems to be proportional to dist(A) and dist(B), and inverse
proportional to dist(A,B). So when increasing the dist(A) and dist(B)
10x, you'll get a 10x the original estimate. Similarly, by increasing
the dist(A,B) 10x, you'll get 10x lower estimate.

upper bound
-----------

Unless dist(A) or dist(B) is greater than dist(A,B), the estimated
selectivity is bounded by
     (sel(A) + sel(B)) / 2

I guess we can say that (sel(A) > sel(A,B)) is generally the same as
     sel(A) = sel(A,B)

i.e. we can use the heuristict I've already offered in the PoC.

lower bound
-----------

Not sure what the lower bound is, but it might be 0 if the dist(A) and
dist(B) are very small and/or dist(A,B) is huge.

regards
Tomas



Re: proposal : cross-column stats

From
Florian Pflug
Date:
On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
> Well, not really - I haven't done any experiments with it. For two
> columns selectivity equation is
> 
>      (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
> 
> where A and B are columns, dist(X) is number of distinct values in
> column X and sel(X) is selectivity of column X.

Huh? This is the selectivity estimate for "A = x AND B = y"? Surely,
if A and B are independent, the formula must reduce to sel(A) * sel(B),
and I cannot see how that'd work with the formula above.

best regards,
Florian Pflug



Re: proposal : cross-column stats

From
tv@fuzzy.cz
Date:
> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>> Well, not really - I haven't done any experiments with it. For two
>> columns selectivity equation is
>>
>>      (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>
>> where A and B are columns, dist(X) is number of distinct values in
>> column X and sel(X) is selectivity of column X.
>
> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely,
> if A and B are independent, the formula must reduce to sel(A) * sel(B),
> and I cannot see how that'd work with the formula above.

Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
conditional probability, as
  P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)

and "uniform correlation" assumption so that it's possible to replace the
conditional probabilities with constants. And those constants are then
estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).

So it does not reduce to sel(A)*sel(B) exactly, as the dist(A)/dist(A,B)
is just an estimate of P(B|A). The paper states that this works best for
highly correlated data, while for low correlated data it (at least)
matches the usual estimates.

I don't say it's perfect, but it seems to produce reasonable estimates.

Tomas



Re: proposal : cross-column stats

From
Florian Pflug
Date:
On Dec18, 2010, at 08:10 , tv@fuzzy.cz wrote:

>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>>> Well, not really - I haven't done any experiments with it. For two
>>> columns selectivity equation is
>>> 
>>>     (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>> 
>>> where A and B are columns, dist(X) is number of distinct values in
>>> column X and sel(X) is selectivity of column X.
>> 
>> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely,
>> if A and B are independent, the formula must reduce to sel(A) * sel(B),
>> and I cannot see how that'd work with the formula above.
> 
> Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
> conditional probability, as
> 
>   P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)
> 
> and "uniform correlation" assumption so that it's possible to replace the
> conditional probabilities with constants. And those constants are then
> estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).

Ok, I think I understand this now. The selectivity equation actually
*does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate
for sel(A).

Take the clause "A = x AND B = y" for example. Without knowing anything
about x and y, reasonable guesses for sel(A=x) and sel(B=y) are
 sel(A=x) = 1 / dist(A) sel(B=y) = 1 / dist(B).

This is also what we do currently, according to var_eq_non_const() in
src/backend/utils/adt/selfuncs.c, if we don't have any additional
knowledge about x (Actually, we also factor the probability of A being
NULL into this).

With these estimates, your formula becomes
 sel(A=x,B=y) = 1 / dist(A,B).

and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus
 sel(A=x,B=y) = sel(A=x) * sel(B=y).

If, however, y is a constant, then we use the MKVs to estimate sel(B=y)
(var_eq_const() in src/backend/utils/adt/selfuncs.c). If
 sel(B=y) ~= 0,

we'd currently also conclude that
 sel(A=x,B=y) ~= 0.

With the "uniform correlation" approach, we'd instead estimate
 sel(A=x,B=y) ~= sel(A=x) / dist(B)

assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated.
If dist(B) is small, this estimate is much worse than what we'd currently
get, since we've effectively ignored the information that the restriction
B=y alone guarantees that only very few rows will match.

best regards,
Florian Pflug



Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 18.12.2010 16:59, Florian Pflug napsal(a):
> On Dec18, 2010, at 08:10 , tv@fuzzy.cz wrote:
> 
>>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>>>> Well, not really - I haven't done any experiments with it. For two
>>>> columns selectivity equation is
>>>>
>>>>     (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>>>
>>>> where A and B are columns, dist(X) is number of distinct values in
>>>> column X and sel(X) is selectivity of column X.
>>>
>>> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely,
>>> if A and B are independent, the formula must reduce to sel(A) * sel(B),
>>> and I cannot see how that'd work with the formula above.
>>
>> Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
>> conditional probability, as
>>
>>   P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)
>>
>> and "uniform correlation" assumption so that it's possible to replace the
>> conditional probabilities with constants. And those constants are then
>> estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).
> 
> Ok, I think I understand this now. The selectivity equation actually
> *does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate
> for sel(A).
> 
> Take the clause "A = x AND B = y" for example. Without knowing anything
> about x and y, reasonable guesses for sel(A=x) and sel(B=y) are
> 
>   sel(A=x) = 1 / dist(A)
>   sel(B=y) = 1 / dist(B).
> 
> This is also what we do currently, according to var_eq_non_const() in
> src/backend/utils/adt/selfuncs.c, if we don't have any additional
> knowledge about x (Actually, we also factor the probability of A being
> NULL into this).
> 
> With these estimates, your formula becomes
> 
>   sel(A=x,B=y) = 1 / dist(A,B).
> 
> and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus
> 
>   sel(A=x,B=y) = sel(A=x) * sel(B=y).
> 
> If, however, y is a constant, then we use the MKVs to estimate sel(B=y)
> (var_eq_const() in src/backend/utils/adt/selfuncs.c). If
> 
>   sel(B=y) ~= 0,
> 
> we'd currently also conclude that
> 
>   sel(A=x,B=y) ~= 0.
> 
> With the "uniform correlation" approach, we'd instead estimate
> 
>   sel(A=x,B=y) ~= sel(A=x) / dist(B)
> 
> assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated.
> If dist(B) is small, this estimate is much worse than what we'd currently
> get, since we've effectively ignored the information that the restriction
> B=y alone guarantees that only very few rows will match.

Well, I guess you're right. Let's see two examples - uncorrelated and
higly correlated data (and then a few comments on what I think you're
missing).

uncorrelated
------------
Let there be 10 distinct values in A, 100 distinct values in B, and
there are all possible combinations (10x100 = 1000). All the values (and
combinations) are uniformly distributed.

The current estimate is correct, i.e.
 sel(A=a) = 1/10 = 1/dist(A) sel(B=b) = 1/100 = 1/dist(B) sel(A=a, B=b) = sel(A) * sel(B) = 1/1000  /* due to AVI */

the proposed estimate is
 sel(A=a, B=b) = (dist(A)*sel(A=a) + dist(B)*sel(B=b)) / (2*dist(A,B))               = (10 * 1/10 + 100 * 1/100) / 2000
=1/1000
 

so actually it produces the same estimate, but thats due to the uniform
distribution assumption.

Let's say the selectivities for A and B are very different - A is 10x
les common, B is 10x more common (let's say we know this). Then the
current estimate is still correct, i.e. it gives
 sel(A=a) = 1/100 sel(B=b) = 1/10
 => sel(A=a,B=b) = 1/1000

but the new estimate is
 (10 * 1/100 + 100 * 1/10) / 2000 = (101/10) / 2000 ~ 1/200

So yes, in case of uncorrelated data this overestimates the result.


highly correlated
-----------------
Again, let dist(A)=10, dist(B)=100. But this case let B=>A i.e. for each
value in B, there is a unique value in A. Say B=mod(A,10) or something
like that. So now there is dist(A,B)=100.

Assuming uniform distribution, the correct estimate is
 sel(A=a,B=b) = sel(B=b) = 1/100

and the current estimate is
 sel(A=a,B=b) = sel(A=a) * sel(B=b) = 1/10 * 1/100 = 1/1000

The proposed estimate is
 (10 * 1/10 + 100 * 1/100) / 200 = 2/200 = 1/100

which is correct.

Without the uniform distribution (again, sel(A)=1/100, sel(B)=1/10), we
currently get (just as before)
 sel(A=a,B=b) = 1/10 * 1/100 = 1/1000

which is 100x less than the correct value (sel(B)=1/10). The new
estimate yields
 (10*1/100 + 100*1/10) / 200 = (1010/100) / 200 ~ 1/20

which is not as accurate as with uniform distribution, but it's
considerably more accurate than the current estimate (1/1000).


comments
--------
I really don't think this a perfect solution giving 100% accurate
results in all cases. But the examples above illustrate that it gives
much better estimates in case of correlated data.

It seems to me you're missing one very important thing - this was not
meant as a new default way to do estimates. It was meant as an option
when the user (DBA, developer, ...) realizes the current solution gives
really bad estimates (due to correlation). In that case he could create
'cross-column' statistics on those columns, and the optimizer would use
that info to do the estimates.

This kind of eliminates the issue with uncorrelated data, as it would
not actually happen. OK, the user might create cross-column stats on
uncorrelated columns and he'd get incorrect estimates in that case, but
I really don't think we can solve this kind of problems.

regards
Tomas


Re: proposal : cross-column stats

From
Simon Riggs
Date:
On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> >> The proposed solution is based on contingency tables, built for selected
> >> groups of columns (not for each possible group). And the contingency
> >> table gives you the ability to estimate the probabilities needed to
> >> compute the selectivity. Or am I missing something?
> 
> > Well, I'm not real familiar with contingency tables, but it seems like
> > you could end up needing to store a huge amount of data to get any
> > benefit out of it, in some cases.
> 
> The reason that this wasn't done years ago is precisely that nobody's
> figured out how to do it with a tolerable amount of stats data and a
> tolerable amount of processing time (both at ANALYZE time and during
> query planning).  It's not hard to see what we'd ideally like to do;
> it's getting from there to something useful in production that's hard.

I think we have to face up to the fact that attempting to derive
meaningful cross-column stats will require larger sample sizes.

If we collect everything we're going to have ~10^9 stats slots with
default stats_target 100 and a 100 column table.

We should disconnect sample size from histogram size, and we need to
make the initial column pairings vastly fewer than all combinations.
Manual specification seems like it will be required for the cases where
we decide not to include it automatically, so it seems we'll need manual
specification anyway. In that case, we should do manual specification
first.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 19.12.2010 21:21, Simon Riggs napsal(a):
> On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>> The proposed solution is based on contingency tables, built for selected
>>>> groups of columns (not for each possible group). And the contingency
>>>> table gives you the ability to estimate the probabilities needed to
>>>> compute the selectivity. Or am I missing something?
>>
>>> Well, I'm not real familiar with contingency tables, but it seems like
>>> you could end up needing to store a huge amount of data to get any
>>> benefit out of it, in some cases.
>>
>> The reason that this wasn't done years ago is precisely that nobody's
>> figured out how to do it with a tolerable amount of stats data and a
>> tolerable amount of processing time (both at ANALYZE time and during
>> query planning).  It's not hard to see what we'd ideally like to do;
>> it's getting from there to something useful in production that's hard.
> 
> I think we have to face up to the fact that attempting to derive
> meaningful cross-column stats will require larger sample sizes.

Amen.

> If we collect everything we're going to have ~10^9 stats slots with
> default stats_target 100 and a 100 column table.
>
> We should disconnect sample size from histogram size, and we need to
> make the initial column pairings vastly fewer than all combinations.
> Manual specification seems like it will be required for the cases where
> we decide not to include it automatically, so it seems we'll need manual
> specification anyway. In that case, we should do manual specification
> first.

Well, not really. The more bins you have, the larger sample you need to
get a representative representation of the stats. So the histogram and
sample size are naturally connected.

And there are some (mostly heuristics) rules to determine how large the
sample should be. E.g. when building a contingency table for a
chi-squared test, a common rule is that each bin should contain at least
5 values. So the more bins you have, the larger sample you need.

I like the way oracle does this - you can either let them decide what is
the proper sample size, or you can specify how large the sample should
be (what portion of the table).

So the tricky part here is determining the number of bins in the
histogram. In the one-dimensional case, stats_target=100 actually means
each bin contains about 1% of data. So I guess we should use a similar
approach in the multi-dimensional case too, i.e. let the user determine
a desired precision and then derive the number of bins automatically
(which is a bit tricky because of the multiple dimensions).

But yeah, in general you're right - this will require larger samples,
more complex stats to collect, we'll need some space to store the
collected stats ...

regards
Tomas


Re: proposal : cross-column stats

From
Florian Pflug
Date:
On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
> It seems to me you're missing one very important thing - this was not
> meant as a new default way to do estimates. It was meant as an option
> when the user (DBA, developer, ...) realizes the current solution gives
> really bad estimates (due to correlation). In that case he could create
> 'cross-column' statistics on those columns, and the optimizer would use
> that info to do the estimates.

I do understand that. I just have the nagging feeling that there is a
way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense
to apply the uniform bayesian approach or to assume the columns are
unrelated.

I play with this for a bit over the weekend, but unfortunately ran out
of time. So I'm writing up what I found, to prevent it from getting lost.

I tried to pick up Robert's idea of quantifying "Implicativeness" -
i.e., finding a number between 0 and 1 that describes how close the
(A,B) are to representing a function A -> B.

Observe that dist(A),dist(B) <= dist(A,B) <= dist(A)*dist(B) if the
estimates of dist(?) are consistent. From that you easily get
 dist(A,B)/dist(B) <= dist(A) <= dist(A,B) and dist(A,B)/dist(A) <= dist(B) <= dist(A,B)

If dist(A) == dist(A,B), then there is a functional dependency
A -> B, and conversely if dist(B) == dist(A,B) there is a functional
dependency B -> A. Note that you can have both at the same time!

On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the
smallest number of distinct values possible for a given combination
of dist(A,B) and dist(A). This is the anti-function case.

This motivates the definition
 F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ]

(You can probably drop the "-1", it doesn't make much of a difference
for larger values of dist(B).

F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
a value of 1 indicates that dist(A) == dist(A,B).

So F(A,B) is a suitable measure of "Implicativeness" - it's higher
if the table (A,B) looks more like a function A -> B.

You might use that to decide if either A->B or B->a looks function-like
enough to use the uniform bayesian approach. Or you might even go further,
and decide *with* bayesian formula to use - the paper you cited always
averages
 P(A=x|B=y)*P(B=y) and P(B=y|A=x)*P(A=x)

but they offer no convincing reason for that other than "We don't know
which to pick".

I'd like to find a statistical explanation for that definition of
F(A,B), but so far I couldn't come up with any. I created a Maple 14
worksheet while playing around with this - if you happen to have a
copy of Maple available I'd be happy to send it to you..

This is what I got so far - I hope it may prove to be of use somehow.

best regards,
Florian Pflug



Re: proposal : cross-column stats

From
tv@fuzzy.cz
Date:
> On Dec18, 2010, at 17:59 , Tomas Vondra wrote:
>> It seems to me you're missing one very important thing - this was not
>> meant as a new default way to do estimates. It was meant as an option
>> when the user (DBA, developer, ...) realizes the current solution gives
>> really bad estimates (due to correlation). In that case he could create
>> 'cross-column' statistics on those columns, and the optimizer would use
>> that info to do the estimates.
>
> I do understand that. I just have the nagging feeling that there is a
> way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense
> to apply the uniform bayesian approach or to assume the columns are
> unrelated.

I doubt there is a way to this decision with just dist(A), dist(B) and
dist(A,B) values. Well, we could go with a rule
 if [dist(A) == dist(A,B)] the [A => B]

but that's very fragile. Think about estimates (we're not going to work
with exact values of dist(?)), and then about data errors (e.g. a city
matched to an incorrect ZIP code or something like that).

So for a real-world dataset, the condition [dist(A)==dist(A,B)] will
almost never hold. And about the same holds for the "uniform correlation"
assumption which is the basis for the formula I posted.

So actually we're looking for a formula that does reasonable estimates and
is robust even in cases where the correlation is not uniform or the
estimates are a reasonably unprecise.

> This motivates the definition
>
> F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [dist(A,B) * ( dist(B) - 1)]
>
> (You can probably drop the "-1", it doesn't make much of a difference
> for larger values of dist(B).
>
> F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
> dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
> a value of 1 indicates that dist(A) == dist(A,B).
>
> So F(A,B) is a suitable measure of "Implicativeness" - it's higher
> if the table (A,B) looks more like a function A -> B.
>
> You might use that to decide if either A->B or B->a looks function-like
> enough to use the uniform bayesian approach. Or you might even go further,
> and decide *with* bayesian formula to use - the paper you cited always
> averages
>
>   P(A=x|B=y)*P(B=y) and
>   P(B=y|A=x)*P(A=x)
>
> but they offer no convincing reason for that other than "We don't know
> which to pick".

Well, the reason why they chose the sum/2 approach is they were unable to
infer which of the values is 'better' and the sum/2 limits the errors.

I haven't studied this thoroughly, but my impression is that you are going
in the same direction as they did, i.e. while they've done
  P(A,B) = (P(A|B)*P(A) + P(B|A)*P(B)) / 2

with P(A|B) = dist(A) / dist(A,B), you've chosen P(A|B) ~ F(B,A) or
something like that.

You'll probably object that you could compute F(A,B) and F(B,A) and then
use only the part corresponding to the larger value, but what if the
F(A,B) and F(B,A) are about the same?

This is the reason why they choose to always combine the values (with
varying weights).

> I'd like to find a statistical explanation for that definition of
> F(A,B), but so far I couldn't come up with any. I created a Maple 14
> worksheet while playing around with this - if you happen to have a
> copy of Maple available I'd be happy to send it to you..

No, I don't have Maple. Have you tried Maxima
(http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage
even has an online notebook - that seems like a very comfortable way to
exchange this kind of data.

regards
Tomas



Re: proposal : cross-column stats

From
Robert Haas
Date:
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug <fgp@phlo.org> wrote:
> I tried to pick up Robert's idea of quantifying "Implicativeness" -
> i.e., finding a number between 0 and 1 that describes how close the
> (A,B) are to representing a function A -> B.

Actually Heikki's idea...

> Observe that dist(A),dist(B) <= dist(A,B) <= dist(A)*dist(B) if the
> estimates of dist(?) are consistent. From that you easily get
>
>  dist(A,B)/dist(B) <= dist(A) <= dist(A,B) and
>  dist(A,B)/dist(A) <= dist(B) <= dist(A,B)
>
> If dist(A) == dist(A,B), then there is a functional dependency
> A -> B, and conversely if dist(B) == dist(A,B) there is a functional
> dependency B -> A. Note that you can have both at the same time!
>
> On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the
> smallest number of distinct values possible for a given combination
> of dist(A,B) and dist(A). This is the anti-function case.
>
> This motivates the definition
>
>  F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ]
>
> (You can probably drop the "-1", it doesn't make much of a difference
> for larger values of dist(B).
>
> F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and
> dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while
> a value of 1 indicates that dist(A) == dist(A,B).
>
> So F(A,B) is a suitable measure of "Implicativeness" - it's higher
> if the table (A,B) looks more like a function A -> B.
>
> You might use that to decide if either A->B or B->a looks function-like
> enough to use the uniform bayesian approach. Or you might even go further,
> and decide *with* bayesian formula to use - the paper you cited always
> averages
>
>  P(A=x|B=y)*P(B=y) and
>  P(B=y|A=x)*P(A=x)
>
> but they offer no convincing reason for that other than "We don't know
> which to pick".

Ideally you want to somehow make this a continuous transaition between
the available formulas rather than a discrete transition, I think.  If
F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and
if it's 0, then it's P(A=x)*P(B=y).  But suppose F(A,B)=0.5.  Then
what?  A naive approach would be to estimate P(A=x && B=y) = P(A=x) *
(1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and
P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1
we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9)
= 0.055.  Of course I'm just hand-waving here, and this is without any
mathematical basis, being just the simplest formula I could think of
that gets the endpoints right and plots some sort of smooth curve
between them in the middle.  A similar formula with a believable
argument to back it up seems like it would be a big step forward for
this method.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: proposal : cross-column stats

From
tv@fuzzy.cz
Date:
> On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug <fgp@phlo.org> wrote:
>> You might use that to decide if either A->B or B->a looks function-like
>> enough to use the uniform bayesian approach. Or you might even go
>> further,
>> and decide *with* bayesian formula to use - the paper you cited always
>> averages
>>
>>  P(A=x|B=y)*P(B=y) and
>>  P(B=y|A=x)*P(A=x)
>>
>> but they offer no convincing reason for that other than "We don't know
>> which to pick".
>
> Ideally you want to somehow make this a continuous transaition between
> the available formulas rather than a discrete transition, I think.  If
> F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and
> if it's 0, then it's P(A=x)*P(B=y).  But suppose F(A,B)=0.5.  Then
> what?  A naive approach would be to estimate P(A=x && B=y) = P(A=x) *
> (1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and
> P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1
> we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9)
> = 0.055.  Of course I'm just hand-waving here, and this is without any
> mathematical basis, being just the simplest formula I could think of
> that gets the endpoints right and plots some sort of smooth curve
> between them in the middle.  A similar formula with a believable
> argument to back it up seems like it would be a big step forward for
> this method.

This somehow reminds me how the various t-norms in fuzzy logic evolved.
I'm not saying we should use fuzzy logic here, but the requirements are
very similar so it might be an interesting inspiration. See for example
this http://plato.stanford.edu/entries/logic-fuzzy (chapter 4).

And there's one additional - IMHO very important - requirement. The whole
thing should easily extend to more than two columns. This "IF (F(A,B) >
F(B,A)) THEN ..." probably is not a good solution regarding this.

For example given 3 columns A,B,C, would you do that comparison for each
pair of columns, or would you do that for A vs (B,C)? Or maybe a
completely different approach? Because that would require to collect a lot
more data (number of distinct values in each combination) etc.

I'm not saying for example there is a table with (C=A+B)
 A | B | C=========== 1 | 1 | 2 1 | 2 | 3 1 | 3 | 4 2 | 1 | 3 2 | 2 | 4 2 | 3 | 5 3 | 1 | 4 3 | 2 | 5 3 | 3 | 6

So that dist(A)=dist(B)=3, dist(C)=6 and dist(A,B,C)=dist(A,B)=9. Given
the paper, you get something like
P(A,B,C) = [dist(A)*P(A) +  dist(B)*P(B) + dist(C)*P(C)] / [3*dist(A,B,C)]         = [P(A) + P(B) + 2*P(C)] / 9

so for example
P(A=3,B=2,C=5) = [1/3 + 1/3 + 2/9]/9 = (8/9)/9

which is almost correct (by 1/81).

Don't get me wrong - I'm not a fanatic who thinks this particular formula
is the best formula possible. I'm just saying we could end up with a
formula that works beautifully in 2D, but once we get to 3 columns it
fails miserably.

Hmmm, maybe we could give this possibility (to identify two separate
groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
user would say 'build stats for (A,B) and (C)' - this actually represents
apriori knowledge of dependencies supplied by the user.

In that case we could search for 'implicativeness' between those two
groups (and not within the groups), and we could restrict ourselves to 2D
(and thus use a more sophisticated formula).

But we should be able to do some basic analysis even when the user
supplies a list of columns without such apriori knowledge.

regards
Tomas



Re: proposal : cross-column stats

From
Florian Pflug
Date:
On Dec21, 2010, at 11:37 , tv@fuzzy.cz wrote:
> I doubt there is a way to this decision with just dist(A), dist(B) and
> dist(A,B) values. Well, we could go with a rule
> 
>  if [dist(A) == dist(A,B)] the [A => B]
> 
> but that's very fragile. Think about estimates (we're not going to work
> with exact values of dist(?)), and then about data errors (e.g. a city
> matched to an incorrect ZIP code or something like that).

Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
kind of fragility without penalizing the non-correlated case...

> This is the reason why they choose to always combine the values (with
> varying weights).

There are no varying weights involved there. What they do is to express
P(A=x,B=y) once as
 P(A=x,B=y) = P(B=y|A=x)*P(A=x) and then as P(A=x,B=y) = P(A=x|B=y)*P(B=y).

Then they assume
 P(B=y|A=x) ~= dist(A)/dist(A,B) and P(A=x|B=y) ~= dist(B)/dist(A,B),

and go on to average the two different ways of computing P(A=x,B=y), which
finally gives
 P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2             = dist(A)*P(A=x)/(2*dist(A,B)) +
dist(B)*P(B=x)/(2*dist(A,B))            = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
 

That averaging steps add *no* further data-dependent weights. 

>> I'd like to find a statistical explanation for that definition of
>> F(A,B), but so far I couldn't come up with any. I created a Maple 14
>> worksheet while playing around with this - if you happen to have a
>> copy of Maple available I'd be happy to send it to you..
> 
> No, I don't have Maple. Have you tried Maxima
> (http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage
> even has an online notebook - that seems like a very comfortable way to
> exchange this kind of data.

I haven' tried them, but I will. That java-based GUI of Maple is driving
me nuts anyway... Thanks for the pointers!

best regards,
Florian Pflug




Re: proposal : cross-column stats

From
tv@fuzzy.cz
Date:
> On Dec21, 2010, at 11:37 , tv@fuzzy.cz wrote:
>> I doubt there is a way to this decision with just dist(A), dist(B) and
>> dist(A,B) values. Well, we could go with a rule
>>
>>  if [dist(A) == dist(A,B)] the [A => B]
>>
>> but that's very fragile. Think about estimates (we're not going to work
>> with exact values of dist(?)), and then about data errors (e.g. a city
>> matched to an incorrect ZIP code or something like that).
>
> Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
> kind of fragility without penalizing the non-correlated case...

Yes, I understand the intention, but I'm not sure how exactly do you want
to use the F(?,?) function to compute the P(A,B) - which is the value
we're looking for.

If I understand it correctly, you proposed something like this
 IF (F(A,B) > F(B,A)) THEN   P(A,B) := c*P(A); ELSE   P(A,B) := d*P(B); END IF;

or something like that (I guess c=dist(A)/dist(A,B) and
d=dist(B)/dist(A,B)). But what if F(A,B)=0.6 and F(B,A)=0.59? This may
easily happen due to data errors / imprecise estimate.

And this actually matters, because P(A) and P(B) may be actually
significantly different. So this would be really vulnerable to slight
changes in the estimates etc.

>> This is the reason why they choose to always combine the values (with
>> varying weights).
>
> There are no varying weights involved there. What they do is to express
> P(A=x,B=y) once as
>
> ...
>
>   P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
>               = dist(A)*P(A=x)/(2*dist(A,B)) +
> dist(B)*P(B=x)/(2*dist(A,B))
>               = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
>
> That averaging steps add *no* further data-dependent weights.

Sorry, by 'varying weights' I didn't mean that the weights are different
for each value of A or B. What I meant is that they combine the values
with different weights (just as you explained).

regards
Tomas



Re: proposal : cross-column stats

From
Florian Pflug
Date:
On Dec21, 2010, at 15:51 , tv@fuzzy.cz wrote:
>>> This is the reason why they choose to always combine the values (with
>>> varying weights).
>> 
>> There are no varying weights involved there. What they do is to express
>> P(A=x,B=y) once as
>> 
>> ...
>> 
>>  P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
>>              = dist(A)*P(A=x)/(2*dist(A,B)) +
>> dist(B)*P(B=x)/(2*dist(A,B))
>>              = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
>> 
>> That averaging steps add *no* further data-dependent weights.
> 
> Sorry, by 'varying weights' I didn't mean that the weights are different
> for each value of A or B. What I meant is that they combine the values
> with different weights (just as you explained).

I'm still not sure we're on the same page here. The resulting formula
is *not* a weighted average of P(A=x) and P(B=y), since in general
dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one
syntactically, but that's about it.

The resulting formula instead is an *unweighted* (weights 1) average of
the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just
as well estimate P(A=x,B=y) with
 P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)

and it's *still* be the very same uniform bayesian approach, just no
longer symmetric in A and B. Which may easily be preferable if you
have reasons to believe that this estimate is more correct than the
one obtained by swapping A and B. The original paper doesn't deal with
that case simply because they don't mention how P(A=x) and P(B=y)
are obtained at all. The postgres estimator, on the other hand,
knows quite well how it derived P(A=x) and P(B=y) and may have much
higher confidence in one value than in the other.

Assume for example that you're preparing the statement
 SELECT * FROM T WHERE A = ? AND B = 1

We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better
without an actual value for the parameter "?". The estimate for P(B=1),
on the other hand, can use the histogram, and will thus very likely be
much more precise. The two estimates for P(A=?,B=1) in this case are
 P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B).

There's a good chance that the former beats the latter, and thus also
the average, then.

best regards,
Florian Pflug



Re: proposal : cross-column stats

From
Florian Pflug
Date:
On Dec21, 2010, at 13:25 , tv@fuzzy.cz wrote:
> And there's one additional - IMHO very important - requirement. The whole
> thing should easily extend to more than two columns. This "IF (F(A,B) >
> F(B,A)) THEN ..." probably is not a good solution regarding this.
>
> For example given 3 columns A,B,C, would you do that comparison for each
> pair of columns, or would you do that for A vs (B,C)? Or maybe a
> completely different approach? Because that would require to collect a lot
> more data (number of distinct values in each combination) etc.

That's certainly a valid concern. The uniform bayesian approach avoids that
quite beautifully...

> Hmmm, maybe we could give this possibility (to identify two separate
> groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
> user would say 'build stats for (A,B) and (C)' - this actually represents
> apriori knowledge of dependencies supplied by the user.
>
> In that case we could search for 'implicativeness' between those two
> groups (and not within the groups), and we could restrict ourselves to 2D
> (and thus use a more sophisticated formula).

Hm, I hated this idea at first, but I'm starting to like it more and more.
It *does* seem rather unrealistic that a user would know that a bunch of
columns are correlated, but have no idea in what way...

Any examples when this's be the case would be very much appreciated - Maybe
we should ask around on -general about this?

> But we should be able to do some basic analysis even when the user
> supplies a list of columns without such apriori knowledge.

That, I think, overcomplicates things, at least for a first cut.

To summarize, I think you've shown quite nicely that the uniform bayesian
approach is a very sensible first step towards better estimates in the case
of correlated columns. It's statistically sound, and the dist(A,B) estimates
it requires are probably a necessary ingredient of any solution to the problem.
If we can make it degrade more gracefully if the columns are uncorrelated we
should do that, but if we can't thats still no reason to drop the whole idea.

So I guess we should turn our attention to how we'd obtain reasonably good estimates
of dist(A,B), and return to the current discussion once the other pieces are in place.

I think that maybe it'd be acceptable to scan a large portion of the
table to estimate dist(A,B), *if* we must only do so only once in a while. But even with
a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory,
and spilling them into, say, an on-disk hash table adds even more overhead to the already
expensive full scan. Maybe using a bloom filter instead of a hash table could avoid
the spilling to disk, in exchange for a slightly less precise result...

best regards,
Florian Pflug



Re: proposal : cross-column stats

From
tv@fuzzy.cz
Date:
> On Dec21, 2010, at 15:51 , tv@fuzzy.cz wrote:
>>>> This is the reason why they choose to always combine the values (with
>>>> varying weights).
>>>
>>> There are no varying weights involved there. What they do is to express
>>> P(A=x,B=y) once as
>>>
>>> ...
>>>
>>>  P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
>>>              = dist(A)*P(A=x)/(2*dist(A,B)) +
>>> dist(B)*P(B=x)/(2*dist(A,B))
>>>              = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
>>>
>>> That averaging steps add *no* further data-dependent weights.
>>
>> Sorry, by 'varying weights' I didn't mean that the weights are different
>> for each value of A or B. What I meant is that they combine the values
>> with different weights (just as you explained).
>
> I'm still not sure we're on the same page here. The resulting formula
> is *not* a weighted average of P(A=x) and P(B=y), since in general
> dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one
> syntactically, but that's about it.

OK, another crazy usage or 'weights' on my side :-(

What I meant is that in the end you have two equations of P(A,B):
 P(A=x|B=y)*P(B=y) = dist(B)*P(B=y)/dist(A,B) P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)

and you need to combine those two estimates. They did that by averaging,
as they don't know which of the estimates is better.

Generally I think that is a good solution, unless you know one of the
estimates is much more reliable (although I'm not sure we should
completely omit the other estimate).

> The resulting formula instead is an *unweighted* (weights 1) average of
> the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just
> as well estimate P(A=x,B=y) with
>
>   P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B)
>
> and it's *still* be the very same uniform bayesian approach, just no
> longer symmetric in A and B. Which may easily be preferable if you
> have reasons to believe that this estimate is more correct than the
> one obtained by swapping A and B. The original paper doesn't deal with
> that case simply because they don't mention how P(A=x) and P(B=y)
> are obtained at all. The postgres estimator, on the other hand,
> knows quite well how it derived P(A=x) and P(B=y) and may have much
> higher confidence in one value than in the other.

OK, good point. I haven't realized that one of the estimates may be much
more reliable.

But let's assume both estimates are about the same (regarding reliability)
and let's see the following example
A | B=====1 | 11 | 11 | 11 | 22 | 12 | 22 | 22 | 2

Thus dist(A)=dist(B)=2, dist(A,B)=4 and
 P(A=1)=P(A=2)=P(B=1)=P(B=2)=1/2 P(A=1,B=1)=P(A=2,B=2)=3/8 P(A=1,B=2)=P(A=1,B=1)=1/8

According to the formula presented in the paper, the partial estimates for
P(A=1,B=2) are
 P(A=1|B=2)*P(B=2) = dist(A)/dist(A,B) * P(B=2) = 2/4 * 1/2 = 1/4 P(B=2|A=1)*P(A=1) = dist(B)/dist(A,B) * P(A=1) = 2/4
*1/2 = 1/4
 

Thus P(A=1,B=2) = (1/4 + 1/4)/2 = 1/4, so it's overestimated (2x)
A | B=====1 | 11 | 21 | 21 | 22 | 12 | 12 | 12 | 2

This obviously has exactly the same features (regarding number of distinct
values), and the produced estimate is exactly the same. But in this case
 P(A=1,B=2)=P(A=2,B=1)=3/8 P(A=1,B=1)=P(A=2,B=2)=1/8

and thus the 1/4 is an underestimate (compared to 3/8).

The problem is the F(A,B) does not change at all. It's very simple to
construct examples (just use more rows) where F(A,B) returns exactly the
same value, but the estimates are off. The averaging somehow (smooths)
this of ...

But I think I'm missing something about how to use the F(?,?) to derive
the final estimate. So maybe the resulting estimate would be better.

Say there are two tables
  A | B | number of such rows ==============================  1 | 1 | 1000  1 | 2 | 1000  2 | 1 | 1000  1 | 2 | 1000
  A | B | number of such rows ==============================  1 | 1 | 1  1 | 2 | 1999  2 | 1 | 1999  1 | 2 | 1

How would you estimate the P(A=1,B=1) in those cases? Assume that both
estimates are equally reliable - i.e. deduced from a histogram or MCV.

>
> Assume for example that you're preparing the statement
>
>   SELECT * FROM T WHERE A = ? AND B = 1
>
> We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better
> without an actual value for the parameter "?". The estimate for P(B=1),
> on the other hand, can use the histogram, and will thus very likely be
> much more precise. The two estimates for P(A=?,B=1) in this case are
>
>   P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and
>   P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B).
>
> There's a good chance that the former beats the latter, and thus also
> the average, then.

OK, good point. I was not thinking about prepared statements. In this case
it makes sense to use only one of the estimates ...

regards
Tomas



Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
>> Hmmm, maybe we could give this possibility (to identify two separate
>> groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
>> user would say 'build stats for (A,B) and (C)' - this actually represents
>> apriori knowledge of dependencies supplied by the user.
>>
>> In that case we could search for 'implicativeness' between those two
>> groups (and not within the groups), and we could restrict ourselves to 2D
>> (and thus use a more sophisticated formula).
> 
> Hm, I hated this idea at first, but I'm starting to like it more and more.
> It *does* seem rather unrealistic that a user would know that a bunch of
> columns are correlated, but have no idea in what way... 

Yes, that's true. Although sometimes the dependency may be very
complicated - but let's restrict to 2D for now, build something that
solves this simplified case and then we can discuss higher dimensions.

> Any examples when this's be the case would be very much appreciated - Maybe
> we should ask around on -general about this?

Well, I think the ZIP code example i a typical case of this - the users
know about the dependency between ZIP codes and cities. A natural
workaround would be to omit the dependent column from the query, but
that's not always possible (e.g. when an ORM is involved, building the
queries automatically).

>> But we should be able to do some basic analysis even when the user
>> supplies a list of columns without such apriori knowledge.
> 
> That, I think, overcomplicates things, at least for a first cut.
> 
> To summarize, I think you've shown quite nicely that the uniform bayesian
> approach is a very sensible first step towards better estimates in the case
> of correlated columns. It's statistically sound, and the dist(A,B) estimates
> it requires are probably a necessary ingredient of any solution to the problem.
> If we can make it degrade more gracefully if the columns are uncorrelated we
> should do that, but if we can't thats still no reason to drop the whole idea.

Agreed. IMHO the uncorrelated case is not a big concern, as the users
usually know something's wrong with the columns. But we should introduce
some 'autodetect' but let's leave that for the future.

> So I guess we should turn our attention to how we'd obtain reasonably good estimates
> of dist(A,B), and return to the current discussion once the other pieces are in place.
> 
> I think that maybe it'd be acceptable to scan a large portion of the
> table to estimate dist(A,B), *if* we must only do so only once in a while. But even with
> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory,
> and spilling them into, say, an on-disk hash table adds even more overhead to the already
> expensive full scan. Maybe using a bloom filter instead of a hash table could avoid
> the spilling to disk, in exchange for a slightly less precise result...

I have no idea what a Bloom filter is (shame on me). I was not thinking
about collecting the stats, I was interested primarily in what data do
we actually need. And my knowledge about the algorithms currently used
is very limited :-(

But I agree we should at least discuss the possible solutions. Until now
I've done something like this
  SELECT COUNT(DISTINCT a) AS dist_a,         COUNT(DISTINCT b) AS dist_b,         COUNT(DISTINCT a || ':' || b) AS
dist_abFROM my_table;
 

but that's not very efficient.

My plan for the near future (a few weeks) is to build a simple 'module'
with the ability to estimate the number of rows for a given condition.
This could actually be quite useful as a stand-alone contrib module, as
the users often ask how to get a number of rows fast (usually for paging).

That may be quite slow when the query returns too many rows, even when
there is an index. It may be even much slower than the actual query (as
it usually contains a small LIMIT).

An estimate is often sufficient, but the 'pg_class.tuples' does not
really work with conditions. So this might be an improvement ...

regards
Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
> I think that maybe it'd be acceptable to scan a large portion of the
> table to estimate dist(A,B), *if* we must only do so only once in a while. But even with
> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory,
> and spilling them into, say, an on-disk hash table adds even more overhead to the already
> expensive full scan. Maybe using a bloom filter instead of a hash table could avoid
> the spilling to disk, in exchange for a slightly less precise result...

I've read some basics about a Bloom filters, and I'm not sure we can use
it in this case. I see two problems with this approach:

1) impossibility to predict the number of elements in advance
  To build a Bloom filter with limited error rate, you need to size it  properly (number of hash function and size of
thebit array). But  that's exactly the the information we're looking for.
 
  I guess we could use the highest possible value (equal to the number  of tuples) - according to wiki you need about
10bits per element  with 1% error, i.e. about 10MB of memory for each million of  elements.
 
  That's not much but this needs to be done for each column separately  and for the combination - for N columns you
need(at least) N+1  filters.
 
  Sure - increasing the error rate and using a different estimate to  build the bloom filter would significantly
decreasethe memory  requirements.
 

2) sampling the whole table
  A naive approach would be to sample the whole table each time, but  for large tables that might be a bit of a
problem.So I'm thinking  about what alternatives are there ...
 
  One possibility is to build the Bloom filter incrementally and store  it on disk, or something like that. I guess
thiscould be done  during VACUUM or something like that. Anyway there's an issue how to  set the filter size initially
(estimatethe number of elements),  just as in the previous section.
 
  Another possibility is to collect the data from just a small portion  of a table and then use the result to estimate
thenumber of distinct  values for the whole table. But I'm not sure we can do this reliably,  I see many traps in
this.

But maybe I'm just missing something important.

regards
Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 21.12.2010 19:03, Tomas Vondra napsal(a):
> My plan for the near future (a few weeks) is to build a simple 'module'
> with the ability to estimate the number of rows for a given condition.
> This could actually be quite useful as a stand-alone contrib module, as
> the users often ask how to get a number of rows fast (usually for paging).

Hm, I've been thinking about where to place this new module - I thought
pgfoundry might be the right place, but I've noticed it supports just
CVS. Well, that's a bit antiquated SCM I think - I'm very happy with the
comfort provided by Git or SVN.

Is there some other place commonly used for pg-related projects? I've
been thinking about github or something like that ...

And I've finally set up the wiki-page about this effort - it's just a
summary of this whole thread.

http://wiki.postgresql.org/wiki/Cross_Columns_Stats

regards
Tomas


Re: proposal : cross-column stats

From
Florian Pflug
Date:
On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
> Dne 21.12.2010 16:54, Florian Pflug napsal(a):
>> I think that maybe it'd be acceptable to scan a large portion of the
>> table to estimate dist(A,B), *if* we must only do so only once in a while. But even with
>> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory,
>> and spilling them into, say, an on-disk hash table adds even more overhead to the already
>> expensive full scan. Maybe using a bloom filter instead of a hash table could avoid
>> the spilling to disk, in exchange for a slightly less precise result...
>
> I've read some basics about a Bloom filters, and I'm not sure we can use
> it in this case. I see two problems with this approach:
>
> 1) impossibility to predict the number of elements in advance
>
>   To build a Bloom filter with limited error rate, you need to size it
>   properly (number of hash function and size of the bit array). But
>   that's exactly the the information we're looking for.
>
>   I guess we could use the highest possible value (equal to the number
>   of tuples) - according to wiki you need about 10 bits per element
>   with 1% error, i.e. about 10MB of memory for each million of
>   elements.
Drat. I had expected these number to come out quite a bit lower than
that, at least for a higher error target. But even with 10% false
positive rate, it's still 4.5MB per 1e6 elements. Still too much to
assume the filter will always fit into memory, I fear :-(

> 2) sampling the whole table
>
>   A naive approach would be to sample the whole table each time, but
>   for large tables that might be a bit of a problem. So I'm thinking
>   about what alternatives are there ..


>   One possibility is to build the Bloom filter incrementally and store
>   it on disk, or something like that. I guess this could be done
>   during VACUUM or something like that. Anyway there's an issue how to
>   set the filter size initially (estimate the number of elements),
>   just as in the previous section.
The filter size could be derived from the table's statistics target, or
be otherwise user-definable. We could also auto-resize once it gets too
full. But still, that all seems awfully complex :-(

>   Another possibility is to collect the data from just a small portion
>   of a table and then use the result to estimate the number of distinct
>   values for the whole table. But I'm not sure we can do this reliably,
>   I see many traps in this.
This is how it works currently. The problem with this approach is that
it gives you very little guarantees about how precise the result will be.
Extrapolating works very well for things like MKVs and histograms, because
there you're by definition interested mostly in values which occur often -
and thus with a high probability in the relative few rows you sample. For
the number of distinct values, however, this isn't true - if ndistinct
is an order of magnitude smaller than the number of rows, relatively few
rows can account for a large percentage of the distinct values...

Another idea would be to obtain the ndistinct values from an index somehow.
Postgres cannot currently scan an index in physical order, only in logical
order, due to locking considerations. But since we'd only be interested in
an estimate, maybe a scan in physical block order would work for ndistinc
estimates? Just a wild idea, mind you, I haven't checked at all if that'd
be even remotely feasible.

best regards,
Florian Pflug




Re: proposal : cross-column stats

From
Nicolas Barbier
Date:
2010/12/24 Florian Pflug <fgp@phlo.org>:

> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>
>>   I guess we could use the highest possible value (equal to the number
>>   of tuples) - according to wiki you need about 10 bits per element
>>   with 1% error, i.e. about 10MB of memory for each million of
>>   elements.
>
> Drat. I had expected these number to come out quite a bit lower than
> that, at least for a higher error target. But even with 10% false
> positive rate, it's still 4.5MB per 1e6 elements. Still too much to
> assume the filter will always fit into memory, I fear :-(

I have the impression that both of you are forgetting that there are 8
bits in a byte. 10 bits per element = 1.25MB per milion elements.

Nicolas


Re: proposal : cross-column stats

From
tv@fuzzy.cz
Date:
> 2010/12/24 Florian Pflug <fgp@phlo.org>:
>
>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>
>>>   I guess we could use the highest possible value (equal to the number
>>>   of tuples) - according to wiki you need about 10 bits per element
>>>   with 1% error, i.e. about 10MB of memory for each million of
>>>   elements.
>>
>> Drat. I had expected these number to come out quite a bit lower than
>> that, at least for a higher error target. But even with 10% false
>> positive rate, it's still 4.5MB per 1e6 elements. Still too much to
>> assume the filter will always fit into memory, I fear :-(
>
> I have the impression that both of you are forgetting that there are 8
> bits in a byte. 10 bits per element = 1.25MB per milion elements.

We are aware of that, but we really needed to do some very rough estimates
and it's much easier to do the calculations with 10. Actually according to
wikipedia it's not 10bits per element but 9.6, etc. But it really does not
matter if there is 10MB or 20MB of data, it's still a lot of data ...

Tomas



Re: proposal : cross-column stats

From
Florian Pflug
Date:
On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:

> 2010/12/24 Florian Pflug <fgp@phlo.org>:
> 
>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>> 
>>>   I guess we could use the highest possible value (equal to the number
>>>   of tuples) - according to wiki you need about 10 bits per element
>>>   with 1% error, i.e. about 10MB of memory for each million of
>>>   elements.
>> 
>> Drat. I had expected these number to come out quite a bit lower than
>> that, at least for a higher error target. But even with 10% false
>> positive rate, it's still 4.5MB per 1e6 elements. Still too much to
>> assume the filter will always fit into memory, I fear :-(
> 
> I have the impression that both of you are forgetting that there are 8
> bits in a byte. 10 bits per element = 1.25MB per milion elements.

Uh, of course. So in the real universe, the numbers are

~1.2MB per 1e6 elements for a false positive rate of 1%
~0.5MB per 1e6 elements for a false positive rate of 10%

Hm. So for a table with a billion distinct elements, we'd need half
a gigabyte per column for the filter. A tuple with two int columns
takes at least 24+2*4 = 32bytes to store I think, making such a table
at least 32GB in size. The filter size would thus be 1/64 of the table
size in the worst case. 

best regards,
Florian Pflug



Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 24.12.2010 13:15, tv@fuzzy.cz napsal(a):
>> 2010/12/24 Florian Pflug <fgp@phlo.org>:
>>
>>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>>
>>>>   I guess we could use the highest possible value (equal to the number
>>>>   of tuples) - according to wiki you need about 10 bits per element
>>>>   with 1% error, i.e. about 10MB of memory for each million of
>>>>   elements.
>>>
>>> Drat. I had expected these number to come out quite a bit lower than
>>> that, at least for a higher error target. But even with 10% false
>>> positive rate, it's still 4.5MB per 1e6 elements. Still too much to
>>> assume the filter will always fit into memory, I fear :-(
>>
>> I have the impression that both of you are forgetting that there are 8
>> bits in a byte. 10 bits per element = 1.25MB per milion elements.
> 
> We are aware of that, but we really needed to do some very rough estimates
> and it's much easier to do the calculations with 10. Actually according to
> wikipedia it's not 10bits per element but 9.6, etc. But it really does not
> matter if there is 10MB or 20MB of data, it's still a lot of data ...

Oooops, now I see what's the problem. I thought you were pointing out
something out, but I've actually used 1B = 1b (which is obviously
wrong). But Florian already noticed that and fixed the estimates.

Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 24.12.2010 13:37, Florian Pflug napsal(a):
> On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
> 
>> 2010/12/24 Florian Pflug <fgp@phlo.org>:
>>
>>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>>
>>>>   I guess we could use the highest possible value (equal to the number
>>>>   of tuples) - according to wiki you need about 10 bits per element
>>>>   with 1% error, i.e. about 10MB of memory for each million of
>>>>   elements.
>>>
>>> Drat. I had expected these number to come out quite a bit lower than
>>> that, at least for a higher error target. But even with 10% false
>>> positive rate, it's still 4.5MB per 1e6 elements. Still too much to
>>> assume the filter will always fit into memory, I fear :-(
>>
>> I have the impression that both of you are forgetting that there are 8
>> bits in a byte. 10 bits per element = 1.25MB per milion elements.
> 
> Uh, of course. So in the real universe, the numbers are
> 
> ~1.2MB per 1e6 elements for a false positive rate of 1%
> ~0.5MB per 1e6 elements for a false positive rate of 10%
> 
> Hm. So for a table with a billion distinct elements, we'd need half
> a gigabyte per column for the filter. A tuple with two int columns
> takes at least 24+2*4 = 32bytes to store I think, making such a table
> at least 32GB in size. The filter size would thus be 1/64 of the table
> size in the worst case. 

Yes, but in reality you need three such filters - one for each column,
one for the combination. So that is 1.5GB (with 10% error rate) or 3.6GB
(with 1% error rate).

But this is severely excessive compared to the real needs, as there are
usually much less distinct (not equal to the number of tuples as we
assume in these computations).

I was thinking about a simple heuristics to scale the filter properly,
something like this:

1) sample a small portion of the table and count distinct of values
2) compute "number of dist. values" / "number of sampled tuples"
3) scale this to the whole table and scale the filter

Say there are really 50 distinct values, 1.000 rows will be sampled but
20 distinct values are missing in the sample. This gives 5% in step (2)
and if the table has 1.000.000 tuples you'll get 50.000 in (3). So the
filter needs just 60kB. Which is a huge improvement compared to the
previous approach (1.2MB).

Obviously this will still lead to overestimates in most cases, and there
are probably some other fail cases, but I think it's a reasonable
solution. I don't think this can result in an underestimate (which is
the case where you loose precision).

And in case we want to build this incrementally (from a VACUUM) we
really need to use a bit larger filter, because rescaling the filter is
not possible AFAIK (without rebuilding it from scratch).

regards
Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 24.12.2010 04:41, Florian Pflug napsal(a):
> The filter size could be derived from the table's statistics target, or
> be otherwise user-definable. We could also auto-resize once it gets too
> full. But still, that all seems awfully complex :-(

Using a statistics target is a good idea I think. I think we could use
it to determine error rate of the filter. Something like
  error rate = 10 - 0.9 * (statistics_target - 100)

which gives
  1%  for statistics target 1000  10% for statistics target 100

or maybe something like this (where the error rate grows faster for
smaller statistic target values)
  error rate = 11 - 91000 / (statistics_target^2)

which gives about
  1%  for statistics target 1000  10% for statistics targer 100  36% for statistics target 50

But I guess 10% error rate is the minimum we need so it does not make
much sense to use lower values.

>> >   Another possibility is to collect the data from just a small portion
>> >   of a table and then use the result to estimate the number of distinct
>> >   values for the whole table. But I'm not sure we can do this reliably,
>> >   I see many traps in this.
> This is how it works currently. The problem with this approach is that
> it gives you very little guarantees about how precise the result will be.
> Extrapolating works very well for things like MKVs and histograms, because
> there you're by definition interested mostly in values which occur often -
> and thus with a high probability in the relative few rows you sample. For
> the number of distinct values, however, this isn't true - if ndistinct
> is an order of magnitude smaller than the number of rows, relatively few
> rows can account for a large percentage of the distinct values...

That basically means we need to sample a large portion of the table :-(

> Another idea would be to obtain the ndistinct values from an index somehow.
> Postgres cannot currently scan an index in physical order, only in logical
> order, due to locking considerations. But since we'd only be interested in
> an estimate, maybe a scan in physical block order would work for ndistinc
> estimates? Just a wild idea, mind you, I haven't checked at all if that'd
> be even remotely feasible.

I was thinking about that too, and I think we could do this using
pageinspect contrib module. Sure, there might be a problem with bloated
indexes.

And relying on this actually means it's required to have a multi-column
index on all the columns. Individual indexes are not enough as we need
to get the number of distinct combinations too.

regards
Tomas


Re: proposal : cross-column stats

From
Tomas Vondra
Date:
Dne 24.12.2010 13:37, Florian Pflug napsal(a):
> On Dec24, 2010, at 11:23 , Nicolas Barbier wrote:
> 
>> 2010/12/24 Florian Pflug <fgp@phlo.org>:
>>
>>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote:
>>>
>>>>   I guess we could use the highest possible value (equal to the number
>>>>   of tuples) - according to wiki you need about 10 bits per element
>>>>   with 1% error, i.e. about 10MB of memory for each million of
>>>>   elements.
>>>
>>> Drat. I had expected these number to come out quite a bit lower than
>>> that, at least for a higher error target. But even with 10% false
>>> positive rate, it's still 4.5MB per 1e6 elements. Still too much to
>>> assume the filter will always fit into memory, I fear :-(
>>
>> I have the impression that both of you are forgetting that there are 8
>> bits in a byte. 10 bits per element = 1.25MB per milion elements.
> 
> Uh, of course. So in the real universe, the numbers are
> 
> ~1.2MB per 1e6 elements for a false positive rate of 1%
> ~0.5MB per 1e6 elements for a false positive rate of 10%
> 
> Hm. So for a table with a billion distinct elements, we'd need half
> a gigabyte per column for the filter. A tuple with two int columns
> takes at least 24+2*4 = 32bytes to store I think, making such a table
> at least 32GB in size. The filter size would thus be 1/64 of the table
> size in the worst case. 

I've read several papers about estimating the number of distinct values,
and I have some bad as well as good news. I don't want this thread to
grow infinitely, and it's a rather separate topic so I'll start a new
thread for ndistinct estimates.

regards
Tomas