Re: multivariate statistics / patch v7 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: multivariate statistics / patch v7
Date
Msg-id 55BA9712.9090608@2ndquadrant.com
Whole thread Raw
In response to Re: multivariate statistics / patch v7  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: multivariate statistics / patch v7
List pgsql-hackers
Hi,

On 07/30/2015 06:58 PM, Heikki Linnakangas wrote:
>
> The problem with a threshold is that around that threshold, even a
> small change in the data set can drastically change the produced
> estimates. For example, imagine that we know from the stats that zip
> code implies city. But then someone adds a single row to the table
> with an odd zip code & city combination, which pushes the estimator
> over the threshold, and the columns are no longer considered
> dependent, and the estimates are now completely different. We should
> avoid steep cliffs like that.
>
> BTW, what is the threshold in the current patch?

There's not a simple threshold - the algorithm mining the functional 
dependencies is a bit more complicated. I tried to explain it in the 
comment before build_mv_dependencies (in dependencies.c), but let me 
briefly summarize it here.

To mine dependency [A => B], build_mv_dependencies does this:

(1) sort the sample by {A,B}

(2) split the sample into groups with the same value of A

(3) for each group, decide if it's consistent with the dependency
    (a) if the group is too small (less than 3 rows), ignore it
    (a) if the group is consistent, update
        n_supporting        n_supporting_rows
    (b) if the group is inconsistent, update
        n_contradictingn_contradicting_rows

(4) decide whether the dependency is "valid" by checking
    n_supporting_rows >= n_contradicting_rows * 10

The limit is rather arbitrary and yes - I can imagine a more complex 
condition (e.g. looking at average number of tuples per group etc.), but 
I haven't looked into that - the point was to use something very simple, 
only to illustrate the infrastructure.

I think we might come up with some elaborate way of associating "degree" 
with the functional dependency, but at that point we really loose the 
simplicity, and also make it indistinguishable from the remaining 
statistics (because it won't be possible to reduce the clauses like 
this, before performing the regular estimation). Which is exactly what 
makes the functional dependencies so neat and efficient, so I'm not 
overly enthusiastic about doing that.

What seems more interesting is implementing the ndistinct coefficient 
instead, as proposed by Kyotaro-san - that seems to have the nice 
"smooth" behavior you desire, while keeping the simplicity.

Both statistics types (functional dependencies and ndistinct coeff) have 
one weak point, though - they somehow assume the queries use 
"compatible" values. For example if you use a query with
   WHERE city = 'New York' AND zip = 'zip for Detroit'

they can't detect cases like this, because those statistics types are 
oblivious to individual values. I don't see this as a fatal flaw, though 
- it's rather a consequence of the nature of the stats. And I tend to 
look at the functional dependencies the same way.

If you need stats without these "issues" you'll have to use MCV list or 
a histogram. Trying to fix the simple statistics types is futile, IMHO.

regards
Tomas

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: brin index vacuum versus transaction snapshots
Next
From: Josh Berkus
Date:
Subject: Re: 64-bit XIDs again