Re: multivariate statistics / patch v7 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: multivariate statistics / patch v7 |
Date | |
Msg-id | 55BA9712.9090608@2ndquadrant.com Whole thread Raw |
In response to | Re: multivariate statistics / patch v7 (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: multivariate statistics / patch v7
|
List | pgsql-hackers |
Hi, On 07/30/2015 06:58 PM, Heikki Linnakangas wrote: > > The problem with a threshold is that around that threshold, even a > small change in the data set can drastically change the produced > estimates. For example, imagine that we know from the stats that zip > code implies city. But then someone adds a single row to the table > with an odd zip code & city combination, which pushes the estimator > over the threshold, and the columns are no longer considered > dependent, and the estimates are now completely different. We should > avoid steep cliffs like that. > > BTW, what is the threshold in the current patch? There's not a simple threshold - the algorithm mining the functional dependencies is a bit more complicated. I tried to explain it in the comment before build_mv_dependencies (in dependencies.c), but let me briefly summarize it here. To mine dependency [A => B], build_mv_dependencies does this: (1) sort the sample by {A,B} (2) split the sample into groups with the same value of A (3) for each group, decide if it's consistent with the dependency (a) if the group is too small (less than 3 rows), ignore it (a) if the group is consistent, update n_supporting n_supporting_rows (b) if the group is inconsistent, update n_contradictingn_contradicting_rows (4) decide whether the dependency is "valid" by checking n_supporting_rows >= n_contradicting_rows * 10 The limit is rather arbitrary and yes - I can imagine a more complex condition (e.g. looking at average number of tuples per group etc.), but I haven't looked into that - the point was to use something very simple, only to illustrate the infrastructure. I think we might come up with some elaborate way of associating "degree" with the functional dependency, but at that point we really loose the simplicity, and also make it indistinguishable from the remaining statistics (because it won't be possible to reduce the clauses like this, before performing the regular estimation). Which is exactly what makes the functional dependencies so neat and efficient, so I'm not overly enthusiastic about doing that. What seems more interesting is implementing the ndistinct coefficient instead, as proposed by Kyotaro-san - that seems to have the nice "smooth" behavior you desire, while keeping the simplicity. Both statistics types (functional dependencies and ndistinct coeff) have one weak point, though - they somehow assume the queries use "compatible" values. For example if you use a query with WHERE city = 'New York' AND zip = 'zip for Detroit' they can't detect cases like this, because those statistics types are oblivious to individual values. I don't see this as a fatal flaw, though - it's rather a consequence of the nature of the stats. And I tend to look at the functional dependencies the same way. If you need stats without these "issues" you'll have to use MCV list or a histogram. Trying to fix the simple statistics types is futile, IMHO. regards Tomas -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: