Thread: proposal : cross-column stats
Hi everyone, one of the ssesion I've attended on PgDay last week was Heikki's session about statistics in PostgreSQL. One of the issues he mentioned (and one I regularly run into) is the absence of cross-column stats. When the columns are not independent, this usually result in poor estimates (and then in suboptimal plans). I was thinking about this issue before, but that session was the last impulse that pushed me to try to hack up a proof of concept. So here it is ;-) Lets talk about one special case - I'll explain how the proposed solution works, and then I'll explain how to make it more general, what improvements are possible, what issues are there. Anyway this is by no means a perfect or complete solution - it's just a starting point. ------------------------ Short introduction ------------------------ Say we have a table with two INT columns - col_a and col_b, and we want to estimate number of rows for a condition involving both columns: WHERE (col_a BETWEEN m AND n) AND (col_b BETWEEN p AND q) When the columns are independent, doing the estimate is just a matter of multiplication. When the columns are dependent, the estimate may be way off. Lets assume there are histograms with 5 bins for both columns. What I propose is basically building a 2D histogram. It kind of resembles contingency table. So we do have a table like this col_b \ col_a || 20% | 20% | 20% | 20% | 20% | ===============||============================== 20% || 4% | 4% | 4% | 4% | 4% | 20% || 4% | 4% | 4% | 4% | 4% | 20% || 4% | 4% | 4% | 4% | 4% | 20% || 4% | 4% | 4% | 4% | 4% | 20% || 4% | 4% | 4% | 4% | 4% | ===============||============================== where each column / row represents a bin in the original histograms, and each cell represents an expected number of rows in it (for really independent columns, it's 4%). For dependent columns the actual values may be actually very different, of course - e.g. for strongly dependent columns it might be like this col_b \ col_a || 20% | 20% | 20% | 20% | 20% | ===============||============================== 20% || 20% | 0% | 0% | 0% | 0% | 20% || 0% | 20% | 0% | 0% | 0% | 20% || 0% | 0% | 20% | 0% | 0% | 20% || 0% | 0% | 0% | 20% | 0% | 20% || 0% | 0% | 9% | 0% | 20% | ===============||============================== To estimate the number of rows matching the condition, you'd sum estimates for cells matching the condition. I.e. when the condition on col_a matches the lowest 20% (the first histogram bin) and the condition on col_b matches the lowest 20% of values, this corresponds to the first cell (20% of rows). Current optimizer estimates this to be 4% as it believes the columns are independent. I'm not sure whether I've explained it well enough, but the essence of the proposal is to build N-dimensional histograms (where N is the number of columns covered by the statistics) just as we are building histograms today. ----------------------- Proof of concept --------------------------- I've hacked a nasty proof of concept in PL/pgSQL (see the attachment). It creates two tables - test_table and cross_stats. The former contains the data (in two columns), the latter is used to store cross-column statistics of test_table. Then there are several PL/pgSQL functions - two of them are really important: collect_stats(p_bins_a INT, p_bins_b INT) - this one is used to collect the stats, the parameters represent number of bins for the columns (sides of the contingency table) - collect_stats(10,10) will build contingency table with 100 cells get_estimate(p_from_a INT, p_to_a INT, p_from_b INT, p_to_b INT) - computes estimate for the condition listed above (ranges on both columns) So to run the PoC, just do something like this: 1) fill the table with some data INSERT INTO test_table SELECT round(random()*1000), round(random()*1000) FROM generate_series(1,100000); 2) collect the cross-column statistics SELECT collect_stats(10,10); 3) see current estimated and actual number of rows EXPLAIN ANALYZE SELECT * FROM test_table WHERE (col_a BETWEEN 30 AND 129) AND (col_b BETWEEN 234 AND 484); 4) see the estimate based on contingency table SELECT get_estimate(30, 129, 234, 484); Just two very simple tests for now - col_a/col_b contain the range from the query, then there are actual number of matching rows and a current estimate, And finally the new estimate based on contingency table with various number of bins. A) independent columns (proof that it produces just as good estimates as the current code) col_a | col_b | actual | expected | 10x10 | 20x20 | [50,70] | [50,70] | 41 | 40 | 41 | 47 | [50,250] | [50,250] | 4023 | 4024 | 4436 | 3944 | [50,250] | [750,950] | 4023 | 3955 | 4509 | 3933 | B) strongly positively dependent columns (actually col_a = col_b) col_a | col_b | actual | expect | 10x10 | 20x20 | 100x100 | [50,70] | [50,70] | 2143 | 57 | 391 | 729 | 1866 | [50,250] | [50,250] | 20181 | 4111 | 15401 | 19983 | 19991 | [50,250] | [750,950] | 0 | 3977 | 0 | 0 | 0 | Which proves that it actually significantly improves the estimates. Again, I've hacked that in about 1 hour, so it's really slow and I think some of the estimates may be further improved. And the precision obviously depends on the number of cells. ----------------------- Additional notes --------------------------- 1) The whole thing may be easily extended to more than two columns, just build N-dimensional cubes. 2) I really don't think we should collect stats for all combinations of columns of a table - I do like the Oracle-like approach where a DBA has to enable cross-column stats using an ALTER TABLE (for a particular list of columns). The only exception might be columns from a multi-column index. It might be quite efficient I guess? 3) There are independence tests for contingency tables (e.g. Pearson's Chi-squared test), so that it's easy to find out whether the columns are independent. In that case we can just throw away these stats and use the simple estimation. http://mathworld.wolfram.com/Chi-SquaredTest.html 4) Or we could store just those cells where expected and observed values differ significantly (may help if most of the values are indendent, but there's a small glitch somewhere). 5) It does not make sense to collect stats for two list of columns that share several columns - just collect cross stats for union of the column lists and then 'dripp up' as needed. An extreme case might be collecting stats for all columns in a table. But there are practical issues with this approach (huge tables, small values within cells). 6) The precision increases with the number of cells, but the number of cells grows exponentionally. So it's necessary to choose a reasonable number of bins for the columns (might be part of the ALTER COMMAND, should not be firmly bound to the statistics target). At a cell level, the simple 'multiplication' estimates are actually used (but only when the cell is divided by the range). 7) All this is done under the assumption that all the columns are in a single table - when doing research in the archives I've noticed suggestions to use this to optimize joins. Maybe it can be done but I really don't know how to do that. 8) I'm not sure where to store this and in what form - generally the table does not need to be significantly more complex than cross_stats table from the PoC. 9) I've noticed demands to collect data about columns used frequently in a single WHERE condition. Not sure how to do that and how to analyze the collected data. I think collecting data about expected and observed number of columns should be just fine - but it's kinda independent from this proposal. regards Tomas
Attachment
On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: > Hi everyone, > > one of the ssesion I've attended on PgDay last week was Heikki's session > about statistics in PostgreSQL. One of the issues he mentioned (and one > I regularly run into) is the absence of cross-column stats. When the > columns are not independent, this usually result in poor estimates (and > then in suboptimal plans). Very cool that you're working on this. > Lets talk about one special case - I'll explain how the proposed > solution works, and then I'll explain how to make it more general, what > improvements are possible, what issues are there. Anyway this is by no > means a perfect or complete solution - it's just a starting point. It looks like you handled most of the issues. Just a few points: - This is obviously applicable to more than just integers, probably anything with a b-tree operator class. What you've codedseems rely on calculations on the values. Have you thought about how it could work for, for example, strings? The classic failure case has always been: postcodes and city names. Strongly correlated, but in a way that the computer can't easily see. Not that I suggest you fix this, but it's food for though. Though strictly speaking this is a different kind of correlation than what you're looking at. > 2) I really don't think we should collect stats for all combinations of > columns of a table - I do like the Oracle-like approach where a DBA > has to enable cross-column stats using an ALTER TABLE (for a > particular list of columns). > > The only exception might be columns from a multi-column index. It > might be quite efficient I guess? In the past it has been suggested to only do it for multi-column indexes, but I find these days I find in some situations I prefer to make individual indexes and let the bitmap scan code combine them. So perhaps it would be best to let it be configured by the DBA. > 3) There are independence tests for contingency tables (e.g. Pearson's > Chi-squared test), so that it's easy to find out whether the columns > are independent. In that case we can just throw away these stats and > use the simple estimation. > > http://mathworld.wolfram.com/Chi-SquaredTest.html I think this would be good to include, if possible. Actually, I wonder if the existing stats collection code could be altered to attempt to calculate the correlation between columns as part of its other work. > 4) Or we could store just those cells where expected and observed values > differ significantly (may help if most of the values are indendent, > but there's a small glitch somewhere). Comrpessing that grid would be useful, given that for many dimensions most of the grid will be not interesting. In fact, storing the 20 largest values may be enough. Worth an experiment. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On 12.12.2010 15:17, Martijn van Oosterhout wrote: > On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: > Very cool that you're working on this. +1 >> Lets talk about one special case - I'll explain how the proposed >> solution works, and then I'll explain how to make it more general, what >> improvements are possible, what issues are there. Anyway this is by no >> means a perfect or complete solution - it's just a starting point. > > It looks like you handled most of the issues. Just a few points: > > - This is obviously applicable to more than just integers, probably > anything with a b-tree operator class. What you've coded seems rely > on calculations on the values. Have you thought about how it could > work for, for example, strings? > > The classic failure case has always been: postcodes and city names. > Strongly correlated, but in a way that the computer can't easily see. Yeah, and that's actually analogous to the example I used in my presentation. The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the selectivity of "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a certain value for column B. Perhaps that could be measured by counting the number of distinct values of column B for each value of column A, or something like that. I don't know what the statisticians call that property, or if there's some existing theory on how to measure that from a sample. That's assuming the combination has any matches. It's possible that the user chooses a postcode and city combination that doesn't exist, but that's no different from a user doing "city = 'fsdfsdfsd'" on a single column, returning no matches. We should assume that the combination makes sense. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi! Dne 12.12.2010 15:17, Martijn van Oosterhout napsal(a): >> Lets talk about one special case - I'll explain how the proposed >> solution works, and then I'll explain how to make it more general, what >> improvements are possible, what issues are there. Anyway this is by no >> means a perfect or complete solution - it's just a starting point. > > It looks like you handled most of the issues. Just a few points: > > - This is obviously applicable to more than just integers, probably > anything with a b-tree operator class. What you've coded seems rely > on calculations on the values. Have you thought about how it could > work for, for example, strings? Yes, I know, I just forgot to address this in my previous e-mail. The contingency tables have a really nice feature - they are based on splitting the sets into groups (~ bins of the histograms for each column). And this can be done if you can sort the values, you really don't need any calculations. So it should work with strings. And another thing I somehow forgot is handling the case when there is no histogram, just MCV. That's mostly the same - each of the values might be a separate group, or the values might be grouped to form less groups, etc. > The classic failure case has always been: postcodes and city names. > Strongly correlated, but in a way that the computer can't easily see. > Not that I suggest you fix this, but it's food for though. Though > strictly speaking this is a different kind of correlation than what > you're looking at. Hmmm, I see. I think the proposal does not fix this particular case, although it might improve the situation a little bit (limit the error between expected and observed number of rows). The problem is that once we get to a cell-level of the contingency table, there is no additional (more detailed) information. So we're stuck with the multiplication estimate, or something like that. I was thinking about it actually, and I think we could collect some more info - a correlation coefficient for each bin, or something like that. But that was not part of my proposal, and I'm not sure how to do that. >> 2) I really don't think we should collect stats for all combinations of >> columns of a table - I do like the Oracle-like approach where a DBA >> has to enable cross-column stats using an ALTER TABLE (for a >> particular list of columns). >> >> The only exception might be columns from a multi-column index. It >> might be quite efficient I guess? > > In the past it has been suggested to only do it for multi-column > indexes, but I find these days I find in some situations I prefer to > make individual indexes and let the bitmap scan code combine them. So > perhaps it would be best to let it be configured by the DBA. Yes, I prefer individual indexes too. The idea behind collecting cross-column stats for multi-column indexes was that maybe we could 'append' this to the current functionality (building the index or something like that) so that it does not introduce significant performance problems. >> 3) There are independence tests for contingency tables (e.g. Pearson's >> Chi-squared test), so that it's easy to find out whether the columns >> are independent. In that case we can just throw away these stats and >> use the simple estimation. >> >> http://mathworld.wolfram.com/Chi-SquaredTest.html > > I think this would be good to include, if possible. > > Actually, I wonder if the existing stats collection code could be > altered to attempt to calculate the correlation between columns as part > of its other work. I guess that would be rather expensive - to compute correlation you need two passes, and you need to do that for each pair or columns. So I'd be surprised if it is possible (and effective). Another thing is that you can compute correlation only for numeric columns, so it's not possible to do that for city/ZIP code mentioned above. More precisely - it's possible to do that (if you map strings to numbers somehow), but I doubt you'll get useful results as the assignment is rather random. Well, you could ask the governments to assign the ZIP codes to cities in strictly alphabecital order, but I guess they'll say no. >> 4) Or we could store just those cells where expected and observed values >> differ significantly (may help if most of the values are indendent, >> but there's a small glitch somewhere). > > Comrpessing that grid would be useful, given that for many dimensions > most of the grid will be not interesting. In fact, storing the 20 > largest values may be enough. Worth an experiment. Not exactly just the largest values - rather values that are significantly different from the expected values. Generally there are two interesting cases expected << observed - The optimizer may choose index scan, although the seq scan would be better. expected >> observed - The optimizer may choose seq scan, although the index scan would be better. regards Tomas
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a): >> The classic failure case has always been: postcodes and city names. >> Strongly correlated, but in a way that the computer can't easily see. > > Yeah, and that's actually analogous to the example I used in my > presentation. > > The way I think of that problem is that once you know the postcode, > knowing the city name doesn't add any information. The postcode implies > the city name. So the selectivity for "postcode = ? AND city = ?" should > be the selectivity of "postcode = ?" alone. The measurement we need is > "implicativeness": How strongly does column A imply a certain value for > column B. Perhaps that could be measured by counting the number of > distinct values of column B for each value of column A, or something > like that. I don't know what the statisticians call that property, or if > there's some existing theory on how to measure that from a sample. Yes, those issues are a righteous punishment for breaking BCNF rules ;-) I'm not sure it's solvable using the contingency tables, as it requires knowledge about dependencies between individual values (working with cells is not enough, although it might improve the estimates). Well, maybe we could collect these stats (number of cities for a given ZIP code and number of ZIP codes for a given city). Collecting a good stats about this is a bit tricky, but possible. What about collecting this for the MCVs from both columns? Tomas
On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote: > The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information.The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the selectivityof "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a certainvalue for column B. Perhaps that could be measured by counting the number of distinct values of column B for eachvalue of column A, or something like that. I don't know what the statisticians call that property, or if there's someexisting theory on how to measure that from a sample. The statistical term for this is "conditional probability", written P(A|B), meaning the probability of A under the assumptionor knowledge of B. The basic tool for working with conditional probabilities is bayes' theorem which states that P(A|B) = P(A and B) / P(B). Currently, we assume that P(A|B) = P(A), meaning the probability (or selectivity as we call it) of an event (like a=3) doesnot change under additional assumptions like b=4. Bayes' theorem thus becomes P(A) = P(A and B) / P(B) <=> P(A and B) = P(A)*P(B) which is how we currently compute the selectivity of a clause such as "WHERE a=3 AND b=4". I believe that measuring this by counting the number of distinct values of column B for each A is basically the right idea.Maybe we could count the number of distinct values of "b" for every one of the most common values of "a", and comparethat to the overall number of distinct values of "b"... A (very) quick search on scholar.google.com for "estimate conditional probability" didn't turn up anything useful, but it'shard to believe that there isn't at least some literature on the subject. best regards, Florian Pflug
Dne 12.12.2010 17:33, Florian Pflug napsal(a): > On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote: >> The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information.The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the selectivityof "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a certainvalue for column B. Perhaps that could be measured by counting the number of distinct values of column B for eachvalue of column A, or something like that. I don't know what the statisticians call that property, or if there's someexisting theory on how to measure that from a sample. > > The statistical term for this is "conditional probability", written P(A|B), meaning the probability of A under the assumptionor knowledge of B. The basic tool for working with conditional probabilities is bayes' theorem which states that > > P(A|B) = P(A and B) / P(B). > > Currently, we assume that P(A|B) = P(A), meaning the probability (or selectivity as we call it) of an event (like a=3)does not change under additional assumptions like b=4. Bayes' theorem thus becomes > > P(A) = P(A and B) / P(B) <=> > P(A and B) = P(A)*P(B) > > which is how we currently compute the selectivity of a clause such as "WHERE a=3 AND b=4". > > I believe that measuring this by counting the number of distinct values of column B for each A is basically the right idea.Maybe we could count the number of distinct values of "b" for every one of the most common values of "a", and comparethat to the overall number of distinct values of "b"... Good point! Well, I was thinking about this too - generally this means creating a contingency table with the MCV as bins. Then you can compute these interesting probabilities P(A and B). (OK, now I definitely look like some contingency table weirdo, who tries to solve everything with a contingency table. OMG!) The question is - what are we going to do when the values in the query are not in the MCV list? Is there some heuristics to estimate the probability from MCV, or something like that? Could we use some "average" probability or what? Tomas
On Sun, Dec 12, 2010 at 9:43 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > The way I think of that problem is that once you know the postcode, knowing > the city name doesn't add any information. The postcode implies the city > name. So the selectivity for "postcode = ? AND city = ?" should be the > selectivity of "postcode = ?" alone. The measurement we need is > "implicativeness": How strongly does column A imply a certain value for > column B. Perhaps that could be measured by counting the number of distinct > values of column B for each value of column A, or something like that. I > don't know what the statisticians call that property, or if there's some > existing theory on how to measure that from a sample. This is a good idea, but I guess the question is what you do next. If you know that the "applicability" is 100%, you can disregard the restriction clause on the implied column. And if it has no implicatory power, then you just do what we do now. But what if it has some intermediate degree of implicability? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Dne 13.12.2010 01:05, Robert Haas napsal(a): > This is a good idea, but I guess the question is what you do next. If > you know that the "applicability" is 100%, you can disregard the > restriction clause on the implied column. And if it has no > implicatory power, then you just do what we do now. But what if it > has some intermediate degree of implicability? Well, I think you've missed the e-mail from Florian Pflug - he actually pointed out that the 'implicativeness' Heikki mentioned is called conditional probability. And conditional probability can be used to express the "AND" probability we are looking for (selectiveness). For two columns, this is actually pretty straighforward - as Florian wrote, the equation is P(A and B) = P(A|B) * P(B) = P(B|A) * P(A) where P(B) may be estimated from the current histogram, and P(A|B) may be estimated from the contingency (see the previous mails). And "P(A and B)" is actually the value we're looking for. Anyway there really is no "intermediate" degree of aplicability, it just gives you the right estimate. And AFAIR this is easily extensible to more than two columns, as P(A and B and C) = P(A and (B and C)) = P(A|(B and C)) * P(B and C) so it's basically a recursion. Well, I hope my statements are really correct - it's been a few years since I gained my degree in statistics ;-) regards Tomas
On Sun, Dec 12, 2010 at 8:46 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 13.12.2010 01:05, Robert Haas napsal(a): >> This is a good idea, but I guess the question is what you do next. If >> you know that the "applicability" is 100%, you can disregard the >> restriction clause on the implied column. And if it has no >> implicatory power, then you just do what we do now. But what if it >> has some intermediate degree of implicability? > > Well, I think you've missed the e-mail from Florian Pflug - he actually > pointed out that the 'implicativeness' Heikki mentioned is called > conditional probability. And conditional probability can be used to > express the "AND" probability we are looking for (selectiveness). > > For two columns, this is actually pretty straighforward - as Florian > wrote, the equation is > > P(A and B) = P(A|B) * P(B) = P(B|A) * P(A) Well, the question is what data you are actually storing. It's appealing to store a measure of the extent to which a constraint on column X constrains column Y, because you'd only need to store O(ncolumns^2) values, which would be reasonably compact and would potentially handle the zip code problem - a classic "hard case" rather neatly. But that wouldn't be sufficient to use the above equation, because there A and B need to be things like "column X has value x", and it's not going to be practical to store a complete set of MCVs for column X for each possible value that could appear in column Y. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> P(A|B) = P(A and B) / P(B). Well, until this point we've discussed failure cases involving 'AND' conditions. What about 'OR' conditions? I think the current optimizer computes the selectivity as 's1+s2 - s1*s2' (at least that's what I found in backend/optimizer/path/clausesel.c:630). Sometimes that may return nearly 2x the actual selectivity, but in general it's a reasonable estimate. Are there any severe failure cases that produce much worse estimates? regards Tomas
Dne 13.12.2010 03:00, Robert Haas napsal(a): > Well, the question is what data you are actually storing. It's > appealing to store a measure of the extent to which a constraint on > column X constrains column Y, because you'd only need to store > O(ncolumns^2) values, which would be reasonably compact and would > potentially handle the zip code problem - a classic "hard case" rather > neatly. But that wouldn't be sufficient to use the above equation, > because there A and B need to be things like "column X has value x", > and it's not going to be practical to store a complete set of MCVs for > column X for each possible value that could appear in column Y. O(ncolumns^2) values? You mean collecting such stats for each possible pair of columns? Well, I meant something different. The proposed solution is based on contingency tables, built for selected groups of columns (not for each possible group). And the contingency table gives you the ability to estimate the probabilities needed to compute the selectivity. Or am I missing something? regards Tomas
On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > Dne 13.12.2010 03:00, Robert Haas napsal(a): >> Well, the question is what data you are actually storing. It's >> appealing to store a measure of the extent to which a constraint on >> column X constrains column Y, because you'd only need to store >> O(ncolumns^2) values, which would be reasonably compact and would >> potentially handle the zip code problem - a classic "hard case" rather >> neatly. But that wouldn't be sufficient to use the above equation, >> because there A and B need to be things like "column X has value x", >> and it's not going to be practical to store a complete set of MCVs for >> column X for each possible value that could appear in column Y. > > O(ncolumns^2) values? You mean collecting such stats for each possible > pair of columns? Well, I meant something different. > > The proposed solution is based on contingency tables, built for selected > groups of columns (not for each possible group). And the contingency > table gives you the ability to estimate the probabilities needed to > compute the selectivity. Or am I missing something? Well, I'm not real familiar with contingency tables, but it seems like you could end up needing to store a huge amount of data to get any benefit out of it, in some cases. For example, in the United States, there are over 40,000 postal codes, and some even larger number of city names, and doesn't the number of entries go as O(m*n)? Now maybe this is useful enough anyway that we should Just Do It, but it'd be a lot cooler if we could find a way to give the planner a meaningful clue out of some more compact representation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>> The proposed solution is based on contingency tables, built for selected >> groups of columns (not for each possible group). And the contingency >> table gives you the ability to estimate the probabilities needed to >> compute the selectivity. Or am I missing something? > > Well, I'm not real familiar with contingency tables, but it seems like > you could end up needing to store a huge amount of data to get any > benefit out of it, in some cases. For example, in the United States, > there are over 40,000 postal codes, and some even larger number of > city names, and doesn't the number of entries go as O(m*n)? Not to mention that the model parameters will be impossible to estimate well. ( I've only scanned the thread, so sorry if Im repeating something that's already been said ) My intuition is that storing the covariance structure will be unworkable both technically and statistically for all but the simplest of cases. That being said, I dont think the problem is a lack of ways to parameterize the covariance structure ( there are several papers on mutli-dim histogram estimators, at least one of whichtalks about databases explicitly, not to mention approaches like CART[1] ) , but a complete lack of infrastructure to do anything with the estimates. So keep working on it ;-) ( but try to make the framework general enough to allow better estimators ). I wonder if a good first step might be to focus on the AND and OR operators since they seem like the most common cases and union and intersection commute ( although it's important to remember that the estimate variances do NOT ) That is, what about estimating the selectivity of the condition WHERE X=A and Y=B by f(A,B) = x_1*(selectivity(X = A) + selectivity( Y = B )) + x_2*selectivity(X = A)*selectivity( Y = B ) + x_3 where x_{1,2,3} are parameters to be estimated from the data. Another quick note: I think that storing the full contingency table is wasteful since the marginals are already stored in the single column statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was looking at this a couple years back ). Best, Nathan [1] http://en.wikipedia.org/wiki/Classification_and_regression_tree [2] http://en.wikipedia.org/wiki/Copula_%28statistics%29
> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> Dne 13.12.2010 03:00, Robert Haas napsal(a): >>> Well, the question is what data you are actually storing. It's >>> appealing to store a measure of the extent to which a constraint on >>> column X constrains column Y, because you'd only need to store >>> O(ncolumns^2) values, which would be reasonably compact and would >>> potentially handle the zip code problem - a classic "hard case" rather >>> neatly. But that wouldn't be sufficient to use the above equation, >>> because there A and B need to be things like "column X has value x", >>> and it's not going to be practical to store a complete set of MCVs for >>> column X for each possible value that could appear in column Y. >> >> O(ncolumns^2) values? You mean collecting such stats for each possible >> pair of columns? Well, I meant something different. >> >> The proposed solution is based on contingency tables, built for selected >> groups of columns (not for each possible group). And the contingency >> table gives you the ability to estimate the probabilities needed to >> compute the selectivity. Or am I missing something? > > Well, I'm not real familiar with contingency tables, but it seems like > you could end up needing to store a huge amount of data to get any > benefit out of it, in some cases. For example, in the United States, > there are over 40,000 postal codes, and some even larger number of > city names, and doesn't the number of entries go as O(m*n)? Now maybe > this is useful enough anyway that we should Just Do It, but it'd be a > lot cooler if we could find a way to give the planner a meaningful > clue out of some more compact representation. Yes, storing a complete contingency table is not feasible in such cases. My original proposal actually did not address this particular issue (cities and ZIP codes) as it was based on simplified contingency tables (with bins corresponding to current histograms, not to individual values). So the number of values to store would grow much slower. On the other hand, this generalization really makes it unusable in some cases, and the issue we're discussing here (cities and ZIP codes) is one of them. I think in such cases we could build a contingency table for MCV and then use it to estimate those conditional probabilities we need, but I expect it to be very tricky. Thanks for the comments. Tomas
On 2010-12-13 03:28, Robert Haas wrote: > Well, I'm not real familiar with contingency tables, but it seems like > you could end up needing to store a huge amount of data to get any > benefit out of it, in some cases. For example, in the United States, > there are over 40,000 postal codes, and some even larger number of > city names, and doesn't the number of entries go as O(m*n)? Now maybe > this is useful enough anyway that we should Just Do It, but it'd be a > lot cooler if we could find a way to give the planner a meaningful > clue out of some more compact representation. A sparse matrix that holds only 'implicative' (P(A|B) <> P(A*B)?) combinations? Also, some information might be deduced from others. For Heikki's city/region example, for each city it would be known that it is 100% in one region. In that case it suffices to store only that information, since 0% in all other regions ca be deduced. I wouldn't be surprized if storing implicatures like this would reduce the size to O(n). regards, Yeb Havinga
> On 2010-12-13 03:28, Robert Haas wrote: >> Well, I'm not real familiar with contingency tables, but it seems like >> you could end up needing to store a huge amount of data to get any >> benefit out of it, in some cases. For example, in the United States, >> there are over 40,000 postal codes, and some even larger number of >> city names, and doesn't the number of entries go as O(m*n)? Now maybe >> this is useful enough anyway that we should Just Do It, but it'd be a >> lot cooler if we could find a way to give the planner a meaningful >> clue out of some more compact representation. > A sparse matrix that holds only 'implicative' (P(A|B) <> P(A*B)?) > combinations? Also, some information might be deduced from others. For > Heikki's city/region example, for each city it would be known that it is > 100% in one region. In that case it suffices to store only that > information, since 0% in all other regions ca be deduced. I wouldn't be > surprized if storing implicatures like this would reduce the size to O(n). OK, but I'll leave this for the future. My plan is to build a small PoC, just to see whether the contingency-table + probability-estimates approach works for the failure case mentioned by Heikki. I'l like to do this till the end of this week, if possible. I'll read the articles/mentioned by Nathan Boley (thanks for those links, if you have more of them just let me know). Once we have a solution that solves (or significantly improves) these failure cases, we can do further plans (how to do that ascually in the code etc.). BTW thanks for all the comments! regards Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > Well, until this point we've discussed failure cases involving 'AND' > conditions. What about 'OR' conditions? I think the current optimizer > computes the selectivity as 's1+s2 - s1*s2' (at least that's what I > found in backend/optimizer/path/clausesel.c:630). If you can solve the AND case, the OR case falls out of that. Just replace s1*s2 with a more accurate AND calculation. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> The proposed solution is based on contingency tables, built for selected >> groups of columns (not for each possible group). And the contingency >> table gives you the ability to estimate the probabilities needed to >> compute the selectivity. Or am I missing something? > Well, I'm not real familiar with contingency tables, but it seems like > you could end up needing to store a huge amount of data to get any > benefit out of it, in some cases. The reason that this wasn't done years ago is precisely that nobody's figured out how to do it with a tolerable amount of stats data and a tolerable amount of processing time (both at ANALYZE time and during query planning). It's not hard to see what we'd ideally like to do; it's getting from there to something useful in production that's hard. regards, tom lane
On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote: > Another quick note: I think that storing the full contingency table is > wasteful since the marginals are already stored in the single column > statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was > looking at this a couple years back ). Josh Tolley still looks at it occasionally, though time hasn't permitted any sort of significant work for quite some time. The multicolstat branch on my git.postgresql.org repository will create an empirical copula each multi-column index, and stick it in pg_statistic. It doesn't yet do anything useful with that information, nor am I convinced it's remotely bug-free. In a brief PGCon discussion with Tom a while back, it was suggested a good place for the planner to use these stats would be clausesel.c, which is responsible for handling code such as "...WHERE foo > 4 AND foo > 5". -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
Dne 13.12.2010 16:34, Tom Lane napsal(a): > Tomas Vondra <tv@fuzzy.cz> writes: >> Well, until this point we've discussed failure cases involving 'AND' >> conditions. What about 'OR' conditions? I think the current optimizer >> computes the selectivity as 's1+s2 - s1*s2' (at least that's what I >> found in backend/optimizer/path/clausesel.c:630). > > If you can solve the AND case, the OR case falls out of that. Just > replace s1*s2 with a more accurate AND calculation. Oh yeah, now I see - it's just the usual equation P(A or B) = P(A) + P(B) - P(A and B) and we're estimating "P(A and B)" as P(A)*P(B). regards Tomas
Dne 13.12.2010 16:38, Tom Lane napsal(a): > The reason that this wasn't done years ago is precisely that nobody's > figured out how to do it with a tolerable amount of stats data and a > tolerable amount of processing time (both at ANALYZE time and during > query planning). It's not hard to see what we'd ideally like to do; > it's getting from there to something useful in production that's hard. OK, I fully realize that. My plan is to simply (a) find out what statistics do we need to collect and how to use it (b) implement a really stupid inefficient solution(c) optimize in iterations, i.e. making it faster, consuming less space etc. It will take time, it won't be perfect for a long time, but every journey starts somewhere. regards Tomas
Dne 13.12.2010 18:59, Joshua Tolley napsal(a): > On Sun, Dec 12, 2010 at 07:10:44PM -0800, Nathan Boley wrote: >> Another quick note: I think that storing the full contingency table is >> wasteful since the marginals are already stored in the single column >> statistics. Look at copulas [2] ( FWIW I think that Josh Tolley was >> looking at this a couple years back ). > > Josh Tolley still looks at it occasionally, though time hasn't permitted any > sort of significant work for quite some time. The multicolstat branch on my > git.postgresql.org repository will create an empirical copula each > multi-column index, and stick it in pg_statistic. It doesn't yet do anything > useful with that information, nor am I convinced it's remotely bug-free. In a > brief PGCon discussion with Tom a while back, it was suggested a good place > for the planner to use these stats would be clausesel.c, which is responsible > for handling code such as "...WHERE foo > 4 AND foo > 5". Well, that's good news ;-) I've done a bit of research today, and I've found some quite interesting papers on this topic (probably, I did not have time to read them, in most cases I've read just the title and abstract). [1] Selectivity estimators for multidimensional range queries over real attributes http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.914 This seems like a very good starting point. AFAIK it precisely describes what data need to be collected, how to do themath etc. [2] Selectivity Estimation Without the Attribute Value Independence Assumption http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.8126 This obviously deals with the independence problem. Haven't investigated it further, but it seems worth to read. [3] On Analyzing the Cost of Queries with Multi-Attribute Restrictions and Sort Operations (A Cost Function for UniformlyPartitioned UB-Trees) http://mistral.in.tum.de/results/publications/MB00_ideas.pdf Describes something called UB-Tree, and shows how it may be used to do estimates. Might be interesting as an alternativeto the traditional histograms. There are more details about UB-Trees at http://mistral.in.tum.de/results/publications/ [4] http://www.dbnet.ece.ntua.gr/~nikos/edith/qopt_bibl/ A rather nice collection of papers related to estimation (including some of the papers listed above). Hm, I planned to finally read the "Understanding MySQL Internals" over the Xmas ... that obviously won't happen. regards Tomas
Tomas, > (a) find out what statistics do we need to collect and how to use it > (b) implement a really stupid inefficient solution > (c) optimize in iterations, i.e. making it faster, consuming less > space etc. I'll suggest again how to decide *which* columns to cross: whichever columns are combined in composite indexes. In version 2, allow the DBA to specify combinations. In the unlikely event that correlation could be reduced to a single float number, it would be conceivable for each column to have an array of correlation stats for every other column where correlation was non-random; on most tables (i.e. ones with less than 100 columns) we're not talking about that much storage space. The main cost would be the time spent collecting that info ... -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Dne 13.12.2010 22:50, Josh Berkus napsal(a): > Tomas, > >> (a) find out what statistics do we need to collect and how to use it >> (b) implement a really stupid inefficient solution >> (c) optimize in iterations, i.e. making it faster, consuming less >> space etc. > > I'll suggest again how to decide *which* columns to cross: whichever > columns are combined in composite indexes. In version 2, allow the DBA > to specify combinations. > > In the unlikely event that correlation could be reduced to a single > float number, it would be conceivable for each column to have an array > of correlation stats for every other column where correlation was > non-random; on most tables (i.e. ones with less than 100 columns) we're > not talking about that much storage space. > > The main cost would be the time spent collecting that info ... I think this is a bit early to discuss this, given the fact that we don't have a working solution yet. But OK, let's discuss these options anyway 1) collecting the automatically for composite indexes I don't think this is wise idea. The first versions definitely won't be very efficient, and collecting the data for eachcomposite index means everyone will be hit by this inefficiency, even if he actually does not need that (e.g. the columnsare independent so the current estimates are quite accurate or he's not using those columns very often in the sameWHERE clause). Another reason against this is that many DBAs don't actually use composed indexes - they simply create indexes on eachcolumn and let the bitmap index scan to work it out. And this would not work for this case. And actually it's not very complicated to allow the DBA to do this, this can be a quite simple PL/pgSQL procedure. 2) collecting correlation for each pair of columns Again, you're effectively forcing everyone to pay the price even though he may not need the feature. Maybe we'll get thereone day, but it's not a good idea to do that from the beginning. And the correlation itself has a very limited use in real life, as it's not possible to compute it for character columnsand is not very useful in case of some numerical columns (e.g. ZIP codes). regards Tomas
2010/12/13 Josh Berkus <josh@agliodbs.com>: > Tomas, > >> (a) find out what statistics do we need to collect and how to use it >> (b) implement a really stupid inefficient solution >> (c) optimize in iterations, i.e. making it faster, consuming less >> space etc. > > I'll suggest again how to decide *which* columns to cross: whichever > columns are combined in composite indexes. In version 2, allow the DBA > to specify combinations. It's really good idea? Composite index can be created when single columns are too less unique - (name, surname). DBA specification can be cheeper. We can set a options for relation? So it can be used. Pavel > > In the unlikely event that correlation could be reduced to a single > float number, it would be conceivable for each column to have an array > of correlation stats for every other column where correlation was > non-random; on most tables (i.e. ones with less than 100 columns) we're > not talking about that much storage space. > > The main cost would be the time spent collecting that info ... > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On Mon, Dec 13, 2010 at 5:45 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> I'll suggest again how to decide *which* columns to cross: whichever >> columns are combined in composite indexes. In version 2, allow the DBA >> to specify combinations. > I think this is a bit early to discuss this, given the fact that we > don't have a working solution yet. Agreed. Not to mention that adding syntax is pretty easy. The hard part is figuring out what data you want to collect and what you want to do with it. If we can figure that out, the syntax will be simple by comparison. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a): > On 12.12.2010 15:17, Martijn van Oosterhout wrote: >> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: >> Very cool that you're working on this. > > +1 > >>> Lets talk about one special case - I'll explain how the proposed >>> solution works, and then I'll explain how to make it more general, what >>> improvements are possible, what issues are there. Anyway this is by no >>> means a perfect or complete solution - it's just a starting point. >> >> It looks like you handled most of the issues. Just a few points: >> >> - This is obviously applicable to more than just integers, probably >> anything with a b-tree operator class. What you've coded seems rely >> on calculations on the values. Have you thought about how it could >> work for, for example, strings? >> >> The classic failure case has always been: postcodes and city names. >> Strongly correlated, but in a way that the computer can't easily see. > > Yeah, and that's actually analogous to the example I used in my > presentation. > > The way I think of that problem is that once you know the postcode, > knowing the city name doesn't add any information. The postcode implies > the city name. So the selectivity for "postcode = ? AND city = ?" should > be the selectivity of "postcode = ?" alone. The measurement we need is > "implicativeness": How strongly does column A imply a certain value for > column B. Perhaps that could be measured by counting the number of > distinct values of column B for each value of column A, or something > like that. I don't know what the statisticians call that property, or if > there's some existing theory on how to measure that from a sample. > > That's assuming the combination has any matches. It's possible that the > user chooses a postcode and city combination that doesn't exist, but > that's no different from a user doing "city = 'fsdfsdfsd'" on a single > column, returning no matches. We should assume that the combination > makes sense. Well, I've read several papers this week, in an attempt to find out what possibilities are there when implementing cross-column statistics. One of the papers seems like a reasonable solution to this particular problem - discrete data + equality queries. The article is "A Bayesian Approach to Estimating The Selectivity of Conjuctive Predicates" (written by Heimel, Markl and Murthy). It's freely available as a PDF http:://subs.emis.de/LNI/Proceedings/Proceedings144/52.pdf I do have a PDF with my own notes in it (as annotations), let me know if you're interested ... The basic principle is that instead of 'attribute value independence' (which is the problem with correlated columns) they assume 'uniform correlation' which is a much weaker assumption. In the end, all they need to compute an estimate is number of distinct values for each of the columns (we already have that in pg_stats) and a number of distinct values for the group of columns in a query. They really don't need any multidimensional histogram or something like that. The funny thing is I've implemented most of the PoC before reading the article - the only difference was the way to combine the estimates ( ------------------------ pros and cons -------------------------- So let's see the pros: * it's reasonably simple, the overhead should be minimal I guess (we're already estimating distinct values for the columns, so I guess we can do that for multiple columns with reasonable impact) * there are no 'advanced data structures' as multi-dimensional histograms that are expensive to build and maintain * it consistently produces better estimates than the current estimator based on attribute value independence assumption, and it's reasonably accurate (I've been playing with it for some time, so we'll need more tests of course) * it's easily extensible to more columns * I guess we could improve the estimated by our own heuristicts, to catch the special case when one column is perfectly implied by another one (e.g. when state is implied by ZIP code) - we can catch that by 'distinct for combination = distinct for column' and use just the estimate for one of the column but there are some cons too: * this really is not designed to work with real-valued data, it's a very nice solution for discrete data (namely 'labels' as for example city/state names, ZIP codes etc.) * you need to know the list of columns in advance (there's nothing like 'let's build' the stats for columns (a,b,c) and then query just (a,b)) as you need to know the count of distinct values for the queried columns (well, maybe it's possible - I guess we could 'integrate' over all possible values of the omited column) * it's built for 'equality' qeuries, although maybe we could modify it for inequalities too (but it won't be as elegant), but I guess the equality queries are much more common in case of discrete data (OK, you can run something like WHERE (zip_code BETWEEN '49389' AND '54980') AND state = '48' but I guess that's not very frequent. And I think we could still use the solution described in the paper. ------------------------ proof of concept -------------------------- I've prepared another 'proof of concept' example - see the attachment. It works with data from http://www.census.gov/geo/www/tiger/zip1999.zip which is a list of US ZIP codes from 1999, along with info on GPS, state, county etc. Basically a perfect example of the fail case described above. It's in a DBF format, so use http:://pgdbf.sf.net (or something like that) to extract the data. Then use the script to create a table zip_codes_usa (with appropriate data types), table 'cross_stats' to store the stats (number of distinct values) and plpgsql functions collect_stats, get_estimate and get_frequency. Run the 'collect_stats' with two columns as parameters, i.e. SELECT collect_stats('zip_code', 'state'); and then you can run the 'get_estimate' with values for the columns, i.e. something like this SELECT get_estimate('07034', '34'); which prints some debug notices and returns three integers - estimate for the colums separately, and then the combined result. So what really matters is the third number ... Some basic tests, compared with the current estimator: a) queries involving 'zip_code' - I've put there the heuristics described above, so it actually returns the best estimate (unless the combination with state does not make sense, of course) ZIP | state | actual | current estimate | get_estimate ========================================================= 07034 | 34 | 32 | 1 | 32 07034 | 32 | 0 | 1 | 32 There are 32 rows with this combination as I've copied the data multiple times to make the table bigger. The 'better performance' of the current estimator in the second case is actually just a lucky coincidence, a side-effect of the independence assumption. And as Heikki already pointed out, we should assume the combination makes sense, so this actually is not an advantage of the current estimator. b) queries involving 'post_name' and 'state' - this can't use the heuristics as in (a), as there are 18.940 different post names, 59 states and 30113 combinations. On the other side, post name implies state. post name | state | actual | current estimate | get_estimate ========================================================= ESSEX | 09 | 32 | 1 | 75 ELMSFORD | 36 | 32 | 4 | 188 Hm, in this case the estimate is not as good as in the first case, but I guess it's still a bit better than the current estimate. In the first case it's about 2.5x overestimated (compared to 32x underestimate of the current estimate), and about 5x overestimated in the second case (compared to 8x underestimate of the current one). Again, with the invalid combination (e.g. ESSEX/36) the current estimator would perform better. regards Tomas
Attachment
On Fri, Dec 17, 2010 at 12:58 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > In the end, all they need to compute an estimate is number of distinct > values for each of the columns (we already have that in pg_stats) and a > number of distinct values for the group of columns in a query. They > really don't need any multidimensional histogram or something like that. I haven't read the paper yet (sorry) but just off the top of my head, one possible problem here is that our n_distinct estimates aren't always very accurate, especially for large tables. As we've discussed before, making them accurate requires sampling a significant percentage of the table, whereas all of our other statistics can be computed reasonably accurately by sampling a fixed amount of an arbitrarily large table. So it's possible that relying more heavily on n_distinct could turn out worse overall even if the algorithm is better. Not sure if that's an issue here, just throwing it out there... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, I've read about 10 papers about estimation this week, and I have gained some insight. I'm not going to describe each of the papers here, I plan to set up a wiki page about cross column statistics http://wiki.postgresql.org/wiki/Cross_Columns_Stats and I'll put the list of papers and some basic info there. There was a lot of discussion in this thread, and it's difficult to follow it, so I'll put there some details about the proposal etc. So I'll just briefly describe the interesting things I've learned from those papers. ---------------------- instances of the problem ---------------------- Generally, there are four quite different cases of the problem. First, the columns may be either real-valued or discrete. And by 'discrete' I mean the value is rather a label than a value. It does not make sense to add or subtract them, etc. So for example city names or zip codes are discrete values (although zip codes are numbers etc). Second, the queries are consist of equality or inequality (range) conditions. So actually there are four instances of the problem: | equality conditions | inequality conditions |================================================================discrete values | A | D | numeric values | C | B |---------------------------------------------------------------- I think those four problems should be treated separately, although some of the techniques may be common. A) discrete values and inequality conditions One of the papers (A Bayesian Approach to Estimating The Selectivity of Conjuctive Predicates) describes a quite interestingsolution to this problem - I've already posted a description on how to apply it to the ZIP code / city nameproblem - see this http://archives.postgresql.org/pgsql-hackers/2010-12/msg01576.php B) numeric values and inequality conditions Most of the papers dealing with this problem are based on discretization and multi-dimensional histograms to approximatethe joint distribution. So I believe my initial proposal was not a complete nonsense. Once we have a working histogram-based solution, we can work on precision and efficiency (how much space is needed tostore it, how long does it take to compute an estimate, etc.). There are two ways to do that. First, most of the papers offer an enhanced type of histogram (although the histogram proposed in the paper is alwaysevaluated as the most efficient one, which is a bit suspicious). Anyway there are advanced quite-promising ways tobuild multi-dimensional histograms. Second, the paper "Selectivity estimation using probabilistic models") describes a solution based on bayesian networks.That means really really promising (actually it addresses join selectivity estimation too). So yeas, I'm quite confident we can solve this issue and improve the efficiency - either by an advanced histogram or usingbayesian networks. C) numeric values and equality conditions OK, I'm not sure how to solve this case. But the queries against numeric queries are range queries in most cases I guess,so maybe that's not that big deal. D) discrete values and inequality conditions Basically, this can be handled just like numeric values after discretization, i.e. using a histogram. But this is usually E) a combination of discrete / numeric columns I'm not sure how to deal with this. Obviously it's possible to build multi-dimensional histogram, and estimate as manyqueries as possible. ----------------------------------------------------------------------- The papers describe some interesting alternative approaches, e.g. based on SVD (singular value decomposition) or random sampling (or kernel estimators, which an enhanced version of sampling). But there are various issues associated with those solutions. SVD is applicable to 2D only, so we'd be stuck with 2 columns, and random sampling sounds a bit suspicious to me). regards Tomas
Dne 17.12.2010 19:58, Robert Haas napsal(a): > I haven't read the paper yet (sorry) but just off the top of my head, > one possible problem here is that our n_distinct estimates aren't > always very accurate, especially for large tables. As we've discussed > before, making them accurate requires sampling a significant > percentage of the table, whereas all of our other statistics can be > computed reasonably accurately by sampling a fixed amount of an > arbitrarily large table. So it's possible that relying more heavily > on n_distinct could turn out worse overall even if the algorithm is > better. Not sure if that's an issue here, just throwing it out > there... Yes, you're right - the paper really is based on (estimates of) number of distinct values for each of the columns as well as for the group of columns. AFAIK it will work with reasonably precise estimates, but the point is you need an estimate of distinct values of the whole group of columns. So when you want to get an estimate for queries on columns (a,b), you need the number of distinct value combinations of these two columns. And I think we're not collecting this right now, so this solution requires scanning the table (or some part of it). I know this is a weak point of the whole solution, but the truth is every cross-column stats solution will have to do something like this. I don't think we'll find a solution with 0 performance impact, without the need to scan sufficient part of a table. That's why I want to make this optional so that the users will use it only when really needed. Anyway one possible solution might be to allow the user to set these values manually (as in case when ndistinct estimates are not precise). regards Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > AFAIK it will work with reasonably precise estimates, but the point is > you need an estimate of distinct values of the whole group of columns. > So when you want to get an estimate for queries on columns (a,b), you > need the number of distinct value combinations of these two columns. That seems likely to be even more unreliable than our existing single-column estimates :-( regards, tom lane
On 17.12.2010 23:13, Tomas Vondra wrote: > Dne 17.12.2010 19:58, Robert Haas napsal(a): >> I haven't read the paper yet (sorry) but just off the top of my head, >> one possible problem here is that our n_distinct estimates aren't >> always very accurate, especially for large tables. As we've discussed >> before, making them accurate requires sampling a significant >> percentage of the table, whereas all of our other statistics can be >> computed reasonably accurately by sampling a fixed amount of an >> arbitrarily large table. So it's possible that relying more heavily >> on n_distinct could turn out worse overall even if the algorithm is >> better. Not sure if that's an issue here, just throwing it out >> there... > > Yes, you're right - the paper really is based on (estimates of) number > of distinct values for each of the columns as well as for the group of > columns. > > AFAIK it will work with reasonably precise estimates, but the point is > you need an estimate of distinct values of the whole group of columns. > So when you want to get an estimate for queries on columns (a,b), you > need the number of distinct value combinations of these two columns. > > And I think we're not collecting this right now, so this solution > requires scanning the table (or some part of it). Any idea how sensitive it is to the accuracy of that estimate on distinct value combinations? If we get that off by a factor of ten or a hundred, what kind of an effect does it have on the final cost estimates? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Dne 17.12.2010 22:24, Tom Lane napsal(a): > That seems likely to be even more unreliable than our existing > single-column estimates :-( > > regards, tom lane Well, yes :-( I guess this is a place where we could use a multi-column index, if it contains all the interesting columns. We could scan the index, not the whole table. That could save us tremendous amount of I/O and should be quite precise (unless it's severely bloated). Another thing is those 'discrete' columns are usually quite stable, i.e. there's usually a limited list of values and it does not change very often. Think about ZIP codes, states, etc. And the combinations are quite stable too (counties do not move to other states, etc.). So I think scanning a reasonable part of a table should be enough in these cases. regards Tomas
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a): > On 17.12.2010 23:13, Tomas Vondra wrote: >> Dne 17.12.2010 19:58, Robert Haas napsal(a): >>> I haven't read the paper yet (sorry) but just off the top of my head, >>> one possible problem here is that our n_distinct estimates aren't >>> always very accurate, especially for large tables. As we've discussed >>> before, making them accurate requires sampling a significant >>> percentage of the table, whereas all of our other statistics can be >>> computed reasonably accurately by sampling a fixed amount of an >>> arbitrarily large table. So it's possible that relying more heavily >>> on n_distinct could turn out worse overall even if the algorithm is >>> better. Not sure if that's an issue here, just throwing it out >>> there... >> >> Yes, you're right - the paper really is based on (estimates of) number >> of distinct values for each of the columns as well as for the group of >> columns. >> >> AFAIK it will work with reasonably precise estimates, but the point is >> you need an estimate of distinct values of the whole group of columns. >> So when you want to get an estimate for queries on columns (a,b), you >> need the number of distinct value combinations of these two columns. >> >> And I think we're not collecting this right now, so this solution >> requires scanning the table (or some part of it). > > Any idea how sensitive it is to the accuracy of that estimate on > distinct value combinations? If we get that off by a factor of ten or a > hundred, what kind of an effect does it have on the final cost estimates? Well, not really - I haven't done any experiments with it. For two columns selectivity equation is (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) where A and B are columns, dist(X) is number of distinct values in column X and sel(X) is selectivity of column X. dependency on dist(A), dist(B) and dist(A,C) -------------------------------------------- So it seems to be proportional to dist(A) and dist(B), and inverse proportional to dist(A,B). So when increasing the dist(A) and dist(B) 10x, you'll get a 10x the original estimate. Similarly, by increasing the dist(A,B) 10x, you'll get 10x lower estimate. upper bound ----------- Unless dist(A) or dist(B) is greater than dist(A,B), the estimated selectivity is bounded by (sel(A) + sel(B)) / 2 I guess we can say that (sel(A) > sel(A,B)) is generally the same as sel(A) = sel(A,B) i.e. we can use the heuristict I've already offered in the PoC. lower bound ----------- Not sure what the lower bound is, but it might be 0 if the dist(A) and dist(B) are very small and/or dist(A,B) is huge. regards Tomas
On Dec17, 2010, at 23:12 , Tomas Vondra wrote: > Well, not really - I haven't done any experiments with it. For two > columns selectivity equation is > > (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) > > where A and B are columns, dist(X) is number of distinct values in > column X and sel(X) is selectivity of column X. Huh? This is the selectivity estimate for "A = x AND B = y"? Surely, if A and B are independent, the formula must reduce to sel(A) * sel(B), and I cannot see how that'd work with the formula above. best regards, Florian Pflug
> On Dec17, 2010, at 23:12 , Tomas Vondra wrote: >> Well, not really - I haven't done any experiments with it. For two >> columns selectivity equation is >> >> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) >> >> where A and B are columns, dist(X) is number of distinct values in >> column X and sel(X) is selectivity of column X. > > Huh? This is the selectivity estimate for "A = x AND B = y"? Surely, > if A and B are independent, the formula must reduce to sel(A) * sel(B), > and I cannot see how that'd work with the formula above. Yes, it's a selectivity estimate for P(A=a and B=b). It's based on conditional probability, as P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a) and "uniform correlation" assumption so that it's possible to replace the conditional probabilities with constants. And those constants are then estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B). So it does not reduce to sel(A)*sel(B) exactly, as the dist(A)/dist(A,B) is just an estimate of P(B|A). The paper states that this works best for highly correlated data, while for low correlated data it (at least) matches the usual estimates. I don't say it's perfect, but it seems to produce reasonable estimates. Tomas
On Dec18, 2010, at 08:10 , tv@fuzzy.cz wrote: >> On Dec17, 2010, at 23:12 , Tomas Vondra wrote: >>> Well, not really - I haven't done any experiments with it. For two >>> columns selectivity equation is >>> >>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) >>> >>> where A and B are columns, dist(X) is number of distinct values in >>> column X and sel(X) is selectivity of column X. >> >> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely, >> if A and B are independent, the formula must reduce to sel(A) * sel(B), >> and I cannot see how that'd work with the formula above. > > Yes, it's a selectivity estimate for P(A=a and B=b). It's based on > conditional probability, as > > P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a) > > and "uniform correlation" assumption so that it's possible to replace the > conditional probabilities with constants. And those constants are then > estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B). Ok, I think I understand this now. The selectivity equation actually *does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate for sel(A). Take the clause "A = x AND B = y" for example. Without knowing anything about x and y, reasonable guesses for sel(A=x) and sel(B=y) are sel(A=x) = 1 / dist(A) sel(B=y) = 1 / dist(B). This is also what we do currently, according to var_eq_non_const() in src/backend/utils/adt/selfuncs.c, if we don't have any additional knowledge about x (Actually, we also factor the probability of A being NULL into this). With these estimates, your formula becomes sel(A=x,B=y) = 1 / dist(A,B). and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus sel(A=x,B=y) = sel(A=x) * sel(B=y). If, however, y is a constant, then we use the MKVs to estimate sel(B=y) (var_eq_const() in src/backend/utils/adt/selfuncs.c). If sel(B=y) ~= 0, we'd currently also conclude that sel(A=x,B=y) ~= 0. With the "uniform correlation" approach, we'd instead estimate sel(A=x,B=y) ~= sel(A=x) / dist(B) assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated. If dist(B) is small, this estimate is much worse than what we'd currently get, since we've effectively ignored the information that the restriction B=y alone guarantees that only very few rows will match. best regards, Florian Pflug
Dne 18.12.2010 16:59, Florian Pflug napsal(a): > On Dec18, 2010, at 08:10 , tv@fuzzy.cz wrote: > >>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote: >>>> Well, not really - I haven't done any experiments with it. For two >>>> columns selectivity equation is >>>> >>>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) >>>> >>>> where A and B are columns, dist(X) is number of distinct values in >>>> column X and sel(X) is selectivity of column X. >>> >>> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely, >>> if A and B are independent, the formula must reduce to sel(A) * sel(B), >>> and I cannot see how that'd work with the formula above. >> >> Yes, it's a selectivity estimate for P(A=a and B=b). It's based on >> conditional probability, as >> >> P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a) >> >> and "uniform correlation" assumption so that it's possible to replace the >> conditional probabilities with constants. And those constants are then >> estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B). > > Ok, I think I understand this now. The selectivity equation actually > *does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate > for sel(A). > > Take the clause "A = x AND B = y" for example. Without knowing anything > about x and y, reasonable guesses for sel(A=x) and sel(B=y) are > > sel(A=x) = 1 / dist(A) > sel(B=y) = 1 / dist(B). > > This is also what we do currently, according to var_eq_non_const() in > src/backend/utils/adt/selfuncs.c, if we don't have any additional > knowledge about x (Actually, we also factor the probability of A being > NULL into this). > > With these estimates, your formula becomes > > sel(A=x,B=y) = 1 / dist(A,B). > > and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus > > sel(A=x,B=y) = sel(A=x) * sel(B=y). > > If, however, y is a constant, then we use the MKVs to estimate sel(B=y) > (var_eq_const() in src/backend/utils/adt/selfuncs.c). If > > sel(B=y) ~= 0, > > we'd currently also conclude that > > sel(A=x,B=y) ~= 0. > > With the "uniform correlation" approach, we'd instead estimate > > sel(A=x,B=y) ~= sel(A=x) / dist(B) > > assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated. > If dist(B) is small, this estimate is much worse than what we'd currently > get, since we've effectively ignored the information that the restriction > B=y alone guarantees that only very few rows will match. Well, I guess you're right. Let's see two examples - uncorrelated and higly correlated data (and then a few comments on what I think you're missing). uncorrelated ------------ Let there be 10 distinct values in A, 100 distinct values in B, and there are all possible combinations (10x100 = 1000). All the values (and combinations) are uniformly distributed. The current estimate is correct, i.e. sel(A=a) = 1/10 = 1/dist(A) sel(B=b) = 1/100 = 1/dist(B) sel(A=a, B=b) = sel(A) * sel(B) = 1/1000 /* due to AVI */ the proposed estimate is sel(A=a, B=b) = (dist(A)*sel(A=a) + dist(B)*sel(B=b)) / (2*dist(A,B)) = (10 * 1/10 + 100 * 1/100) / 2000 =1/1000 so actually it produces the same estimate, but thats due to the uniform distribution assumption. Let's say the selectivities for A and B are very different - A is 10x les common, B is 10x more common (let's say we know this). Then the current estimate is still correct, i.e. it gives sel(A=a) = 1/100 sel(B=b) = 1/10 => sel(A=a,B=b) = 1/1000 but the new estimate is (10 * 1/100 + 100 * 1/10) / 2000 = (101/10) / 2000 ~ 1/200 So yes, in case of uncorrelated data this overestimates the result. highly correlated ----------------- Again, let dist(A)=10, dist(B)=100. But this case let B=>A i.e. for each value in B, there is a unique value in A. Say B=mod(A,10) or something like that. So now there is dist(A,B)=100. Assuming uniform distribution, the correct estimate is sel(A=a,B=b) = sel(B=b) = 1/100 and the current estimate is sel(A=a,B=b) = sel(A=a) * sel(B=b) = 1/10 * 1/100 = 1/1000 The proposed estimate is (10 * 1/10 + 100 * 1/100) / 200 = 2/200 = 1/100 which is correct. Without the uniform distribution (again, sel(A)=1/100, sel(B)=1/10), we currently get (just as before) sel(A=a,B=b) = 1/10 * 1/100 = 1/1000 which is 100x less than the correct value (sel(B)=1/10). The new estimate yields (10*1/100 + 100*1/10) / 200 = (1010/100) / 200 ~ 1/20 which is not as accurate as with uniform distribution, but it's considerably more accurate than the current estimate (1/1000). comments -------- I really don't think this a perfect solution giving 100% accurate results in all cases. But the examples above illustrate that it gives much better estimates in case of correlated data. It seems to me you're missing one very important thing - this was not meant as a new default way to do estimates. It was meant as an option when the user (DBA, developer, ...) realizes the current solution gives really bad estimates (due to correlation). In that case he could create 'cross-column' statistics on those columns, and the optimizer would use that info to do the estimates. This kind of eliminates the issue with uncorrelated data, as it would not actually happen. OK, the user might create cross-column stats on uncorrelated columns and he'd get incorrect estimates in that case, but I really don't think we can solve this kind of problems. regards Tomas
On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > >> The proposed solution is based on contingency tables, built for selected > >> groups of columns (not for each possible group). And the contingency > >> table gives you the ability to estimate the probabilities needed to > >> compute the selectivity. Or am I missing something? > > > Well, I'm not real familiar with contingency tables, but it seems like > > you could end up needing to store a huge amount of data to get any > > benefit out of it, in some cases. > > The reason that this wasn't done years ago is precisely that nobody's > figured out how to do it with a tolerable amount of stats data and a > tolerable amount of processing time (both at ANALYZE time and during > query planning). It's not hard to see what we'd ideally like to do; > it's getting from there to something useful in production that's hard. I think we have to face up to the fact that attempting to derive meaningful cross-column stats will require larger sample sizes. If we collect everything we're going to have ~10^9 stats slots with default stats_target 100 and a 100 column table. We should disconnect sample size from histogram size, and we need to make the initial column pairings vastly fewer than all combinations. Manual specification seems like it will be required for the cases where we decide not to include it automatically, so it seems we'll need manual specification anyway. In that case, we should do manual specification first. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Dne 19.12.2010 21:21, Simon Riggs napsal(a): > On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>>> The proposed solution is based on contingency tables, built for selected >>>> groups of columns (not for each possible group). And the contingency >>>> table gives you the ability to estimate the probabilities needed to >>>> compute the selectivity. Or am I missing something? >> >>> Well, I'm not real familiar with contingency tables, but it seems like >>> you could end up needing to store a huge amount of data to get any >>> benefit out of it, in some cases. >> >> The reason that this wasn't done years ago is precisely that nobody's >> figured out how to do it with a tolerable amount of stats data and a >> tolerable amount of processing time (both at ANALYZE time and during >> query planning). It's not hard to see what we'd ideally like to do; >> it's getting from there to something useful in production that's hard. > > I think we have to face up to the fact that attempting to derive > meaningful cross-column stats will require larger sample sizes. Amen. > If we collect everything we're going to have ~10^9 stats slots with > default stats_target 100 and a 100 column table. > > We should disconnect sample size from histogram size, and we need to > make the initial column pairings vastly fewer than all combinations. > Manual specification seems like it will be required for the cases where > we decide not to include it automatically, so it seems we'll need manual > specification anyway. In that case, we should do manual specification > first. Well, not really. The more bins you have, the larger sample you need to get a representative representation of the stats. So the histogram and sample size are naturally connected. And there are some (mostly heuristics) rules to determine how large the sample should be. E.g. when building a contingency table for a chi-squared test, a common rule is that each bin should contain at least 5 values. So the more bins you have, the larger sample you need. I like the way oracle does this - you can either let them decide what is the proper sample size, or you can specify how large the sample should be (what portion of the table). So the tricky part here is determining the number of bins in the histogram. In the one-dimensional case, stats_target=100 actually means each bin contains about 1% of data. So I guess we should use a similar approach in the multi-dimensional case too, i.e. let the user determine a desired precision and then derive the number of bins automatically (which is a bit tricky because of the multiple dimensions). But yeah, in general you're right - this will require larger samples, more complex stats to collect, we'll need some space to store the collected stats ... regards Tomas
On Dec18, 2010, at 17:59 , Tomas Vondra wrote: > It seems to me you're missing one very important thing - this was not > meant as a new default way to do estimates. It was meant as an option > when the user (DBA, developer, ...) realizes the current solution gives > really bad estimates (due to correlation). In that case he could create > 'cross-column' statistics on those columns, and the optimizer would use > that info to do the estimates. I do understand that. I just have the nagging feeling that there is a way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense to apply the uniform bayesian approach or to assume the columns are unrelated. I play with this for a bit over the weekend, but unfortunately ran out of time. So I'm writing up what I found, to prevent it from getting lost. I tried to pick up Robert's idea of quantifying "Implicativeness" - i.e., finding a number between 0 and 1 that describes how close the (A,B) are to representing a function A -> B. Observe that dist(A),dist(B) <= dist(A,B) <= dist(A)*dist(B) if the estimates of dist(?) are consistent. From that you easily get dist(A,B)/dist(B) <= dist(A) <= dist(A,B) and dist(A,B)/dist(A) <= dist(B) <= dist(A,B) If dist(A) == dist(A,B), then there is a functional dependency A -> B, and conversely if dist(B) == dist(A,B) there is a functional dependency B -> A. Note that you can have both at the same time! On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the smallest number of distinct values possible for a given combination of dist(A,B) and dist(A). This is the anti-function case. This motivates the definition F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ] (You can probably drop the "-1", it doesn't make much of a difference for larger values of dist(B). F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while a value of 1 indicates that dist(A) == dist(A,B). So F(A,B) is a suitable measure of "Implicativeness" - it's higher if the table (A,B) looks more like a function A -> B. You might use that to decide if either A->B or B->a looks function-like enough to use the uniform bayesian approach. Or you might even go further, and decide *with* bayesian formula to use - the paper you cited always averages P(A=x|B=y)*P(B=y) and P(B=y|A=x)*P(A=x) but they offer no convincing reason for that other than "We don't know which to pick". I'd like to find a statistical explanation for that definition of F(A,B), but so far I couldn't come up with any. I created a Maple 14 worksheet while playing around with this - if you happen to have a copy of Maple available I'd be happy to send it to you.. This is what I got so far - I hope it may prove to be of use somehow. best regards, Florian Pflug
> On Dec18, 2010, at 17:59 , Tomas Vondra wrote: >> It seems to me you're missing one very important thing - this was not >> meant as a new default way to do estimates. It was meant as an option >> when the user (DBA, developer, ...) realizes the current solution gives >> really bad estimates (due to correlation). In that case he could create >> 'cross-column' statistics on those columns, and the optimizer would use >> that info to do the estimates. > > I do understand that. I just have the nagging feeling that there is a > way to judge from dist(A), dist(B) and dist(A,B) whether it makes sense > to apply the uniform bayesian approach or to assume the columns are > unrelated. I doubt there is a way to this decision with just dist(A), dist(B) and dist(A,B) values. Well, we could go with a rule if [dist(A) == dist(A,B)] the [A => B] but that's very fragile. Think about estimates (we're not going to work with exact values of dist(?)), and then about data errors (e.g. a city matched to an incorrect ZIP code or something like that). So for a real-world dataset, the condition [dist(A)==dist(A,B)] will almost never hold. And about the same holds for the "uniform correlation" assumption which is the basis for the formula I posted. So actually we're looking for a formula that does reasonable estimates and is robust even in cases where the correlation is not uniform or the estimates are a reasonably unprecise. > This motivates the definition > > F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [dist(A,B) * ( dist(B) - 1)] > > (You can probably drop the "-1", it doesn't make much of a difference > for larger values of dist(B). > > F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and > dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while > a value of 1 indicates that dist(A) == dist(A,B). > > So F(A,B) is a suitable measure of "Implicativeness" - it's higher > if the table (A,B) looks more like a function A -> B. > > You might use that to decide if either A->B or B->a looks function-like > enough to use the uniform bayesian approach. Or you might even go further, > and decide *with* bayesian formula to use - the paper you cited always > averages > > P(A=x|B=y)*P(B=y) and > P(B=y|A=x)*P(A=x) > > but they offer no convincing reason for that other than "We don't know > which to pick". Well, the reason why they chose the sum/2 approach is they were unable to infer which of the values is 'better' and the sum/2 limits the errors. I haven't studied this thoroughly, but my impression is that you are going in the same direction as they did, i.e. while they've done P(A,B) = (P(A|B)*P(A) + P(B|A)*P(B)) / 2 with P(A|B) = dist(A) / dist(A,B), you've chosen P(A|B) ~ F(B,A) or something like that. You'll probably object that you could compute F(A,B) and F(B,A) and then use only the part corresponding to the larger value, but what if the F(A,B) and F(B,A) are about the same? This is the reason why they choose to always combine the values (with varying weights). > I'd like to find a statistical explanation for that definition of > F(A,B), but so far I couldn't come up with any. I created a Maple 14 > worksheet while playing around with this - if you happen to have a > copy of Maple available I'd be happy to send it to you.. No, I don't have Maple. Have you tried Maxima (http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage even has an online notebook - that seems like a very comfortable way to exchange this kind of data. regards Tomas
On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug <fgp@phlo.org> wrote: > I tried to pick up Robert's idea of quantifying "Implicativeness" - > i.e., finding a number between 0 and 1 that describes how close the > (A,B) are to representing a function A -> B. Actually Heikki's idea... > Observe that dist(A),dist(B) <= dist(A,B) <= dist(A)*dist(B) if the > estimates of dist(?) are consistent. From that you easily get > > dist(A,B)/dist(B) <= dist(A) <= dist(A,B) and > dist(A,B)/dist(A) <= dist(B) <= dist(A,B) > > If dist(A) == dist(A,B), then there is a functional dependency > A -> B, and conversely if dist(B) == dist(A,B) there is a functional > dependency B -> A. Note that you can have both at the same time! > > On the other hand, if dist(B) = dist(A,B)/dist(A), then B has the > smallest number of distinct values possible for a given combination > of dist(A,B) and dist(A). This is the anti-function case. > > This motivates the definition > > F(A,B) = [ dist(A)*dist(B) - dist(A,B) ] / [ dist(A,B) * ( dist(B) - 1) ] > > (You can probably drop the "-1", it doesn't make much of a difference > for larger values of dist(B). > > F(A,B) specifies where dist(A) lies relative to dist(A,B)/dist(B) and > dist(A,B) - a value of 0 indicates dist(A) = dist(A,B)/dist(B) while > a value of 1 indicates that dist(A) == dist(A,B). > > So F(A,B) is a suitable measure of "Implicativeness" - it's higher > if the table (A,B) looks more like a function A -> B. > > You might use that to decide if either A->B or B->a looks function-like > enough to use the uniform bayesian approach. Or you might even go further, > and decide *with* bayesian formula to use - the paper you cited always > averages > > P(A=x|B=y)*P(B=y) and > P(B=y|A=x)*P(A=x) > > but they offer no convincing reason for that other than "We don't know > which to pick". Ideally you want to somehow make this a continuous transaition between the available formulas rather than a discrete transition, I think. If F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and if it's 0, then it's P(A=x)*P(B=y). But suppose F(A,B)=0.5. Then what? A naive approach would be to estimate P(A=x && B=y) = P(A=x) * (1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1 we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9) = 0.055. Of course I'm just hand-waving here, and this is without any mathematical basis, being just the simplest formula I could think of that gets the endpoints right and plots some sort of smooth curve between them in the middle. A similar formula with a believable argument to back it up seems like it would be a big step forward for this method. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Mon, Dec 20, 2010 at 9:29 PM, Florian Pflug <fgp@phlo.org> wrote: >> You might use that to decide if either A->B or B->a looks function-like >> enough to use the uniform bayesian approach. Or you might even go >> further, >> and decide *with* bayesian formula to use - the paper you cited always >> averages >> >> P(A=x|B=y)*P(B=y) and >> P(B=y|A=x)*P(A=x) >> >> but they offer no convincing reason for that other than "We don't know >> which to pick". > > Ideally you want to somehow make this a continuous transaition between > the available formulas rather than a discrete transition, I think. If > F(A,B) = 1 then the selectivity of A = x AND B = y is just P(A=x), and > if it's 0, then it's P(A=x)*P(B=y). But suppose F(A,B)=0.5. Then > what? A naive approach would be to estimate P(A=x && B=y) = P(A=x) * > (1 - (1 - F(A,B))*(1 - P(B = y))), so that if, say, P(A=x) = 0.1 and > P(B=y) = 0.1, then when F(A,B) = 0 we estimate 0.01, when F(A,B) = 1 > we estimate 0.1, and when F(A,B) = 0.5 we estimate (0.1)(1 - 0.5*0.9) > = 0.055. Of course I'm just hand-waving here, and this is without any > mathematical basis, being just the simplest formula I could think of > that gets the endpoints right and plots some sort of smooth curve > between them in the middle. A similar formula with a believable > argument to back it up seems like it would be a big step forward for > this method. This somehow reminds me how the various t-norms in fuzzy logic evolved. I'm not saying we should use fuzzy logic here, but the requirements are very similar so it might be an interesting inspiration. See for example this http://plato.stanford.edu/entries/logic-fuzzy (chapter 4). And there's one additional - IMHO very important - requirement. The whole thing should easily extend to more than two columns. This "IF (F(A,B) > F(B,A)) THEN ..." probably is not a good solution regarding this. For example given 3 columns A,B,C, would you do that comparison for each pair of columns, or would you do that for A vs (B,C)? Or maybe a completely different approach? Because that would require to collect a lot more data (number of distinct values in each combination) etc. I'm not saying for example there is a table with (C=A+B) A | B | C=========== 1 | 1 | 2 1 | 2 | 3 1 | 3 | 4 2 | 1 | 3 2 | 2 | 4 2 | 3 | 5 3 | 1 | 4 3 | 2 | 5 3 | 3 | 6 So that dist(A)=dist(B)=3, dist(C)=6 and dist(A,B,C)=dist(A,B)=9. Given the paper, you get something like P(A,B,C) = [dist(A)*P(A) + dist(B)*P(B) + dist(C)*P(C)] / [3*dist(A,B,C)] = [P(A) + P(B) + 2*P(C)] / 9 so for example P(A=3,B=2,C=5) = [1/3 + 1/3 + 2/9]/9 = (8/9)/9 which is almost correct (by 1/81). Don't get me wrong - I'm not a fanatic who thinks this particular formula is the best formula possible. I'm just saying we could end up with a formula that works beautifully in 2D, but once we get to 3 columns it fails miserably. Hmmm, maybe we could give this possibility (to identify two separate groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the user would say 'build stats for (A,B) and (C)' - this actually represents apriori knowledge of dependencies supplied by the user. In that case we could search for 'implicativeness' between those two groups (and not within the groups), and we could restrict ourselves to 2D (and thus use a more sophisticated formula). But we should be able to do some basic analysis even when the user supplies a list of columns without such apriori knowledge. regards Tomas
On Dec21, 2010, at 11:37 , tv@fuzzy.cz wrote: > I doubt there is a way to this decision with just dist(A), dist(B) and > dist(A,B) values. Well, we could go with a rule > > if [dist(A) == dist(A,B)] the [A => B] > > but that's very fragile. Think about estimates (we're not going to work > with exact values of dist(?)), and then about data errors (e.g. a city > matched to an incorrect ZIP code or something like that). Huh? The whole point of the F(A,B)-exercise is to avoid precisely this kind of fragility without penalizing the non-correlated case... > This is the reason why they choose to always combine the values (with > varying weights). There are no varying weights involved there. What they do is to express P(A=x,B=y) once as P(A=x,B=y) = P(B=y|A=x)*P(A=x) and then as P(A=x,B=y) = P(A=x|B=y)*P(B=y). Then they assume P(B=y|A=x) ~= dist(A)/dist(A,B) and P(A=x|B=y) ~= dist(B)/dist(A,B), and go on to average the two different ways of computing P(A=x,B=y), which finally gives P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2 = dist(A)*P(A=x)/(2*dist(A,B)) + dist(B)*P(B=x)/(2*dist(A,B)) = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B)) That averaging steps add *no* further data-dependent weights. >> I'd like to find a statistical explanation for that definition of >> F(A,B), but so far I couldn't come up with any. I created a Maple 14 >> worksheet while playing around with this - if you happen to have a >> copy of Maple available I'd be happy to send it to you.. > > No, I don't have Maple. Have you tried Maxima > (http://maxima.sourceforge.net) or Sage (http://www.sagemath.org/). Sage > even has an online notebook - that seems like a very comfortable way to > exchange this kind of data. I haven' tried them, but I will. That java-based GUI of Maple is driving me nuts anyway... Thanks for the pointers! best regards, Florian Pflug
> On Dec21, 2010, at 11:37 , tv@fuzzy.cz wrote: >> I doubt there is a way to this decision with just dist(A), dist(B) and >> dist(A,B) values. Well, we could go with a rule >> >> if [dist(A) == dist(A,B)] the [A => B] >> >> but that's very fragile. Think about estimates (we're not going to work >> with exact values of dist(?)), and then about data errors (e.g. a city >> matched to an incorrect ZIP code or something like that). > > Huh? The whole point of the F(A,B)-exercise is to avoid precisely this > kind of fragility without penalizing the non-correlated case... Yes, I understand the intention, but I'm not sure how exactly do you want to use the F(?,?) function to compute the P(A,B) - which is the value we're looking for. If I understand it correctly, you proposed something like this IF (F(A,B) > F(B,A)) THEN P(A,B) := c*P(A); ELSE P(A,B) := d*P(B); END IF; or something like that (I guess c=dist(A)/dist(A,B) and d=dist(B)/dist(A,B)). But what if F(A,B)=0.6 and F(B,A)=0.59? This may easily happen due to data errors / imprecise estimate. And this actually matters, because P(A) and P(B) may be actually significantly different. So this would be really vulnerable to slight changes in the estimates etc. >> This is the reason why they choose to always combine the values (with >> varying weights). > > There are no varying weights involved there. What they do is to express > P(A=x,B=y) once as > > ... > > P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2 > = dist(A)*P(A=x)/(2*dist(A,B)) + > dist(B)*P(B=x)/(2*dist(A,B)) > = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B)) > > That averaging steps add *no* further data-dependent weights. Sorry, by 'varying weights' I didn't mean that the weights are different for each value of A or B. What I meant is that they combine the values with different weights (just as you explained). regards Tomas
On Dec21, 2010, at 15:51 , tv@fuzzy.cz wrote: >>> This is the reason why they choose to always combine the values (with >>> varying weights). >> >> There are no varying weights involved there. What they do is to express >> P(A=x,B=y) once as >> >> ... >> >> P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2 >> = dist(A)*P(A=x)/(2*dist(A,B)) + >> dist(B)*P(B=x)/(2*dist(A,B)) >> = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B)) >> >> That averaging steps add *no* further data-dependent weights. > > Sorry, by 'varying weights' I didn't mean that the weights are different > for each value of A or B. What I meant is that they combine the values > with different weights (just as you explained). I'm still not sure we're on the same page here. The resulting formula is *not* a weighted average of P(A=x) and P(B=y), since in general dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one syntactically, but that's about it. The resulting formula instead is an *unweighted* (weights 1) average of the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just as well estimate P(A=x,B=y) with P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B) and it's *still* be the very same uniform bayesian approach, just no longer symmetric in A and B. Which may easily be preferable if you have reasons to believe that this estimate is more correct than the one obtained by swapping A and B. The original paper doesn't deal with that case simply because they don't mention how P(A=x) and P(B=y) are obtained at all. The postgres estimator, on the other hand, knows quite well how it derived P(A=x) and P(B=y) and may have much higher confidence in one value than in the other. Assume for example that you're preparing the statement SELECT * FROM T WHERE A = ? AND B = 1 We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better without an actual value for the parameter "?". The estimate for P(B=1), on the other hand, can use the histogram, and will thus very likely be much more precise. The two estimates for P(A=?,B=1) in this case are P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B). There's a good chance that the former beats the latter, and thus also the average, then. best regards, Florian Pflug
On Dec21, 2010, at 13:25 , tv@fuzzy.cz wrote: > And there's one additional - IMHO very important - requirement. The whole > thing should easily extend to more than two columns. This "IF (F(A,B) > > F(B,A)) THEN ..." probably is not a good solution regarding this. > > For example given 3 columns A,B,C, would you do that comparison for each > pair of columns, or would you do that for A vs (B,C)? Or maybe a > completely different approach? Because that would require to collect a lot > more data (number of distinct values in each combination) etc. That's certainly a valid concern. The uniform bayesian approach avoids that quite beautifully... > Hmmm, maybe we could give this possibility (to identify two separate > groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the > user would say 'build stats for (A,B) and (C)' - this actually represents > apriori knowledge of dependencies supplied by the user. > > In that case we could search for 'implicativeness' between those two > groups (and not within the groups), and we could restrict ourselves to 2D > (and thus use a more sophisticated formula). Hm, I hated this idea at first, but I'm starting to like it more and more. It *does* seem rather unrealistic that a user would know that a bunch of columns are correlated, but have no idea in what way... Any examples when this's be the case would be very much appreciated - Maybe we should ask around on -general about this? > But we should be able to do some basic analysis even when the user > supplies a list of columns without such apriori knowledge. That, I think, overcomplicates things, at least for a first cut. To summarize, I think you've shown quite nicely that the uniform bayesian approach is a very sensible first step towards better estimates in the case of correlated columns. It's statistically sound, and the dist(A,B) estimates it requires are probably a necessary ingredient of any solution to the problem. If we can make it degrade more gracefully if the columns are uncorrelated we should do that, but if we can't thats still no reason to drop the whole idea. So I guess we should turn our attention to how we'd obtain reasonably good estimates of dist(A,B), and return to the current discussion once the other pieces are in place. I think that maybe it'd be acceptable to scan a large portion of the table to estimate dist(A,B), *if* we must only do so only once in a while. But even with a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory, and spilling them into, say, an on-disk hash table adds even more overhead to the already expensive full scan. Maybe using a bloom filter instead of a hash table could avoid the spilling to disk, in exchange for a slightly less precise result... best regards, Florian Pflug
> On Dec21, 2010, at 15:51 , tv@fuzzy.cz wrote: >>>> This is the reason why they choose to always combine the values (with >>>> varying weights). >>> >>> There are no varying weights involved there. What they do is to express >>> P(A=x,B=y) once as >>> >>> ... >>> >>> P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2 >>> = dist(A)*P(A=x)/(2*dist(A,B)) + >>> dist(B)*P(B=x)/(2*dist(A,B)) >>> = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B)) >>> >>> That averaging steps add *no* further data-dependent weights. >> >> Sorry, by 'varying weights' I didn't mean that the weights are different >> for each value of A or B. What I meant is that they combine the values >> with different weights (just as you explained). > > I'm still not sure we're on the same page here. The resulting formula > is *not* a weighted average of P(A=x) and P(B=y), since in general > dist(A) + dist(B) = 2*dist(A,B) does *not* hold. It may look like one > syntactically, but that's about it. OK, another crazy usage or 'weights' on my side :-( What I meant is that in the end you have two equations of P(A,B): P(A=x|B=y)*P(B=y) = dist(B)*P(B=y)/dist(A,B) P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B) and you need to combine those two estimates. They did that by averaging, as they don't know which of the estimates is better. Generally I think that is a good solution, unless you know one of the estimates is much more reliable (although I'm not sure we should completely omit the other estimate). > The resulting formula instead is an *unweighted* (weights 1) average of > the two estimates P(B=y|A=x)*P(A=x) and P(A=x|B=y)*P(B=y). You might just > as well estimate P(A=x,B=y) with > > P(B=y|A=x)*P(A=x) = dist(A)*P(A=x)/dist(A,B) > > and it's *still* be the very same uniform bayesian approach, just no > longer symmetric in A and B. Which may easily be preferable if you > have reasons to believe that this estimate is more correct than the > one obtained by swapping A and B. The original paper doesn't deal with > that case simply because they don't mention how P(A=x) and P(B=y) > are obtained at all. The postgres estimator, on the other hand, > knows quite well how it derived P(A=x) and P(B=y) and may have much > higher confidence in one value than in the other. OK, good point. I haven't realized that one of the estimates may be much more reliable. But let's assume both estimates are about the same (regarding reliability) and let's see the following example A | B=====1 | 11 | 11 | 11 | 22 | 12 | 22 | 22 | 2 Thus dist(A)=dist(B)=2, dist(A,B)=4 and P(A=1)=P(A=2)=P(B=1)=P(B=2)=1/2 P(A=1,B=1)=P(A=2,B=2)=3/8 P(A=1,B=2)=P(A=1,B=1)=1/8 According to the formula presented in the paper, the partial estimates for P(A=1,B=2) are P(A=1|B=2)*P(B=2) = dist(A)/dist(A,B) * P(B=2) = 2/4 * 1/2 = 1/4 P(B=2|A=1)*P(A=1) = dist(B)/dist(A,B) * P(A=1) = 2/4 *1/2 = 1/4 Thus P(A=1,B=2) = (1/4 + 1/4)/2 = 1/4, so it's overestimated (2x) A | B=====1 | 11 | 21 | 21 | 22 | 12 | 12 | 12 | 2 This obviously has exactly the same features (regarding number of distinct values), and the produced estimate is exactly the same. But in this case P(A=1,B=2)=P(A=2,B=1)=3/8 P(A=1,B=1)=P(A=2,B=2)=1/8 and thus the 1/4 is an underestimate (compared to 3/8). The problem is the F(A,B) does not change at all. It's very simple to construct examples (just use more rows) where F(A,B) returns exactly the same value, but the estimates are off. The averaging somehow (smooths) this of ... But I think I'm missing something about how to use the F(?,?) to derive the final estimate. So maybe the resulting estimate would be better. Say there are two tables A | B | number of such rows ============================== 1 | 1 | 1000 1 | 2 | 1000 2 | 1 | 1000 1 | 2 | 1000 A | B | number of such rows ============================== 1 | 1 | 1 1 | 2 | 1999 2 | 1 | 1999 1 | 2 | 1 How would you estimate the P(A=1,B=1) in those cases? Assume that both estimates are equally reliable - i.e. deduced from a histogram or MCV. > > Assume for example that you're preparing the statement > > SELECT * FROM T WHERE A = ? AND B = 1 > > We'll then estimate P(A=?) as 1/dist(A), since we cannot do any better > without an actual value for the parameter "?". The estimate for P(B=1), > on the other hand, can use the histogram, and will thus very likely be > much more precise. The two estimates for P(A=?,B=1) in this case are > > P(A=?,B=1)*P(B=1) = dist(B)*P(B=1)/dist(A,B), and > P(B=1,A=?)*P(A=1) = dist(A)*P(A=?)/dist(A,B). > > There's a good chance that the former beats the latter, and thus also > the average, then. OK, good point. I was not thinking about prepared statements. In this case it makes sense to use only one of the estimates ... regards Tomas
Dne 21.12.2010 16:54, Florian Pflug napsal(a): >> Hmmm, maybe we could give this possibility (to identify two separate >> groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the >> user would say 'build stats for (A,B) and (C)' - this actually represents >> apriori knowledge of dependencies supplied by the user. >> >> In that case we could search for 'implicativeness' between those two >> groups (and not within the groups), and we could restrict ourselves to 2D >> (and thus use a more sophisticated formula). > > Hm, I hated this idea at first, but I'm starting to like it more and more. > It *does* seem rather unrealistic that a user would know that a bunch of > columns are correlated, but have no idea in what way... Yes, that's true. Although sometimes the dependency may be very complicated - but let's restrict to 2D for now, build something that solves this simplified case and then we can discuss higher dimensions. > Any examples when this's be the case would be very much appreciated - Maybe > we should ask around on -general about this? Well, I think the ZIP code example i a typical case of this - the users know about the dependency between ZIP codes and cities. A natural workaround would be to omit the dependent column from the query, but that's not always possible (e.g. when an ORM is involved, building the queries automatically). >> But we should be able to do some basic analysis even when the user >> supplies a list of columns without such apriori knowledge. > > That, I think, overcomplicates things, at least for a first cut. > > To summarize, I think you've shown quite nicely that the uniform bayesian > approach is a very sensible first step towards better estimates in the case > of correlated columns. It's statistically sound, and the dist(A,B) estimates > it requires are probably a necessary ingredient of any solution to the problem. > If we can make it degrade more gracefully if the columns are uncorrelated we > should do that, but if we can't thats still no reason to drop the whole idea. Agreed. IMHO the uncorrelated case is not a big concern, as the users usually know something's wrong with the columns. But we should introduce some 'autodetect' but let's leave that for the future. > So I guess we should turn our attention to how we'd obtain reasonably good estimates > of dist(A,B), and return to the current discussion once the other pieces are in place. > > I think that maybe it'd be acceptable to scan a large portion of the > table to estimate dist(A,B), *if* we must only do so only once in a while. But even with > a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory, > and spilling them into, say, an on-disk hash table adds even more overhead to the already > expensive full scan. Maybe using a bloom filter instead of a hash table could avoid > the spilling to disk, in exchange for a slightly less precise result... I have no idea what a Bloom filter is (shame on me). I was not thinking about collecting the stats, I was interested primarily in what data do we actually need. And my knowledge about the algorithms currently used is very limited :-( But I agree we should at least discuss the possible solutions. Until now I've done something like this SELECT COUNT(DISTINCT a) AS dist_a, COUNT(DISTINCT b) AS dist_b, COUNT(DISTINCT a || ':' || b) AS dist_abFROM my_table; but that's not very efficient. My plan for the near future (a few weeks) is to build a simple 'module' with the ability to estimate the number of rows for a given condition. This could actually be quite useful as a stand-alone contrib module, as the users often ask how to get a number of rows fast (usually for paging). That may be quite slow when the query returns too many rows, even when there is an index. It may be even much slower than the actual query (as it usually contains a small LIMIT). An estimate is often sufficient, but the 'pg_class.tuples' does not really work with conditions. So this might be an improvement ... regards Tomas
Dne 21.12.2010 16:54, Florian Pflug napsal(a): > I think that maybe it'd be acceptable to scan a large portion of the > table to estimate dist(A,B), *if* we must only do so only once in a while. But even with > a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory, > and spilling them into, say, an on-disk hash table adds even more overhead to the already > expensive full scan. Maybe using a bloom filter instead of a hash table could avoid > the spilling to disk, in exchange for a slightly less precise result... I've read some basics about a Bloom filters, and I'm not sure we can use it in this case. I see two problems with this approach: 1) impossibility to predict the number of elements in advance To build a Bloom filter with limited error rate, you need to size it properly (number of hash function and size of thebit array). But that's exactly the the information we're looking for. I guess we could use the highest possible value (equal to the number of tuples) - according to wiki you need about 10bits per element with 1% error, i.e. about 10MB of memory for each million of elements. That's not much but this needs to be done for each column separately and for the combination - for N columns you need(at least) N+1 filters. Sure - increasing the error rate and using a different estimate to build the bloom filter would significantly decreasethe memory requirements. 2) sampling the whole table A naive approach would be to sample the whole table each time, but for large tables that might be a bit of a problem.So I'm thinking about what alternatives are there ... One possibility is to build the Bloom filter incrementally and store it on disk, or something like that. I guess thiscould be done during VACUUM or something like that. Anyway there's an issue how to set the filter size initially (estimatethe number of elements), just as in the previous section. Another possibility is to collect the data from just a small portion of a table and then use the result to estimate thenumber of distinct values for the whole table. But I'm not sure we can do this reliably, I see many traps in this. But maybe I'm just missing something important. regards Tomas
Dne 21.12.2010 19:03, Tomas Vondra napsal(a): > My plan for the near future (a few weeks) is to build a simple 'module' > with the ability to estimate the number of rows for a given condition. > This could actually be quite useful as a stand-alone contrib module, as > the users often ask how to get a number of rows fast (usually for paging). Hm, I've been thinking about where to place this new module - I thought pgfoundry might be the right place, but I've noticed it supports just CVS. Well, that's a bit antiquated SCM I think - I'm very happy with the comfort provided by Git or SVN. Is there some other place commonly used for pg-related projects? I've been thinking about github or something like that ... And I've finally set up the wiki-page about this effort - it's just a summary of this whole thread. http://wiki.postgresql.org/wiki/Cross_Columns_Stats regards Tomas
On Dec23, 2010, at 20:39 , Tomas Vondra wrote: > Dne 21.12.2010 16:54, Florian Pflug napsal(a): >> I think that maybe it'd be acceptable to scan a large portion of the >> table to estimate dist(A,B), *if* we must only do so only once in a while. But even with >> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory, >> and spilling them into, say, an on-disk hash table adds even more overhead to the already >> expensive full scan. Maybe using a bloom filter instead of a hash table could avoid >> the spilling to disk, in exchange for a slightly less precise result... > > I've read some basics about a Bloom filters, and I'm not sure we can use > it in this case. I see two problems with this approach: > > 1) impossibility to predict the number of elements in advance > > To build a Bloom filter with limited error rate, you need to size it > properly (number of hash function and size of the bit array). But > that's exactly the the information we're looking for. > > I guess we could use the highest possible value (equal to the number > of tuples) - according to wiki you need about 10 bits per element > with 1% error, i.e. about 10MB of memory for each million of > elements. Drat. I had expected these number to come out quite a bit lower than that, at least for a higher error target. But even with 10% false positive rate, it's still 4.5MB per 1e6 elements. Still too much to assume the filter will always fit into memory, I fear :-( > 2) sampling the whole table > > A naive approach would be to sample the whole table each time, but > for large tables that might be a bit of a problem. So I'm thinking > about what alternatives are there .. > One possibility is to build the Bloom filter incrementally and store > it on disk, or something like that. I guess this could be done > during VACUUM or something like that. Anyway there's an issue how to > set the filter size initially (estimate the number of elements), > just as in the previous section. The filter size could be derived from the table's statistics target, or be otherwise user-definable. We could also auto-resize once it gets too full. But still, that all seems awfully complex :-( > Another possibility is to collect the data from just a small portion > of a table and then use the result to estimate the number of distinct > values for the whole table. But I'm not sure we can do this reliably, > I see many traps in this. This is how it works currently. The problem with this approach is that it gives you very little guarantees about how precise the result will be. Extrapolating works very well for things like MKVs and histograms, because there you're by definition interested mostly in values which occur often - and thus with a high probability in the relative few rows you sample. For the number of distinct values, however, this isn't true - if ndistinct is an order of magnitude smaller than the number of rows, relatively few rows can account for a large percentage of the distinct values... Another idea would be to obtain the ndistinct values from an index somehow. Postgres cannot currently scan an index in physical order, only in logical order, due to locking considerations. But since we'd only be interested in an estimate, maybe a scan in physical block order would work for ndistinc estimates? Just a wild idea, mind you, I haven't checked at all if that'd be even remotely feasible. best regards, Florian Pflug
2010/12/24 Florian Pflug <fgp@phlo.org>: > On Dec23, 2010, at 20:39 , Tomas Vondra wrote: > >> I guess we could use the highest possible value (equal to the number >> of tuples) - according to wiki you need about 10 bits per element >> with 1% error, i.e. about 10MB of memory for each million of >> elements. > > Drat. I had expected these number to come out quite a bit lower than > that, at least for a higher error target. But even with 10% false > positive rate, it's still 4.5MB per 1e6 elements. Still too much to > assume the filter will always fit into memory, I fear :-( I have the impression that both of you are forgetting that there are 8 bits in a byte. 10 bits per element = 1.25MB per milion elements. Nicolas
> 2010/12/24 Florian Pflug <fgp@phlo.org>: > >> On Dec23, 2010, at 20:39 , Tomas Vondra wrote: >> >>> I guess we could use the highest possible value (equal to the number >>> of tuples) - according to wiki you need about 10 bits per element >>> with 1% error, i.e. about 10MB of memory for each million of >>> elements. >> >> Drat. I had expected these number to come out quite a bit lower than >> that, at least for a higher error target. But even with 10% false >> positive rate, it's still 4.5MB per 1e6 elements. Still too much to >> assume the filter will always fit into memory, I fear :-( > > I have the impression that both of you are forgetting that there are 8 > bits in a byte. 10 bits per element = 1.25MB per milion elements. We are aware of that, but we really needed to do some very rough estimates and it's much easier to do the calculations with 10. Actually according to wikipedia it's not 10bits per element but 9.6, etc. But it really does not matter if there is 10MB or 20MB of data, it's still a lot of data ... Tomas
On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: > 2010/12/24 Florian Pflug <fgp@phlo.org>: > >> On Dec23, 2010, at 20:39 , Tomas Vondra wrote: >> >>> I guess we could use the highest possible value (equal to the number >>> of tuples) - according to wiki you need about 10 bits per element >>> with 1% error, i.e. about 10MB of memory for each million of >>> elements. >> >> Drat. I had expected these number to come out quite a bit lower than >> that, at least for a higher error target. But even with 10% false >> positive rate, it's still 4.5MB per 1e6 elements. Still too much to >> assume the filter will always fit into memory, I fear :-( > > I have the impression that both of you are forgetting that there are 8 > bits in a byte. 10 bits per element = 1.25MB per milion elements. Uh, of course. So in the real universe, the numbers are ~1.2MB per 1e6 elements for a false positive rate of 1% ~0.5MB per 1e6 elements for a false positive rate of 10% Hm. So for a table with a billion distinct elements, we'd need half a gigabyte per column for the filter. A tuple with two int columns takes at least 24+2*4 = 32bytes to store I think, making such a table at least 32GB in size. The filter size would thus be 1/64 of the table size in the worst case. best regards, Florian Pflug
Dne 24.12.2010 13:15, tv@fuzzy.cz napsal(a): >> 2010/12/24 Florian Pflug <fgp@phlo.org>: >> >>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote: >>> >>>> I guess we could use the highest possible value (equal to the number >>>> of tuples) - according to wiki you need about 10 bits per element >>>> with 1% error, i.e. about 10MB of memory for each million of >>>> elements. >>> >>> Drat. I had expected these number to come out quite a bit lower than >>> that, at least for a higher error target. But even with 10% false >>> positive rate, it's still 4.5MB per 1e6 elements. Still too much to >>> assume the filter will always fit into memory, I fear :-( >> >> I have the impression that both of you are forgetting that there are 8 >> bits in a byte. 10 bits per element = 1.25MB per milion elements. > > We are aware of that, but we really needed to do some very rough estimates > and it's much easier to do the calculations with 10. Actually according to > wikipedia it's not 10bits per element but 9.6, etc. But it really does not > matter if there is 10MB or 20MB of data, it's still a lot of data ... Oooops, now I see what's the problem. I thought you were pointing out something out, but I've actually used 1B = 1b (which is obviously wrong). But Florian already noticed that and fixed the estimates. Tomas
Dne 24.12.2010 13:37, Florian Pflug napsal(a): > On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: > >> 2010/12/24 Florian Pflug <fgp@phlo.org>: >> >>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote: >>> >>>> I guess we could use the highest possible value (equal to the number >>>> of tuples) - according to wiki you need about 10 bits per element >>>> with 1% error, i.e. about 10MB of memory for each million of >>>> elements. >>> >>> Drat. I had expected these number to come out quite a bit lower than >>> that, at least for a higher error target. But even with 10% false >>> positive rate, it's still 4.5MB per 1e6 elements. Still too much to >>> assume the filter will always fit into memory, I fear :-( >> >> I have the impression that both of you are forgetting that there are 8 >> bits in a byte. 10 bits per element = 1.25MB per milion elements. > > Uh, of course. So in the real universe, the numbers are > > ~1.2MB per 1e6 elements for a false positive rate of 1% > ~0.5MB per 1e6 elements for a false positive rate of 10% > > Hm. So for a table with a billion distinct elements, we'd need half > a gigabyte per column for the filter. A tuple with two int columns > takes at least 24+2*4 = 32bytes to store I think, making such a table > at least 32GB in size. The filter size would thus be 1/64 of the table > size in the worst case. Yes, but in reality you need three such filters - one for each column, one for the combination. So that is 1.5GB (with 10% error rate) or 3.6GB (with 1% error rate). But this is severely excessive compared to the real needs, as there are usually much less distinct (not equal to the number of tuples as we assume in these computations). I was thinking about a simple heuristics to scale the filter properly, something like this: 1) sample a small portion of the table and count distinct of values 2) compute "number of dist. values" / "number of sampled tuples" 3) scale this to the whole table and scale the filter Say there are really 50 distinct values, 1.000 rows will be sampled but 20 distinct values are missing in the sample. This gives 5% in step (2) and if the table has 1.000.000 tuples you'll get 50.000 in (3). So the filter needs just 60kB. Which is a huge improvement compared to the previous approach (1.2MB). Obviously this will still lead to overestimates in most cases, and there are probably some other fail cases, but I think it's a reasonable solution. I don't think this can result in an underestimate (which is the case where you loose precision). And in case we want to build this incrementally (from a VACUUM) we really need to use a bit larger filter, because rescaling the filter is not possible AFAIK (without rebuilding it from scratch). regards Tomas
Dne 24.12.2010 04:41, Florian Pflug napsal(a): > The filter size could be derived from the table's statistics target, or > be otherwise user-definable. We could also auto-resize once it gets too > full. But still, that all seems awfully complex :-( Using a statistics target is a good idea I think. I think we could use it to determine error rate of the filter. Something like error rate = 10 - 0.9 * (statistics_target - 100) which gives 1% for statistics target 1000 10% for statistics target 100 or maybe something like this (where the error rate grows faster for smaller statistic target values) error rate = 11 - 91000 / (statistics_target^2) which gives about 1% for statistics target 1000 10% for statistics targer 100 36% for statistics target 50 But I guess 10% error rate is the minimum we need so it does not make much sense to use lower values. >> > Another possibility is to collect the data from just a small portion >> > of a table and then use the result to estimate the number of distinct >> > values for the whole table. But I'm not sure we can do this reliably, >> > I see many traps in this. > This is how it works currently. The problem with this approach is that > it gives you very little guarantees about how precise the result will be. > Extrapolating works very well for things like MKVs and histograms, because > there you're by definition interested mostly in values which occur often - > and thus with a high probability in the relative few rows you sample. For > the number of distinct values, however, this isn't true - if ndistinct > is an order of magnitude smaller than the number of rows, relatively few > rows can account for a large percentage of the distinct values... That basically means we need to sample a large portion of the table :-( > Another idea would be to obtain the ndistinct values from an index somehow. > Postgres cannot currently scan an index in physical order, only in logical > order, due to locking considerations. But since we'd only be interested in > an estimate, maybe a scan in physical block order would work for ndistinc > estimates? Just a wild idea, mind you, I haven't checked at all if that'd > be even remotely feasible. I was thinking about that too, and I think we could do this using pageinspect contrib module. Sure, there might be a problem with bloated indexes. And relying on this actually means it's required to have a multi-column index on all the columns. Individual indexes are not enough as we need to get the number of distinct combinations too. regards Tomas
Dne 24.12.2010 13:37, Florian Pflug napsal(a): > On Dec24, 2010, at 11:23 , Nicolas Barbier wrote: > >> 2010/12/24 Florian Pflug <fgp@phlo.org>: >> >>> On Dec23, 2010, at 20:39 , Tomas Vondra wrote: >>> >>>> I guess we could use the highest possible value (equal to the number >>>> of tuples) - according to wiki you need about 10 bits per element >>>> with 1% error, i.e. about 10MB of memory for each million of >>>> elements. >>> >>> Drat. I had expected these number to come out quite a bit lower than >>> that, at least for a higher error target. But even with 10% false >>> positive rate, it's still 4.5MB per 1e6 elements. Still too much to >>> assume the filter will always fit into memory, I fear :-( >> >> I have the impression that both of you are forgetting that there are 8 >> bits in a byte. 10 bits per element = 1.25MB per milion elements. > > Uh, of course. So in the real universe, the numbers are > > ~1.2MB per 1e6 elements for a false positive rate of 1% > ~0.5MB per 1e6 elements for a false positive rate of 10% > > Hm. So for a table with a billion distinct elements, we'd need half > a gigabyte per column for the filter. A tuple with two int columns > takes at least 24+2*4 = 32bytes to store I think, making such a table > at least 32GB in size. The filter size would thus be 1/64 of the table > size in the worst case. I've read several papers about estimating the number of distinct values, and I have some bad as well as good news. I don't want this thread to grow infinitely, and it's a rather separate topic so I'll start a new thread for ndistinct estimates. regards Tomas