Thread: [HACKERS] Make ANALYZE more selective about what is a "most common value"?
I've been thinking about the behavior discussed in https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org and it seems to me that there are a couple of things we ought to do about it. First, I think we need a larger hard floor on the number of occurrences of a value that're required to make ANALYZE decide it is a "most common value". The existing coding is willing to believe that anything that appears at least twice in the sample is a potential MCV, but that design originated when we were envisioning stats samples of just a few thousand rows --- specifically, default_statistics_target was originally just 10, leading to a 3000-row sample size. So accepting two-appearance values as MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it could be a tenth or a hundredth of that. As a round number, I'm thinking that a good floor would be a frequency estimate of 1/1000. With today's typical sample size of 30000 rows, a value would have to appear at least 30 times in the sample to be believed to be an MCV. That seems like it gives us a reasonable margin of error against the kind of sampling noise seen in the above-cited thread. Second, the code also has a rule that potential MCVs need to have an estimated frequency at least 25% larger than what it thinks the "average" value's frequency is. A rule of that general form seems like a good idea, but I now think the 25% threshold is far too small to do anything useful. In particular, in any case like this where there are more distinct values than there are sample rows, the "average frequency" estimate will correspond to less than one occurrence in the sample, so that this rule is totally useless to filter anything that we would otherwise consider as an MCV. I wonder if we shouldn't make it be "at least double the estimated average frequency". Thoughts? regards, tom lane
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
From
Robert Haas
Date:
On Sun, Jun 4, 2017 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I've been thinking about the behavior discussed in > https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org > and it seems to me that there are a couple of things we ought to do about > it. > > First, I think we need a larger hard floor on the number of occurrences > of a value that're required to make ANALYZE decide it is a "most common > value". The existing coding is willing to believe that anything that > appears at least twice in the sample is a potential MCV, but that design > originated when we were envisioning stats samples of just a few thousand > rows --- specifically, default_statistics_target was originally just 10, > leading to a 3000-row sample size. So accepting two-appearance values as > MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it > could be a tenth or a hundredth of that. > > As a round number, I'm thinking that a good floor would be a frequency > estimate of 1/1000. With today's typical sample size of 30000 rows, > a value would have to appear at least 30 times in the sample to be > believed to be an MCV. That seems like it gives us a reasonable margin > of error against the kind of sampling noise seen in the above-cited > thread. This kind of math isn't my strong point, but it seems to me that, while sampling noise is a problem, it's not obvious how to tell whether a given result is signal or noise. I mean, if we sample 30,000 rows from a table with a billion rows and we see the the same value twice, then it could be the case that the row occurs just twice in the whole table and we unluckily saw both occurrences. Contrariwise, it could also be that the typical value in that table occurs 1000 times and the value in question occurs 50,000 times, so our sample was right on the money. There's no way to know whether, if we were to take a bunch more samples, we'd see that value coming up twice in most of those samples or whether the fact that it showed up twice this time is a statistical fluke. What makes me a bit cautious about this approach overall is that I've seen cases where the length of the MCV list turns out to be key to getting a good plan, and I needed to make it longer in order for things to work. Any non-MCV is, IIRC, assumed to be less frequent than the least-common MCV. If you raise the minimum frequency that is required for something to be considered an MCV, then the cap on the frequency of a non-MCV also goes up, and I think it's not impossible to imagine that being a problem for somebody. Now maybe you're going to say those people can fix it by twiddling n_distinct using DDL. I'm not sure whether that works in all cases, but even if it does, some of those people might not need to twiddle n_distinct today, so from their point of view it would be a regression. Another way to state it is: is this problem one-sided? If we *only* have a problem with things that aren't really MCVs being considered as MCVs, then tightening up the rules for including something in the MCV list will fix it. Also, the statistics will be more stable and planning will be faster, which all sounds like a win. However, I'm guessing (on the basis of some personal experience) that we *also* have a problem with things that really are MCVs *not* being considered MCVs. If it's true that both of those things are problems, then the only real fix is to increase the sample size -- which I notice you recommended on the original thread. In general, I've pretty skeptical of the idea that sampling 30,000 rows out of an arbitrarily large table will produce a sufficiently-accurate MCV list. There's a comment in std_typanalyze() citing a 1998 paper by Chaudhuri, Motwani, and Narasayya for the proposition that we'll get a reasonably accurate histogram with that sample size, but histogram bins are not MCVs. Moreover, in practice, we need *increased* accuracy, not just *equal* accuracy, as the table size increases. On a billion row table, a minimum MCV frequency of 1/1000 means that we can't distinguish between a value that occurs 1 time and a value that occurs 999,999 times -- they are both non-MCVs. As you shrink the table that sort of thing becomes a smaller and smaller problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Jun 4, 2017 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> First, I think we need a larger hard floor on the number of occurrences >> of a value that're required to make ANALYZE decide it is a "most common >> value". > This kind of math isn't my strong point, but it seems to me that, > while sampling noise is a problem, it's not obvious how to tell > whether a given result is signal or noise. I think that a single count in a 30K-row sample is noise by definition. We really ought to be setting the threshold for "what is an MCV" high enough that it's not drastically affected by variations that are clearly at the sampling-error level. > What makes me a bit cautious about this approach overall is that I've > seen cases where the length of the MCV list turns out to be key to > getting a good plan, and I needed to make it longer in order for > things to work. As long as they actually are MCVs, sure. The problem I've got with the current behavior is that it manufactures a spiky distribution where there is none. That leads directly to bad estimates, as shown in Marko's example. We'd be much better off, both as to planning time and accuracy, if we'd concluded that the table had no MCVs. > Another way to state it is: is this problem one-sided? You know as well as I do that there's no free lunch in this area. Anything we change at all will make things worse for somebody, if only by accident. But I do not think that choosing a bunch of values entirely at random and claiming (incorrectly) that they are more common than other values in the table can possibly lead to better results except by accident. > In general, I've pretty skeptical of the idea that sampling 30,000 > rows out of an arbitrarily large table will produce a > sufficiently-accurate MCV list. Perhaps not, but allowing two occurrences to define an MCV surely isn't helping with that. More to the point maybe, it's a behavior that doesn't go away simply by making the sample larger. Making the sample larger just allows us to falsely invent more pseudo-MCVs. I'm not by any means wedded to the proposition that we have to fix it simply by changing the filter rule. One idea that seems worth considering is to keep a track list that's a bit longer than the maximum allowed number of MCVs, and then to say that we accept only MCVs whose counts are significantly greater than what we find at the tail of the list. I'm not sure what "significantly" should be exactly, but surely a distribution as flat as the ones I was showing upthread should be a red flag. regards, tom lane
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
From
Mark Kirkwood
Date:
On 05/06/17 09:30, Tom Lane wrote: > I've been thinking about the behavior discussed in > https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org > and it seems to me that there are a couple of things we ought to do about > it. > > First, I think we need a larger hard floor on the number of occurrences > of a value that're required to make ANALYZE decide it is a "most common > value". The existing coding is willing to believe that anything that > appears at least twice in the sample is a potential MCV, but that design > originated when we were envisioning stats samples of just a few thousand > rows --- specifically, default_statistics_target was originally just 10, > leading to a 3000-row sample size. So accepting two-appearance values as > MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it > could be a tenth or a hundredth of that. > > As a round number, I'm thinking that a good floor would be a frequency > estimate of 1/1000. With today's typical sample size of 30000 rows, > a value would have to appear at least 30 times in the sample to be > believed to be an MCV. That seems like it gives us a reasonable margin > of error against the kind of sampling noise seen in the above-cited > thread. > > Second, the code also has a rule that potential MCVs need to have an > estimated frequency at least 25% larger than what it thinks the "average" > value's frequency is. A rule of that general form seems like a good idea, > but I now think the 25% threshold is far too small to do anything useful. > In particular, in any case like this where there are more distinct values > than there are sample rows, the "average frequency" estimate will > correspond to less than one occurrence in the sample, so that this rule is > totally useless to filter anything that we would otherwise consider as an > MCV. I wonder if we shouldn't make it be "at least double the estimated > average frequency". > Or possibly calculate the sample standard deviation and make use of that to help decide on a more flexible cutoff than twice the avg frequency? Are there any research papers that might help us here (I'm drowning in a sea of barely relevant search results for most phrases I've tried so far)? I recall there were some that Tom referenced when this stuff was originally written. On the other hand I do have access to some mathematicians specializing in statistics - so can get their thoughts on this issue if you feel it would be worthwhile. Cheers Mark
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
From
Gavin Flower
Date:
On 06/06/17 09:41, Mark Kirkwood wrote: > On 05/06/17 09:30, Tom Lane wrote: > >> I've been thinking about the behavior discussed in >> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org >> >> and it seems to me that there are a couple of things we ought to do >> about >> it. >> >> First, I think we need a larger hard floor on the number of occurrences >> of a value that're required to make ANALYZE decide it is a "most common >> value". The existing coding is willing to believe that anything that >> appears at least twice in the sample is a potential MCV, but that design >> originated when we were envisioning stats samples of just a few thousand >> rows --- specifically, default_statistics_target was originally just 10, >> leading to a 3000-row sample size. So accepting two-appearance >> values as >> MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it >> could be a tenth or a hundredth of that. >> >> As a round number, I'm thinking that a good floor would be a frequency >> estimate of 1/1000. With today's typical sample size of 30000 rows, >> a value would have to appear at least 30 times in the sample to be >> believed to be an MCV. That seems like it gives us a reasonable margin >> of error against the kind of sampling noise seen in the above-cited >> thread. >> >> Second, the code also has a rule that potential MCVs need to have an >> estimated frequency at least 25% larger than what it thinks the >> "average" >> value's frequency is. A rule of that general form seems like a good >> idea, >> but I now think the 25% threshold is far too small to do anything >> useful. >> In particular, in any case like this where there are more distinct >> values >> than there are sample rows, the "average frequency" estimate will >> correspond to less than one occurrence in the sample, so that this >> rule is >> totally useless to filter anything that we would otherwise consider >> as an >> MCV. I wonder if we shouldn't make it be "at least double the estimated >> average frequency". >> > > Or possibly calculate the sample standard deviation and make use of > that to help decide on a more flexible cutoff than twice the avg > frequency? > > Are there any research papers that might help us here (I'm drowning in > a sea of barely relevant search results for most phrases I've tried so > far)? I recall there were some that Tom referenced when this stuff was > originally written. > > On the other hand I do have access to some mathematicians specializing > in statistics - so can get their thoughts on this issue if you feel it > would be worthwhile. > > Cheers > > Mark > > The standard deviation (sd) is proportional to the square root of the number in the sample in a Normal Distribution. In a Normal Distribution, about 2/3 the values will be within plus or minus one sd of the mean. There seems to be an implicit assumption that the distribution of values follows the Normal Distribution - has this been verified? I suspect that real data will have a skewed distribution of values, and may even be multi modal (multiple peaks) The Normal Distribution has one central peak with 2 tails of the same shape & size. So in a sample of 100 the sd is proportional to 10%, for 10,000 the sd is proportional to 1%. So essentially, the larger the sample size the more reliable is knowledge of the most common values (ignoring pathologically extreme distributions!) - the measure of reliability depends on the distribution. How about selecting the cut off as the mean plus one sd, or something of that nature? Note that the cut off point may result in no mcv's being selected - especially for small samples. If practicable, it would be good to sample real datasets. Suggest looking at datasets were the current mechanism looks reasonable, and ones were the estimates are too far off. Also, if possible, try any new selection method on the datasets and see what the difference is. The above is based on what I remember from my university statistics papers, I took it up to 4th year level many moons ago. Cheers, Gavin
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
From
Gavin Flower
Date:
On 06/06/17 10:12, Gavin Flower wrote: > On 06/06/17 09:41, Mark Kirkwood wrote: >> On 05/06/17 09:30, Tom Lane wrote: >> >>> I've been thinking about the behavior discussed in >>> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org >>> >>> and it seems to me that there are a couple of things we ought to do >>> about >>> it. >>> >>> First, I think we need a larger hard floor on the number of occurrences >>> of a value that're required to make ANALYZE decide it is a "most common >>> value". The existing coding is willing to believe that anything that >>> appears at least twice in the sample is a potential MCV, but that >>> design >>> originated when we were envisioning stats samples of just a few >>> thousand >>> rows --- specifically, default_statistics_target was originally just >>> 10, >>> leading to a 3000-row sample size. So accepting two-appearance >>> values as >>> MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it >>> could be a tenth or a hundredth of that. >>> >>> As a round number, I'm thinking that a good floor would be a frequency >>> estimate of 1/1000. With today's typical sample size of 30000 rows, >>> a value would have to appear at least 30 times in the sample to be >>> believed to be an MCV. That seems like it gives us a reasonable margin >>> of error against the kind of sampling noise seen in the above-cited >>> thread. >>> >>> Second, the code also has a rule that potential MCVs need to have an >>> estimated frequency at least 25% larger than what it thinks the >>> "average" >>> value's frequency is. A rule of that general form seems like a good >>> idea, >>> but I now think the 25% threshold is far too small to do anything >>> useful. >>> In particular, in any case like this where there are more distinct >>> values >>> than there are sample rows, the "average frequency" estimate will >>> correspond to less than one occurrence in the sample, so that this >>> rule is >>> totally useless to filter anything that we would otherwise consider >>> as an >>> MCV. I wonder if we shouldn't make it be "at least double the >>> estimated >>> average frequency". >>> >> >> Or possibly calculate the sample standard deviation and make use of >> that to help decide on a more flexible cutoff than twice the avg >> frequency? >> >> Are there any research papers that might help us here (I'm drowning >> in a sea of barely relevant search results for most phrases I've >> tried so far)? I recall there were some that Tom referenced when this >> stuff was originally written. >> >> On the other hand I do have access to some mathematicians >> specializing in statistics - so can get their thoughts on this issue >> if you feel it would be worthwhile. >> >> Cheers >> >> Mark >> >> > The standard deviation (sd) is proportional to the square root of the > number in the sample in a Normal Distribution. > > In a Normal Distribution, about 2/3 the values will be within plus or > minus one sd of the mean. > > There seems to be an implicit assumption that the distribution of > values follows the Normal Distribution - has this been verified? I > suspect that real data will have a skewed distribution of values, and > may even be multi modal (multiple peaks) The Normal Distribution has > one central peak with 2 tails of the same shape & size. > > So in a sample of 100 the sd is proportional to 10%, > for 10,000 the sd is proportional to 1%. > > So essentially, the larger the sample size the more reliable is > knowledge of the most common values (ignoring pathologically extreme > distributions!) - the measure of reliability depends on the distribution. > > How about selecting the cut off as the mean plus one sd, or something > of that nature? Note that the cut off point may result in no mcv's > being selected - especially for small samples. > > If practicable, it would be good to sample real datasets. Suggest > looking at datasets were the current mechanism looks reasonable, and > ones were the estimates are too far off. Also, if possible, try any > new selection method on the datasets and see what the difference is. > > The above is based on what I remember from my university statistics > papers, I took it up to 4th year level many moons ago. > > > Cheers, > Gavin > > > > Suddenly realized, that the distribution of the FREQUENCIES of values in a Normal Distribution are probably NOT normally distributed! For some shapes, and fancy names for them (like 'leptokurtic'), see: http://onlinestatbook.com/2/introduction/distributions.html For multi modal examples: http://www.statisticshowto.com/multimodal-distribution I tried looking for useful stuff to clarify things without success - so I think that asking a practising statistician is probably a very good idea! :-) Cheers, Gavin Cheers, Gavin
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
From
Robert Haas
Date:
On Mon, Jun 5, 2017 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think that a single count in a 30K-row sample is noise by definition. Not if the table only *has* 30K rows. Or even 100K. And anyway we're talking about what to do with a value we hit at least twice, which is not quite the same thing. > As long as they actually are MCVs, sure. The problem I've got with the > current behavior is that it manufactures a spiky distribution where there > is none. That leads directly to bad estimates, as shown in Marko's > example. We'd be much better off, both as to planning time and accuracy, > if we'd concluded that the table had no MCVs. That is so, but that example is also constructed to show the case where the current code falls down. The case where the distribution actually is spiky but the frequency is near the minimum that ANALYZE can detect isn't tested by that example. >> Another way to state it is: is this problem one-sided? > > You know as well as I do that there's no free lunch in this area. > Anything we change at all will make things worse for somebody, if only > by accident. Yep. > But I do not think that choosing a bunch of values entirely > at random and claiming (incorrectly) that they are more common than other > values in the table can possibly lead to better results except by > accident. But accidents happen, and sometimes they happen frequently. I maintain that the root of this problem is that you want to pretend like a 30k row sample on (in this instance) a 100m row table ought to be expected to be enough to reliably produce good results, and I think that's doomed. If you tweak the algorithm to remove what is in this case noise, you'll just break some other case instead. Maybe it'll take somebody a few years to find and post an example of such a case, but surely you can see that there must be cases where most ANALYZE runs will find 2 of some value precisely because it is a lot more common than average. I think what we ought to do is install some kind of auto-scaling behavior so that when we discover that a table contains a very large number of rows, we use a higher statistics target. Admittedly there's some circularity problems there -- if we haven't sampled the table yet, we don't know how wide the rows are. But we could estimate based on the on-disk size for the first ANALYZE and based on the previously-observed row width thereafter, and probably improve things measurably. Now if, in conjunction with that, we also get more aggressive about filtering out low-order MCVs, fine. I think there are plenty of small tables where the low-order MCVs make very little difference and chucking them will do nothing but save planning time. But I believe that for very large tables, shortening the MCV list will hurt -- in some cases, probably a lot -- so I'd like to see some compensating change to avoid that. > I'm not by any means wedded to the proposition that we have to fix it > simply by changing the filter rule. One idea that seems worth considering > is to keep a track list that's a bit longer than the maximum allowed > number of MCVs, and then to say that we accept only MCVs whose counts are > significantly greater than what we find at the tail of the list. I'm not > sure what "significantly" should be exactly, but surely a distribution > as flat as the ones I was showing upthread should be a red flag. I'm not sure exactly which bit upthread (or on the other thread) you're referring to here, but I'm worried that your thinking is being excessively shaped by the example presently in front of us. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
From
Alvaro Herrera
Date:
Gavin Flower wrote: > The standard deviation (sd) is proportional to the square root of > the number in the sample in a Normal Distribution. > > In a Normal Distribution, about 2/3 the values will be within plus > or minus one sd of the mean. > > There seems to be an implicit assumption that the distribution of > values follows the Normal Distribution - has this been verified? The whole problem here is precisely to determine what is the data distribution -- one side of it is how to represent it for the planner (which we do by storing a number of distinct values, a list of MCVs and their respective frequencies, and a histogram representing values not in the MCV list); the other side is how to figure out what data to put in the MCV list and histogram (i.e. what to compute during ANALYZE). If we knew the distribution was a normal, we wouldn't need any of these things -- we'd just store the mean and standard deviation. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
From
Mark Kirkwood
Date:
On 06/06/17 10:12, Gavin Flower wrote: > On 06/06/17 09:41, Mark Kirkwood wrote: >> On 05/06/17 09:30, Tom Lane wrote: >> >>> I've been thinking about the behavior discussed in >>> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org >>> >>> and it seems to me that there are a couple of things we ought to do >>> about >>> it. >>> >>> First, I think we need a larger hard floor on the number of occurrences >>> of a value that're required to make ANALYZE decide it is a "most common >>> value". The existing coding is willing to believe that anything that >>> appears at least twice in the sample is a potential MCV, but that >>> design >>> originated when we were envisioning stats samples of just a few >>> thousand >>> rows --- specifically, default_statistics_target was originally just >>> 10, >>> leading to a 3000-row sample size. So accepting two-appearance >>> values as >>> MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it >>> could be a tenth or a hundredth of that. >>> >>> As a round number, I'm thinking that a good floor would be a frequency >>> estimate of 1/1000. With today's typical sample size of 30000 rows, >>> a value would have to appear at least 30 times in the sample to be >>> believed to be an MCV. That seems like it gives us a reasonable margin >>> of error against the kind of sampling noise seen in the above-cited >>> thread. >>> >>> Second, the code also has a rule that potential MCVs need to have an >>> estimated frequency at least 25% larger than what it thinks the >>> "average" >>> value's frequency is. A rule of that general form seems like a good >>> idea, >>> but I now think the 25% threshold is far too small to do anything >>> useful. >>> In particular, in any case like this where there are more distinct >>> values >>> than there are sample rows, the "average frequency" estimate will >>> correspond to less than one occurrence in the sample, so that this >>> rule is >>> totally useless to filter anything that we would otherwise consider >>> as an >>> MCV. I wonder if we shouldn't make it be "at least double the >>> estimated >>> average frequency". >>> >> >> Or possibly calculate the sample standard deviation and make use of >> that to help decide on a more flexible cutoff than twice the avg >> frequency? >> >> Are there any research papers that might help us here (I'm drowning >> in a sea of barely relevant search results for most phrases I've >> tried so far)? I recall there were some that Tom referenced when this >> stuff was originally written. >> >> On the other hand I do have access to some mathematicians >> specializing in statistics - so can get their thoughts on this issue >> if you feel it would be worthwhile. >> >> Cheers >> >> Mark >> >> > The standard deviation (sd) is proportional to the square root of the > number in the sample in a Normal Distribution. > > In a Normal Distribution, about 2/3 the values will be within plus or > minus one sd of the mean. > > There seems to be an implicit assumption that the distribution of > values follows the Normal Distribution - has this been verified? I > suspect that real data will have a skewed distribution of values, and > may even be multi modal (multiple peaks) The Normal Distribution has > one central peak with 2 tails of the same shape & size. > > So in a sample of 100 the sd is proportional to 10%, > for 10,000 the sd is proportional to 1%. > > So essentially, the larger the sample size the more reliable is > knowledge of the most common values (ignoring pathologically extreme > distributions!) - the measure of reliability depends on the distribution. > > How about selecting the cut off as the mean plus one sd, or something > of that nature? Note that the cut off point may result in no mcv's > being selected - especially for small samples. > > If practicable, it would be good to sample real datasets. Suggest > looking at datasets were the current mechanism looks reasonable, and > ones were the estimates are too far off. Also, if possible, try any > new selection method on the datasets and see what the difference is. > > The above is based on what I remember from my university statistics > papers, I took it up to 4th year level many moons ago. > > > With respect to the assumption of Normal distribution - it is easy to come up with an example that shows it need not be: consider our our Pgbench 'accounts' table. The distribution of 'bid' values is not Normal (it is Uniform). Now I realized I made people think about Normal distributions by mentioning computing the standard deviation of the sample (and I probably should have said 'standard deviation of the *sample frequencies*') - but this was only a suggestion for working out how to (maybe) be less arbitrary about how we decide what values to put in the MCV list (currently 25% different from the mean frequency). regards Mark
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
From
Dean Rasheed
Date:
On 05/06/17 09:30, Tom Lane wrote: > First, I think we need a larger hard floor on the number of occurrences > of a value that're required to make ANALYZE decide it is a "most common > value"... > > Second, the code also has a rule that potential MCVs need to have an > estimated frequency at least 25% larger than what it thinks the "average" > value's frequency is. A rule of that general form seems like a good idea, > but I now think the 25% threshold is far too small to do anything useful. > In particular, in any case like this where there are more distinct values > than there are sample rows, the "average frequency" estimate will > correspond to less than one occurrence in the sample, so that this rule is > totally useless to filter anything that we would otherwise consider as an > MCV. I wonder if we shouldn't make it be "at least double the estimated > average frequency". > I think we should attempt to come up with a more principled approach to this, taking into account the table and sample sizes. Here's what I found, after a bit of research: A common initial rule of thumb is that the value should occur at least 10 times in the sample - see, for example [1], [2]. The idea behind that goes as follows: Suppose that N is the population size (total number of rows in the table), and n is the sample size, and that some particular candidate value appears x times in the sample. Then the "sample proportion" is given by p = x/n. Taking a different sample would, in general, give a different value for x, and hence for the sample proportion, so repeatedly taking different random samples of the table would lead to a distribution of values of p, and in general, this distribution would be a binomial distribution, since x is the count of sampled rows matching a yes/no question (does the row match the candidate value?). The idea behind the rule of thumb above is that if n is large enough, and x is not too close to 0 or n, then the binomial distribution of p can be reasonably approximated by a normal distribution. There are various rules of thumb for this to be true, but this one is actually one of the stronger ones, as can be seen in [2]. Technically the rule should be: x >= 10 and x <= n-10 but it's probably not worth bothering with the second test, since we're looking for MCVs. Note that this says nothing about the margin of error of p, just that it is reasonable to treat it as having a normal distribution, which then allows the margin of error to be analysed using standard techniques. The standard way of doing this is to calculate the "standard error" of the sample proportion - see, for example [3], [4]: SE = sqrt(p*(1-p)/n) Note, however, that this formula assumes that the sample size n is small compared to the population size N, which is not necessarily the case. This can be taken into account by applying the "finite population correction" (see, for example [5]), which involves multiplying by an additional factor: SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1)) This correction factor becomes 0 when n=N (whole table sampled => no error) and it approaches 1 when n is much smaller than N. Having computed the standard error of the sample proportion, it is then possible to establish a confidence interval around p. For example, one could say that there is a 95% probability that the total population proportion lies in the range [ pmin = p-2*SE, pmax = p+2*SE ] This gives rise to the possibility of writing another rule for candidate MCVs: If there are Nd distinct values in the table, so that the average frequency of occurrence of any particular value is 1/Nd, then the test pmin > 1/Nd would imply that there is a 97.5% probably that the candidate value is more common than the average (since the 5% of the distribution of p outside the confidence interval is split evenly below pmin and above pmax). To take a concrete example, suppose that the table has N=1000,000 rows, with Nd=500 distinct values, so that each value occurs on average 2000 times. If the sample size is 30,000 then the current rule that a value needs to have an estimated frequency 25% over the average means that a value would need to be seen 75 times to be considered as an MCV. If that rule were changed to "at least double the estimated average frequency", then a value would need to be seen at least 150 times. On the other hand, using the above rule based on a 95% confidence interval, a value would only need to be seen 78 times, in which case the estimated total number of occurrences of the value in the table would be 2600+/-579, and using the MCV estimate would almost certainly be better than not using it. Regards, Dean [1] https://onlinecourses.science.psu.edu/stat200/node/43 [2] https://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation [3] http://mathworld.wolfram.com/SampleProportion.html [4] https://en.wikipedia.org/wiki/Population_proportion [5] http://stattrek.com/statistics/dictionary.aspx?Definition=finite_population_correction [6] https://en.wikipedia.org/wiki/Margin_of_error
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > I think we should attempt to come up with a more principled approach > to this, taking into account the table and sample sizes. Here's what I > found, after a bit of research: Thanks for doing some legwork on this! > A common initial rule of thumb is that the value should occur at least > 10 times in the sample - see, for example [1], [2]. ... > Note that this says nothing about the margin of error of p, just that > it is reasonable to treat it as having a normal distribution, which > then allows the margin of error to be analysed using standard > techniques. Got it. Still, insisting on >= 10 occurrences feels a lot better to me than insisting on >= 2. It's interesting that that can be correlated to whether the margin of error is easily analyzed. > The standard way of doing this is to calculate the "standard error" of > the sample proportion - see, for example [3], [4]: > SE = sqrt(p*(1-p)/n) > Note, however, that this formula assumes that the sample size n is > small compared to the population size N, which is not necessarily the > case. This can be taken into account by applying the "finite > population correction" (see, for example [5]), which involves > multiplying by an additional factor: > SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1)) It's been a long time since college statistics, but that wikipedia article reminds me that the binomial distribution isn't really the right thing for our problem anyway. We're doing sampling without replacement, so that the correct model is the hypergeometric distribution. The article points out that the binomial distribution is a good approximation as long as n << N. Can this FPC factor be justified as converting binomial estimates into hypergeometric ones, or is it ad hoc? > This gives rise to the possibility of writing another rule for candidate MCVs: > If there are Nd distinct values in the table, so that the average > frequency of occurrence of any particular value is 1/Nd, then the test > pmin > 1/Nd Interesting indeed. We'd have to be working with our estimated Nd, of course, not the true Nd. We know empirically that the estimate usually lowballs the true value unless our sample is a fairly large fraction of the whole table. That would tend to make our cutoff pmin too high, so that we'd be excluding some candidate MCVs even though a better sample would show they almost certainly had a frequency more common than average. That behavior seems fine to me; it'd tend to drop questionable MCVs in small samples, which is exactly what I think we should be doing. But it is probably worth doing some empirical experiments with this rule to see what we get. regards, tom lane
Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
From
Dean Rasheed
Date:
On 11 June 2017 at 20:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The standard way of doing this is to calculate the "standard error" of >> the sample proportion - see, for example [3], [4]: >> SE = sqrt(p*(1-p)/n) >> Note, however, that this formula assumes that the sample size n is >> small compared to the population size N, which is not necessarily the >> case. This can be taken into account by applying the "finite >> population correction" (see, for example [5]), which involves >> multiplying by an additional factor: >> SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1)) > > It's been a long time since college statistics, but that wikipedia article > reminds me that the binomial distribution isn't really the right thing for > our problem anyway. We're doing sampling without replacement, so that the > correct model is the hypergeometric distribution. Yes that's right. > The article points out > that the binomial distribution is a good approximation as long as n << N. > Can this FPC factor be justified as converting binomial estimates into > hypergeometric ones, or is it ad hoc? No, it's not just ad hoc. It comes from the variance of the hypergeometric distribution [1] divided by the variance of a binomial distribution [2] with p=K/N, in the notation of those articles. This is actually a very widely used formula, used in fields like analysis of survey data, which is inherently sampling without replacement (assuming the questioners don't survey the same people more than once!). Regards, Dean [1] https://en.wikipedia.org/wiki/Hypergeometric_distribution [2] https://en.wikipedia.org/wiki/Binomial_distribution