Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? |
Date | |
Msg-id | CA+TgmoaryvpaEh_XEjVMPBw3-ULezbxDbM_1eO0cstXZwK25TA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"? (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On Mon, Jun 5, 2017 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think that a single count in a 30K-row sample is noise by definition. Not if the table only *has* 30K rows. Or even 100K. And anyway we're talking about what to do with a value we hit at least twice, which is not quite the same thing. > As long as they actually are MCVs, sure. The problem I've got with the > current behavior is that it manufactures a spiky distribution where there > is none. That leads directly to bad estimates, as shown in Marko's > example. We'd be much better off, both as to planning time and accuracy, > if we'd concluded that the table had no MCVs. That is so, but that example is also constructed to show the case where the current code falls down. The case where the distribution actually is spiky but the frequency is near the minimum that ANALYZE can detect isn't tested by that example. >> Another way to state it is: is this problem one-sided? > > You know as well as I do that there's no free lunch in this area. > Anything we change at all will make things worse for somebody, if only > by accident. Yep. > But I do not think that choosing a bunch of values entirely > at random and claiming (incorrectly) that they are more common than other > values in the table can possibly lead to better results except by > accident. But accidents happen, and sometimes they happen frequently. I maintain that the root of this problem is that you want to pretend like a 30k row sample on (in this instance) a 100m row table ought to be expected to be enough to reliably produce good results, and I think that's doomed. If you tweak the algorithm to remove what is in this case noise, you'll just break some other case instead. Maybe it'll take somebody a few years to find and post an example of such a case, but surely you can see that there must be cases where most ANALYZE runs will find 2 of some value precisely because it is a lot more common than average. I think what we ought to do is install some kind of auto-scaling behavior so that when we discover that a table contains a very large number of rows, we use a higher statistics target. Admittedly there's some circularity problems there -- if we haven't sampled the table yet, we don't know how wide the rows are. But we could estimate based on the on-disk size for the first ANALYZE and based on the previously-observed row width thereafter, and probably improve things measurably. Now if, in conjunction with that, we also get more aggressive about filtering out low-order MCVs, fine. I think there are plenty of small tables where the low-order MCVs make very little difference and chucking them will do nothing but save planning time. But I believe that for very large tables, shortening the MCV list will hurt -- in some cases, probably a lot -- so I'd like to see some compensating change to avoid that. > I'm not by any means wedded to the proposition that we have to fix it > simply by changing the filter rule. One idea that seems worth considering > is to keep a track list that's a bit longer than the maximum allowed > number of MCVs, and then to say that we accept only MCVs whose counts are > significantly greater than what we find at the tail of the list. I'm not > sure what "significantly" should be exactly, but surely a distribution > as flat as the ones I was showing upthread should be a red flag. I'm not sure exactly which bit upthread (or on the other thread) you're referring to here, but I'm worried that your thinking is being excessively shaped by the example presently in front of us. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: