Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
Date
Msg-id CA+TgmoaryvpaEh_XEjVMPBw3-ULezbxDbM_1eO0cstXZwK25TA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, Jun 5, 2017 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think that a single count in a 30K-row sample is noise by definition.

Not if the table only *has* 30K rows.  Or even 100K.  And anyway we're
talking about what to do with a value we hit at least twice, which is
not quite the same thing.

> As long as they actually are MCVs, sure.  The problem I've got with the
> current behavior is that it manufactures a spiky distribution where there
> is none.  That leads directly to bad estimates, as shown in Marko's
> example.  We'd be much better off, both as to planning time and accuracy,
> if we'd concluded that the table had no MCVs.

That is so, but that example is also constructed to show the case
where the current code falls down.  The case where the distribution
actually is spiky but the frequency is near the minimum that ANALYZE
can detect isn't tested by that example.

>> Another way to state it is: is this problem one-sided?
>
> You know as well as I do that there's no free lunch in this area.
> Anything we change at all will make things worse for somebody, if only
> by accident.

Yep.

> But I do not think that choosing a bunch of values entirely
> at random and claiming (incorrectly) that they are more common than other
> values in the table can possibly lead to better results except by
> accident.

But accidents happen, and sometimes they happen frequently.  I
maintain that the root of this problem is that you want to pretend
like a 30k row sample on (in this instance) a 100m row table ought to
be expected to be enough to reliably produce good results, and I think
that's doomed.  If you tweak the algorithm to remove what is in this
case noise, you'll just break some other case instead.  Maybe it'll
take somebody a few years to find and post an example of such a case,
but surely you can see that there must be cases where most ANALYZE
runs will find 2 of some value precisely because it is a lot more
common than average.

I think what we ought to do is install some kind of auto-scaling
behavior so that when we discover that a table contains a very large
number of rows, we use a higher statistics target.  Admittedly there's
some circularity problems there -- if we haven't sampled the table
yet, we don't know how wide the rows are.  But we could estimate based
on the on-disk size for the first ANALYZE and based on the
previously-observed row width thereafter, and probably improve things
measurably.

Now if, in conjunction with that, we also get more aggressive about
filtering out low-order MCVs, fine.  I think there are plenty of small
tables where the low-order MCVs make very little difference and
chucking them will do nothing but save planning time.  But I believe
that for very large tables, shortening the MCV list will hurt -- in
some cases, probably a lot -- so I'd like to see some compensating
change to avoid that.

> I'm not by any means wedded to the proposition that we have to fix it
> simply by changing the filter rule.  One idea that seems worth considering
> is to keep a track list that's a bit longer than the maximum allowed
> number of MCVs, and then to say that we accept only MCVs whose counts are
> significantly greater than what we find at the tail of the list.  I'm not
> sure what "significantly" should be exactly, but surely a distribution
> as flat as the ones I was showing upthread should be a red flag.

I'm not sure exactly which bit upthread (or on the other thread)
you're referring to here, but I'm worried that your thinking is being
excessively shaped by the example presently in front of us.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: "Tsunakawa, Takayuki"
Date:
Subject: Re: [HACKERS] Is ECPG's SET CONNECTION really not thread-aware?
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher