On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I initially thought about driving the shutoff strictly from the estimate
>>> of the MCV frequency, without involving the more general ndistinct
>>> computation that estimate_hash_bucketsize does. I'm not sure how much
>>> that would do for your concern, but at least the MCV frequency doesn't
>>> involve quite as much extrapolation as ndistinct.
>
>> Hmm, so we could do something like: if the estimated frequency of the
>> least-common MCV is enough to make one bucket overflow work_mem, then
>> don't use a hash join? That would still be prone to some error (in
>> both directions, really) but it seems less likely to spit out
>> completely stupid results than relying on ndistinct, which never gets
>> very big even in a 10TB table.
>
> No, it'd be the *most* common MCV, because we're concerned about the
> worst-case (largest) bucket size. But that's good, really, because the
> highest MCV frequency will be the one we have most statistical
> confidence in. There's generally a whole lot of noise in the tail-end
> MCV numbers.
Oh, right. That's reassuring, as it seems like it has a much better
chance of actually being right.
> Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
> stats. That would be an expected case if the column is believed unique,
> and it's probably a better fallback behavior when we simply don't have
> stats. With the ndistinct-based rule, we'd be shutting off hashjoin
> almost always when we don't have stats. Given how long it took us
> to recognize this problem, that's probably the wrong default.
Right.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company