Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys
Date
Msg-id 4199.1487278280@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I initially thought about driving the shutoff strictly from the estimate
>> of the MCV frequency, without involving the more general ndistinct
>> computation that estimate_hash_bucketsize does.  I'm not sure how much
>> that would do for your concern, but at least the MCV frequency doesn't
>> involve quite as much extrapolation as ndistinct.

> Hmm, so we could do something like: if the estimated frequency of the
> least-common MCV is enough to make one bucket overflow work_mem, then
> don't use a hash join?  That would still be prone to some error (in
> both directions, really) but it seems less likely to spit out
> completely stupid results than relying on ndistinct, which never gets
> very big even in a 10TB table.

No, it'd be the *most* common MCV, because we're concerned about the
worst-case (largest) bucket size.  But that's good, really, because the
highest MCV frequency will be the one we have most statistical
confidence in.  There's generally a whole lot of noise in the tail-end
MCV numbers.

Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
stats.  That would be an expected case if the column is believed unique,
and it's probably a better fallback behavior when we simply don't have
stats.  With the ndistinct-based rule, we'd be shutting off hashjoin
almost always when we don't have stats.  Given how long it took us
to recognize this problem, that's probably the wrong default.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: [HACKERS] Small issue in online devel documentation build
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys