Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys
Date
Msg-id CA+TgmoZCFcNs+BuTziv+zY=MAZ2yqtR=0Kq5305s4QaRxCOqiw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys
List pgsql-hackers
On Thu, Feb 16, 2017 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Feb 16, 2017 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I initially thought about driving the shutoff strictly from the estimate
>>> of the MCV frequency, without involving the more general ndistinct
>>> computation that estimate_hash_bucketsize does.  I'm not sure how much
>>> that would do for your concern, but at least the MCV frequency doesn't
>>> involve quite as much extrapolation as ndistinct.
>
>> Hmm, so we could do something like: if the estimated frequency of the
>> least-common MCV is enough to make one bucket overflow work_mem, then
>> don't use a hash join?  That would still be prone to some error (in
>> both directions, really) but it seems less likely to spit out
>> completely stupid results than relying on ndistinct, which never gets
>> very big even in a 10TB table.
>
> No, it'd be the *most* common MCV, because we're concerned about the
> worst-case (largest) bucket size.  But that's good, really, because the
> highest MCV frequency will be the one we have most statistical
> confidence in.  There's generally a whole lot of noise in the tail-end
> MCV numbers.

Oh, right.  That's reassuring, as it seems like it has a much better
chance of actually being right.

> Also, I'd be inclined to do nothing (no shutoff) if we have no MCV
> stats.  That would be an expected case if the column is believed unique,
> and it's probably a better fallback behavior when we simply don't have
> stats.  With the ndistinct-based rule, we'd be shutting off hashjoin
> almost always when we don't have stats.  Given how long it took us
> to recognize this problem, that's probably the wrong default.

Right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys
Next
From: David Christensen
Date:
Subject: [HACKERS] [PATCH] Add pg_disable_checksums() and supporting infrastructure