Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys
Date
Msg-id CAEepm=0XGzmB4qHefEKExLGG=SEXHccoJPfsVCCCsrqnBYWMUQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Wed, Mar 8, 2017 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> I have been wondering about a couple of different worst case execution
>> strategies that would be better than throwing our hands up and
>> potentially exploding memory once we detect that further partitioning
>> is not going to help, if we still manage to reach that case despite
>> adding stats-based defences like this due to statistics being missing,
>> bad or confused by joins below us.
>
> Yeah, it would definitely be nice if we could constrain the runtime
> space consumption better.
>
>> 1.  We could probe the fraction of the hash table that we have managed
>> to load into work_mem so far and then rewind the outer batch and do it
>> again for the next work_mem-sized fraction of the inner batch and so
>> on.  For outer joins we'd need to scan for unmatched tuples after each
>> hash table refill.
>
> I do not understand how that works for a left join?  You'd need to track
> whether a given outer tuple has been matched in any one of the fractions
> of the inner batch, so that when you're done with the batch you could know
> which outer tuples need to be emitted null-extended.  Right now we only
> need to track that state for the current outer tuple, but in a rescan
> situation we'd have to remember it for each outer tuple in the batch.
>
> Perhaps it could be done by treating the outer batch file as read/write
> and incorporating a state flag in each tuple; or to reduce write volume,
> maintaining a separate outer batch file parallel to the main one with just
> a bool or even just a bit per outer tuple.  Seems messy though.

Right.  Messy.  I think what I described may fall under the category
of "block nested loop".  It looks doable but not very appealing for
left joins, and performance seems not great, multiplying the probing
scans by the number of fragments.  Whether we actually care about
performance at all when we've reached this emergency state and are
primarily concerned with completing the query I'm not entirely sure.

Another idea would be to identify the offending bucket (how?) and
spill it to disk in its own file, and track it by pushing a special
control object with a distinguishing header flag into the hash table
(or a new overflow table, or extend the duties of the skew table,
or...).  We'd have to deal with the matched flags of spilled inner
tuples for right/full joins.  Matching is really per-key, not
per-tuple, so if there is a control object in memory for each of these
"overflow" buckets then perhaps it could hold the matched flag that
covers all tuples with each distinct key.  What I like about this is
that is doesn't change the join algorithm at all, it just bolts on a
per-bucket escape valve.  The changes might be quite localised, though
I know someone who probably wouldn't like an extra branch in
ExecScanHashBucket().

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [HACKERS] More stats about skipped vacuums
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Failed to delete old ReorderBuffer spilled files