Re: accounting for memory used for BufFile during hash joins - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: accounting for memory used for BufFile during hash joins
Date
Msg-id 20190711154011.lmkp7jyd3e2zfcnx@development
Whole thread Raw
In response to Re: accounting for memory used for BufFile during hash joins  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: accounting for memory used for BufFile during hash joins  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:
>Okay, so, while I do have specific, actual code review/commitfest-y
>feedback for the patch in this thread registered for this commitfest,
>I wanted to defer that for a later email and use this one to cover off
>on a few higher level issues.
>
>1) How this patch's approach fits into the wider set of problems with
>hybrid hashjoin.
>
>2) Parallel HashJoin implementation of this patch's approach
>
>I think implementing support for parallel hashjoin or explicitly
>disabling it would be the bare minimum for this patch, which is why I
>made 2 its own item. I've marked it as returned to author for this
>reason.
>

OK. I'm a bit confused / unsure what exactly our solution to the various
hashjoin issues is. I have not been paying attention to all the various
threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
against pushing this patch (the slicing one) forward and then maybe add
BNL on top.

>I do think that accounting for Buffile overhead when estimating the
>size of the hashtable during ExecChooseHashTableSize() so it can be
>used during planning is a worthwhile patch by itself (though I know it
>is not even part of this patch).
>

+1 to that

>I'll start with 2 since I have less to say there.
>
>From comments upthread, I take it this would not work with parallel
>hashjoin as expected. Is this because each worker operates on batches
>independently and now batches are lumped into slices?
>
>Thinking through a parallel-aware implementation, it seems like you
>would use slice-based barriers for the build phase but batch-based
>barriers for the probe phase to avoid getting out of sync (workers
>with outer tuples from one batch should not try and join those with
>tuples from another batch, even if in the same slice).
>
>You would, of course, need to add code to make slices work with
>SharedTuplestore--caveat here is I still haven't tried to understand
>how parallel-aware hashjoin works/uses SharedTuplestore.
>

I don't know. I haven't thought about the parallel version very much. I
wonder if Thomas Munro has some thoughts about it ...

>Now, addressing 1, how this patch fits into the wider set of problem's
>with current hybrid hashjoin:
>
>Thomas Munro nicely summarized roughly what I'm about to lay out like
>this (upthread) -- he called them "three separate but related
>problems":
>
>> A.  Broken escape valve:  sometimes we generate a huge number of
>> batches while trying to split up many duplicates, because of the
>> presence of other more uniformly distributed keys.  We could fix that
>> with (say) a 95% rule.
>> B.  Lack of good alternative execution strategy when the escape valve
>> is triggered.  A batch cannot be split effectively, but cannot fit in
>> work_mem, so for now we decide to ignore work_mem.
>> C.  Unmetered explosion of batches and thus BufFiles, probably usually
>> caused by problem A, but theoretically also due to a real need for
>> partitions.
>
>However, I would like to lay out the problem space a little bit
>differently. (using the end of the alphabet to differentiate).
>
>The following scenarios are how you could end up running out of
>memory:
>
>Y. Plan-time underestimation of the number of required batches with
>relatively uniform data distribution
>
>In this case, the best join execution strategy is a plain hashjoin
>with spilling as needed.
>nbatches should be increased as needed, because the data is ~evenly
>distributed.
>slicing should be employed when buffile overhead exceeds some
>threshhold for the ratio of work_mem to be used for buffile overhead
>

OK, makes sense. But at some point we get so many slices the overflow
files alone use more than work_mem. Of course, to hit that the
underestimate needs to be sufficiently serious. My understanding was we'll
roll until that point and then switch to BNL.

>Z. Plan and or execution time underestimation of the number of
>required batches with skewed data
>
>If you knew this at planning time, you could have picked another
>join-type, though, there might be cases where it would actually be
>less costly to use plain hashjoin for all batches except the bad batch
>and fall back to hash block nested loop join just for the duplicate
>values.
>
>If you could not have known this at planning time, the best join
>execution strategy is a hybrid hashjoin/hash block nested loop join.
>
>To do this, preview if increasing nbatches would move tuples, and, if
>it would, do this (also, employing slicing if buffile overhead exceeds
>the threshold)
>
>If increasing nbatches wouldn't move tuples, process this batch with
>hash block nested loop join.
>

OK.

>Essentially, what we want is logical units of tuples which are
>work_mem-sized. In some cases, each unit may contain multiple batches
>(a slice in Tomas' patch) and in other cases, each unit may contain
>only part of a batch (a chunk is the term I used in my hash block
>nested loop join patch [1]).
>

OK, although with slicing the work_mem-sized unit is still one batch. The
slice just ensures the metadata we need to keep in memory does not grow as
O(N) with the number of batches (instead it's O(log(N)) I think).

>For slicing, each unit, a slice, has multiple batches but one spill
>file.
>For hbnlj, each unit, a chunk, is one of multiple chunks in a single
>batch, all of which are in the same spill file (1 batch = 1 spill
>file).
>

Yep.

>Thinking through it, it seems to make the most sense to split the work
>into ~ 3 independent pieces:
>
>patch1 - "preview" a batch increase (not yet written [I think])
>patch2 - slicing (Tomas' patch [2] but add in threshhold for portion of
>work_mem buffile overhead is using)
>patch3 - hash block nested loop join (my patch [1])
>
>patch1 allows us to re-enable growth and was mentioned upthread, but I
>will quote it here for simplicity:
>
>> I think we can fix A by relaxing the escape valve condition, and then
>> rechecking it once in a while. So we fill work_mem, realize it didn't
>> actually reduce the batch size significantly and disable nbatch growth.
>> But at the same time we increase the threshold to 2x work_mem, and after
>> reaching it we "consider" a nbatch increase.  That is, we walk the batch
>> and see how many tuples would move if we increased nbatch (that should be
>> fairly cheap) - if it helps, great, enable growth and split the batch. If
>> not, double the threshold again.  Rinse and repeat.
>
>We don't want to fill up work_mem with buffile overhead after
>increasing nbatches many times just to move a few tuples for one batch
>and end up disabling growth thus making it so that later we can't
>increase nbatches and repartition for a batch that would nicely
>partition (like Hubert's case, I believe [3]).
>

Yes, this seems like a very reasonable plan. Also, I now see it actually
explains what the plan with BNL vs. slicing is.

>We want to identify when re-partitioning would help and only do it
>then and, for times when it wouldn't help, use a fallback strategy
>that still allows progress on the hashjoin, and, for some spiky data,
>where we have re-partitioned for the right reasons, but there are
>still a lot of batches that are small enough that they could all fit
>in memory at once, we want to track them with as little overhead as
>possible -- lump them into slices.
>
>We should probably consider deciding to use slices based on some
>threshold for the portion of work_mem which is allowed to be occupied
>by buffile overhead instead of waiting until the buffile overhead is
>literally taking up most of work_mem.
>

But that heuristics is already there, no? That's the "Don't use more than
2*work_mem/3 for batch BufFiles" at which point we start adding slices.

>The above summary is to address the concern in this thread about a
>holistic solution.
>
>I think the slicing patch is independent of both the hash block nested
>loop join patch and the "preview" mode for batch increasing.
>
>If slicing is made to work for parallel-aware hashjoin and the code is
>in a committable state (and probably has the threshold I mentioned
>above), then I think that this patch should go in.
>

Yes, I think this seems like a good plan.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Rafia Sabih
Date:
Subject: Fwd: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Next
From: Andrew Dunstan
Date:
Subject: Re: buildfarm's typedefs list has gone completely nutso