Re: Avoiding hash join batch explosions with extreme skew and weirdstats - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Avoiding hash join batch explosions with extreme skew and weirdstats
Date
Msg-id 20200626002232.33jhfr3s5sy3dmzx@development
Whole thread Raw
In response to Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Avoiding hash join batch explosions with extreme skew and weird stats
List pgsql-hackers
On Thu, Jun 25, 2020 at 03:09:44PM -0700, Melanie Plageman wrote:
>On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>> I started looking at the patch to refresh my knowledge both of this
>> patch and parallel hash join, but I think it needs a rebase. The
>> changes in 7897e3bb90 apparently touched some of the code.
>
>
>Thanks so much for the review, Tomas!
>
>I've attached a rebased patch which also contains updates discussed
>below.
>

Thanks.

>
>> I assume
>> you're working on a patch addressing the remaining TODOS, right?
>>
>
>I wanted to get some feedback on the patch before working through the
>TODOs to make sure I was on the right track.
>
>Now that you are reviewing this, I will focus all my attention
>on addressing your feedback. If there are any TODOs that you feel are
>most important, let me know, so I can start with those.
>
>Otherwise, I will prioritize parallel batch 0 spilling.

Feel free to work on the batch 0 spilling, please. I still need to get
familiar with various parts of the parallel hash join etc. so I don't
have any immediate feedback which TODOs to work on first.

>David Kimura plans to do a bit of work on on parallel hash join batch 0
>spilling tomorrow. Whatever is left after that, I will pick up next
>week. Parallel hash join batch 0 spilling is the last large TODO that I
>had.
>
>My plan was to then focus on the feedback (either about which TODOs are
>most important or outside of the TODOs I've identified) I get from you
>and anyone else who reviews this.
>

OK.

>>
>> I see you've switched to "stripe" naming - I find that a bit confusing,
>> because when I hear stripe I think about RAID, where it means pieces of
>> data interleaved and stored on different devices. But maybe that's just
>> me and it's a good name. Maybe it'd be better to keep the naming and
>> only tweak it at the end, not to disrupt reviews unnecessarily.
>>
>
>I hear you about "stripe". I still quite like it, especially as compared
>to its predecessor (originally, I called them chunks -- which is
>impossible given that SharedTuplestoreChunks are a thing).
>

I don't think using chunks in one place means we can't use it elsewhere
in a different context. I'm sure we have "chunks" in other places. But
let's not bikeshed on this too much.

>For ease of review, as you mentioned, I will keep the name for now. I am
>open to changing it later, though.
>
>I've been soliciting ideas for alternatives and, so far, folks have
>suggested "stride", "step", "flock", "herd", "cohort", and "school". I'm
>still on team "stripe" though, as it stands.
>

;-)

>
>>
>> nodeHash.c
>> ----------
>>
>>
>> 1) MultiExecPrivateHash says this
>>
>>    /*
>>     * Not subject to skew optimization, so either insert normally
>>     * or save to batch file if it belongs to another stripe
>>     */
>>
>> I wonder what it means to "belong to another stripe". I understand what
>> that means for batches, which are identified by batchno computed from
>> the hash value. But I thought "stripes" are just work_mem-sized pieces
>> of a batch, so I don't quite understand this. Especially when the code
>> does not actually check "which stripe" the row belongs to.
>>
>>
>I agree this was confusing.
>
>"belongs to another stripe" meant here that if batch 0 falls back and we
>are still loading it, once we've filled up work_mem, we need to start
>saving those tuples to a spill file for batch 0. I've changed the
>comment to this:
>
>-        * or save to batch file if it belongs to another stripe
>+       * or save to batch file if batch 0 falls back and we have
>+       * already filled the hashtable up to space_allowed.
>

OK. Silly question - what does "batch 0 falls back" mean? Does it mean
that we realized the hash table for batch 0 would not fit into work_mem,
so we switched to the "hashloop" strategy?

>
>> 2) I find the fields hashloop_fallback rather confusing. We have one in
>> HashJoinTable (and it's array of BufFile items) and another one in
>> ParallelHashJoinBatch (this time just bool).
>>
>> I think HashJoinTable should be renamed to hashloopBatchFile (similarly
>> to the other BufFile arrays).
>
>
>I think you are right about the name. I've changed the name in
>HashJoinTableData to hashloopBatchFile.
>
>The array of BufFiles hashloop_fallback was only used by serial
>hashjoin. The boolean hashloop_fallback variable is used only by
>parallel hashjoin.
>
>The reason I had them named the same thing is that I thought it would be
>nice to have a variable with the same name to indicate if a batch "fell
>back" for both parallel and serial hashjoin--especially since we check
>it in the main hashjoin state machine used by parallel and serial
>hashjoin.
>
>In serial hashjoin, the BufFiles aren't identified by name, so I kept
>them in that array. In parallel hashjoin, each ParallelHashJoinBatch has
>the status saved (in the struct).
>So, both represented the fall back status of a batch.
>
>However, I agree with you, so I've renamed the serial one to
>hashloopBatchFile.
>

OK

>>
>> Although I'm not sure why we even need
>> this file, when we have innerBatchFile? BufFile(s) are not exactly free,
>> in fact it's one of the problems for hashjoins with many batches.
>>
>>
>Interesting -- it didn't even occur to me to combine the bitmap with the
>inner side batch file data.
>It definitely seems like a good idea to save the BufFile given that so
>little data will likely go in it and that it has a 1-1 relationship with
>inner side batches.
>
>How might it work? Would you reserve some space at the beginning of the
>file? When would you reserve the bytes (before adding tuples you won't
>know how many bytes you need, so it might be hard to make sure there is
>enough space.) Would all inner side files have space reserved or just
>fallback batches?
>

Oh! So the hashloopBatchFile is only used for the bitmap? I haven't
realized that. In that case it probably makes sense to keep it separate
from the files with spilled tuples, interleaving that somehow would be
way too complex, I think.

However, do we need an array of those files? I thought we only need the
bitmap until we process all rows from each "stripe" and then we can
throw it away, right? Which would also mean we don't need to worry about
the memory usage too much, because the 8kB buffer will go away after
calling BufFileClose.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: xid wraparound danger due to INDEX_CLEANUP false
Next
From: Peter Geoghegan
Date:
Subject: Re: xid wraparound danger due to INDEX_CLEANUP false