Re: Avoiding hash join batch explosions with extreme skew and weirdstats - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Avoiding hash join batch explosions with extreme skew and weirdstats
Date
Msg-id 20190520002211.5uapffoz2tvxnqyh@development
Whole thread Raw
In response to Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Avoiding hash join batch explosions with extreme skew and weird stats
List pgsql-hackers
On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote:
>On Sat, May 18, 2019 at 12:15 PM Melanie Plageman
><melanieplageman@gmail.com> wrote:
>> On Thu, May 16, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>>> Admittedly I don't have a patch, just a bunch of handwaving.  One
>>> reason I haven't attempted to write it is because although I know how
>>> to do the non-parallel version using a BufFile full of match bits in
>>> sync with the tuples for outer joins, I haven't figured out how to do
>>> it for parallel-aware hash join, because then each loop over the outer
>>> batch could see different tuples in each participant.  You could use
>>> the match bit in HashJoinTuple header, but then you'd have to write
>>> all the tuples out again, which is more IO than I want to do.  I'll
>>> probably start another thread about that.
>>
>> Could you explain more about the implementation you are suggesting?
>>
>> Specifically, what do you mean "BufFile full of match bits in sync with the
>> tuples for outer joins?"
>
>First let me restate the PostgreSQL terminology for this stuff so I
>don't get confused while talking about it:
>
>* The inner side of the join = the right side = the side we use to
>build a hash table.  Right and full joins emit inner tuples when there
>is no matching tuple on the outer side.
>
>* The outer side of the join = the left side = the side we use to
>probe the hash table.  Left and full joins emit outer tuples when
>there is no matching tuple on the inner side.
>
>* Semi and anti joins emit exactly one instance of each outer tuple if
>there is/isn't at least one match on the inner side.
>

I think you're conflating inner/outer side and left/right, or rather
assuming it's always left=inner and right=outer.

> ... snip ...
>
>> Could you make an outer "batch" which is the whole of the outer relation? That
>> is, could you do something like: when hashing the inner side, if re-partitioning
>> is resulting in batches that will overflow spaceAllowed, could you set a flag on
>> that batch use_NLJ and when making batches for the outer side, make one "batch"
>> that has all the tuples from the outer side which the inner side batch which was
>> flagged will do NLJ with.
>
>I didn't understand this... you always need to make one outer batch
>corresponding to every inner batch.  The problem is the tricky
>left/full/anti/semi join cases when joining against fragments holding
>less that the full inner batch: we still need some way to implement
>join logic that depends on knowing whether there is a match in *any*
>of the inner fragments/loops.
>
>About the question of when exactly to set the "use_NLJ" flag:  I had
>originally been thinking of this only as a way to deal with the
>extreme skew problem.  But in light of Tomas's complaints about
>unmetered per-batch memory overheads, I had a new thought: it should
>also be triggered whenever doubling the number of batches would halve
>the amount of memory left for the hash table (after including the size
>of all those BufFile objects in the computation as Tomas proposes).  I
>think that might be exactly the right right cut-off if you want to do
>as much Grace partitioning as your work_mem can afford, and therefore
>as little looping as possible to complete the join while respecting
>work_mem.
>

Not sure what NLJ flag rule you propose, exactly.

Regarding the threshold value - once the space for BufFiles (and other
overhead) gets over work_mem/2, it does not make any sense to increase
the number of batches because then the work_mem would be entirely
occupied by BufFiles.

The WIP patches don't actually do exactly that though - they just check
if the incremented size would be over work_mem/2. I think we should
instead allow up to work_mem*2/3, i.e. stop adding batches after the
BufFiles start consuming more than work_mem/3 memory.

I think that's actually what you mean by "halving the amount of memory
left for the hash table" because that's what happens after reaching the
work_mem/3.

But I think that rule is irrelevant here, really, because this thread
was discussing cases where adding batches is futile due to skew, no? In
which case we should stop adding batches after reaching some % of tuples
not moving from the batch.

Or are you suggesting we should remove that rule, and instead realy on
this rule about halving the hash table space? That might work too, I
guess.

OTOH I'm not sure it's a good idea to handle both those cases the same
way - "overflow file" idea works pretty well for cases where the hash
table actually can be split into batches, and I'm afraid NLJ will be
much less efficient for those cases.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Segfault on ANALYZE in SERIALIZABLE isolation
Next
From: Fujii Masao
Date:
Subject: Re: vacuumdb and new VACUUM options