Re: Avoiding hash join batch explosions with extreme skew and weirdstats - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Avoiding hash join batch explosions with extreme skew and weirdstats
Date
Msg-id 20190516225427.uvqthiocuibkgh2s@development
Whole thread Raw
In response to Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Fri, May 17, 2019 at 10:21:56AM +1200, Thomas Munro wrote:
>On Fri, May 17, 2019 at 4:39 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> I think this is a step in the right direction, but as I said on the other
>> thread(s), I think we should not disable growth forever and recheck once
>> in a while. Otherwise we'll end up in sad situation with non-uniform data
>> sets, as poined out by Hubert Zhang in [1]. It's probably even truer with
>> this less strict logic, using 95% as a threshold (instead of 100%).
>>
>> I kinda like the idea with increasing the spaceAllowed value. Essentially,
>> if we decide adding batches would be pointless, increasing the memory
>> budget is the only thing we can do anyway.
>
>But that's not OK, we need to fix THAT.
>

I agree increasing the budget is not ideal, althought at the moment it's
the only thing we can do. If we can improve that, great.

>> The problem however is that we only really look at a single bit - it may
>> be that doubling the batches would not help, but doing it twice would
>> actually reduce the memory usage. For example, assume there are 2 distinct
>> values in the batch, with hash values (in binary)
>
>Yes, that's a good point, and not a case that we should ignore.  But
>if we had a decent fall-back strategy that respected work_mem, we
>wouldn't care so much if we get it wrong in a corner case.  I'm
>arguing that we should use Grace partitioning as our primary
>partitioning strategy, but fall back to looping (or possibly
>sort-merging) for the current batch if Grace doesn't seem to be
>working.  You'll always be able to find cases where if you'd just
>tried one more round, Grace would work, but that seems acceptable to
>me, because getting it wrong doesn't melt your computer, it just
>probably takes longer.  Or maybe it doesn't.  How much longer would it
>take to loop twice?  Erm, twice as long, and each loop makes actual
>progress, unlike extra speculative Grace partition expansions which
>apply not just to the current batch but all batches, might not
>actually work, and you *have* to abandon at some point.  The more I
>think about it, the more I think that a loop-base escape valve, though
>unpalatably quadratic, is probably OK because we're in a sink-or-swim
>situation at this point, and our budget is work_mem, not work_time.
>

True.

>I'm concerned that we're trying to find ways to treat the symptoms,
>allowing us to exceed work_mem but maybe not so much, instead of
>focusing on the fundamental problem, which is that we don't yet have
>an algorithm that is guaranteed to respect work_mem.
>

Yes, that's a good point.

>Admittedly I don't have a patch, just a bunch of handwaving.  One
>reason I haven't attempted to write it is because although I know how
>to do the non-parallel version using a BufFile full of match bits in
>sync with the tuples for outer joins, I haven't figured out how to do
>it for parallel-aware hash join, because then each loop over the outer
>batch could see different tuples in each participant.  You could use
>the match bit in HashJoinTuple header, but then you'd have to write
>all the tuples out again, which is more IO than I want to do.  I'll
>probably start another thread about that.
>

That pesky parallelism ;-)


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats
Next
From: Tom Lane
Date:
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats