Re: accounting for memory used for BufFile during hash joins - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: accounting for memory used for BufFile during hash joins
Date
Msg-id 20190507135912.qakpj74sv3ngrypz@development
Whole thread Raw
In response to Re: accounting for memory used for BufFile during hash joins  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: accounting for memory used for BufFile during hash joins
List pgsql-hackers
On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
>On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>> >Seems expensive for large numbers of slices -- you need to join the
>> >outer batch against each inner slice.
>>
>> Nope, that's not how it works. It's the array of batches that gets
>> sliced, not the batches themselves.
>
>Sorry, I read only the description and not the code, and got confused
>about that.  So, I see three separate but related problems:
>
>A.  Broken escape valve:  sometimes we generate a huge number of
>batches while trying to split up many duplicates, because of the
>presence of other more uniformly distributed keys.  We could fix that
>with (say) a 95% rule.
>B.  Lack of good alternative execution strategy when the escape valve
>is triggered.  A batch cannot be split effectively, but cannot fit in
>work_mem, so for now we decide to ignore work_mem.
>C.  Unmetered explosion of batches and thus BufFiles, probably usually
>caused by problem A, but theoretically also due to a real need for
>partitions.
>

Right. I don't think a single solution addressing all those issues exists.
It's more likely we need multiple improvements.

>> >But I wonder how we'd deal with outer joins, as Tom Lane asked in
>> >another thread:
>> >
>> >https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
>>
>> That seems unrelated - we slice the array of batches, to keep memory
>> needed for BufFile under control. The hash table remains intact, so
>> there's no issue with outer joins.
>
>Right, sorry, my confusion.  I thought you were describing
>https://en.wikipedia.org/wiki/Block_nested_loop.  (I actually think we
>can make that work for left outer joins without too much fuss by
>writing out a stream of match bits to a new temporary file.  Googling,
>I see that MySQL originally didn't support BNL for outer joins and
>then added some match flag propagation thing recently.)
>

Possibly, I'm not against implementing that, although I don't have very
good idea what the benefits of BNL joins are (performance-wise). In any
case, I think entirely unrelated to hash joins.

>> I agree we should relax the 0%/100% split condition, and disable the
>> growth sooner. But I think we should also re-evaluate that decision
>> after a while - the data set may be correlated in some way, in which
>> case we may disable the growth prematurely. It may not reduce memory
>> usage now, but it may help in the future.
>>
>> It's already an issue, but it would be even more likely if we disabled
>> growth e.g. with just 5%/95% splits.
>>
>> FWIW I believe this is mostly orthogonal issue to what's discussed in
>> this thread.
>
>But isn't problem A the root cause of problem C, in most cases?  There
>must also be "genuine" cases of problem C that would occur even if we
>fix that, of course: someone has small work_mem, and data that can be
>effectively partitioned to fit it, but it just takes a huge number of
>partitions to do it.  So that we don't behave badly in those cases, I
>agree with you 100%: we should fix the memory accounting to count
>BufFile overheads as you are proposing, and then I guess ideally
>switch to our alternative strategy (BNL or sort-merge or ...) when we
>see that BufFiles are wasting to much work_mem and its time to try
>something else.  It seems you don't actually have one of those cases
>here, though?
>

Maybe. Or maybe not. I don't have enough data to make such judgements
about the causes in general. We have one query from pgsql-performance.
There might be more, but IMO that's probably biased data set.

But even that reported query actually is not the case that A causes C.
The outer side of the hash join was significantly underestimated (34619
vs. 113478127) due to highly-correlated conditions.

And in that case it's trivial to cause nbatch explosion even with perfect
data sets with no duplicates (so no escape valve failure).


>I think we should fix problem A.  Then handle problem C by accounting
>for BufFiles, and figure out a way to switch to our alternative
>strategy (currently: ignore work_mem), when we think that creating
>more BufFiles will be futile (not sure exactly what the rule there
>should be).  And then work on fixing B properly with a good strategy.
>Here's a straw-man idea: we could adopt BNL, and then entirely remove
>our repartitioning code.  If the planner's number of partitions turns
>out to be not enough, we'll just handle it using BNL loops.
>

Yeah, something like that.

I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase.  That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again.  Rinse and repeat.

For C, I think we can use either of the two approaches I proposed. I like
the second option better, as it actually enforces work_mem. The first
option kinda helped with A too, although in different way, ana I think the
solution I outlined in the previous paragraph will work better.

No opinion regarding the switch to BNL, at the moment.

>> Switching to some other algorithm during execution moves the goal posts
>> to the next galaxy, I'm afraid.
>
>The main problem I'm aware of with sort-merge join is: not all that is
>hashable is sortable.  So BNL is actually the only solution I'm aware
>of for problem B that doesn't involve changing a fundamental thing
>about PostgreSQL's data type requirements.
>

Sure, each of those algorithms has limitations. But I think that's mostly
irrelevant to the main issue - switching between algorithms mid-execution.
At that point some of the tuples might have been already sent sent to the
other nodes, and I have no idea how to "resume" the tuple stream short of
buffering everything locally until the join completes. And that would be
rather terrible, I guess.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: PG12, PGXS and linking pgfeutils
Next
From: Tom Lane
Date:
Subject: Re: jsonpath