Home > mailing lists

Re: accounting for memory used for BufFile during hash joins - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: accounting for memory used for BufFile during hash joins
Date	May 7, 2019 07:28:36
Msg-id	CA+hUKG+546WK=ya70QMs-oWE94BhsUKvqjr=ds_X_wkv7fP-2w@mail.gmail.com Whole thread Raw
In response to	Re: accounting for memory used for BufFile during hash joins (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: accounting for memory used for BufFile during hash joins (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List	pgsql-hackers

Tree view

On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
> >Seems expensive for large numbers of slices -- you need to join the
> >outer batch against each inner slice.
>
> Nope, that's not how it works. It's the array of batches that gets
> sliced, not the batches themselves.

Sorry, I read only the description and not the code, and got confused
about that.  So, I see three separate but related problems:

A.  Broken escape valve:  sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys.  We could fix that
with (say) a 95% rule.
B.  Lack of good alternative execution strategy when the escape valve
is triggered.  A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C.  Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.

> >But I wonder how we'd deal with outer joins, as Tom Lane asked in
> >another thread:
> >
> >https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us
>
> That seems unrelated - we slice the array of batches, to keep memory
> needed for BufFile under control. The hash table remains intact, so
> there's no issue with outer joins.

Right, sorry, my confusion.  I thought you were describing
https://en.wikipedia.org/wiki/Block_nested_loop.  (I actually think we
can make that work for left outer joins without too much fuss by
writing out a stream of match bits to a new temporary file.  Googling,
I see that MySQL originally didn't support BNL for outer joins and
then added some match flag propagation thing recently.)

> I agree we should relax the 0%/100% split condition, and disable the
> growth sooner. But I think we should also re-evaluate that decision
> after a while - the data set may be correlated in some way, in which
> case we may disable the growth prematurely. It may not reduce memory
> usage now, but it may help in the future.
>
> It's already an issue, but it would be even more likely if we disabled
> growth e.g. with just 5%/95% splits.
>
> FWIW I believe this is mostly orthogonal issue to what's discussed in
> this thread.

But isn't problem A the root cause of problem C, in most cases?  There
must also be "genuine" cases of problem C that would occur even if we
fix that, of course: someone has small work_mem, and data that can be
effectively partitioned to fit it, but it just takes a huge number of
partitions to do it.  So that we don't behave badly in those cases, I
agree with you 100%: we should fix the memory accounting to count
BufFile overheads as you are proposing, and then I guess ideally
switch to our alternative strategy (BNL or sort-merge or ...) when we
see that BufFiles are wasting to much work_mem and its time to try
something else.  It seems you don't actually have one of those cases
here, though?

I think we should fix problem A.  Then handle problem C by accounting
for BufFiles, and figure out a way to switch to our alternative
strategy (currently: ignore work_mem), when we think that creating
more BufFiles will be futile (not sure exactly what the rule there
should be).  And then work on fixing B properly with a good strategy.
Here's a straw-man idea: we could adopt BNL, and then entirely remove
our repartitioning code.  If the planner's number of partitions turns
out to be not enough, we'll just handle it using BNL loops.

> Switching to some other algorithm during execution moves the goal posts
> to the next galaxy, I'm afraid.

The main problem I'm aware of with sort-merge join is: not all that is
hashable is sortable.  So BNL is actually the only solution I'm aware
of for problem B that doesn't involve changing a fundamental thing
about PostgreSQL's data type requirements.

-- 
Thomas Munro
https://enterprisedb.com

pgsql-hackers by date:

From: Alexander Korotkov
Date: 07 May 2019, 07:14:40
Subject: Re: jsonpath

From: Ian Barwick
Date: 07 May 2019, 08:24:19
Subject: PG12, PGXS and linking pgfeutils

Re: accounting for memory used for BufFile during hash joins - Mailing list pgsql-hackers

Previous

Next