Re: accounting for memory used for BufFile during hash joins - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: accounting for memory used for BufFile during hash joins |
Date | |
Msg-id | CA+hUKG+546WK=ya70QMs-oWE94BhsUKvqjr=ds_X_wkv7fP-2w@mail.gmail.com Whole thread Raw |
In response to | Re: accounting for memory used for BufFile during hash joins (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: accounting for memory used for BufFile during hash joins
(Tomas Vondra <tomas.vondra@2ndquadrant.com>)
|
List | pgsql-hackers |
On Tue, May 7, 2019 at 3:15 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote: > >Seems expensive for large numbers of slices -- you need to join the > >outer batch against each inner slice. > > Nope, that's not how it works. It's the array of batches that gets > sliced, not the batches themselves. Sorry, I read only the description and not the code, and got confused about that. So, I see three separate but related problems: A. Broken escape valve: sometimes we generate a huge number of batches while trying to split up many duplicates, because of the presence of other more uniformly distributed keys. We could fix that with (say) a 95% rule. B. Lack of good alternative execution strategy when the escape valve is triggered. A batch cannot be split effectively, but cannot fit in work_mem, so for now we decide to ignore work_mem. C. Unmetered explosion of batches and thus BufFiles, probably usually caused by problem A, but theoretically also due to a real need for partitions. > >But I wonder how we'd deal with outer joins, as Tom Lane asked in > >another thread: > > > >https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us > > That seems unrelated - we slice the array of batches, to keep memory > needed for BufFile under control. The hash table remains intact, so > there's no issue with outer joins. Right, sorry, my confusion. I thought you were describing https://en.wikipedia.org/wiki/Block_nested_loop. (I actually think we can make that work for left outer joins without too much fuss by writing out a stream of match bits to a new temporary file. Googling, I see that MySQL originally didn't support BNL for outer joins and then added some match flag propagation thing recently.) > I agree we should relax the 0%/100% split condition, and disable the > growth sooner. But I think we should also re-evaluate that decision > after a while - the data set may be correlated in some way, in which > case we may disable the growth prematurely. It may not reduce memory > usage now, but it may help in the future. > > It's already an issue, but it would be even more likely if we disabled > growth e.g. with just 5%/95% splits. > > FWIW I believe this is mostly orthogonal issue to what's discussed in > this thread. But isn't problem A the root cause of problem C, in most cases? There must also be "genuine" cases of problem C that would occur even if we fix that, of course: someone has small work_mem, and data that can be effectively partitioned to fit it, but it just takes a huge number of partitions to do it. So that we don't behave badly in those cases, I agree with you 100%: we should fix the memory accounting to count BufFile overheads as you are proposing, and then I guess ideally switch to our alternative strategy (BNL or sort-merge or ...) when we see that BufFiles are wasting to much work_mem and its time to try something else. It seems you don't actually have one of those cases here, though? I think we should fix problem A. Then handle problem C by accounting for BufFiles, and figure out a way to switch to our alternative strategy (currently: ignore work_mem), when we think that creating more BufFiles will be futile (not sure exactly what the rule there should be). And then work on fixing B properly with a good strategy. Here's a straw-man idea: we could adopt BNL, and then entirely remove our repartitioning code. If the planner's number of partitions turns out to be not enough, we'll just handle it using BNL loops. > Switching to some other algorithm during execution moves the goal posts > to the next galaxy, I'm afraid. The main problem I'm aware of with sort-merge join is: not all that is hashable is sortable. So BNL is actually the only solution I'm aware of for problem B that doesn't involve changing a fundamental thing about PostgreSQL's data type requirements. -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: