Re: accounting for memory used for BufFile during hash joins - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: accounting for memory used for BufFile during hash joins |
Date | |
Msg-id | 20190507135912.qakpj74sv3ngrypz@development Whole thread Raw |
In response to | Re: accounting for memory used for BufFile during hash joins (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: accounting for memory used for BufFile during hash joins
|
List | pgsql-hackers |
On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote: >On Tue, May 7, 2019 at 3:15 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote: >> >Seems expensive for large numbers of slices -- you need to join the >> >outer batch against each inner slice. >> >> Nope, that's not how it works. It's the array of batches that gets >> sliced, not the batches themselves. > >Sorry, I read only the description and not the code, and got confused >about that. So, I see three separate but related problems: > >A. Broken escape valve: sometimes we generate a huge number of >batches while trying to split up many duplicates, because of the >presence of other more uniformly distributed keys. We could fix that >with (say) a 95% rule. >B. Lack of good alternative execution strategy when the escape valve >is triggered. A batch cannot be split effectively, but cannot fit in >work_mem, so for now we decide to ignore work_mem. >C. Unmetered explosion of batches and thus BufFiles, probably usually >caused by problem A, but theoretically also due to a real need for >partitions. > Right. I don't think a single solution addressing all those issues exists. It's more likely we need multiple improvements. >> >But I wonder how we'd deal with outer joins, as Tom Lane asked in >> >another thread: >> > >> >https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us >> >> That seems unrelated - we slice the array of batches, to keep memory >> needed for BufFile under control. The hash table remains intact, so >> there's no issue with outer joins. > >Right, sorry, my confusion. I thought you were describing >https://en.wikipedia.org/wiki/Block_nested_loop. (I actually think we >can make that work for left outer joins without too much fuss by >writing out a stream of match bits to a new temporary file. Googling, >I see that MySQL originally didn't support BNL for outer joins and >then added some match flag propagation thing recently.) > Possibly, I'm not against implementing that, although I don't have very good idea what the benefits of BNL joins are (performance-wise). In any case, I think entirely unrelated to hash joins. >> I agree we should relax the 0%/100% split condition, and disable the >> growth sooner. But I think we should also re-evaluate that decision >> after a while - the data set may be correlated in some way, in which >> case we may disable the growth prematurely. It may not reduce memory >> usage now, but it may help in the future. >> >> It's already an issue, but it would be even more likely if we disabled >> growth e.g. with just 5%/95% splits. >> >> FWIW I believe this is mostly orthogonal issue to what's discussed in >> this thread. > >But isn't problem A the root cause of problem C, in most cases? There >must also be "genuine" cases of problem C that would occur even if we >fix that, of course: someone has small work_mem, and data that can be >effectively partitioned to fit it, but it just takes a huge number of >partitions to do it. So that we don't behave badly in those cases, I >agree with you 100%: we should fix the memory accounting to count >BufFile overheads as you are proposing, and then I guess ideally >switch to our alternative strategy (BNL or sort-merge or ...) when we >see that BufFiles are wasting to much work_mem and its time to try >something else. It seems you don't actually have one of those cases >here, though? > Maybe. Or maybe not. I don't have enough data to make such judgements about the causes in general. We have one query from pgsql-performance. There might be more, but IMO that's probably biased data set. But even that reported query actually is not the case that A causes C. The outer side of the hash join was significantly underestimated (34619 vs. 113478127) due to highly-correlated conditions. And in that case it's trivial to cause nbatch explosion even with perfect data sets with no duplicates (so no escape valve failure). >I think we should fix problem A. Then handle problem C by accounting >for BufFiles, and figure out a way to switch to our alternative >strategy (currently: ignore work_mem), when we think that creating >more BufFiles will be futile (not sure exactly what the rule there >should be). And then work on fixing B properly with a good strategy. >Here's a straw-man idea: we could adopt BNL, and then entirely remove >our repartitioning code. If the planner's number of partitions turns >out to be not enough, we'll just handle it using BNL loops. > Yeah, something like that. I think we can fix A by relaxing the escape valve condition, and then rechecking it once in a while. So we fill work_mem, realize it didn't actually reduce the batch size significantly and disable nbatch growth. But at the same time we increase the threshold to 2x work_mem, and after reaching it we "consider" a nbatch increase. That is, we walk the batch and see how many tuples would move if we increased nbatch (that should be fairly cheap) - if it helps, great, enable growth and split the batch. If not, double the threshold again. Rinse and repeat. For C, I think we can use either of the two approaches I proposed. I like the second option better, as it actually enforces work_mem. The first option kinda helped with A too, although in different way, ana I think the solution I outlined in the previous paragraph will work better. No opinion regarding the switch to BNL, at the moment. >> Switching to some other algorithm during execution moves the goal posts >> to the next galaxy, I'm afraid. > >The main problem I'm aware of with sort-merge join is: not all that is >hashable is sortable. So BNL is actually the only solution I'm aware >of for problem B that doesn't involve changing a fundamental thing >about PostgreSQL's data type requirements. > Sure, each of those algorithms has limitations. But I think that's mostly irrelevant to the main issue - switching between algorithms mid-execution. At that point some of the tuples might have been already sent sent to the other nodes, and I have no idea how to "resume" the tuple stream short of buffering everything locally until the join completes. And that would be rather terrible, I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: