Re: Avoiding hash join batch explosions with extreme skew and weirdstats - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Avoiding hash join batch explosions with extreme skew and weirdstats |
Date | |
Msg-id | 20190910131057.r7hhmynayz7k5r33@development Whole thread Raw |
In response to | Re: Avoiding hash join batch explosions with extreme skew and weird stats (Melanie Plageman <melanieplageman@gmail.com>) |
List | pgsql-hackers |
On Fri, Sep 06, 2019 at 10:54:13AM -0700, Melanie Plageman wrote: >On Thu, Sep 5, 2019 at 10:35 PM Thomas Munro <thomas.munro@gmail.com> wrote: > >> Seems like a good time for me to try to summarise what I think the >> main problems are here: >> >> 1. The match-bit storage problem already discussed. The tuples that >> each process receives while reading from SharedTupleStore are >> non-deterministic (like other parallel scans). To use a bitmap-based >> approach, I guess we'd need to invent some way to give the tuples a >> stable identifier within some kind of densely packed number space that >> we could use to address the bitmap, or take the IO hit and write all >> the tuples back. That might involve changing the way SharedTupleStore >> holds data. >> > >This I've dealt with by adding a tuplenum to the SharedTupleStore >itself which I atomically increment in sts_puttuple(). >In ExecParallelHashJoinPartitionOuter(), as each worker writes tuples >to the batch files, they call sts_puttuple() and this increments the >number so each tuple has a unique number. >For persisting this number, I added the tuplenum to the meta data >section of the MinimalTuple (along with the hashvalue -- there was a >comment about this meta data that said it could be used for other >things in the future, so this seemed like a good place to put it) and >write that out to the batch file. > >At the end of ExecParallelHashJoinPartitionOuter(), I make the outer >match status bitmap file. I use the final tuplenum count to determine >the number of bytes to write to it. Each worker has a file with a >bitmap which has the number of bytes required to represent the number >of tuples in that batch. > >Because one worker may beat the other(s) and build the whole batch >file for a batch before the others have a chance, I also make the >outer match status bitmap file for workers who missed out in >ExecParallelHashJoinOuterGetTuple() using the final tuplenum as well. > That seems like a perfectly sensible solution to me. I'm sure there are ways to optimize it (say, having a bitmap optimized for sparse data, or bitmap shared by all the workers or something like that), but that's definitely not needed for v1. Even having a bitmap per worker is pretty cheap. Assume we have 1B rows, the bitmap is 1B/8 bytes = ~120MB per worker. So with 16 workers that's ~2GB, give or take. But with 100B rows, the original data is ~100GB. So the bitmaps are not free, but it's not terrible either. >> >> 2. Tricky problems relating to barriers and flow control. First, let >> me explain why PHJ doesn't support full/right outer joins yet. At >> first I thought it was going to be easy, because, although the shared >> memory hash table is read-only after it has been built, it seems safe >> to weaken that only slightly and let the match flag be set by any >> process during probing: it's OK if two processes clobber each other's >> writes, as the only transition is a single bit going strictly from 0 >> to 1, and there will certainly be a full memory barrier before anyone >> tries to read those match bits. Then during the scan for unmatched, >> you just have to somehow dole out hash table buckets or ranges of >> buckets to processes on a first-come-first-served basis. But.... then >> I crashed into the following problem: >> >> * You can't begin the scan for unmatched tuples until every process >> has finished probing (ie until you have the final set of match bits). >> * You can't wait for every process to finish probing, because any >> process that has emitted a tuple might never come back if there is >> another node that is also waiting for all processes (ie deadlock >> against another PHJ doing the same thing), and probing is a phase that >> emits tuples. >> >> Generally, it's not safe to emit tuples while you are attached to a >> Barrier, unless you're only going to detach from it, not wait at it, >> because emitting tuples lets the program counter escape your control. >> Generally, it's not safe to detach from a Barrier while accessing >> resources whose lifetime it controls, such as a hash table, because >> then it might go away underneath you. >> >> The PHJ plans that are supported currently adhere to that programming >> rule and so don't have a problem: after the Barrier reaches the >> probing phase, processes never wait for each other again so they're >> free to begin emitting tuples. They just detach when they're done >> probing, and the last to detach cleans up (frees the hash table etc). >> If there is more than one batch, they detach from one batch and attach >> to another when they're ready (each batch has its own Barrier), so we >> can consider the batches to be entirely independent. >> >> There is probably a way to make a scan-for-unmatched-inner phase work, >> possibly involving another Barrier or something like that, but I ran >> out of time trying to figure it out and wanted to ship a working PHJ >> for the more common plan types. I suppose PHLJ will face two variants >> of this problem: (1) you need to synchronise the loops (you can't dump >> the hash table in preparation for the next loop until all have >> finished probing for the current loop), and yet you've already emitted >> tuples, so you're not allowed to wait for other processes and they're >> not allowed to wait for you, and (2) you can't start the >> scan-for-unmatched-outer until all the probe loops belonging to one >> batch are done. The first problem is sort of analogous to a problem I >> faced with batches in the first place, which Robert and I found a >> solution to by processing the batches in parallel, and could perhaps >> be solved in the same way: run the loops in parallel (if that sounds >> crazy, recall that every worker has its own quota of work_mem and the >> data is entirely prepartitioned up front, which is why we are able to >> run the batches in parallel; in constrast, single-batch mode makes a >> hash table with a quota of nparticipants * work_mem). The second >> problem is sort of analogous to the existing scan-for-unmatched-inner >> problem that I haven't solved. >> >> >I "solved" these problem for now by having all workers except for one >detach from the outer batch file after finishing probing. The last >worker to arrive does not detach from the batch and instead iterates >through all of the workers' outer match status files per participant >shared mem SharedTuplestoreParticipant) and create a single unified >bitmap. All the other workers continue to wait at the barrier until >the sole remaining worker has finished with iterating through the >outer match status bitmap files. > Why did you put solved in quotation marks? This seems like a reasonable solution to me, at least for now, but the quotation marks kinda suggest you think it's either not correct or not good enough. Or did I miss some flaw that makes this unacceptable? >Admittedly, I'm still fighting with this step a bit, but, my intent is >to have all the backends wait until the lone remaining worker has >created the unified bitmap, then, that worker, which is still attached >to the outer batch will scan the outer batch file and the unified >outer match status bitmap and emit unmatched tuples. > Makes sense, I think. The one "issue" this probably has is that it serializes the last step, i.e. the search for unmatched tuples is done in a single process, instead of parallelized over multiple workers. That's certainly unfortunate, but is that really an issue in practice? Probably not for queries with just a small number of unmatched tuples. And for cases with many unmatched rows it's probably going to degrade to non-parallel case. >I thought that the other workers can move on and stop waiting at the >barrier once the lone remaining worker has scanned their outer match >status files. All the probe loops would be done, and the worker that >is emitting tuples is not referencing the inner side hashtable at all >and only the outer batch file and the combined bitmap. > Why would the workers need to wait for the lone worker to scan their bitmap file? Or do the files disappear with the workers, or something like that? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: