Re: [HACKERS] WIP: [[Parallel] Shared] Hash - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] WIP: [[Parallel] Shared] Hash |
Date | |
Msg-id | CAEepm=2PRCtpo6UL4RxSbp=OXpyty0dg3oT3Vyk0eb=r8JwZhg@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] WIP: [[Parallel] Shared] Hash (Rafia Sabih <rafia.sabih@enterprisedb.com>) |
Responses |
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
|
List | pgsql-hackers |
On Thu, Feb 2, 2017 at 4:57 PM, Rafia Sabih <rafia.sabih@enterprisedb.com> wrote: > On Thu, Feb 2, 2017 at 1:19 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Thu, Feb 2, 2017 at 3:34 AM, Rafia Sabih >> <rafia.sabih@enterprisedb.com> wrote: >>> [ regressions ] >> >> Thanks Rafia. At first glance this plan is using the Parallel Shared >> Hash in one place where it should pay off, that is loading the orders >> table, but the numbers are terrible. I noticed that it uses batch >> files and then has to increase the number of batch files, generating a >> bunch of extra work, even though it apparently overestimated the >> number of rows, though that's only ~9 seconds of ~60. I am >> investigating. > > Hi Thomas, > Apart from the previously reported regression, there appear one more > issue in this set of patches. At times, running a query using parallel > hash it hangs up and all the workers including the master shows the > following backtrace, Here's a new version to fix the problems reported by Rafia above. The patch descriptions are as before but it starts from 0002 because 0001 was committed as 7c5d8c16 (thanks, Andres). First, some quick master-vs-patch numbers from the queries listed with regressions, using TPCH dbgen scale 10, work_mem = 64MB, max_parallel_workers_per_gather = 4, shared_buffers = 8GB (the numbers themselves not comparable as different scale and different hardware). Better except for Q5 and Q8, which for some mysterious reason plans only one worker and then loses. I'm looking into that. Q3 19917.682 -> 8649.822 Q5 4149.983 -> 4192.551 Q7 14453.721 -> 10303.911 Q8 1981.540 -> 8030.264 Q9 26928.102 -> 17384.607 Q10 16955.240 -> 14563.787 I plan to explore the performance space with a range of worker numbers and work_mem sizes and do some analysis; more soon. Changes: 1. Fixed two bugs that resulted in ExecHashShrink sometimes hanging, as reported by Rafia. (1) When splitting the large v3 patch up into smaller patches for v4, I'd managed to lose the line that initialises shared->shrink_barrier, causing some occasional strange behaviour. (2) I found a bug[1] in condition_variable.c that could cause hangs and fixed that via a separate patch and the fix was committed as 3f3d60d3 (thanks, Robert). 2. Simplified barrier.c by removing BarrierWaitSet(), because that turned out to be unnecessary to implement rescan as I'd originally thought, and was incompatible with the way BarrierDetach() works. The latter assumes that the phase only ever increments, so that combination of features was broken. 3. Sorted out the hash table sizing logic that was previously leading to some strange decisions about batches. This involved putting the total estimated number of inner rows into the path and plan when there is a partial inner plan, because plan_rows only has the partial number. I need to size the hash table correctly at execution time. It seems a bit strange to do that specifically and only for Hash (see rows_total in the 0008 patch)... should there be some more generic way? Should total rows go into Plan rather than HashPlan, or perhaps the parallel divisor should go somewhere? 4. Comments fixed and added based on Ashutosh's feedback on patch 0003. 5. Various small bug fixes. I've also attached a small set of test queries that hit the four "modes" (for want of a better word) of our hash join algorithm for dealing with different memory conditions, which I've nicknamed thus: 1. "Good": We estimate that the hash table will fit in work_mem, and at execution time it does. This patch makes that more likely because [Parallel] Shared Hash gets to use more work_mem as discussed. 2. "Bad": We estimate that the hash table won't fit in work_mem, but that if we partition it into N batches using some bits from the hash value then each batch will fit in work_mem. At execution time, each batch does indeed fit into work_mem. This is not ideal, because we have to write out and read back in N - (1 / N) inner and outer tuples (ie all batches except the first one, although actually costsize.c always charges for all of them). But it may still be better than other plans, and the IO is sequential. Currently Shared Hash shouldn't be selected over (private) Hash if it would require batching anyway due to the cpu_shared_tuple_cost tie-breaker: on the one had it avoids a bunch of copies of the batch files being written out, but on the other it introduces a bunch of synchronisation overhead. Parallel Shared Hash is fairly likely to be chosen if possible be due to division of the inner relation's cost outweighing cpu_shared_tuple_cost. 3. "Ugly": We planned for "good" or "bad" mode, but we ran out of work_mem at some point during execution: this could be during the initial hash table load, or while loading a subsequent batch. So now we double the number of batches, splitting the current batch and all batches that haven't been processed yet into two in the hope of shrinking the hash table, while generating extra reading and writing of all as-yet unprocessed tuples. This patch can do the shrinking work in parallel, which may help. 4. "Fail": After reaching "ugly" mode (and perhaps trying multiple times to shrink the hash table), we deduce that there is a kind of extreme skew that our partitioning scheme can never help with. So we stop respecting work_mem and hope for the best. The hash join may or may not be able to complete, depending on how much memory you can successfully allocate without melting the server or being killed by the OOM reaper. The "ugly" mode was added in 2005[1], so before that we had only "good", "bad" and "fail". We don't ever want to be in "ugly" or "fail" modes: a sort merge join would have been better, or in any case is guaranteed to be able to run to completion in the configured space. However, at the point where we reach this condition, there isn't anything else we can do. Some other interesting cases that hit new code are: rescan with single batch (reuses the hash table contents), rescan with multiple batches (blows away and rebuilds the hash table), outer join (scans hash table for unmatched tuples). Outer joins are obviously easy to test but rescans are a bit tricky to reach... one way is to run TPCH Q9 with cph_shared_tuple_cost = -10 (I think what's happening here is that it's essentially running the optimiser in reverse, and a nested loop rescanning a gather node (= fork/exit workers for every loop) is about the worst plan imaginable), but I haven't found a short and sweet test query for that yet. Some assorted thoughts: * Instead of abandoning our work_mem limit in "fail" mode, you might think we could probe the portion of the hash table that we managed to load so far, then rewind the outer batch and probe again using the next work_mem-sized portion of the same inner batch file. This doesn't work though because in the case of work_mem exhaustion during the initial batch it's too late to decide to start recording the the initial outer batch, so we have no way to rewind. * Instead of using the shared hash table for batch mode, we could do just the initial batch with a shared hash table, but drop back to smaller private hash tables for later batches and give each worker its own batch to work until they're all done with no further communication. There are some problems with this though: inability to handle outer joins (just like parallel hash join in 9.6), limit of work_mem (not work_mem * P) for the private hash tables, load balancing/granularity problems with skewed data. Thanks to my colleague Ashutosh Bapat for this off-list suggestion. One of the unpleasant things about this patch is the risk of deadlock, as already discussed. I wanted to mention an idea for how to get rid of this problem eventually. I am aware of two ways that a deadlock could happen: 1. A worker is waiting to write into its tuple queue (because the reader is not consuming fast enough and its fixed buffer has filled up), but the leader (which should be reading the tuple queue) is stuck waiting for the worker. This is avoided currently with the early-exit protocol, at the cost of losing a CPU core after probing the first batch. 2. Two different hash joins run in non-deterministic order. Workers A and B have executed hash join nodes 1 and 2 at least once and attached to the barrier, and now Worker A is in hash join node 1, and worker B is in hash join node 2 at a barrier wait point. I am not aware of any executor nodes that could do that currently, but there is nothing to say that future nodes couldn't do that. If I am wrong about that and this could happen today, that would be fatal for this patch in its current form. Once we have asynchronous execution infrastructure, perhaps we could make those problems go away like this: 1. Introduce a new way for barrier clients to try to advance to the next phase, but detach and return immediately if they would have to wait. 2. Introduce a way for barriers to participate in the the readiness protocol used for async execution, so that barrier advances counts as a kind of readiness. (The asynchronous scheduler probably doesn't need to know anything about that since it's based on latches which the WaitSet API already knows how to multiplex.) 3. Teach Hash Join to yield instead of waiting at barriers, asking to be executed again when the barrier might have advanced. 4. Make sure the Gather node is suitably asynchronicity-aware. At a minimum it should be able to deal with the child plan yielding (in the case where it runs in the leader due to lack of better things to do) and be able to try that again when it needs to. [1] https://www.postgresql.org/message-id/CAEepm%3D3a4VaPFnmwcdyUH8gE5_hW4tRvXQkpfQyrzgDQ9gJCYw%40mail.gmail.com [2] https://www.postgresql.org/message-id/15661.1109887540@sss.pgh.pa.us [3] 849074f9ae422c64501bb1d53ef840de870bf65c -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: