Re: [HACKERS] Parallel Hash take II - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] Parallel Hash take II |
Date | |
Msg-id | CAEepm=3jB1NnDXKqYUi_FhoeTn5Usda4XFNphsw2V7bhvad5UQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel Hash take II (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Tue, Aug 1, 2017 at 9:28 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-07-26 20:12:56 +1200, Thomas Munro wrote: >> 2. Simplified costing. There is now just one control knob >> "parallel_synchronization_cost", which I charge for each time the >> participants will wait for each other at a barrier, to be set high >> enough to dissuade the planner from using Parallel Hash for tiny hash >> tables that would be faster in a parallel-oblivious hash join. >> Earlier ideas about modelling the cost of shared memory access didn't >> work out. > > Hm. You say, "didn't work out" - could you expand a bit on that? I'm > quite doubtful that justaccounting for barriers will be good enough. The earlier approach and some variants I played with were based on the idea that we should try to estimate the cost of using shared memory. But there's no precedent for costing the cache hierarchy beyond disk vs memory, and it depends so much on your hardware (NUMA vs UMA) and the data distribution. I have no doubt that variations in memory access costs are important (for example, it was data distribution that determined whether big-cache-oblivious-shared-hash-table or MonetDB-style cache-aware approach won in that paper I've mentioned here before[1]), but it seems like a hard problem and I didn't feel like it was necessary. Or do you have a different idea here? Another point is that in the earlier versions I was trying to teach the planner how to choose among Hash, Shared Hash and Parallel Shared Hash. The difference in costing between Hash and Shared Hash (one worker builds, all workers probe) was important and sensitive, because the only difference between them would be the cost of memory sharing. When I dropped Shared Hash from the patch set, it no longer seemed necessary to try to deal with such subtle costing, because Hash and Parallel Hash (now without the word 'Shared') already have wildly different costs: the latter is divided over N CPUs. So I felt I could get away with a much blunter instrument: just something to avoid parallel build overheads for tiny tables like TPC-H "nation". I still wanted something that makes intuitive sense and that could be set using experimental evidence though. Parallel_synchronization_cost is an estimate of how long the average backend will have to wait for the last backend to complete the phase and arrive at each barrier. The most interesting case is the build phase: how long will the the last backend make us wait before probing can begin? Well, that depends on the parallel grain. Currently, the ultimate source of all parallelism in our executor is Parallel Seq Scan and Parallel Index Scan, and they hand out a page at a time. Of course, any number of nodes may sit between the hash join and the scan, and one of them might include a function that sleeps for 100 years in one backend or performs a join that generates wildly different numbers of tuples in each backend. I don't know what to do about that, other than to assume we have perfectly spherical cows and reason on the basis of an expected parallel grain reaching us from the scans. One thing to note about parallel_synchronization_cost is that the cost units, where 1 is traditionally the cost of scanning a page, actually make *some* kind of sense here, though it's a bit tenuous: the last worker to complete is the one that scans the final pages, while the others see the scan finished. What's really wanted here is not simply page scanning cost but rather a percentage of the total cost that represents how much extra work the lanterne rouge of backends has to do. Two relevant projects here are: 1. David Rowley proposes changing the seq scan grain[2], perhaps adaptively. I suppose as this number increases the time at which two workers finish can vary more greatly. 2. The parallel-append project introduces a completely different type of granularity based on unrelated and separately costed subplans rather than pages. Perhaps there are things that could be done here to model the fact that some workers might finish a long time before others, but I don't know. Perhaps what parallel hash really needs is not a user-controlled parallel_synchronization_cost, but some number produced by the planner to describe the expected distribution of tuple counts over workers. Armed with something like that and the cost per tuple you might be able to estimate how long we expect hash join barriers to make you wait without introducing any new GUCs at all. I thought about some of these things a bit but it seemed like a big research project of its own and I was persuaded in an off-list discussion by Robert to try to find the simplest thing that would avoid parallel-aware hash for little tables that are already built very cheaply. [1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.225.3495 [2] https://www.postgresql.org/message-id/CAKJS1f-XhfQ2-%3D85wgYo5b3WtEs%3Dys%3D2Rsq%3DNuvnmaV4ZsM1XQ%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: