Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAM3SWZSKbOv2hyxJ+cLtfYmdK2RP57zkiF5Y4g=wfTFP7+RDcQ@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
|
List | pgsql-hackers |
On Fri, Aug 5, 2016 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: > To some extent, sure, absolutely. But it's our job as developers to > try to foresee and minimize those cases. When Noah was at > EnterpriseDB a few years ago and we were talking about parallel > internal sort, Noah started by doing a survey of the literature and > identified parallel quicksort as the algorithm that seemed best for > our use case. Of course, every time quicksort partitions the input, > you get two smaller sorting problems, so it's easy to see how to use 2 > CPUs after the initial partitioning step has been completed and 4 CPUs > after each of those partitions has been partitioned again, and so on. > However, that turns out not to be good enough because the first > partitioning step can consume a significant percentage of the total > runtime - so if you only start parallelizing after that, you're > leaving too much on the table. To avoid that, the algorithm he was > looking at had a (complicated) way of parallelizing the first > partitioning step; then you can, it seems, do the full sort in > parallel. > > There are some somewhat outdated and perhaps naive ideas about this > that we wrote up here: > > https://wiki.postgresql.org/wiki/Parallel_Sort I'm familiar with that effort. I think that when research topics like sorting, it can sometimes be a mistake to not look at an approach specifically recommended by the database research community. A lot of the techniques we've benefited from within tuplesort.c have been a matter of addressing memory latency as a bottleneck; techniques that are fairly simple and not worth writing a general interest paper on. Also, things like abbreviated keys are beneficial in large part because people tend to follow the first normal form, and therefore an abbreviated key can contain a fair amount of entropy most of the time. Similarly, Radix sort seems really cool, but our requirements around generality seem to make it impractical. > Anyway, you're proposing an algorithm that can't be fully > parallelized. Maybe that's OK. But I'm a little worried about it. > I'd feel more confident if we knew that the merge could be done in > parallel and were just leaving that to a later development stage; or > if we picked an algorithm like the one above that doesn't leave a > major chunk of the work unparallelizable. I might be able to resurrect the parallel merge stuff, just to guide reviewer intuition on how much that can help or hurt. I can probably repurpose it to show you the mixed picture on how effective it is. I think it might help more with collatable text that doesn't have abbreviated keys, for example, because you can use more of the machines memory bandwidth for longer. But for integers, it can hurt. (That's my recollection; I prototyped parallel merge a couple of months ago now.) >> Isn't that what you >> generally see with queries that show off the parallel join capability? > > For nested loop joins, no. The whole join operation can be done in > parallel. Sure, I know, but I'm suggesting that laws-of-physics problems may still be more significant than implementation deficiencies, even though those deficiencies should need to be stamped out. Linear scalability is really quite rare for most database workloads. > Obviously, parallel query is subject to a long list of annoying > restrictions at this point. On queries that don't hit any of those > restrictions we can get 4-5x speedup with a leader and 4 workers. As > we expand the range of plan types that we can construct, I think we'll > see those kinds of speedups for a broader range of queries. (The > question of exactly why we top out with as few workers as currently > seems to be the case needs more investigation, too; maybe contention > effects?) You're probably bottlenecked on memory bandwidth. Note that I showed improvements with 8 workers, not 4. 4 Workers are slower than 8, but not by that much. >> https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/ > > I do agree that it is important not to have unrealistic expectations. Great. My ambition for this patch is that it put parallel CREATE INDEX on a competitive footing against the implementations featured in other major systems. I don't think we need to do everything at once, but I have no intention of pushing forward with something that doesn't do respectably there. I also want to avoid partitioning in the first version of this, and probably in any version that backs CREATE INDEX. I've only made minimal changes to the tuplesort.h interface here to support parallelism. That flexibility counts for a lot, IMV. >> As I've said, there is probably a good argument to be made for >> partitioning to increase parallelism. But, that involves risks around >> the partitioning being driven by statistics or a cost model > Yes. Rushabh is working on that, and Finalize GroupAggregate -> > Gather Merge -> Partial GroupAggregate -> Sort -> whatever is looking > pretty sweet. A "Gather Merge" node doesn't really sound like what I'm talking about. Isn't that something to do with table-level partitioning? I'm talking about dynamic partitioning, typically of a single table, of course. >>> The work on making the logtape infrastructure parallel-aware seems >>> very interesting and potentially useful for other things. Sadly, I >>> don't have time to look at it right now. >> >> I would be happy to look at generalizing that further, to help >> parallel hash join. As you know, Thomas Munro and I have discussed >> this privately. > > Right. By the way, the patch is in better shape from that perspective, as compared to the early version Thomas (CC'd) had access to. The BufFile stuff is now credible as a general-purpose abstraction. -- Peter Geoghegan
pgsql-hackers by date: