Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CA+TgmoY5JYs4R1g_ZJ-P6SkULSb19xx4zUh7S8LJiXonCgVTuQ@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
|
List | pgsql-hackers |
On Wed, Aug 3, 2016 at 5:13 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Aug 3, 2016 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm not going to say it's bad to be able to do things 2-2.5x faster, >> but linear scalability this ain't - particularly because your 2.58x >> faster case is using up to 7 or 8 times as much memory. The >> single-process case would be faster in that case, too: you could >> quicksort. > > [ lengthy counter-argument ] None of this convinces me that testing this in a way that is not "apples to apples" is a good idea, nor will any other argument. >> I also think that Amdahl's law is going to pinch pretty severely here. > > Doesn't that almost always happen, though? To some extent, sure, absolutely. But it's our job as developers to try to foresee and minimize those cases. When Noah was at EnterpriseDB a few years ago and we were talking about parallel internal sort, Noah started by doing a survey of the literature and identified parallel quicksort as the algorithm that seemed best for our use case. Of course, every time quicksort partitions the input, you get two smaller sorting problems, so it's easy to see how to use 2 CPUs after the initial partitioning step has been completed and 4 CPUs after each of those partitions has been partitioned again, and so on. However, that turns out not to be good enough because the first partitioning step can consume a significant percentage of the total runtime - so if you only start parallelizing after that, you're leaving too much on the table. To avoid that, the algorithm he was looking at had a (complicated) way of parallelizing the first partitioning step; then you can, it seems, do the full sort in parallel. There are some somewhat outdated and perhaps naive ideas about this that we wrote up here: https://wiki.postgresql.org/wiki/Parallel_Sort Anyway, you're proposing an algorithm that can't be fully parallelized. Maybe that's OK. But I'm a little worried about it. I'd feel more confident if we knew that the merge could be done in parallel and were just leaving that to a later development stage; or if we picked an algorithm like the one above that doesn't leave a major chunk of the work unparallelizable. > Isn't that what you > generally see with queries that show off the parallel join capability? For nested loop joins, no. The whole join operation can be done in parallel. For hash joins, yes: building the hash table once per worker can run afoul of Amdahl's law in a big way. That's why Thomas Munro is working on fixing it: https://wiki.postgresql.org/wiki/EnterpriseDB_database_server_roadmap Obviously, parallel query is subject to a long list of annoying restrictions at this point. On queries that don't hit any of those restrictions we can get 4-5x speedup with a leader and 4 workers. As we expand the range of plan types that we can construct, I think we'll see those kinds of speedups for a broader range of queries. (The question of exactly why we top out with as few workers as currently seems to be the case needs more investigation, too; maybe contention effects?) >> If the final merge phase is a significant percentage of the total >> runtime, picking an algorithm that can't parallelize the final merge >> is going to limit the speedups to small multiples. That's an OK place >> to be as a result of not having done all the work yet, but you don't >> want to get locked into it. If we're going to have a substantial >> portion of the work that can never be parallelized, maybe we've picked >> the wrong algorithm. > > I suggest that this work be compared to something with similar > constraints. I used Google to try to get some indication of how much > of a difference parallel CREATE INDEX makes in other major database > systems. This is all I could find: > > https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/ I do agree that it is important not to have unrealistic expectations. > As I've said, there is probably a good argument to be made for > partitioning to increase parallelism. But, that involves risks around > the partitioning being driven by statistics or a cost model, and I > don't think you'd be too on board with the idea of every CREATE INDEX > after bulk loading needing an ANALYZE first. I tend to think of that > as more of a parallel query thing, because you can often push down a > lot more there, dynamic sampling might be possible, and there isn't a > need to push all the tuples through one point in the end. Nothing I've > done here precludes your idea of a sort-order-preserving gather node. > I think that we may well need both. Yes. Rushabh is working on that, and Finalize GroupAggregate -> Gather Merge -> Partial GroupAggregate -> Sort -> whatever is looking pretty sweet. >> The work on making the logtape infrastructure parallel-aware seems >> very interesting and potentially useful for other things. Sadly, I >> don't have time to look at it right now. > > I would be happy to look at generalizing that further, to help > parallel hash join. As you know, Thomas Munro and I have discussed > this privately. Right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: