Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CA+TgmoYtNR-kv9Q-bCDPV1-Ebidg55j==E7YYi5D9Kv193+LMg@mail.gmail.com Whole thread Raw |
In response to | Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
|
List | pgsql-hackers |
On Mon, Aug 1, 2016 at 6:18 PM, Peter Geoghegan <pg@heroku.com> wrote: > As some of you know, I've been working on parallel sort. I think I've > gone as long as I can without feedback on the design (and I see that > we're accepting stuff for September CF now), so I'd like to share what > I came up with. This project is something that I've worked on > inconsistently since late last year. It can be thought of as the > Postgres 10 follow-up to the 9.6 work on external sorting. I am glad that you are working on this. Just a first thought after reading the email: > As you can see, the parallel case is 2.58x faster (while using more > memory, though it's not the case that a higher maintenance_work_mem > setting speeds up the serial/baseline index build). 8 workers are a > bit faster than 4, but not by much (not shown). 16 are a bit slower, > but not by much (not shown). ... > I've seen cases where a CREATE INDEX ended up more than 3x faster, > though. I benchmarked this case in the interest of simplicity (the > serial case is intended to be comparable, making the test fair). > Encouragingly, as you can see from the trace_sort output, the 8 > parallel workers are 5.67x faster at getting to the final merge (a > merge that even it performs serially). Note that the final merge for > each CREATE INDEX is comparable (7 runs vs. 8 runs from each of 8 > workers). Not bad! I'm not going to say it's bad to be able to do things 2-2.5x faster, but linear scalability this ain't - particularly because your 2.58x faster case is using up to 7 or 8 times as much memory. The single-process case would be faster in that case, too: you could quicksort. I feel like for sorting, in particular, we probably ought to be setting the total memory budget, not the per-process memory budget. Or if not, then any CREATE INDEX benchmarking had better compare using scaled values for maintenance_work_mem; otherwise, you're measuring the impact of using more memory as much as anything else. I also think that Amdahl's law is going to pinch pretty severely here. If the final merge phase is a significant percentage of the total runtime, picking an algorithm that can't parallelize the final merge is going to limit the speedups to small multiples. That's an OK place to be as a result of not having done all the work yet, but you don't want to get locked into it. If we're going to have a substantial portion of the work that can never be parallelized, maybe we've picked the wrong algorithm. The work on making the logtape infrastructure parallel-aware seems very interesting and potentially useful for other things. Sadly, I don't have time to look at it right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: