Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CA+TgmoYtNR-kv9Q-bCDPV1-Ebidg55j==E7YYi5D9Kv193+LMg@mail.gmail.com
Whole thread Raw
In response to Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
On Mon, Aug 1, 2016 at 6:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
> As some of you know, I've been working on parallel sort. I think I've
> gone as long as I can without feedback on the design (and I see that
> we're accepting stuff for September CF now), so I'd like to share what
> I came up with. This project is something that I've worked on
> inconsistently since late last year. It can be thought of as the
> Postgres 10 follow-up to the 9.6 work on external sorting.

I am glad that you are working on this.

Just a first thought after reading the email:

> As you can see, the parallel case is 2.58x faster (while using more
> memory, though it's not the case that a higher maintenance_work_mem
> setting speeds up the serial/baseline index build). 8 workers are a
> bit faster than 4, but not by much (not shown). 16 are a bit slower,
> but not by much (not shown).
...
> I've seen cases where a CREATE INDEX ended up more than 3x faster,
> though. I benchmarked this case in the interest of simplicity (the
> serial case is intended to be comparable, making the test fair).
> Encouragingly, as you can see from the trace_sort output, the 8
> parallel workers are 5.67x faster at getting to the final merge (a
> merge that even it performs serially). Note that the final merge for
> each CREATE INDEX is comparable (7 runs vs. 8 runs from each of 8
> workers). Not bad!

I'm not going to say it's bad to be able to do things 2-2.5x faster,
but linear scalability this ain't - particularly because your 2.58x
faster case is using up to 7 or 8 times as much memory.  The
single-process case would be faster in that case, too: you could
quicksort.  I feel like for sorting, in particular, we probably ought
to be setting the total memory budget, not the per-process memory
budget.  Or if not, then any CREATE INDEX benchmarking had better
compare using scaled values for maintenance_work_mem; otherwise,
you're measuring the impact of using more memory as much as anything
else.

I also think that Amdahl's law is going to pinch pretty severely here.
If the final merge phase is a significant percentage of the total
runtime, picking an algorithm that can't parallelize the final merge
is going to limit the speedups to small multiples.  That's an OK place
to be as a result of not having done all the work yet, but you don't
want to get locked into it.  If we're going to have a substantial
portion of the work that can never be parallelized, maybe we've picked
the wrong algorithm.

The work on making the logtape infrastructure parallel-aware seems
very interesting and potentially useful for other things.  Sadly, I
don't have time to look at it right now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Why we lost Uber as a user
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Why we lost Uber as a user