Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAM3SWZSKbOv2hyxJ+cLtfYmdK2RP57zkiF5Y4g=wfTFP7+RDcQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallel tuplesort (for parallel B-Tree index creation)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
On Fri, Aug 5, 2016 at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> To some extent, sure, absolutely.  But it's our job as developers to
> try to foresee and minimize those cases.  When Noah was at
> EnterpriseDB a few years ago and we were talking about parallel
> internal sort, Noah started by doing a survey of the literature and
> identified parallel quicksort as the algorithm that seemed best for
> our use case.  Of course, every time quicksort partitions the input,
> you get two smaller sorting problems, so it's easy to see how to use 2
> CPUs after the initial partitioning step has been completed and 4 CPUs
> after each of those partitions has been partitioned again, and so on.
> However, that turns out not to be good enough because the first
> partitioning step can consume a significant percentage of the total
> runtime - so if you only start parallelizing after that, you're
> leaving too much on the table.  To avoid that, the algorithm he was
> looking at had a (complicated) way of parallelizing the first
> partitioning step; then you can, it seems, do the full sort in
> parallel.
>
> There are some somewhat outdated and perhaps naive ideas about this
> that we wrote up here:
>
> https://wiki.postgresql.org/wiki/Parallel_Sort

I'm familiar with that effort. I think that when research topics like
sorting, it can sometimes be a mistake to not look at an approach
specifically recommended by the database research community. A lot of
the techniques we've benefited from within tuplesort.c have been a
matter of addressing memory latency as a bottleneck; techniques that
are fairly simple and not worth writing a general interest paper on.
Also, things like abbreviated keys are beneficial in large part
because people tend to follow the first normal form, and therefore an
abbreviated key can contain a fair amount of entropy most of the time.
Similarly, Radix sort seems really cool, but our requirements around
generality seem to make it impractical.

> Anyway, you're proposing an algorithm that can't be fully
> parallelized.  Maybe that's OK.  But I'm a little worried about it.
> I'd feel more confident if we knew that the merge could be done in
> parallel and were just leaving that to a later development stage; or
> if we picked an algorithm like the one above that doesn't leave a
> major chunk of the work unparallelizable.

I might be able to resurrect the parallel merge stuff, just to guide
reviewer intuition on how much that can help or hurt. I can probably
repurpose it to show you the mixed picture on how effective it is. I
think it might help more with collatable text that doesn't have
abbreviated keys, for example, because you can use more of the
machines memory bandwidth for longer. But for integers, it can hurt.
(That's my recollection; I prototyped parallel merge a couple of
months ago now.)

>> Isn't that what you
>> generally see with queries that show off the parallel join capability?
>
> For nested loop joins, no.  The whole join operation can be done in
> parallel.

Sure, I know, but I'm suggesting that laws-of-physics problems may
still be more significant than implementation deficiencies, even
though those deficiencies should need to be stamped out. Linear
scalability is really quite rare for most database workloads.

> Obviously, parallel query is subject to a long list of annoying
> restrictions at this point.  On queries that don't hit any of those
> restrictions we can get 4-5x speedup with a leader and 4 workers.  As
> we expand the range of plan types that we can construct, I think we'll
> see those kinds of speedups for a broader range of queries.  (The
> question of exactly why we top out with as few workers as currently
> seems to be the case needs more investigation, too; maybe contention
> effects?)

You're probably bottlenecked on memory bandwidth. Note that I showed
improvements with 8 workers, not 4. 4 Workers are slower than 8, but
not by that much.

>> https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/
>
> I do agree that it is important not to have unrealistic expectations.

Great. My ambition for this patch is that it put parallel CREATE INDEX
on a competitive footing against the implementations featured in other
major systems. I don't think we need to do everything at once, but I
have no intention of pushing forward with something that doesn't do
respectably there. I also want to avoid partitioning in the first
version of this, and probably in any version that backs CREATE INDEX.
I've only made minimal changes to the tuplesort.h interface here to
support parallelism. That flexibility counts for a lot, IMV.

>> As I've said, there is probably a good argument to be made for
>> partitioning to increase parallelism. But, that involves risks around
>> the partitioning being driven by statistics or a cost model

> Yes.  Rushabh is working on that, and Finalize GroupAggregate ->
> Gather Merge -> Partial GroupAggregate -> Sort -> whatever is looking
> pretty sweet.

A "Gather Merge" node doesn't really sound like what I'm talking
about. Isn't that something to do with table-level partitioning? I'm
talking about dynamic partitioning, typically of a single table, of
course.

>>> The work on making the logtape infrastructure parallel-aware seems
>>> very interesting and potentially useful for other things.  Sadly, I
>>> don't have time to look at it right now.
>>
>> I would be happy to look at generalizing that further, to help
>> parallel hash join. As you know, Thomas Munro and I have discussed
>> this privately.
>
> Right.

By the way, the patch is in better shape from that perspective, as
compared to the early version Thomas (CC'd) had access to. The BufFile
stuff is now credible as a general-purpose abstraction.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: truncate trigger for foreign data wrappers
Next
From: Andres Freund
Date:
Subject: Re: truncate trigger for foreign data wrappers