Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAM3SWZQKGDkH-qFj1zEuTcb=AHXGh=59vCe88PVYGiY3zGy7uw@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
|
List | pgsql-hackers |
On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Okay, but what is the proof or why do you think second is going to > better than first? I don't have proof. It's my opinion that it probably would be, based on partial information, and my intuition. It's hard to prove something like that, because it's not really clear what that alternative would look like. Also, finding all of that out would take a long time -- it's hard to prototype. Do tuple table slots need to care about IndexTuples now? What does that even look like? What existing executor code needs to be taught about this new requirement? > One thing which strikes as a major difference > between your approach and Gather Merge is that in your approach leader > has to wait till all the workers have done with their work on sorting > whereas with Gather Merge as soon as first one is done, leader starts > with merging. I could be wrong here, but if I understood it > correctly, then there is a argument that Gather Merge kind of approach > can win in cases where some of the workers can produce sorted outputs > ahead of others and I am not sure if we can dismiss such cases. How can it? You need to have at least one tuple from every worker (before the worker has exhausted its supply of output tuples) in order to merge to return the next tuple to the top level consumer (the thing above the Gather Merge). If you're talking about "eager vs. lazy merging", please see my previous remarks on that, on this thread. (In any case, whether we merge more eagerly seems like an orthogonal question to the one you ask.) The first thing to note about my approach is that I openly acknowledge that this parallel CREATE INDEX patch is not much use for parallel query. I have only generalized tuplesort.c to support parallelizing a sort operation. I think that parallel query needs partitioning to push down parts of a sort to workers, with little or no need for them to be funneled together at the end, since most tuples are eliminated before being passed to the Gather/Gather Merge node. The partitioning part is really hard. I guess that Gather Merge nodes have value because they allow us to preserve the sorted-ness of a parallel path, which might be most useful because it enables things elsewhere. But, that doesn't really recommend making Gather Merge nodes good at batch processing a large number of tuples, I suspect. (One problem with the tuple queue mechanism is that it can be a big bottleneck -- that's why we want to eliminate most tuples before they're passed up to the leader, in the case of parallel sequential scan in 9.6.) I read the following paragraph from the Volcano paper just now: """ During implementation and benchmarking of parallel sorting, we added two more features to exchange. First, we wanted to implement a merge network in which some processors produce sorted streams merge concurrently by other processors. Volcano’s sort iterator can be used to generate a sorted stream. A merge iterator was easily derived from the sort module. It uses a single level merge, instead of the cascaded merge of runs used in sort. The input of a merge iterator is an exchange. Differently from other operators, the merge iterator requires to distinguish the input records by their producer. As an example, for a join operation it does not matter where the input records were created, and all inputs can be accumulated in a single input stream. For a merge operation, it is crucial to distinguish the input records by their producer in order to merge multiple sorted streams correctly. """ I don't really understand this paragraph, but thought I'd ask: why the need to "distinguish the input records by their producer in order to merge multiple sorted streams correctly"? Isn't that talking about partitioning, where each workers *ownership* of a range matters? My patch doesn't care which values belong to which workers. And, it focuses quite a lot on dealing well with the memory bandwidth bound, I/O bound part of the sort where we write out the index itself, just by piggy-backing on tuplesort.c. I don't think that that's useful for a general-purpose executor node -- tuple-at-a-time processing when fetching from workers would kill performance. > By looking at 'workersFinished' usage, it looks like you have devised > a new way for leader to know when workers have finished which might be > required for this patch. However, have you tried to use or > investigate if existing infrastructure which serves same purpose could > be used for it? Yes, I have. I think that Robert's "condition variables" patch would offer a general solution to what I've devised. What I have there is, as you say, fairly ad-hoc, even though my requirements are actually fairly general. I was actually annoyed that there wasn't an easier way to do that myself. Robert has said that he won't commit his "condition variables" work until it's clear that there will be some use for the facility. Well, I'd use it for this patch, if I could. Robert? -- Peter Geoghegan
pgsql-hackers by date: