Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Rushabh Lathia
Subject Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAGPqQf3M6GDfF60NaKYULAwmqSXpM4DpkQbYQcTsbo58bor5-A@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] Parallel tuplesort (for parallel B-Tree indexcreation)
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
Thanks for review.

On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
> 0001-Add-parallel-B-tree-index-build-sorting_v14.patch

Cool. I'm glad that we now have a patch that applies cleanly against
master, while adding very little to buffile.c. It feels like we're
getting very close here.

>> * You didn't point out the randomAccess restriction in tuplesort.h.
>>
>
> I did, it's there in the file header comments.

I see what you wrote in tuplesort.h here:

> + * algorithm, and are typically only used for large amounts of data. Note
> + * that parallel sorts is not support for random access to the sort result.

This should say "...are not support when random access is requested".


Done.
 
> Added handling for parallel_leader_participation as well as deleted
> compile time option force_single_worker.

I still see this:

> +
> +/*
> + * A parallel sort with one worker process, and without any leader-as-worker
> + * state may be used for testing the parallel tuplesort infrastructure.
> + */
> +#ifdef NOT_USED
> +#define FORCE_SINGLE_WORKER
> +#endif

Looks like you missed this FORCE_SINGLE_WORKER hunk -- please remove it, too.


Done.
 
>> The parallel_leader_participation docs will also need to be updated.
>>
>
> Done.

I don't see this. There is no reference to
parallel_leader_participation in the CREATE INDEX docs, nor is there a
reference to CREATE INDEX in the parallel_leader_participation docs.


I thought parallel_leader_participation is generic GUC which get effect
for all parallel operation. isn't it?  On that understanding I just update the
documentation of parallel_leader_participation into config.sgml to
make it more generalize.

 
> Also performed more testing with the patch, with
> parallel_leader_participation
> ON and OFF.  Found one issue, where earlier we always used to call
> _bt_leader_sort_as_worker() but now need to skip the call if
> parallel_leader_participation
> is OFF.

Hmm. I think the local variable within _bt_heapscan() should go back.
Its value should be directly taken from parallel_leader_participation
assignment, once. There might be some bizarre circumstances where it
is possible for the value of parallel_leader_participation to change
in flight, causing a race condition: we start with the leader as a
participant, and change our mind later within
_bt_leader_sort_as_worker(), causing the whole CREATE INDEX to hang
forever.

Even if that's impossible, it seems like an improvement in style to go
back to one local variable controlling everything.


Yes, to me also it's looks kind of impossible situation but then too
it make sense to make one local variable and then always read the
value from that.
 
Style issue here:

> + long start_block = file->numFiles * BUFFILE_SEG_SIZE;
> + int newNumFiles = file->numFiles + source->numFiles;

Shouldn't start_block conform to the surrounding camelCase style?


Done.
 
Finally, two new thoughts on the patch, that are not responses to
anything you did in v14:

1. Thomas' barrier abstraction was added by commit 1145acc7. I think
that you should use a static barrier in tuplesort.c now, and rip out
the ConditionVariable fields in the Sharedsort struct. It's only a
slightly higher level of abstraction for tuplesort.c, which makes only
a small difference given the simple requirements of tuplesort.c.
However, I see no reason to not go that way if that's the new
standard, which it is. This looks like it will be fairly easy.


Pending, as per Thomas' explanation,  it seems like need some more
work in the barrier APIs.

 
2. Does the plan_create_index_workers() cost model need to account for
parallel_leader_participation, too, when capping workers? I think that
it does.

The relevant planner code is:

> +   /*
> +    * Cap workers based on available maintenance_work_mem as needed.
> +    *
> +    * Note that each tuplesort participant receives an even share of the
> +    * total maintenance_work_mem budget.  Aim to leave workers (where
> +    * leader-as-worker Tuplesortstate counts as a worker) with no less than
> +    * 32MB of memory.  This leaves cases where maintenance_work_mem is set to
> +    * 64MB immediately past the threshold of being capable of launching a
> +    * single parallel worker to sort.
> +    */
> +   sort_mem_blocks = (maintenance_work_mem * 1024L) / BLCKSZ;
> +   min_sort_mem_blocks = (32768L * 1024L) / BLCKSZ;
> +   while (parallel_workers > min_parallel_workers &&
> +          sort_mem_blocks / (parallel_workers + 1) < min_sort_mem_blocks)
> +       parallel_workers--;

This parallel CREATE INDEX planner code snippet is about the need to
have low per-worker maintenance_work_mem availability prevent more
parallel workers from being added to the number that we plan to
launch. Each worker tuplesort state needs at least 32MB. We clearly
need to do something here.

While it's always true that "leader-as-worker Tuplesortstate counts as
a worker" in v14, I think that it should only be true in the next
revision of the patch when parallel_leader_participation is actually
true (IOW, we should only add 1 to parallel_workers within the loop
invariant in that case). The reason why we need to consider
parallel_leader_participation within this plan_create_index_workers()
code is simple: During execution, _bt_leader_sort_as_worker() uses
"worker tuplesort states"/btshared->scantuplesortstates to determine
how much of a share of maintenance_work_mem each worker tuplesort
gets. Our planner code needs to take that into account, now that the
nbtsort.c parallel_leader_participation behavior isn't just some
obscure debug option. IOW, the planner code needs to be consistent
with the nbtsort.c execution code.


Ah nice catch.  I passed the local variable (leaderasworker) of _bt_heapscan()
to plan_create_index_workers() rather than direct reading value from the
parallel_leader_participation (reasons are same as you explained earlier).



Thanks,
Rushabh Lathia
Attachment

pgsql-hackers by date:

Previous
From: Sergei Kornilov
Date:
Subject: Re: proposal: alternative psql commands quit and exit
Next
From: Craig Ringer
Date:
Subject: Re: [HACKERS] Logical decoding on standby