Thread: Allow parallel DISTINCT

Allow parallel DISTINCT

From
David Rowley
Date:
Back in March 2016, e06a38965 added support for parallel aggregation.
IIRC, because it was fairly late in the release cycle, I dropped
parallel DISTINCT to reduce the scope a little. It's been on my list
of things to fix since then. I just didn't get around to it until
today.

The patch is just some plumbing work to connect all the correct paths
up to make it work. It's all fairly trivial.

I thought about refactoring things a bit more to get rid of the
additional calls to grouping_is_sortable() and grouping_is_hashable(),
but I just don't think it's worth making the code ugly for.  We'll
only call them again if we're considering a parallel plan, in which
case it's most likely not a trivial query.  Those functions are pretty
cheap anyway.

I understand that there's another patch in the September commitfest
that does some stuff with Parallel DISTINCT, but that goes about
things a completely different way by creating multiple queues to
distribute values by hash.  I don't think there's any overlap here.
We'd likely want to still have the planner consider both methods if we
get that patch sometime.

David

Attachment

Re: Allow parallel DISTINCT

From
David Rowley
Date:
On Wed, 11 Aug 2021 at 16:51, David Rowley <dgrowleyml@gmail.com> wrote:
> The patch is just some plumbing work to connect all the correct paths
> up to make it work. It's all fairly trivial.

I looked at this patch again and realise that it could be done a bit
better. For example, the previous version set the distinct_rel's FDW
fields twice, once when making the serial paths and once when
finalizing the partial paths.

I've now added two new functions; create_final_distinct_paths and
create_partial_distinct_paths.  The responsibility of
create_distinct_paths has changed. Instead of it creating the
non-parallel DISTINCT paths, it calls the two new functions and also
takes charge of calling the create_upper_paths_hook for
UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call.   I think
this is nicer as I'd previously added a new parameter to
create_distinct_paths() so I could tell it not to call the hook as I
didn't want to call that twice on the same relation as it would no
doubt result in some plugin just creating the same paths again.

I've also changed my mind about the previous choice I'd made not to
call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT.  I now
think that's ok.

I think this is a fairly trivial patch that just does a bit of wiring
up of paths.  Unless anyone has anything to say about it in the next
few days, I'll be looking at it again with intensions to push it.

David



Re: Allow parallel DISTINCT

From
Zhihong Yu
Date:


On Mon, Aug 16, 2021 at 10:07 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 11 Aug 2021 at 16:51, David Rowley <dgrowleyml@gmail.com> wrote:
> The patch is just some plumbing work to connect all the correct paths
> up to make it work. It's all fairly trivial.

I looked at this patch again and realise that it could be done a bit
better. For example, the previous version set the distinct_rel's FDW
fields twice, once when making the serial paths and once when
finalizing the partial paths.

I've now added two new functions; create_final_distinct_paths and
create_partial_distinct_paths.  The responsibility of
create_distinct_paths has changed. Instead of it creating the
non-parallel DISTINCT paths, it calls the two new functions and also
takes charge of calling the create_upper_paths_hook for
UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call.   I think
this is nicer as I'd previously added a new parameter to
create_distinct_paths() so I could tell it not to call the hook as I
didn't want to call that twice on the same relation as it would no
doubt result in some plugin just creating the same paths again.

I've also changed my mind about the previous choice I'd made not to
call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT.  I now
think that's ok.

I think this is a fairly trivial patch that just does a bit of wiring
up of paths.  Unless anyone has anything to say about it in the next
few days, I'll be looking at it again with intensions to push it.

David


Hi, David:
Can you attach updated patch so that we know more detail about the two new functions; create_final_distinct_paths and
create_partial_distinct_paths ?

Thanks

Re: Allow parallel DISTINCT

From
David Rowley
Date:
On Tue, 17 Aug 2021 at 20:07, Zhihong Yu <zyu@yugabyte.com> wrote:
> Can you attach updated patch so that we know more detail about the two new functions; create_final_distinct_paths
and
> create_partial_distinct_paths ?

Must've fallen off in transit :)

David

Attachment

Re: Allow parallel DISTINCT

From
Zhihong Yu
Date:


On Tue, Aug 17, 2021 at 3:59 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 17 Aug 2021 at 20:07, Zhihong Yu <zyu@yugabyte.com> wrote:
> Can you attach updated patch so that we know more detail about the two new functions; create_final_distinct_paths and
> create_partial_distinct_paths ?

Must've fallen off in transit :)

David
Hi,
Since create_partial_distinct_paths() calls create_final_distinct_paths(), I wonder if numDistinctRows can be passed to create_final_distinct_paths() so that the latter doesn't need to call estimate_num_groups().

Cheers

Re: Allow parallel DISTINCT

From
David Rowley
Date:
On Wed, 18 Aug 2021 at 02:42, Zhihong Yu <zyu@yugabyte.com> wrote:
> Since create_partial_distinct_paths() calls create_final_distinct_paths(), I wonder if numDistinctRows can be passed
tocreate_final_distinct_paths() so that the latter doesn't need to call estimate_num_groups().
 

That can't be done. The two calls to estimate_num_groups() are passing
in a different number of input rows.  In
create_partial_distinct_paths() the number of rows is the number of
expected input rows from a partial path.  In
create_final_distinct_paths() when called to complete the final
distinct step, that's the number of distinct values multiplied by the
number of workers.

It might be more possible to do something like cache the value of
distinctExprs, but I just don't feel the need.  If there are partial
paths in the input_rel then it's most likely that planning time is not
going to dominate much between planning and execution. Also, if we
were to calculate the value of distinctExprs in create_distinct_paths
always, then we might end up calculating it for nothing as
create_final_distinct_paths() does not always need it. I don't feel
the need to clutter up the code by doing any lazy calculating of it
either.

David



Re: Allow parallel DISTINCT

From
Zhihong Yu
Date:


On Tue, Aug 17, 2021 at 1:47 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 18 Aug 2021 at 02:42, Zhihong Yu <zyu@yugabyte.com> wrote:
> Since create_partial_distinct_paths() calls create_final_distinct_paths(), I wonder if numDistinctRows can be passed to create_final_distinct_paths() so that the latter doesn't need to call estimate_num_groups().

That can't be done. The two calls to estimate_num_groups() are passing
in a different number of input rows.  In
create_partial_distinct_paths() the number of rows is the number of
expected input rows from a partial path.  In
create_final_distinct_paths() when called to complete the final
distinct step, that's the number of distinct values multiplied by the
number of workers.

It might be more possible to do something like cache the value of
distinctExprs, but I just don't feel the need.  If there are partial
paths in the input_rel then it's most likely that planning time is not
going to dominate much between planning and execution. Also, if we
were to calculate the value of distinctExprs in create_distinct_paths
always, then we might end up calculating it for nothing as
create_final_distinct_paths() does not always need it. I don't feel
the need to clutter up the code by doing any lazy calculating of it
either.

David
Hi,
Thanks for your explanation.

The patch is good from my point of view.

Re: Allow parallel DISTINCT

From
David Rowley
Date:
On Wed, 18 Aug 2021 at 08:50, Zhihong Yu <zyu@yugabyte.com> wrote:
> The patch is good from my point of view.

Thanks for the review.  I looked over the patch again and the only
thing I adjusted was the order of the RESETs in the regression tests.

I left the " if (distinct_rel->pathlist == NIL)" ERROR case check so
that the ERROR is raised before we call the FDW function and hook
function to add more paths.  Part of me thinks it should probably go
afterwards, but I didn't want to change the behaviour there.   The
other part of me thinks that if you can't do distinct by sorting or
hashing then there's not much hope for the hook to add any paths
either.

I've pushed this to master now.

David



Re: Allow parallel DISTINCT

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> I've pushed this to master now.

... and the buildfarm is pushing back, eg

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-08-22%2011%3A31%3A45

diff -U3 /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/expected/select_distinct.out
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/results/select_distinct.out
--- /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/expected/select_distinct.out    2021-08-22
11:26:22.000000000+0000 
+++ /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/results/select_distinct.out    2021-08-22
11:38:43.000000000+0000 
@@ -223,7 +223,7 @@
    ->  Sort
          Sort Key: four
          ->  Gather
-               Workers Planned: 2
+               Workers Planned: 5
                ->  HashAggregate
                      Group Key: four
                      ->  Parallel Seq Scan on tenk1
@@ -270,7 +270,7 @@
    ->  Sort
          Sort Key: (distinct_func(1))
          ->  Gather
-               Workers Planned: 2
+               Workers Planned: 5
                ->  Parallel Seq Scan on tenk1
 (6 rows)


            regards, tom lane



Re: Allow parallel DISTINCT

From
David Rowley
Date:
On Mon, 23 Aug 2021 at 01:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > I've pushed this to master now.
>
> ... and the buildfarm is pushing back, eg

Thanks.  I pushed a fix for that.

David



Re: Allow parallel DISTINCT

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> Thanks.  I pushed a fix for that.

Yeah, I saw your commit just after complaining.  Sorry for the noise.

            regards, tom lane