Thread: Allow parallel DISTINCT

Re: Allow parallel DISTINCT

From

David Rowley

Date:

17 August 2021, 05:07:20

On Wed, 11 Aug 2021 at 16:51, David Rowley <dgrowleyml@gmail.com> wrote:
> The patch is just some plumbing work to connect all the correct paths
> up to make it work. It's all fairly trivial.

I looked at this patch again and realise that it could be done a bit
better. For example, the previous version set the distinct_rel's FDW
fields twice, once when making the serial paths and once when
finalizing the partial paths.

I've now added two new functions; create_final_distinct_paths and
create_partial_distinct_paths.  The responsibility of
create_distinct_paths has changed. Instead of it creating the
non-parallel DISTINCT paths, it calls the two new functions and also
takes charge of calling the create_upper_paths_hook for
UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call.   I think
this is nicer as I'd previously added a new parameter to
create_distinct_paths() so I could tell it not to call the hook as I
didn't want to call that twice on the same relation as it would no
doubt result in some plugin just creating the same paths again.

I've also changed my mind about the previous choice I'd made not to
call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT.  I now
think that's ok.

I think this is a fairly trivial patch that just does a bit of wiring
up of paths.  Unless anyone has anything to say about it in the next
few days, I'll be looking at it again with intensions to push it.

David

Re: Allow parallel DISTINCT

From

Zhihong Yu

Date:

17 August 2021, 08:13:23

On Mon, Aug 16, 2021 at 10:07 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 11 Aug 2021 at 16:51, David Rowley <dgrowleyml@gmail.com> wrote:
> The patch is just some plumbing work to connect all the correct paths
> up to make it work. It's all fairly trivial.

I looked at this patch again and realise that it could be done a bit
better. For example, the previous version set the distinct_rel's FDW
fields twice, once when making the serial paths and once when
finalizing the partial paths.

I've now added two new functions; create_final_distinct_paths and
create_partial_distinct_paths. The responsibility of
create_distinct_paths has changed. Instead of it creating the
non-parallel DISTINCT paths, it calls the two new functions and also
takes charge of calling the create_upper_paths_hook for
UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call. I think
this is nicer as I'd previously added a new parameter to
create_distinct_paths() so I could tell it not to call the hook as I
didn't want to call that twice on the same relation as it would no
doubt result in some plugin just creating the same paths again.

I've also changed my mind about the previous choice I'd made not to
call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT. I now
think that's ok.

I think this is a fairly trivial patch that just does a bit of wiring
up of paths. Unless anyone has anything to say about it in the next
few days, I'll be looking at it again with intensions to push it.

David

Hi, David:

Can you attach updated patch so that we know more detail about the two new functions; create_final_distinct_paths and

create_partial_distinct_paths ?

Thanks

Re: Allow parallel DISTINCT

From

Zhihong Yu

Date:

17 August 2021, 14:47:40

On Tue, Aug 17, 2021 at 3:59 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 17 Aug 2021 at 20:07, Zhihong Yu <zyu@yugabyte.com> wrote:
> Can you attach updated patch so that we know more detail about the two new functions; create_final_distinct_paths and
> create_partial_distinct_paths ?

Must've fallen off in transit :)

David

Hi,

Since create_partial_distinct_paths() calls create_final_distinct_paths(), I wonder if numDistinctRows can be passed to create_final_distinct_paths() so that the latter doesn't need to call estimate_num_groups().

Cheers

Re: Allow parallel DISTINCT

From

David Rowley

Date:

17 August 2021, 20:47:25

On Wed, 18 Aug 2021 at 02:42, Zhihong Yu <zyu@yugabyte.com> wrote:
> Since create_partial_distinct_paths() calls create_final_distinct_paths(), I wonder if numDistinctRows can be passed
tocreate_final_distinct_paths() so that the latter doesn't need to call estimate_num_groups().
 

That can't be done. The two calls to estimate_num_groups() are passing
in a different number of input rows.  In
create_partial_distinct_paths() the number of rows is the number of
expected input rows from a partial path.  In
create_final_distinct_paths() when called to complete the final
distinct step, that's the number of distinct values multiplied by the
number of workers.

It might be more possible to do something like cache the value of
distinctExprs, but I just don't feel the need.  If there are partial
paths in the input_rel then it's most likely that planning time is not
going to dominate much between planning and execution. Also, if we
were to calculate the value of distinctExprs in create_distinct_paths
always, then we might end up calculating it for nothing as
create_final_distinct_paths() does not always need it. I don't feel
the need to clutter up the code by doing any lazy calculating of it
either.

David

Re: Allow parallel DISTINCT

From

Zhihong Yu

Date:

17 August 2021, 20:56:23

On Tue, Aug 17, 2021 at 1:47 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 18 Aug 2021 at 02:42, Zhihong Yu <zyu@yugabyte.com> wrote:
> Since create_partial_distinct_paths() calls create_final_distinct_paths(), I wonder if numDistinctRows can be passed to create_final_distinct_paths() so that the latter doesn't need to call estimate_num_groups().

That can't be done. The two calls to estimate_num_groups() are passing
in a different number of input rows. In
create_partial_distinct_paths() the number of rows is the number of
expected input rows from a partial path. In
create_final_distinct_paths() when called to complete the final
distinct step, that's the number of distinct values multiplied by the
number of workers.

It might be more possible to do something like cache the value of
distinctExprs, but I just don't feel the need. If there are partial
paths in the input_rel then it's most likely that planning time is not
going to dominate much between planning and execution. Also, if we
were to calculate the value of distinctExprs in create_distinct_paths
always, then we might end up calculating it for nothing as
create_final_distinct_paths() does not always need it. I don't feel
the need to clutter up the code by doing any lazy calculating of it
either.

David

Hi,

Thanks for your explanation.

The patch is good from my point of view.

Re: Allow parallel DISTINCT

From

David Rowley

Date:

22 August 2021, 11:37:00

On Wed, 18 Aug 2021 at 08:50, Zhihong Yu <zyu@yugabyte.com> wrote:
> The patch is good from my point of view.

Thanks for the review.  I looked over the patch again and the only
thing I adjusted was the order of the RESETs in the regression tests.

I left the " if (distinct_rel->pathlist == NIL)" ERROR case check so
that the ERROR is raised before we call the FDW function and hook
function to add more paths.  Part of me thinks it should probably go
afterwards, but I didn't want to change the behaviour there.   The
other part of me thinks that if you can't do distinct by sorting or
hashing then there's not much hope for the hook to add any paths
either.

I've pushed this to master now.

David

Re: Allow parallel DISTINCT

From

Tom Lane

Date:

22 August 2021, 13:58:03

David Rowley <dgrowleyml@gmail.com> writes:
> I've pushed this to master now.

... and the buildfarm is pushing back, eg

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-08-22%2011%3A31%3A45

diff -U3 /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/expected/select_distinct.out
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/results/select_distinct.out
--- /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/expected/select_distinct.out    2021-08-22
11:26:22.000000000+0000 
+++ /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/results/select_distinct.out    2021-08-22
11:38:43.000000000+0000 
@@ -223,7 +223,7 @@
    ->  Sort
          Sort Key: four
          ->  Gather
-               Workers Planned: 2
+               Workers Planned: 5
                ->  HashAggregate
                      Group Key: four
                      ->  Parallel Seq Scan on tenk1
@@ -270,7 +270,7 @@
    ->  Sort
          Sort Key: (distinct_func(1))
          ->  Gather
-               Workers Planned: 2
+               Workers Planned: 5
                ->  Parallel Seq Scan on tenk1
 (6 rows)


            regards, tom lane

Re: Allow parallel DISTINCT

From

David Rowley

Date:

22 August 2021, 14:35:46

On Mon, 23 Aug 2021 at 01:58, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > I've pushed this to master now.
>
> ... and the buildfarm is pushing back, eg

Thanks.  I pushed a fix for that.

David

Re: Allow parallel DISTINCT

From

Tom Lane

Date:

22 August 2021, 15:16:41

David Rowley <dgrowleyml@gmail.com> writes:
> Thanks.  I pushed a fix for that.

Yeah, I saw your commit just after complaining.  Sorry for the noise.

            regards, tom lane

Thread: Allow parallel DISTINCT

Allow parallel DISTINCT

Attachment

Re: Allow parallel DISTINCT

Re: Allow parallel DISTINCT

Re: Allow parallel DISTINCT

Attachment

Re: Allow parallel DISTINCT

Re: Allow parallel DISTINCT

Re: Allow parallel DISTINCT

Re: Allow parallel DISTINCT

Re: Allow parallel DISTINCT

Re: Allow parallel DISTINCT

Re: Allow parallel DISTINCT