Thread: Allow parallel DISTINCT
Back in March 2016, e06a38965 added support for parallel aggregation. IIRC, because it was fairly late in the release cycle, I dropped parallel DISTINCT to reduce the scope a little. It's been on my list of things to fix since then. I just didn't get around to it until today. The patch is just some plumbing work to connect all the correct paths up to make it work. It's all fairly trivial. I thought about refactoring things a bit more to get rid of the additional calls to grouping_is_sortable() and grouping_is_hashable(), but I just don't think it's worth making the code ugly for. We'll only call them again if we're considering a parallel plan, in which case it's most likely not a trivial query. Those functions are pretty cheap anyway. I understand that there's another patch in the September commitfest that does some stuff with Parallel DISTINCT, but that goes about things a completely different way by creating multiple queues to distribute values by hash. I don't think there's any overlap here. We'd likely want to still have the planner consider both methods if we get that patch sometime. David
Attachment
On Wed, 11 Aug 2021 at 16:51, David Rowley <dgrowleyml@gmail.com> wrote: > The patch is just some plumbing work to connect all the correct paths > up to make it work. It's all fairly trivial. I looked at this patch again and realise that it could be done a bit better. For example, the previous version set the distinct_rel's FDW fields twice, once when making the serial paths and once when finalizing the partial paths. I've now added two new functions; create_final_distinct_paths and create_partial_distinct_paths. The responsibility of create_distinct_paths has changed. Instead of it creating the non-parallel DISTINCT paths, it calls the two new functions and also takes charge of calling the create_upper_paths_hook for UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call. I think this is nicer as I'd previously added a new parameter to create_distinct_paths() so I could tell it not to call the hook as I didn't want to call that twice on the same relation as it would no doubt result in some plugin just creating the same paths again. I've also changed my mind about the previous choice I'd made not to call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT. I now think that's ok. I think this is a fairly trivial patch that just does a bit of wiring up of paths. Unless anyone has anything to say about it in the next few days, I'll be looking at it again with intensions to push it. David
On Mon, Aug 16, 2021 at 10:07 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 11 Aug 2021 at 16:51, David Rowley <dgrowleyml@gmail.com> wrote:
> The patch is just some plumbing work to connect all the correct paths
> up to make it work. It's all fairly trivial.
I looked at this patch again and realise that it could be done a bit
better. For example, the previous version set the distinct_rel's FDW
fields twice, once when making the serial paths and once when
finalizing the partial paths.
I've now added two new functions; create_final_distinct_paths and
create_partial_distinct_paths. The responsibility of
create_distinct_paths has changed. Instead of it creating the
non-parallel DISTINCT paths, it calls the two new functions and also
takes charge of calling the create_upper_paths_hook for
UPPERREL_DISTINCT plus the FDW GetForeignUpperPaths() call. I think
this is nicer as I'd previously added a new parameter to
create_distinct_paths() so I could tell it not to call the hook as I
didn't want to call that twice on the same relation as it would no
doubt result in some plugin just creating the same paths again.
I've also changed my mind about the previous choice I'd made not to
call GetForeignUpperPaths for the UPPERREL_PARTIAL_DISTINCT. I now
think that's ok.
I think this is a fairly trivial patch that just does a bit of wiring
up of paths. Unless anyone has anything to say about it in the next
few days, I'll be looking at it again with intensions to push it.
David
Hi, David:
Can you attach updated patch so that we know more detail about the two new functions; create_final_distinct_paths and
create_partial_distinct_paths ?Thanks
On Tue, 17 Aug 2021 at 20:07, Zhihong Yu <zyu@yugabyte.com> wrote: > Can you attach updated patch so that we know more detail about the two new functions; create_final_distinct_paths and > create_partial_distinct_paths ? Must've fallen off in transit :) David
Attachment
On Tue, Aug 17, 2021 at 3:59 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 17 Aug 2021 at 20:07, Zhihong Yu <zyu@yugabyte.com> wrote:
> Can you attach updated patch so that we know more detail about the two new functions; create_final_distinct_paths and
> create_partial_distinct_paths ?
Must've fallen off in transit :)
David
Hi,
Since create_partial_distinct_paths() calls create_final_distinct_paths(), I wonder if numDistinctRows can be passed to create_final_distinct_paths() so that the latter doesn't need to call estimate_num_groups().
Cheers
On Wed, 18 Aug 2021 at 02:42, Zhihong Yu <zyu@yugabyte.com> wrote: > Since create_partial_distinct_paths() calls create_final_distinct_paths(), I wonder if numDistinctRows can be passed tocreate_final_distinct_paths() so that the latter doesn't need to call estimate_num_groups(). That can't be done. The two calls to estimate_num_groups() are passing in a different number of input rows. In create_partial_distinct_paths() the number of rows is the number of expected input rows from a partial path. In create_final_distinct_paths() when called to complete the final distinct step, that's the number of distinct values multiplied by the number of workers. It might be more possible to do something like cache the value of distinctExprs, but I just don't feel the need. If there are partial paths in the input_rel then it's most likely that planning time is not going to dominate much between planning and execution. Also, if we were to calculate the value of distinctExprs in create_distinct_paths always, then we might end up calculating it for nothing as create_final_distinct_paths() does not always need it. I don't feel the need to clutter up the code by doing any lazy calculating of it either. David
On Tue, Aug 17, 2021 at 1:47 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 18 Aug 2021 at 02:42, Zhihong Yu <zyu@yugabyte.com> wrote:
> Since create_partial_distinct_paths() calls create_final_distinct_paths(), I wonder if numDistinctRows can be passed to create_final_distinct_paths() so that the latter doesn't need to call estimate_num_groups().
That can't be done. The two calls to estimate_num_groups() are passing
in a different number of input rows. In
create_partial_distinct_paths() the number of rows is the number of
expected input rows from a partial path. In
create_final_distinct_paths() when called to complete the final
distinct step, that's the number of distinct values multiplied by the
number of workers.
It might be more possible to do something like cache the value of
distinctExprs, but I just don't feel the need. If there are partial
paths in the input_rel then it's most likely that planning time is not
going to dominate much between planning and execution. Also, if we
were to calculate the value of distinctExprs in create_distinct_paths
always, then we might end up calculating it for nothing as
create_final_distinct_paths() does not always need it. I don't feel
the need to clutter up the code by doing any lazy calculating of it
either.
David
Hi,
Thanks for your explanation.
The patch is good from my point of view.
On Wed, 18 Aug 2021 at 08:50, Zhihong Yu <zyu@yugabyte.com> wrote: > The patch is good from my point of view. Thanks for the review. I looked over the patch again and the only thing I adjusted was the order of the RESETs in the regression tests. I left the " if (distinct_rel->pathlist == NIL)" ERROR case check so that the ERROR is raised before we call the FDW function and hook function to add more paths. Part of me thinks it should probably go afterwards, but I didn't want to change the behaviour there. The other part of me thinks that if you can't do distinct by sorting or hashing then there's not much hope for the hook to add any paths either. I've pushed this to master now. David
David Rowley <dgrowleyml@gmail.com> writes: > I've pushed this to master now. ... and the buildfarm is pushing back, eg https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2021-08-22%2011%3A31%3A45 diff -U3 /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/expected/select_distinct.out /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/results/select_distinct.out --- /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/expected/select_distinct.out 2021-08-22 11:26:22.000000000+0000 +++ /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/regress/results/select_distinct.out 2021-08-22 11:38:43.000000000+0000 @@ -223,7 +223,7 @@ -> Sort Sort Key: four -> Gather - Workers Planned: 2 + Workers Planned: 5 -> HashAggregate Group Key: four -> Parallel Seq Scan on tenk1 @@ -270,7 +270,7 @@ -> Sort Sort Key: (distinct_func(1)) -> Gather - Workers Planned: 2 + Workers Planned: 5 -> Parallel Seq Scan on tenk1 (6 rows) regards, tom lane
On Mon, 23 Aug 2021 at 01:58, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > I've pushed this to master now. > > ... and the buildfarm is pushing back, eg Thanks. I pushed a fix for that. David
David Rowley <dgrowleyml@gmail.com> writes: > Thanks. I pushed a fix for that. Yeah, I saw your commit just after complaining. Sorry for the noise. regards, tom lane