Thread: Allow DISTINCT to use Incremental Sort
While working on the regression tests added in a14a58329, I noticed that DISTINCT does not make use of Incremental Sort. It'll only ever do full sorts on the cheapest input path or make use of a path that's already got the required pathkeys. Also, I see that create_final_distinct_paths() is a little quirky and if the cheapest input path happens to be sorted, it'll add_path() the same path twice, which seems like a bit of a waste of effort. That could happen if say enable_seqscan is off or if a Merge Join is the cheapest join method. Additionally, the parallel DISTINCT code looks like it should also get the same treatment. I see that I'd coded this to only add a unique path atop of a presorted path and it never considers sorting the cheapest partial path. I've adjusted that in the attached and also made it consider incremental sorting any path with presorted keys. Please see the attached patch. David
Attachment
On Sat, Jan 7, 2023 at 5:47 PM David Rowley <dgrowleyml@gmail.com> wrote:
While working on the regression tests added in a14a58329, I noticed
that DISTINCT does not make use of Incremental Sort. It'll only ever
do full sorts on the cheapest input path or make use of a path that's
already got the required pathkeys. Also, I see that
create_final_distinct_paths() is a little quirky and if the cheapest
input path happens to be sorted, it'll add_path() the same path twice,
which seems like a bit of a waste of effort. That could happen if say
enable_seqscan is off or if a Merge Join is the cheapest join method.
Additionally, the parallel DISTINCT code looks like it should also get
the same treatment. I see that I'd coded this to only add a unique
path atop of a presorted path and it never considers sorting the
cheapest partial path. I've adjusted that in the attached and also
made it consider incremental sorting any path with presorted keys.
Please see the attached patch.
+1 for the changes. A minor comment is that previously on HEAD for
SELECT DISTINCT case, if we have to do an explicit full sort atop the
cheapest path, we try to make sure to always use the more rigorous
ordering.
/* For explicit-sort case, always use the more rigorous clause */
if (list_length(root->distinct_pathkeys) <
list_length(root->sort_pathkeys))
{
needed_pathkeys = root->sort_pathkeys;
/* Assert checks that parser didn't mess up... */
Assert(pathkeys_contained_in(root->distinct_pathkeys,
needed_pathkeys));
}
else
needed_pathkeys = root->distinct_pathkeys;
I'm not sure if this is necessary, as AFAIU the parser should have
ensured that the sortClause is always a prefix of distinctClause.
In the patch this code has been removed. I think we should also remove
the related comments in create_final_distinct_paths.
* When we have DISTINCT ON, we must sort by the more rigorous of
* DISTINCT and ORDER BY, else it won't have the desired behavior.
- * Also, if we do have to do an explicit sort, we might as well use
- * the more rigorous ordering to avoid a second sort later. (Note
- * that the parser will have ensured that one clause is a prefix of
- * the other.)
Also, the comment just above this one is outdated too.
* First, if we have any adequately-presorted paths, just stick a
* Unique node on those. Then consider doing an explicit sort of the
* cheapest input path and Unique'ing that.
The two-step workflow is what is the case on HEAD but not any more in
the patch. And I think we should mention incremental sort on any paths
with presorted keys.
Thanks
Richard
SELECT DISTINCT case, if we have to do an explicit full sort atop the
cheapest path, we try to make sure to always use the more rigorous
ordering.
/* For explicit-sort case, always use the more rigorous clause */
if (list_length(root->distinct_pathkeys) <
list_length(root->sort_pathkeys))
{
needed_pathkeys = root->sort_pathkeys;
/* Assert checks that parser didn't mess up... */
Assert(pathkeys_contained_in(root->distinct_pathkeys,
needed_pathkeys));
}
else
needed_pathkeys = root->distinct_pathkeys;
I'm not sure if this is necessary, as AFAIU the parser should have
ensured that the sortClause is always a prefix of distinctClause.
In the patch this code has been removed. I think we should also remove
the related comments in create_final_distinct_paths.
* When we have DISTINCT ON, we must sort by the more rigorous of
* DISTINCT and ORDER BY, else it won't have the desired behavior.
- * Also, if we do have to do an explicit sort, we might as well use
- * the more rigorous ordering to avoid a second sort later. (Note
- * that the parser will have ensured that one clause is a prefix of
- * the other.)
Also, the comment just above this one is outdated too.
* First, if we have any adequately-presorted paths, just stick a
* Unique node on those. Then consider doing an explicit sort of the
* cheapest input path and Unique'ing that.
The two-step workflow is what is the case on HEAD but not any more in
the patch. And I think we should mention incremental sort on any paths
with presorted keys.
Thanks
Richard
Thanks for having a look at this. On Tue, 10 Jan 2023 at 02:28, Richard Guo <guofenglinux@gmail.com> wrote: > +1 for the changes. A minor comment is that previously on HEAD for > SELECT DISTINCT case, if we have to do an explicit full sort atop the > cheapest path, we try to make sure to always use the more rigorous > ordering. I'm not quite sure I follow what's changed here. As far as I see it the code still does what it did and uses the more rigorous sort. postgres=# explain (costs off) select distinct on (relname) * from pg_Class order by relname, oid; QUERY PLAN ---------------------------------- Unique -> Sort Sort Key: relname, oid -> Seq Scan on pg_class If it didn't, then there'd have been a Sort atop of the Unique to ORDER BY relname,oid in the above. Maybe you were looking at create_partial_distinct_paths()? We don't do anything there for DISTINCT ON, although perhaps we could. Just not for this patch. > > /* For explicit-sort case, always use the more rigorous clause */ > if (list_length(root->distinct_pathkeys) < > list_length(root->sort_pathkeys)) > { > needed_pathkeys = root->sort_pathkeys; > /* Assert checks that parser didn't mess up... */ > Assert(pathkeys_contained_in(root->distinct_pathkeys, > needed_pathkeys)); > } > else > needed_pathkeys = root->distinct_pathkeys; > > I'm not sure if this is necessary, as AFAIU the parser should have > ensured that the sortClause is always a prefix of distinctClause. I think that's true for standard DISTINCT, but it's not for DISTINCT ON. > In the patch this code has been removed. I think we should also remove > the related comments in create_final_distinct_paths. > > * When we have DISTINCT ON, we must sort by the more rigorous of > * DISTINCT and ORDER BY, else it won't have the desired behavior. > - * Also, if we do have to do an explicit sort, we might as well use > - * the more rigorous ordering to avoid a second sort later. (Note > - * that the parser will have ensured that one clause is a prefix of > - * the other.) I'm not quite following what you think has changed here. > Also, the comment just above this one is outdated too. > > * First, if we have any adequately-presorted paths, just stick a > * Unique node on those. Then consider doing an explicit sort of the > * cheapest input path and Unique'ing that. > > The two-step workflow is what is the case on HEAD but not any more in > the patch. And I think we should mention incremental sort on any paths > with presorted keys. Yeah, I agree, that comment should mention incremental sort. I've attached an updated patch David
Attachment
On Tue, Jan 10, 2023 at 10:14 AM David Rowley <dgrowleyml@gmail.com> wrote:
> /* For explicit-sort case, always use the more rigorous clause */
> if (list_length(root->distinct_pathkeys) <
> list_length(root->sort_pathkeys))
> {
> needed_pathkeys = root->sort_pathkeys;
> /* Assert checks that parser didn't mess up... */
> Assert(pathkeys_contained_in(root->distinct_pathkeys,
> needed_pathkeys));
> }
> else
> needed_pathkeys = root->distinct_pathkeys;
>
> I'm not sure if this is necessary, as AFAIU the parser should have
> ensured that the sortClause is always a prefix of distinctClause.
I think that's true for standard DISTINCT, but it's not for DISTINCT ON.
> In the patch this code has been removed. I think we should also remove
> the related comments in create_final_distinct_paths.
>
> * When we have DISTINCT ON, we must sort by the more rigorous of
> * DISTINCT and ORDER BY, else it won't have the desired behavior.
> - * Also, if we do have to do an explicit sort, we might as well use
> - * the more rigorous ordering to avoid a second sort later. (Note
> - * that the parser will have ensured that one clause is a prefix of
> - * the other.)
I'm not quite following what you think has changed here.
Sorry I didn't make myself clear. I mean currently on HEAD in planner.c
from line 4847 to line 4857, we have the code to make sure we always use
the more rigorous clause for explicit-sort case. I think this code is
not necessary, because we have already done that in line 4791 to 4796,
for both DISTINCT ON and standard DISTINCT.
In this patch that code (line 4847 to line 4857) has been removed, which
I agree with. I just wondered if the related comment should be removed
too. But now after a second thought, I think it's OK to keep that
comment, as it still holds true in the new patch.
from line 4847 to line 4857, we have the code to make sure we always use
the more rigorous clause for explicit-sort case. I think this code is
not necessary, because we have already done that in line 4791 to 4796,
for both DISTINCT ON and standard DISTINCT.
In this patch that code (line 4847 to line 4857) has been removed, which
I agree with. I just wondered if the related comment should be removed
too. But now after a second thought, I think it's OK to keep that
comment, as it still holds true in the new patch.
I've attached an updated patch
The patch looks good to me.
Thanks
Richard
Thanks
Richard
On Tue, 10 Jan 2023 at 16:07, Richard Guo <guofenglinux@gmail.com> wrote: > Sorry I didn't make myself clear. I mean currently on HEAD in planner.c > from line 4847 to line 4857, we have the code to make sure we always use > the more rigorous clause for explicit-sort case. I think this code is > not necessary, because we have already done that in line 4791 to 4796, > for both DISTINCT ON and standard DISTINCT. Thanks for explaining. I'm unsure if that code ever did anything useful, but I agree that it does nothing now. >> I've attached an updated patch > > > The patch looks good to me. Thanks for having another look. I've now pushed the patch. David