Thread: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant

Over on [1], Klint highlights a query with a DISTINCT which is a
little sub-optimal in PostgreSQL.  ISTM that in cases where all
DISTINCT pathkeys have been marked as redundant due to constants
existing in all of the EquivalenceClasses of the DISTINCT columns,
then it looks like it should be okay not to bother using a Unique node
to remove duplicate results.

When all the distinct pathkeys are redundant then there can only be,
at most, 1 single distinct value. There may be many rows with that
value, but we can remove those extra ones with a LIMIT 1 rather than
troubling over needlessly uniquifing them.

This might not be a hugely common case, but; 1) it is very cheap to
detect and 2) the speedups are likely to be *very* good.

With the attached we get:

regression=# explain (analyze, costs off, timing off) SELECT DISTINCT
four,1,2,3 FROM tenk1 WHERE four = 0;
                   QUERY PLAN
-------------------------------------------------
 Limit (actual rows=1 loops=1)
   ->  Seq Scan on tenk1 (actual rows=1 loops=1)
         Filter: (four = 0)
 Planning Time: 0.215 ms
 Execution Time: 0.071 ms

naturally, if we removed the WHERE four = 0, we can't optimise this
plan using this method.

I see no reason why this also can't work for DISTINCT ON too.

regression=# explain (analyze, costs off, timing off) SELECT DISTINCT
ON (four,two) four,two FROM tenk1 WHERE four = 0 order by 1,2;
                        QUERY PLAN
----------------------------------------------------------
 Unique (actual rows=1 loops=1)
   ->  Sort (actual rows=2500 loops=1)
         Sort Key: two
         Sort Method: quicksort  Memory: 175kB
         ->  Seq Scan on tenk1 (actual rows=2500 loops=1)
               Filter: (four = 0)
               Rows Removed by Filter: 7500
 Planning Time: 0.123 ms
 Execution Time: 4.251 ms
(9 rows)

then, of course, if we introduce some column that the pathkey is not
redundant for then we must do the distinct operation as normal.

regression=# explain (analyze, costs off, timing off) SELECT DISTINCT
four,two FROM tenk1 WHERE four = 0 order by 1,2;
                        QUERY PLAN
----------------------------------------------------------
 Sort (actual rows=1 loops=1)
   Sort Key: two
   Sort Method: quicksort  Memory: 25kB
   ->  HashAggregate (actual rows=1 loops=1)
         Group Key: four, two
         Batches: 1  Memory Usage: 24kB
         ->  Seq Scan on tenk1 (actual rows=2500 loops=1)
               Filter: (four = 0)
               Rows Removed by Filter: 7500
 Planning Time: 0.137 ms
 Execution Time: 4.274 ms
(11 rows)

Does this seem like something we'd want to do?

Patch attached.

David

[1] https://postgr.es/m/MEYPR01MB7101CD5DA0A07C9DE2B74850A4239@MEYPR01MB7101.ausprd01.prod.outlook.com

Attachment

On Wed, Oct 12, 2022 at 5:19 PM David Rowley <dgrowleyml@gmail.com> wrote:
When all the distinct pathkeys are redundant then there can only be,
at most, 1 single distinct value. There may be many rows with that
value, but we can remove those extra ones with a LIMIT 1 rather than
troubling over needlessly uniquifing them.
 
I'm not sure if this case is common enough in practice, but since this
patch is very straightforward and adds no more costs, I think it's worth
doing.

I also have concerns about the 2 Limit nodes pointed by the comment
inside the patch. Maybe we can check with limit_needed() and manually
add the limit node only if there is no LIMIT clause in the origin query?

Thanks
Richard

On Wed, Oct 12, 2022 at 5:19 PM David Rowley <dgrowleyml@gmail.com> wrote:
When all the distinct pathkeys are redundant then there can only be,
at most, 1 single distinct value. There may be many rows with that
value, but we can remove those extra ones with a LIMIT 1 rather than
troubling over needlessly uniquifing them.

This might not be a hugely common case, but; 1) it is very cheap to
detect and 2) the speedups are likely to be *very* good.

With the attached we get:

regression=# explain (analyze, costs off, timing off) SELECT DISTINCT
four,1,2,3 FROM tenk1 WHERE four = 0;
                   QUERY PLAN
-------------------------------------------------
 Limit (actual rows=1 loops=1)
   ->  Seq Scan on tenk1 (actual rows=1 loops=1)
         Filter: (four = 0)
 Planning Time: 0.215 ms
 Execution Time: 0.071 ms

naturally, if we removed the WHERE four = 0, we can't optimise this
plan using this method.

I see no reason why this also can't work for DISTINCT ON too.
 
For DISTINCT ON, if all the distinct pathkeys are redundant but there
are available sort pathkeys, then for adequately-presorted paths I think
we can also apply this optimization, using a Limit 1 rather than Unique.

regression=# explain (analyze, costs off, timing off) select distinct on (four) * from tenk1 where four = 0 order by four, hundred desc;
                                   QUERY PLAN
--------------------------------------------------------------------------------
 Limit (actual rows=1 loops=1)
   ->  Index Scan Backward using tenk1_hundred on tenk1 (actual rows=1 loops=1)
         Filter: (four = 0)
         Rows Removed by Filter: 300
 Planning Time: 0.165 ms
 Execution Time: 0.458 ms
(6 rows)

Thanks
Richard
On Thu, 13 Oct 2022 at 00:30, Richard Guo <guofenglinux@gmail.com> wrote:
> I also have concerns about the 2 Limit nodes pointed by the comment
> inside the patch. Maybe we can check with limit_needed() and manually
> add the limit node only if there is no LIMIT clause in the origin query?

I wasn't hugely concerned about this. I think we're a little limited
to what we can actually do about it too.

It seems easy enough to skip adding the LimitPath in
create_final_distinct_paths() if the existing query already has
exactly LIMIT 1. However, if they've written LIMIT 10 or LIMIT
random()*1234 then we must add the LimitPath to ensure we only get 1
row rather than 10 or some random number.

As for getting rid of the LIMIT 10 / LIMIT random()*1234, we store the
LIMIT clause information in the parse and currently that's what the
planner uses when creating the LimitPath for the LIMIT clause.  I'd
quite like to avoid making any adjustments to the parse fields here.
(There's a general project desire to move away from the planner
modifying the parse. If we didn't do that we could do things like
re-plan queries with stronger optimization levels when they come out
too costly.)

We could do something like set some bool flag in PlannerInfo to tell
the planner not to bother adding the final LimitPath as we've already
added another which does the job, but is it really worth adding that
complexity for this patch? You already mentioned that "this patch is
very straightforward". I don't think it would be if we added code to
avoid the LimitPath duplication.

David



On Thu, 13 Oct 2022 at 01:13, Richard Guo <guofenglinux@gmail.com> wrote:
> For DISTINCT ON, if all the distinct pathkeys are redundant but there
> are available sort pathkeys, then for adequately-presorted paths I think
> we can also apply this optimization, using a Limit 1 rather than Unique.
>
> regression=# explain (analyze, costs off, timing off) select distinct on (four) * from tenk1 where four = 0 order by
four,hundred desc;
 
>                                    QUERY PLAN
> --------------------------------------------------------------------------------
>  Limit (actual rows=1 loops=1)
>    ->  Index Scan Backward using tenk1_hundred on tenk1 (actual rows=1 loops=1)
>          Filter: (four = 0)
>          Rows Removed by Filter: 300

I don't think we can optimise this case, at least not the same way I'm
doing it in the patch I attached.

The problem is that I'm only added the LimitPath to the
cheapest_total_path.  I think to make your case work we'd need to add
the LimitPath only in cases where the distinct_pathkeys are empty but
the sort_pathkeys are not and hasDistinctOn is true and the path has
pathkeys_contained_in(root->sort_pathkeys, path->pathkeys).  I think
that's doable, but it's become quite a bit more complex than the patch
I proposed. Maybe it's worth a 2nd effort for that part?

David




On Thu, Oct 13, 2022 at 4:41 AM David Rowley <dgrowleyml@gmail.com> wrote:
We could do something like set some bool flag in PlannerInfo to tell
the planner not to bother adding the final LimitPath as we've already
added another which does the job, but is it really worth adding that
complexity for this patch? You already mentioned that "this patch is
very straightforward". I don't think it would be if we added code to
avoid the LimitPath duplication.
 
Yeah, maybe this is the right way to do it. I agree that this would
complicate the code. Not sure if it's worth doing.

Thanks
Richard

On Thu, Oct 13, 2022 at 4:50 AM David Rowley <dgrowleyml@gmail.com> wrote:
The problem is that I'm only added the LimitPath to the
cheapest_total_path.  I think to make your case work we'd need to add
the LimitPath only in cases where the distinct_pathkeys are empty but
the sort_pathkeys are not and hasDistinctOn is true and the path has
pathkeys_contained_in(root->sort_pathkeys, path->pathkeys).  I think
that's doable, but it's become quite a bit more complex than the patch
I proposed. Maybe it's worth a 2nd effort for that part?
 
Currently in the patch the optimization is done before we check for
presorted paths or do the explicit sort of the cheapest path. How about
we move this optimization into the branch where we've found any
presorted paths?  Maybe something like:

--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4780,11 +4780,24 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,

  if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
  {
-     add_path(distinct_rel, (Path *)
-              create_upper_unique_path(root, distinct_rel,
-                                       path,
-                                       list_length(root->distinct_pathkeys),
-                                       numDistinctRows));
+     if (root->distinct_pathkeys == NIL)
+     {
+         Node       *limitCount = (Node *) makeConst(INT8OID, -1, InvalidOid,
+                                                     sizeof(int64),
+                                                     Int64GetDatum(1), false,
+                                                     FLOAT8PASSBYVAL);
+
+         add_path(distinct_rel, (Path *)
+                  create_limit_path(root, distinct_rel,
+                                    path, NULL, limitCount,
+                                    LIMIT_OPTION_COUNT, 0, 1));
+     }
+     else
+         add_path(distinct_rel, (Path *)
+                  create_upper_unique_path(root, distinct_rel,
+                                           path,
+                                           list_length(root->distinct_pathkeys),
+                                           numDistinctRows));

This again makes the code less 'straightforward', just to cover a more
uncommon case. I'm also not sure if it's worth doing.

Thanks
Richard
On Thu, 13 Oct 2022 at 16:47, Richard Guo <guofenglinux@gmail.com> wrote:
> Currently in the patch the optimization is done before we check for
> presorted paths or do the explicit sort of the cheapest path. How about
> we move this optimization into the branch where we've found any
> presorted paths?  Maybe something like:

I've attached a patch to that effect, but it turns out a bit more
complex than what you imagined.  We still need to handle the case for
when there's no path that has the required pathkeys and we must add a
SortPath to the cheapest path. That requires adding some similar code
to add the LimitPath after the foreach loop over the pathlist is over.

I was also getting some weird plans like:

regression=# explain select distinct on (four) four,hundred from tenk1
where four=0 order by 1,2;
                              QUERY PLAN
----------------------------------------------------------------------
 Sort  (cost=0.20..0.20 rows=1 width=8)
   Sort Key: hundred
   ->  Limit  (cost=0.00..0.19 rows=1 width=8)
         ->  Seq Scan on tenk1  (cost=0.00..470.00 rows=2500 width=8)
               Filter: (four = 0)

To stop the planner from adding that final sort, I opted to hack the
LimitPath's pathkeys to say that it's already sorted by the
PlannerInfo's sort_pathkeys.  That feels slightly icky, but it does
seem a little wasteful to initialise a sort node on every execution of
the plan to sort a single tuple.

David

Attachment

On Thu, Oct 13, 2022 at 2:48 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 13 Oct 2022 at 16:47, Richard Guo <guofenglinux@gmail.com> wrote:
> Currently in the patch the optimization is done before we check for
> presorted paths or do the explicit sort of the cheapest path. How about
> we move this optimization into the branch where we've found any
> presorted paths?  Maybe something like:

I've attached a patch to that effect, but it turns out a bit more
complex than what you imagined.  We still need to handle the case for
when there's no path that has the required pathkeys and we must add a
SortPath to the cheapest path. That requires adding some similar code
to add the LimitPath after the foreach loop over the pathlist is over.
 
Thanks for the new patch. Previously I considered we just apply this
optimization for adequately-presorted paths so that we can just fetch
the first row from that path. But yes we can also do this optimization
for explicit-sort case so that we can get the result from a top-1
heapsort, just like the new patch does.
 

I was also getting some weird plans like:

regression=# explain select distinct on (four) four,hundred from tenk1
where four=0 order by 1,2;
                              QUERY PLAN
----------------------------------------------------------------------
 Sort  (cost=0.20..0.20 rows=1 width=8)
   Sort Key: hundred
   ->  Limit  (cost=0.00..0.19 rows=1 width=8)
         ->  Seq Scan on tenk1  (cost=0.00..470.00 rows=2500 width=8)
               Filter: (four = 0)

To stop the planner from adding that final sort, I opted to hack the
LimitPath's pathkeys to say that it's already sorted by the
PlannerInfo's sort_pathkeys.  That feels slightly icky, but it does
seem a little wasteful to initialise a sort node on every execution of
the plan to sort a single tuple.
 
I don't get how this plan comes out. It seems not correct because Limit
node above an unsorted path would give us an unpredictable row. I tried
this query without the hack to LimitPath's pathkeys and I get plans
below, with or with index scan:

explain (costs off) select distinct on (four) four,hundred from tenk1
where four=0 order by 1,2;
                     QUERY PLAN
-----------------------------------------------------
 Result
   ->  Limit
         ->  Index Scan using tenk1_hundred on tenk1
               Filter: (four = 0)

explain (costs off) select distinct on (four) four,hundred from tenk1
where four=0 order by 1,2;
            QUERY PLAN
----------------------------------
 Limit
   ->  Sort
         Sort Key: hundred
         ->  Seq Scan on tenk1
               Filter: (four = 0)

Thanks
Richard
On Thu, 13 Oct 2022 at 21:17, Richard Guo <guofenglinux@gmail.com> wrote:
>
> On Thu, Oct 13, 2022 at 2:48 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> To stop the planner from adding that final sort, I opted to hack the
>> LimitPath's pathkeys to say that it's already sorted by the
>> PlannerInfo's sort_pathkeys.  That feels slightly icky, but it does
>> seem a little wasteful to initialise a sort node on every execution of
>> the plan to sort a single tuple.
>
>
> I don't get how this plan comes out. It seems not correct because Limit
> node above an unsorted path would give us an unpredictable row.

Actually, you're right. That manual setting of the pathkeys is an
unneeded remanent from a bug I fixed before sending out v2.  It can
just be removed.

I've attached the v3 patch.

David

Attachment

On Thu, Oct 13, 2022 at 6:43 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 13 Oct 2022 at 21:17, Richard Guo <guofenglinux@gmail.com> wrote:
>
> On Thu, Oct 13, 2022 at 2:48 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> To stop the planner from adding that final sort, I opted to hack the
>> LimitPath's pathkeys to say that it's already sorted by the
>> PlannerInfo's sort_pathkeys.  That feels slightly icky, but it does
>> seem a little wasteful to initialise a sort node on every execution of
>> the plan to sort a single tuple.
>
>
> I don't get how this plan comes out. It seems not correct because Limit
> node above an unsorted path would give us an unpredictable row.

Actually, you're right. That manual setting of the pathkeys is an
unneeded remanent from a bug I fixed before sending out v2.  It can
just be removed.

I've attached the v3 patch.
 
The v3 patch looks good to me.

Thanks
Richard
On Fri, 14 Oct 2022 at 15:15, Richard Guo <guofenglinux@gmail.com> wrote:
> The v3 patch looks good to me.

Thank you for looking at that.

One other thought I had about the duplicate "Limit" node in the final
plan was that we could make the limit clause an Expr like
LEAST(<existing limit clause>, 1).  That way we could ensure we get at
most 1 row, but perhaps less if the expression given in the LIMIT
clause evaluated to 0. This will still work correctly when the
existing limit evaluates to NULL. I'm still just not that keen on this
idea as it means still having to either edit the parse's limitCount or
store the limit details in a new field in PlannerInfo and use that
when making the final LimitPath. However, I'm still not sure doing
this is worth the extra complexity.

If nobody else has any thoughts on this or the patch in general, then
I plan to push it in the next few days.

David




On Wed, Oct 26, 2022 at 4:25 PM David Rowley <dgrowleyml@gmail.com> wrote:
One other thought I had about the duplicate "Limit" node in the final
plan was that we could make the limit clause an Expr like
LEAST(<existing limit clause>, 1).  That way we could ensure we get at
most 1 row, but perhaps less if the expression given in the LIMIT
clause evaluated to 0. This will still work correctly when the
existing limit evaluates to NULL. I'm still just not that keen on this
idea as it means still having to either edit the parse's limitCount or
store the limit details in a new field in PlannerInfo and use that
when making the final LimitPath. However, I'm still not sure doing
this is worth the extra complexity.
 
I find the duplicate "Limit" node is not that concerning after I realize
it may appear in other queries, such as

explain (analyze, timing off, costs off)
select * from (select * from (select * from generate_series(1,100)i limit 10) limit 5) limit 1;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Limit (actual rows=1 loops=1)
   ->  Limit (actual rows=1 loops=1)
         ->  Limit (actual rows=1 loops=1)
               ->  Function Scan on generate_series i (actual rows=1 loops=1)

Although the situation is different in that the Limit node is actually
atop SubqueryScan which is removed afterwards, but the final plan
appears as a Limit node atop another Limit node.

So I wonder maybe we can just live with it, or resolve it in a separate
patch.

Thanks
Richard
On Thu, 27 Oct 2022 at 15:50, Richard Guo <guofenglinux@gmail.com> wrote:
> I find the duplicate "Limit" node is not that concerning after I realize
> it may appear in other queries, such as
>
> explain (analyze, timing off, costs off)
> select * from (select * from (select * from generate_series(1,100)i limit 10) limit 5) limit 1;

Yeah, the additional limits certainly are not incorrect.  We could
maybe do something better, there just does not seem to be much point.

Perhaps fixing things like this would be better done with the
UniqueKey stuff that Andy Fan and I were working on a while back.
With LIMIT 1 everything would become unique and there'd be no need to
add another LimitPath.

I've now pushed the patch after making a small adjustment to one of
the comments which mentions about rows being "indistinguishable by an
equality check".  I was made to think of the '-0.0'::float8 vs +0.0
case again and thought I'd better mention it for this patch.

Thanks for reviewing the patch.

David