Thread: [PATCH] Remove useless distinct clauses

[PATCH] Remove useless distinct clauses

From

Pierre Ducroquet

Date:

31 July 2020, 08:41:19

Hi

In a recent audit, I noticed that application developers have a tendency to 
abuse the distinct clause. For instance they use an ORM and add a distinct at 
the top level just because they don't know the cost it has, or they don't know 
that using EXISTS is a better way to express their queries than doing JOINs 
(or worse, they can't do better).

They thus have this kind of queries (considering tbl1 has a PK of course):
SELECT DISTINCT * FROM tbl1;
SELECT DISTINCT * FROM tbl1 ORDER BY a;
SELECT DISTINCT tbl1.* FROM tbl1
    JOIN tbl2 ON tbl2.a = tbl1.id;

These can be transformed into:
SELECT * FROM tbl1 ORDER BY *;
SELECT * FROM tbl1 ORDER BY a;
SELECT tbl1.* FROM tbl1 SEMI-JOIN tbl2 ON tbl2.a = tbl1.id ORDER BY tbl1.*;

The attached patch does that.
I added extra safeties in several place just to be sure I don't touch 
something I can not handle, but I may have been very candid with the distinct 
to sort transformation.
The cost of this optimization is quite low : for queries that don't have any 
distinct, it's just one if. If there is a distinct, we iterate once through 
every target, then we fetch the PK and iterate through the DISTINCT clause 
fields. If it is possible to optimize, we then iterate through the JOINs.

Any comment on this would be more than welcome!

Regards

 Pierre

Attachment

0001-Add-a-new-remove_useless_distinct-function-in-planne.patch

Re: [PATCH] Remove useless distinct clauses

From

Ashutosh Bapat

Date:

31 July 2020, 09:36:31

Hi Pierre,

On Fri, Jul 31, 2020 at 2:11 PM Pierre Ducroquet <p.psql@pinaraf.info> wrote:
>
> Hi
>
> In a recent audit, I noticed that application developers have a tendency to
> abuse the distinct clause. For instance they use an ORM and add a distinct at
> the top level just because they don't know the cost it has, or they don't know
> that using EXISTS is a better way to express their queries than doing JOINs
> (or worse, they can't do better).
>
> They thus have this kind of queries (considering tbl1 has a PK of course):
> SELECT DISTINCT * FROM tbl1;
> SELECT DISTINCT * FROM tbl1 ORDER BY a;
> SELECT DISTINCT tbl1.* FROM tbl1
>         JOIN tbl2 ON tbl2.a = tbl1.id;
>
> These can be transformed into:
> SELECT * FROM tbl1 ORDER BY *;

We don't need an ORDER BY here since there's primary key on tbl1 and
DISTINCT doesn't ensure ordered result.

> SELECT * FROM tbl1 ORDER BY a;
> SELECT tbl1.* FROM tbl1 SEMI-JOIN tbl2 ON tbl2.a = tbl1.id ORDER BY tbl1.*;
>
> The attached patch does that.
> I added extra safeties in several place just to be sure I don't touch
> something I can not handle, but I may have been very candid with the distinct
> to sort transformation.
> The cost of this optimization is quite low : for queries that don't have any
> distinct, it's just one if. If there is a distinct, we iterate once through
> every target, then we fetch the PK and iterate through the DISTINCT clause
> fields. If it is possible to optimize, we then iterate through the JOINs.

We are already discussing this feature at

https://www.postgresql.org/message-id/flat/CAKJS1f-wH83Fi2coEVNUWFxOGQ4BJRRTGqDMvidCoiR9WEwxsw%40mail.gmail.com#56a08b441cc61afaf85c6232c5d40a3f.
You are welcome to contribute your ideas/code/review on that thread.

-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Remove useless distinct clauses

From

David Rowley

Date:

15 September 2020, 10:57:04

On Fri, 31 Jul 2020 at 20:41, Pierre Ducroquet <p.psql@pinaraf.info> wrote:
>
> In a recent audit, I noticed that application developers have a tendency to
> abuse the distinct clause. For instance they use an ORM and add a distinct at
> the top level just because they don't know the cost it has, or they don't know
> that using EXISTS is a better way to express their queries than doing JOINs
> (or worse, they can't do better).
>
> They thus have this kind of queries (considering tbl1 has a PK of course):
> SELECT DISTINCT * FROM tbl1;
> SELECT DISTINCT * FROM tbl1 ORDER BY a;
> SELECT DISTINCT tbl1.* FROM tbl1
>         JOIN tbl2 ON tbl2.a = tbl1.id;

This is a common anti-pattern that I used to see a couple of jobs ago.
What seemed to happen was that someone would modify some query or a
view to join in an additional table to fetch some information that was
now required.  At some later time, there'd be a bug report to say that
the query is returning certain records more than once.  The
developer's solution was to add DISTINCT, instead of figuring out that
the join that was previously added missed some column from the join
clause.

> These can be transformed into:
> SELECT * FROM tbl1 ORDER BY *;
> SELECT * FROM tbl1 ORDER BY a;
> SELECT tbl1.* FROM tbl1 SEMI-JOIN tbl2 ON tbl2.a = tbl1.id ORDER BY tbl1.*;
>
> The attached patch does that.

Unfortunately, there are quite a few issues with what you have:

First off, please see
https://www.postgresql.org/docs/devel/source-format.html about how we
format the source code.  Please pay attention to how we do code
comments and braces on a separate line.

Another problem is that we shouldn't be really wiping out the distinct
clause like you are with "root->parse->distinctClause = NULL;" there's
some discussion in [1] about that.

Also, the processing of the join tree where you switch inner joins to
semi joins looks broken. This would require much more careful and
recursive processing to do properly.  However, I'm not really sure
what that is as I'm not sure of all the cases that you can optimise
this way, and more importantly, which ones you can't.  There's also no
hope of anyone else knowing this as you've not left any comments about
why what you're doing is valid.

If you want an example of what can cause what you have to brake:

create table t (a int primary key);
explain select distinct a from t cross join pg_class cross join pg_attribute;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.15..37682.53 rows=999600 width=4)
   ->  Nested Loop  (cost=0.15..12595.31 rows=999600 width=4)
         ->  Index Only Scan using t_pkey on t  (cost=0.15..82.41
rows=2550 width=4)
         ->  Materialize  (cost=0.00..18.88 rows=392 width=0)
               ->  Seq Scan on pg_class  (cost=0.00..16.92 rows=392 width=0)
   ->  Materialize  (cost=0.00..105.17 rows=3145 width=0)
         ->  Seq Scan on pg_attribute  (cost=0.00..89.45 rows=3145 width=0)
(7 rows)

-- Note the join to pg_attribute remains a cross join.
insert into t values(1);
-- the following should only return 1 row. It returns many more than that.
select distinct a from t cross join pg_class cross join pg_attribute;

I can't figure out why you're doing this either:

+ /**
+ * If there was no sort clause, we change the distinct into a sort clause.
+ */
+ if (!root->parse->sortClause)
+ root->parse->sortClause = root->parse->distinctClause;

It's often better to say "why" rather than "what" when it comes to
code comments. It's pretty easy to see "what". It's the "why" part
that people more often get stuck on. Although, sometimes what you're
doing is complex and it does need a mention of "what". That's not the
case for the above though.

David

[1] https://www.postgresql.org/message-id/CAExHW5t7ALZmaN8gL5DZV%2Ben5G%3D4QTbKSYhBrXnSrKgCYNr_AA%40mail.gmail.com

Re: [PATCH] Remove useless distinct clauses

From

Bruce Momjian

Date:

20 September 2020, 00:46:49

On Tue, Sep 15, 2020 at 10:57:04PM +1200, David Rowley wrote:
> On Fri, 31 Jul 2020 at 20:41, Pierre Ducroquet <p.psql@pinaraf.info> wrote:
> >
> > In a recent audit, I noticed that application developers have a tendency to
> > abuse the distinct clause. For instance they use an ORM and add a distinct at
> > the top level just because they don't know the cost it has, or they don't know
> > that using EXISTS is a better way to express their queries than doing JOINs
> > (or worse, they can't do better).
> >
> > They thus have this kind of queries (considering tbl1 has a PK of course):
> > SELECT DISTINCT * FROM tbl1;
> > SELECT DISTINCT * FROM tbl1 ORDER BY a;
> > SELECT DISTINCT tbl1.* FROM tbl1
> >         JOIN tbl2 ON tbl2.a = tbl1.id;
> 
> This is a common anti-pattern that I used to see a couple of jobs ago.
> What seemed to happen was that someone would modify some query or a
> view to join in an additional table to fetch some information that was
> now required.  At some later time, there'd be a bug report to say that
> the query is returning certain records more than once.  The
> developer's solution was to add DISTINCT, instead of figuring out that
> the join that was previously added missed some column from the join
> clause.

I can 100% imagine that happening.  :-(

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: [PATCH] Remove useless distinct clauses

From

Michael Paquier

Date:

30 September 2020, 07:07:42

On Tue, Sep 15, 2020 at 10:57:04PM +1200, David Rowley wrote:
> Unfortunately, there are quite a few issues with what you have:

This review has not been answered after two weeks, so this is marked
as RwF.
--
Michael

Attachment

signature.asc