Re: Removing useless DISTINCT clauses - Mailing list pgsql-hackers

From David Rowley
Subject Re: Removing useless DISTINCT clauses
Date
Msg-id CAKJS1f862Fh+v0Aw1=h7bOz0j1qBp-dozXrGOz7pkXVJ5BNx_w@mail.gmail.com
Whole thread Raw
In response to Re: Removing useless DISTINCT clauses  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Removing useless DISTINCT clauses
List pgsql-hackers
On 24 August 2018 at 14:12, Stephen Frost <sfrost@snowman.net> wrote:
> This is happening at the same time as some optimizations around GROUP
> BY, so either there's something different about what's happening there
> and I didn't appreciate it, or does that optimization suffer from a
> similar issue?

There are two separate optimisations in Jim's version of the patch:

1) Attempting to remove useless DISTINCT clause items in cases where a
true subset of the clauses make up the entire table's primary key.
This can be applied no matter how many joins are in the query.  If a
table's PK is a,b then it's not going to be any more distinct if we
distinctify on a,b,c. Distification on a,b is the most we'll never
need.  We already apply this optimisation to the GROUP BY clause.

2) Attempt to remove the entire DISTINCT clause if the query is to a
single base table when the DISTINCT clause contains the entire table's
PK. Here there's need uniquify the results.

I started this thread for #1 and Jim would like to add #2 as an
additional optimisation.  I feel that #2 is different enough from #1
that they should be considered independently of each other. The area
where #2 should be implemented is far away at the other end of
planning from where #1 is being done.  I believe that two separate
patches are the best way forward here as it does not really make sense
to reject #1 because there are concerns with #2, vice versa.  I think
if Jim would like to propose #2, then I think a separate thread is a
better option than tagging along on this one. This might also reduce
confusion.

I'd also rather not turn this thread into people discussing/reviewing
Jim's work that he plans to implement in some fork taken from Postgres
10.5. I don't really think even this mailing list is the right place
for that, let alone this thread.

> > I've not read the patch, but David's reaction makes it sound like its
> > processing is done too early.  There are right places and wrong places
> > to do most everything in the planner, and I do not wish to accept a
> > patch that does something in the wrong place.
>
> Right, I definitely agree with you there.  This seemed like a reasonable
> place given the similar optimization (at least in appearance to me)
> being done there for the GROUP BY case.  I'm happy to admit that I
> haven't looked at it in very much depth (hence my question to David) and
> I'm not an expert in this area, but I did want to bring up that the
> general idea and the relative trade-offs at least sounded reasonable.

You might be confusing #1 and #2. My concern is with #2.  The existing
GROUP BY clause optimisation is almost identical to #1. I just wanted
to also apply it to the DISTINCT clause.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Pluggable Storage - Andres's take
Next
From: Tatsuro Yamada
Date:
Subject: Re: Fix help option of contrib/oid2name