Fixing DISTINCT ON for duplicate keys - Mailing list pgsql-hackers

From Tom Lane
Subject Fixing DISTINCT ON for duplicate keys
Date
Msg-id 16673.1217525917@sss.pgh.pa.us
Whole thread Raw
Responses Re: Fixing DISTINCT ON for duplicate keys  (David Fetter <david@fetter.org>)
List pgsql-hackers
I looked into this trouble report:
http://archives.postgresql.org/pgsql-sql/2008-07/msg00123.php

The problem is that by the time we get to transformDistinctClause(),
any duplicate entries in the ORDER BY list have been eliminated
(see addTargetToSortList).  But transformDistinctClause expects
a one-for-one match of the two lists, and so it complains.

Clearly, duplicate DISTINCT ON items are just as redundant as duplicate
ORDER BY items are, and so it seems that suppressing them is a
reasonable thing to do.  But I'm thinking that as long as we're touching
this old code, there are some other things that should be fixed:

* There's not really any semantic significance to the ordering of the
DISTINCT ON list anyway, so it would be reasonable to rearrange the
ordering of the list to match the ORDER BY list, rather than making
the user do it.

* It's really bletcherous that the code physically modifies the
user-given ORDER BY.  This damage is visible in stored rules ---
they don't come out the same way you wrote them.  While I don't
mind the idea of dropping redundant entries, adding ORDER BY entries
that the user never wrote seems bogus.  It overconstrains the query,
in a way that doesn't matter given our current implementation but
could matter in the future.

What I am thinking we could do about the latter is modify the querytree
semantics a bit.  Instead of insisting that the transformed
distinctClause be equal to a prefix of the sortClause, allow either
one to be a prefix of the other.  Then the planner simply takes the
longer one as its internal sort-order target.  With this rule, the
sortClause stays as what the user wrote (less any duplicate keys).
The parser is required to remove any duplicate keys from the
distinctClause and rearrange it if needed so that it has a common
prefix with the sortClause (or throw error if this is not possible).
This would be invisible to the user in plain SELECT DISTINCT, and
in SELECT DISTINCT ON would mean that the list is dumped in a
"canonical" order that matches ORDER BY, but isn't changed in any
semantic way.

Now this is probably too big a change to be prudent to back-patch.
Is it worth coming up with a second patch that just tries to get
transformDistinctClause to remove duplicates?  Since the problem
has existed for a very long time (at least back to 7.0 according
to my testing) with no prior reports, it doesn't seem very important
to fix.  I'm a bit worried about putting a patch into only the
back branches --- it would go out with little testing and so the
odds of introducing a fresh problem seem uncomfortably high compared
to the benefit.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: pg_regress inputdir
Next
From: "David E. Wheeler"
Date:
Subject: Re: Type Categories for User-Defined Types