Thread: Fixing the representation of ORDER BY/GROUP BY/DISTINCT

Fixing the representation of ORDER BY/GROUP BY/DISTINCT

From
Tom Lane
Date:
So while I was fooling with Steve Midgley's problem I got a bit of a bee
in my bonnet about the way that the parser emits ORDER BY, GROUP BY,
and DISTINCT lists.

* Currently, ORDER BY and DISTINCT use lists of SortClause, while GROUP
BY is a list of GroupClause --- but these are actually the same struct,
and there's a fair amount of code that relies on that.  The advantage of
them being the same is that it's easy to compare them when considering
sort-and-uniq-style grouping plans.  Except it's not easy enough: I
tried to use a list_member test in one place, and of course it didn't
work because equal() never sees distinct node tags as equal.  So I'm
thinking we ought to unify the two node types logically as well as
physically, and just have one node type (SortGroupClause, maybe).

* The current representation is fine for ORDER BY but leaves something
to be desired for GROUP BY and DISTINCT: there isn't anyplace to specify
the equality operator for a hash-based grouping operation.  This results
in repeat lookups in the planner to fetch information that was readily
available at parse time.  But what's worse IMHO is that we simply cannot
represent a grouping query for a datatype that hasn't got a btree sort
opclass --- even though we could implement it, by means of hashing, if
the type has a hashable equality operator.  (This isn't academic: it's
easy to imagine datatypes that have equality but no natural linear sort
order.  A practical example is XID, which in fact has a hash opclass
but not a btree opclass, because it violates the law of transitivity.)
So what I'm thinking we want is something like

typedef struct SortGroupClause
{   NodeTag     type;   Index       tleSortGroupRef;    /* reference into targetlist */   Oid         eqop;
 /* the equality operator ('=' op) */   Oid         sortop;             /* the ordering operator ('<' op), or 0 */
bool       nulls_first;        /* do NULLs come before normal values? */
 
} SortGroupClause;

In an ORDER BY item the sortop and nulls_first flag *must* be supplied.
The eqop isn't really useful for ORDER BY, but it's trivial to get when
we get the sortop, and always including it simplifies comparisons to
GROUP/DISTINCT items.  In GROUP BY/DISTINCT items, the eqop *must* be
supplied, and if it is a btree equality operator then the associated
sortop should be given as well.  We'd continue the current practice of
duplicating the ordering properties of any ORDER BY clause given for the
same targetlist item, so that compatible ordering and grouping items are
just equal().

* Another thing I've gotten tired of is the difficulty of telling
DISTINCT from DISTINCT ON in the parsed representation.  Surely we can
afford to stick another bool field into Query to make that distinction
unambiguous.  This is important for making the world safe for hashed
DISTINCT, since AFAICS we probably can't ever use hashing for DISTINCT
ON --- its definition is too dependent on the assumption of sorting.

Barring objections, I'm going to go off and make this happen.  It won't
immediately result in supporting hashed DISTINCT, but it's another
necessary step on the road to that.
        regards, tom lane


Re: Fixing the representation of ORDER BY/GROUP BY/DISTINCT

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> This is important for making the world safe for hashed DISTINCT, since
> AFAICS we probably can't ever use hashing for DISTINCT ON --- its definition
> is too dependent on the assumption of sorting.

I don't think that's true. We could store the sort key in the hash along with
the resulting tuple and replace the resulting tuple iff the new sort key is
less than the old sort key.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Fixing the representation of ORDER BY/GROUP BY/DISTINCT

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> typedef struct SortGroupClause
> {
>     NodeTag     type;
>     Index       tleSortGroupRef;    /* reference into targetlist */
>     Oid         eqop;               /* the equality operator ('=' op) */
>     Oid         sortop;             /* the ordering operator ('<' op), or 0 */
>     bool        nulls_first;        /* do NULLs come before normal values? */
> } SortGroupClause;

So ASC/DESC is represented by using > for sortop? 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Fixing the representation of ORDER BY/GROUP BY/DISTINCT

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> So ASC/DESC is represented by using > for sortop? 

Yes --- that's always been true.
        regards, tom lane


Re: Fixing the representation of ORDER BY/GROUP BY/DISTINCT

From
Martijn van Oosterhout
Date:
On Thu, Jul 31, 2008 at 08:54:49PM -0400, Tom Lane wrote:
> So while I was fooling with Steve Midgley's problem I got a bit of a bee
> in my bonnet about the way that the parser emits ORDER BY, GROUP BY,
> and DISTINCT lists.

There's an open TODO item in this area: namely that a GROUP BY referring
to a primary key is equivalent to a GROUP BY involving all the rest of
the fields. Now, I don't think anyone has proposed a way to support
that, but I wanted to check we arn't making it harder to do with these
changes...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Fixing the representation of ORDER BY/GROUP BY/DISTINCT

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Thu, Jul 31, 2008 at 08:54:49PM -0400, Tom Lane wrote:
>> So while I was fooling with Steve Midgley's problem I got a bit of a bee
>> in my bonnet about the way that the parser emits ORDER BY, GROUP BY,
>> and DISTINCT lists.

> There's an open TODO item in this area: namely that a GROUP BY referring
> to a primary key is equivalent to a GROUP BY involving all the rest of
> the fields. Now, I don't think anyone has proposed a way to support
> that, but I wanted to check we arn't making it harder to do with these
> changes...

No, these changes shouldn't make any difference for that.  (Offhand it
seems like that might just be a simple modification in the code that
checks for improper use of ungrouped fields.  Not sure though.)
        regards, tom lane