Fixing the representation of ORDER BY/GROUP BY/DISTINCT - Mailing list pgsql-hackers

From Tom Lane
Subject Fixing the representation of ORDER BY/GROUP BY/DISTINCT
Date
Msg-id 25811.1217552089@sss.pgh.pa.us
Whole thread Raw
Responses Re: Fixing the representation of ORDER BY/GROUP BY/DISTINCT  (Gregory Stark <stark@enterprisedb.com>)
Re: Fixing the representation of ORDER BY/GROUP BY/DISTINCT  (Gregory Stark <stark@enterprisedb.com>)
Re: Fixing the representation of ORDER BY/GROUP BY/DISTINCT  (Martijn van Oosterhout <kleptog@svana.org>)
List pgsql-hackers
So while I was fooling with Steve Midgley's problem I got a bit of a bee
in my bonnet about the way that the parser emits ORDER BY, GROUP BY,
and DISTINCT lists.

* Currently, ORDER BY and DISTINCT use lists of SortClause, while GROUP
BY is a list of GroupClause --- but these are actually the same struct,
and there's a fair amount of code that relies on that.  The advantage of
them being the same is that it's easy to compare them when considering
sort-and-uniq-style grouping plans.  Except it's not easy enough: I
tried to use a list_member test in one place, and of course it didn't
work because equal() never sees distinct node tags as equal.  So I'm
thinking we ought to unify the two node types logically as well as
physically, and just have one node type (SortGroupClause, maybe).

* The current representation is fine for ORDER BY but leaves something
to be desired for GROUP BY and DISTINCT: there isn't anyplace to specify
the equality operator for a hash-based grouping operation.  This results
in repeat lookups in the planner to fetch information that was readily
available at parse time.  But what's worse IMHO is that we simply cannot
represent a grouping query for a datatype that hasn't got a btree sort
opclass --- even though we could implement it, by means of hashing, if
the type has a hashable equality operator.  (This isn't academic: it's
easy to imagine datatypes that have equality but no natural linear sort
order.  A practical example is XID, which in fact has a hash opclass
but not a btree opclass, because it violates the law of transitivity.)
So what I'm thinking we want is something like

typedef struct SortGroupClause
{   NodeTag     type;   Index       tleSortGroupRef;    /* reference into targetlist */   Oid         eqop;
 /* the equality operator ('=' op) */   Oid         sortop;             /* the ordering operator ('<' op), or 0 */
bool       nulls_first;        /* do NULLs come before normal values? */
 
} SortGroupClause;

In an ORDER BY item the sortop and nulls_first flag *must* be supplied.
The eqop isn't really useful for ORDER BY, but it's trivial to get when
we get the sortop, and always including it simplifies comparisons to
GROUP/DISTINCT items.  In GROUP BY/DISTINCT items, the eqop *must* be
supplied, and if it is a btree equality operator then the associated
sortop should be given as well.  We'd continue the current practice of
duplicating the ordering properties of any ORDER BY clause given for the
same targetlist item, so that compatible ordering and grouping items are
just equal().

* Another thing I've gotten tired of is the difficulty of telling
DISTINCT from DISTINCT ON in the parsed representation.  Surely we can
afford to stick another bool field into Query to make that distinction
unambiguous.  This is important for making the world safe for hashed
DISTINCT, since AFAICS we probably can't ever use hashing for DISTINCT
ON --- its definition is too dependent on the assumption of sorting.

Barring objections, I'm going to go off and make this happen.  It won't
immediately result in supporting hashed DISTINCT, but it's another
necessary step on the road to that.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Review: DTrace probes (merged version) ver_03
Next
From: ITAGAKI Takahiro
Date:
Subject: Re: Copy storage parameters on CREATE TABLE LIKE/INHERITS