Re: MAX/MIN optimization via rewrite (plus query rewrites generally) - Mailing list pgsql-hackers

From Greg Stark
Subject Re: MAX/MIN optimization via rewrite (plus query rewrites generally)
Date
Msg-id 87fz3gc4l2.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: MAX/MIN optimization via rewrite (plus query rewrites generally)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > Tom Lane <tgl@sss.pgh.pa.us> writes:
> >> Oh?  How is a first() aggregate going to know what sort order you want
> >> within the group?
> 
> > It would look something like
> 
> > select x,first(a),first(b) from (select x,a,b from table order by x,y) group by x
> 
> > which is equivalent to
> 
> > select DISTINCT ON (x) x,a,b from table ORDER BY x,y
> 
> No, it is not.  The GROUP BY has no commitment to preserve order ---
> consider for example the possibility that we implement the GROUP BY by
> hashing.

It doesn't matter how the group by is implemented. The only thing that matters
is what order the tuples are presented to the aggregate function. Even hashing
presents them to the aggregate function as they're read in from the subquery.

In fact iirc hashing was why the original user that asked for this behaviour
found that it sometimes did work and wondered why it didn't work when the
planner *didn't* use hashing. You added code to allow it to function properly
for that case as well.

> The fact that an optimization is present does not make it part of the
> guaranteed semantics of the language.
>
> Basically, first() is a broken concept in SQL.  Of course DISTINCT ON
> is broken too for the same reasons, but I do not see that first() is
> one whit less of a kluge than DISTINCT ON.

first() is only a whit less of a kludge than DISTINCT ON in that it covers the
more general case of wanting both first value for each group as well as other
aggregates. Otherwise it's exactly equivalent.

Depending on ordering was a broken concept in the abstract theoretical world
of SQL as originally envisioned. But (unfortunately) we don't really live in
that world any more. There are tons of extensions that Postgres and other
databases provide that do depend on the ordering of record sets. And there are
tons of examples of real business problems that require those extensions.

It may be a broken concept but it's one that's extremely popular. Postgres
implemented DISTINCT ON independently, but mysql also has an equivalent
feature: you can include any column not included in the GROUP BY clause and
mysql will implicitly do the equivalent of first().

Oracle on the other hand has a complete suite of functions for processing
order-dependent data. They even implement lead()/lag() for allowing you to
access the data from subsequent or previous records in the data set.

-- 
greg



pgsql-hackers by date:

Previous
From: Thomas Hallgren
Date:
Subject: Re: GUC custom variables broken
Next
From: Patrick B Kelly
Date:
Subject: Re: multiline CSV fields