Thread: Unclear guarantees about sort order on https://www.postgresql.org/docs/current/queries-order.html

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/16/queries-order.html
Description:

The document only says this about unsorted queries:

> After a query has produced an output table (after the select list has been
processed) it can optionally be sorted. If sorting is not chosen, the rows
will be returned in an unspecified order. The actual order in that case will
depend on the scan and join plan types and the order on disk, but it must
not be relied on. A particular output ordering can only be guaranteed if the
sort step is explicitly chosen.

It mentions "If sorting is not chosen". This sort of implies that if you
pick a sort the output order is predictable. However I believe that the only
actual guarantee is if the sort columns selected produce a unique value.

For example if you do `ORDER BY name` and have two rows with the same name I
don't think the order of those rows is predictable.

I think the docs should be updated to either:

1. Clearly state that the order **is** consent as long as any sort clause is
present, and specify what that order is.
2. Update the quoted sentence to refer to "If sorting is not chosen or the
sort columns do not form a unique key" instead of just "If sorting is not
chosen".

On 2023-10-04 16:24 +0200, PG Doc comments form write:
> The following documentation comment has been logged on the website:
> 
> Page: https://www.postgresql.org/docs/16/queries-order.html
> Description:
> 
> The document only says this about unsorted queries:
> 
> > After a query has produced an output table (after the select list has been
> > processed) it can optionally be sorted. If sorting is not chosen, the rows
> > will be returned in an unspecified order. The actual order in that case will
> > depend on the scan and join plan types and the order on disk, but it must
> > not be relied on. A particular output ordering can only be guaranteed if the
> > sort step is explicitly chosen.
> 
> It mentions "If sorting is not chosen". This sort of implies that if you
> pick a sort the output order is predictable. However I believe that the only
> actual guarantee is if the sort columns selected produce a unique value.
> 
> For example if you do `ORDER BY name` and have two rows with the same name I
> don't think the order of those rows is predictable.

Right, without an explicit sorting the order of rows is unpredictable.
When only sorting over some columns/expressions then the ordering is
only predictable in those columns/expressions.  The order of rows with
ties within the same sorted group is also unpredictable.  This is also
implied on the same page after the first example: "When more than one
expression is specified, the later values are used to sort rows that are
equal according to the earlier values."  This implies that without later
values, those rows remain in unpredictable order.

The SQL standards linked by [1] provide some definitions:

SQL:2003, Part 2, Annex C, 26b:

"The relative ordering of two rows that are not distinct with respect to
 the <sort specification> is implementation-dependent."

SQL:2011, Part 2, Annex C, 24e:

"If a <query expression> immediately contains an <order by clause> and
 a group of two or more rows in the table specified by that <query
 expression> contain values that are not distinct in all sort keys
 specified in the <order by clause>, then the ordering of these rows in
 that group is implementation-dependent."

I find the first one quite succinct.

> I think the docs should be updated to either:
> 
> 1. Clearly state that the order **is** consent as long as any sort clause is
> present, and specify what that order is.

What do you mean with "any sort clause"?  Any sort clause at all?  Or a
sort clause that covers all columns?  If the order should be predictable
it must be specified by the client to some degree.  And if the client
does not care about rows with ties than he should not be required to
specify a more specific sorting.

> 2. Update the quoted sentence to refer to "If sorting is not chosen or the
> sort columns do not form a unique key" instead of just "If sorting is not
> chosen".

I think "unique key" is misleading in this case because sorting still
leaves duplicates.  I'd go with something that mentions "sorted groups"
and "tie breaks" or some form of the quoted SQL:2003 definition.

[1] https://wiki.postgresql.org/wiki/Developer_FAQ#Where_can_I_get_a_copy_of_the_SQL_standards.3F

-- 
Erik



On Wed, Oct 4, 2023 at 6:37 PM Erik Wienhold <ewie@ewie.name> wrote:
On 2023-10-04 16:24 +0200, PG Doc comments form write:
> The following documentation comment has been logged on the website:
>
> Page: https://www.postgresql.org/docs/16/queries-order.html
> Description:
>
> The document only says this about unsorted queries:
>
> > After a query has produced an output table (after the select list has been
> > processed) it can optionally be sorted. If sorting is not chosen, the rows
> > will be returned in an unspecified order. The actual order in that case will
> > depend on the scan and join plan types and the order on disk, but it must
> > not be relied on. A particular output ordering can only be guaranteed if the
> > sort step is explicitly chosen.
>
> It mentions "If sorting is not chosen". This sort of implies that if you
> pick a sort the output order is predictable. However I believe that the only
> actual guarantee is if the sort columns selected produce a unique value.
>
> For example if you do `ORDER BY name` and have two rows with the same name I
> don't think the order of those rows is predictable.

"The relative ordering of two rows that are not distinct with respect to
 the <sort specification> is implementation-dependent."

The OP is assuming a promise of a deterministic ordering of all output rows and such a promise is only possible if the order by clause columns uniquely identify every row in the output.  This is because all the order by promises is that output ordering will conform to the order by specification, and indeed if it is under-specified such that multiple rows match a given bin, then there is no deterministic relative ordering among those rows.

I don't feel that the wording makes any such inference regarding determinism of row output due to the mere presence of an order by clause.  Nor doesn't such determinism in the face of an under-specific clause even make logical sense.  I'm mostly inclined to leave the wording alone given this single report.  My only complaints are style-istic at this point.

That said, maybe a final sentence:

Assuming every output row can be uniquely identified by some subset of the output columns, that subset must all be listed within the order by clause if you wish to ensure a fully deterministic ordering.

David J.