Re: `order by random()` makes select-list `random()` invocations deterministic - Mailing list pgsql-bugs

From Tom Lane
Subject Re: `order by random()` makes select-list `random()` invocations deterministic
Date
Msg-id 2784459.1709261187@sss.pgh.pa.us
Whole thread Raw
In response to Re: `order by random()` makes select-list `random()` invocations deterministic  ("Dian Fay" <di@nmfay.com>)
List pgsql-bugs
"Dian Fay" <di@nmfay.com> writes:
> Thanks, that makes sense! Any thoughts on whether it's worth a
> cautionary note in the `order by` and/or random function docs since
> `order by random()` is a fairly well attested solution and I'm not the
> first person to run into this quirk?

The question is where to put the caution, because it's not like this
is somehow specific to random().

I believe what's fundamentally happening here is that ordinarily,
functions in the targetlist are evaluated during the final projection
step that occurs after the scan/join phase.  So for example:

regression=# explain (verbose, costs off) select f1, random(), random() from int4_tbl;
            QUERY PLAN
----------------------------------
 Seq Scan on public.int4_tbl
   Output: f1, random(), random()
(2 rows)

regression=# select f1, random(), random() from int4_tbl;
     f1      |       random        |       random
-------------+---------------------+---------------------
           0 |  0.6623584085865575 |  0.8611211203466376
      123456 |  0.7371018974566144 |   0.707800598912321
     -123456 |  0.5652778572736816 |  0.2684184354600243
  2147483647 | 0.47747487098101504 |   0.531163579706837
 -2147483647 |  0.3829707208069777 | 0.11094005213737002
(5 rows)

The two textually distinct occurrences of random() are evaluated
separately.  However, if you use that same function as a sort key:

regression=# explain (verbose, costs off) select f1, random(), random() from int4_tbl order by random();
               QUERY PLAN
-----------------------------------------
 Result
   Output: f1, (random()), (random())
   ->  Sort
         Output: f1, (random())
         Sort Key: (random())
         ->  Seq Scan on public.int4_tbl
               Output: f1, random()
(7 rows)

regression=# select f1, random(), random() from int4_tbl order by random();
     f1      |       random        |       random
-------------+---------------------+---------------------
     -123456 | 0.39761234612097884 | 0.39761234612097884
      123456 |  0.7227920193261217 |  0.7227920193261217
           0 |  0.7628733460552672 |  0.7628733460552672
  2147483647 |  0.8745622687164281 |  0.8745622687164281
 -2147483647 |   0.929020084273001 |   0.929020084273001
(5 rows)

Reading that explain plan requires a certain amount of expertise,
but the key point is that the extra parens around the upper-level
"random()" expressions indicate that the value is being copied up
from the subplan rather than being evaluated afresh.  So the need
to include "random()" in what passes through the sort step results
in the final projection just re-using that value in both places,
because the targetlist entries (and subexpressions thereof) are
matched literally against what is coming out of the sort step without
consideration for whether those expressions are volatile and what
should happen if they are.  The same would happen with a GROUP BY
expression.

So that's why it happens like that from an implementation standpoint,
but that doesn't inform us much about what we might like to do
differently.  I think the current behavior arose from wanting to
support SQL92 syntax like

    select random() from mytab order by 1;

Here it should surprise nobody if the output appears in sorted order
--- in fact, I bet people would say it's a bug if it doesn't.
However, SQL99 got rid of that syntax and would have us write

    select random() from mytab order by random();

Here it's very much less obvious whether the two occurrences of
random() are meant to denote the same value.  PG has historically
taken the position that they are, mainly because the SQL92 behavior
clearly has some use and you can't readily get that with the SQL99
syntax unless you read it this way.  However, having taken that
position, it's hard to argue that

    select random(), random() from mytab order by random();

shouldn't result in all three instances meaning the same value.

There's an argument certainly that we should require you to write
it differently if you want that behavior, probably by having just
one random() instance in a sub-select.  But I doubt we'd get a
lot of kudos for changing a behavior that's stood for a couple of
decades.

Maybe we could write something like "An expression or subexpression in
the SELECT list that matches an ORDER BY or GROUP BY item is taken to
represent the same value that was sorted or grouped by, even when the
(sub)expression is volatile".  I'm not sure where to put this though.
(TBH, I'm also not very sure that we honor that in absolutely every
case ...)

            regards, tom lane



pgsql-bugs by date:

Previous
From: "Dian Fay"
Date:
Subject: Re: `order by random()` makes select-list `random()` invocations deterministic
Next
From: Noah Misch
Date:
Subject: Re: Facing issue in installing pg_cron extension in Postgresql 15