Thread: Should we optimize the `ORDER BY random() LIMIT x` case?

Should we optimize the `ORDER BY random() LIMIT x` case?

From
Aleksander Alekseev
Date:
Hi,

If I didn't miss anything, currently we don't seem to support sampling
the result of an arbitrary SELECT query efficiently.

To give one specific example:

````
CREATE TABLE temperature(
  ts TIMESTAMP NOT NULL,
  city TEXT NOT NULL,
  temperature INT NOT NULL);

CREATE TABLE humidity(
  ts TIMESTAMP NOT NULL,
  city TEXT NOT NULL,
  humidity INT NOT NULL);

-- imagine having much more data ...
INSERT INTO temperature (ts, city, temperature)
SELECT ts + (INTERVAL '60 minutes' * random()), city, 30*random()
FROM generate_series('2022-01-01' :: TIMESTAMP,
                     '2022-01-31', '1 day') AS ts,
     unnest(array['City A', 'City B']) AS city;

INSERT INTO humidity (ts, city, humidity)
SELECT ts + (INTERVAL '60 minutes' * random()), city, 100*random()
FROM generate_series('2022-01-01' :: TIMESTAMP,
                     '2022-01-31', '1 day') AS ts,
     unnest(array['City A', 'City B']) AS city;

-- "AS OF" join:
SELECT t.ts, t.city, t.temperature, h.humidity
FROM temperature AS t
LEFT JOIN LATERAL
  ( SELECT * FROM humidity
    WHERE city = t.city AND ts <= t.ts
    ORDER BY ts DESC LIMIT 1
  ) AS h ON TRUE
WHERE t.ts < '2022-01-05';
```

One can do `SELECT (the query above) ORDER BY random() LIMIT x` but
this produces an inefficient plan. Alternatively one could create
temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl
TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be
suboptimal even if we supported global temporary tables.

1. Do you think there might be value in addressing this issue?
2. If yes, how would you suggest addressing it from the UI point of
view - by adding a special syntax, some sort of aggregate function, or
...?

-- 
Best regards,
Aleksander Alekseev



Re: Should we optimize the `ORDER BY random() LIMIT x` case?

From
wenhui qiu
Date:
Hi Aleksander
 if we can optimize the query, that would be great,Then we won't need to pull a lot of data to the program end and randomly pick the needed data there.

On Thu, 15 May 2025 at 07:41, Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi,

If I didn't miss anything, currently we don't seem to support sampling
the result of an arbitrary SELECT query efficiently.

To give one specific example:

````
CREATE TABLE temperature(
  ts TIMESTAMP NOT NULL,
  city TEXT NOT NULL,
  temperature INT NOT NULL);

CREATE TABLE humidity(
  ts TIMESTAMP NOT NULL,
  city TEXT NOT NULL,
  humidity INT NOT NULL);

-- imagine having much more data ...
INSERT INTO temperature (ts, city, temperature)
SELECT ts + (INTERVAL '60 minutes' * random()), city, 30*random()
FROM generate_series('2022-01-01' :: TIMESTAMP,
                     '2022-01-31', '1 day') AS ts,
     unnest(array['City A', 'City B']) AS city;

INSERT INTO humidity (ts, city, humidity)
SELECT ts + (INTERVAL '60 minutes' * random()), city, 100*random()
FROM generate_series('2022-01-01' :: TIMESTAMP,
                     '2022-01-31', '1 day') AS ts,
     unnest(array['City A', 'City B']) AS city;

-- "AS OF" join:
SELECT t.ts, t.city, t.temperature, h.humidity
FROM temperature AS t
LEFT JOIN LATERAL
  ( SELECT * FROM humidity
    WHERE city = t.city AND ts <= t.ts
    ORDER BY ts DESC LIMIT 1
  ) AS h ON TRUE
WHERE t.ts < '2022-01-05';
```

One can do `SELECT (the query above) ORDER BY random() LIMIT x` but
this produces an inefficient plan. Alternatively one could create
temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl
TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be
suboptimal even if we supported global temporary tables.

1. Do you think there might be value in addressing this issue?
2. If yes, how would you suggest addressing it from the UI point of
view - by adding a special syntax, some sort of aggregate function, or
...?

--
Best regards,
Aleksander Alekseev


Re: Should we optimize the `ORDER BY random() LIMIT x` case?

From
Aleksander Alekseev
Date:
wenhui, Andrei,

> if we can optimize the query, that would be great

Thanks for the feedback!

> I think I got your point, but just to be sure:
> Do you want to have some random sampling from an arbitrary subquery with
> the guarantee that N distinct (by tid) tuples will be produced or all
> the tuples if the underlying subquery produces less than N?

Right.

> What kind of optimisation trick may the optimiser use here to provide an
> optimal plan? As I see it, it will need to think that all the tuples
> should be returned from the subquery. The only profit is to skip sorting
> the massive sample.

Doesn't look like a generic optimization trick will help us. I was
thinking about a custom aggregate function, e.g. `SELECT sample(*, 10)
...`. However I doubt that aggregate functions are flexible enough. Or
alternatively a rewrite rule. I never dealt with those before so I
have no idea what I'm talking about :D

> As a palliative, you may laterally join your subquery with a stored
> procedure, which will process the incoming tuple and implement the logic
> of random sampling.

Yes, that's a good observation. The query:

```
SELECT t.* FROM mytable AS t
INNER JOIN LATERAL (
    SELECT 1 WHERE random() > 0.99
) AS r ON TRUE
```

Semples 1% of the tuples efficiently. I wonder if there is also an
efficient way to sample "no more than N" tuples without knowing the
size of the entire set / stream beforehand. Implementing such an
algorithm in C is not difficult [1]. Doesn't seem so in SQL.

> Implementation of that in the core will need a new "skip result" node
> and new syntax, which may be too much if a workaround is found.

Agree. I'm not too worried about the new node. The new syntax
apparently is going to be mandatory. Otherwise we will have to check
"wait what if it's ORDER BY random() LIMIT?" for every single query. I
wonder what the community's opinion on having such a syntax is.

[1] e.g. https://samwho.dev/reservoir-sampling/
-- 
Best regards,
Aleksander Alekseev



Re: Should we optimize the `ORDER BY random() LIMIT x` case?

From
Andrei Lepikhov
Date:
On 15/5/2025 11:17, Aleksander Alekseev wrote:
>> What kind of optimisation trick may the optimiser use here to provide an
>> optimal plan? As I see it, it will need to think that all the tuples
>> should be returned from the subquery. The only profit is to skip sorting
>> the massive sample.
> 
> Doesn't look like a generic optimization trick will help us. I was
> thinking about a custom aggregate function, e.g. `SELECT sample(*, 10)
> ...`. However I doubt that aggregate functions are flexible enough. Or
> alternatively a rewrite rule. I never dealt with those before so I
> have no idea what I'm talking about :D
A custom SRF seems great to me. You may propose such an aggregate in the 
core - it seems it doesn't even need any syntax changes. For example:
SELECT * FROM (SELECT sample(q, 10, <type>) FROM (SELECT ...) AS q);
or something like that.

-- 
regards, Andrei Lepikhov



Aleksander Alekseev <aleksander@timescale.com> writes:
> If I'm right about the limitations of aggregate functions and SRFs
> this leaves us the following options:

> 1. Changing the constraints of aggregate functions or SRFs. However I
> don't think we want to do it for such a single niche scenario.
> 2. Custom syntax and a custom node.
> 3. To give up

Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
of the tablesample methods) to allow it to work on a subquery.

            regards, tom lane



Re: Should we optimize the `ORDER BY random() LIMIT x` case?

From
Vik Fearing
Date:
On 16/05/2025 15:01, Tom Lane wrote:
> Aleksander Alekseev <aleksander@timescale.com> writes:
>> If I'm right about the limitations of aggregate functions and SRFs
>> this leaves us the following options:
>> 1. Changing the constraints of aggregate functions or SRFs. However I
>> don't think we want to do it for such a single niche scenario.
>> 2. Custom syntax and a custom node.
>> 3. To give up
> Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
> of the tablesample methods) to allow it to work on a subquery.


Isn't this a job for <fetch first clause>?


Example:

SELECT ...
FROM ... JOIN ...
FETCH SAMPLE FIRST 10 ROWS ONLY


Then the nodeLimit could do some sort of reservoir sampling.


There are several enhancements to <fetch first clause> coming down the 
pipe, this could be one of them.

-- 

Vik Fearing




Vik Fearing <vik@postgresfriends.org> writes:
> On 16/05/2025 15:01, Tom Lane wrote:
>> Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
>> of the tablesample methods) to allow it to work on a subquery.

> Isn't this a job for <fetch first clause>?
> FETCH SAMPLE FIRST 10 ROWS ONLY

How is that an improvement on TABLESAMPLE?  Or did the committee
forget that they already have that feature?

TABLESAMPLE seems strictly better to me here because it affords
the opportunity to specify one of several methods, which seems
like it would be useful in this context.

            regards, tom lane



Re: Should we optimize the `ORDER BY random() LIMIT x` case?

From
Vik Fearing
Date:
On 16/05/2025 23:21, Tom Lane wrote:
> Vik Fearing <vik@postgresfriends.org> writes:
>> On 16/05/2025 15:01, Tom Lane wrote:
>>> Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
>>> of the tablesample methods) to allow it to work on a subquery.
>> Isn't this a job for <fetch first clause>?
>> FETCH SAMPLE FIRST 10 ROWS ONLY
> How is that an improvement on TABLESAMPLE?  Or did the committee
> forget that they already have that feature?
>
> TABLESAMPLE seems strictly better to me here because it affords
> the opportunity to specify one of several methods, which seems
> like it would be useful in this context.


TABLESAMPLE is hitched to a <table primary> which can be basically 
anything resembling a relation.  So it appears the standard already 
allows this and we just need to implement it.

-- 

Vik Fearing




Re: Should we optimize the `ORDER BY random() LIMIT x` case?

From
Nico Williams
Date:
On Fri, May 16, 2025 at 09:01:50AM -0400, Tom Lane wrote:
> Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
> of the tablesample methods) to allow it to work on a subquery.

The key here is that we need one bit of state between rows: the count of
rows seen so far.



Re: Should we optimize the `ORDER BY random() LIMIT x` case?

From
Nico Williams
Date:
On Fri, May 16, 2025 at 11:10:49PM +0200, Vik Fearing wrote:
> Isn't this a job for <fetch first clause>?
> 
> Example:
> 
> SELECT ...
> FROM ... JOIN ...
> FETCH SAMPLE FIRST 10 ROWS ONLY
> 
> Then the nodeLimit could do some sort of reservoir sampling.

The query might return fewer than N rows.  What reservoir sampling
requires is this bit of state: the count of input rows so far.

The only way I know of to keep such state in a SQL query is with a
RECURSIVE CTE, but unfortunately that would require unbounded CTE size,
and it would require a way to query next rows one per-iteration.

Nico
--