Home > mailing lists

Thread: Performance of "distinct with limit"

Performance of "distinct with limit"

From

Klaudie Willis

Date:

28 August 2020, 11:29:58

Hi,

Ran into this under-optimized query execution.

select distinct n from bigtable; -- Lets say this takes 2 minutes

select distinct n from bigtable limit 2 -- This takes approximately the same time

However, the latter should have the potential to be so much quicker. I checked the same query on MSSQL (with 'top 2'), and it seems to do exactly the optimization I would expect.

Is there any way to achieve a similar speedup in Postgresql?

Klaudie

Re: Performance of "distinct with limit"

From

luis.roberto@siscobra.com.br

Date:

28 August 2020, 11:59:08

Hi,

If "n" is indexed, it should run quickly. Can you share the execution plan for your query?

De: "Klaudie Willis" <Klaudie.Willis@protonmail.com>
Para: "pgsql-general" <pgsql-general@lists.postgresql.org>
Enviadas: Sexta-feira, 28 de agosto de 2020 8:29:58
Assunto: Performance of "distinct with limit"

Hi,

Ran into this under-optimized query execution.

select distinct n from bigtable; -- Lets say this takes 2 minutes

select distinct n from bigtable limit 2 -- This takes approximately the same time

However, the latter should have the potential to be so much quicker. I checked the same query on MSSQL (with 'top 2'), and it seems to do exactly the optimization I would expect.

Is there any way to achieve a similar speedup in Postgresql?

Klaudie

Re: Performance of "distinct with limit"

From

Klaudie Willis

Date:

28 August 2020, 12:34:09

No index on n, no. Index might solve it yes, but it seems to me such a trivial optimization even without. Obviously it is not.

QUERY PLAN |

----------------------------------------------------------------------------------|

Limit (cost=1911272.10..1911272.12 rows=2 width=7) |

-> HashAggregate (cost=1911272.10..1911282.45 rows=1035 width=7) |

Group Key: cfi |

-> Seq Scan on bigtable (cost=0.00..1817446.08 rows=37530408 width=7)|

Klaudie

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Friday, August 28, 2020 1:59 PM, <luis.roberto@siscobra.com.br> wrote:

Hi,

If "n" is indexed, it should run quickly. Can you share the execution plan for your query?

De: "Klaudie Willis" <Klaudie.Willis@protonmail.com>
Para: "pgsql-general" <pgsql-general@lists.postgresql.org>
Enviadas: Sexta-feira, 28 de agosto de 2020 8:29:58
Assunto: Performance of "distinct with limit"

Hi,

Ran into this under-optimized query execution.

select distinct n from bigtable; -- Lets say this takes 2 minutes
select distinct n from bigtable limit 2 -- This takes approximately the same time

However, the latter should have the potential to be so much quicker. I checked the same query on MSSQL (with 'top 2'), and it seems to do exactly the optimization I would expect.

Is there any way to achieve a similar speedup in Postgresql?

Klaudie

Re: Performance of "distinct with limit"

From

Jeff Janes

Date:

29 August 2020, 16:23:26

On Fri, Aug 28, 2020 at 8:34 AM Klaudie Willis <Klaudie.Willis@protonmail.com> wrote:

No index on n, no. Index might solve it yes, but it seems to me such a trivial optimization even without. Obviously it is not.

QUERY PLAN                                                                        |
----------------------------------------------------------------------------------|
Limit (cost=1911272.10..1911272.12 rows=2 width=7)                               |
-> HashAggregate (cost=1911272.10..1911282.45 rows=1035 width=7)              |
        Group Key: cfi                                                            |
        -> Seq Scan on bigtable (cost=0.00..1817446.08 rows=37530408 width=7)|

I think it would be nice if the LIMIT functionality could be pushed down into the HashAgg so it could stop early, I've run into this a few times. But it just isn't implemented. It wouldn't be the hardest feature to ever add to PostgreSQL, but it also wouldn't be trivial. It would require coordinated changes both to the planner and to the executor.

Also, the use of LIMIT without an ORDER BY makes the query non-deterministic, which makes it kind of a second-class citizen. There might be more enthusiasm among experienced developers for implementing this if it weren't for that. (Although there may be related deterministic cases in which a similar limited hash agg could be useful.)

In the meantime, an index on "n" would probably cause it to switch to a Unique plan which reads in index order. This plan does get to stop early.

Cheers,

Jeff

Re: Performance of "distinct with limit"

From

Klaudie Willis

Date:

29 August 2020, 16:39:00

Thanks for your insight Jeff. Interesting read!

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Saturday, August 29, 2020 6:23 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Aug 28, 2020 at 8:34 AM Klaudie Willis <Klaudie.Willis@protonmail.com> wrote:
No index on n, no. Index might solve it yes, but it seems to me such a trivial optimization even without. Obviously it is not.

QUERY PLAN                                                                        |
----------------------------------------------------------------------------------|
Limit (cost=1911272.10..1911272.12 rows=2 width=7)                               |
-> HashAggregate (cost=1911272.10..1911282.45 rows=1035 width=7)              |
        Group Key: cfi                                                            |
        -> Seq Scan on bigtable (cost=0.00..1817446.08 rows=37530408 width=7)|

I think it would be nice if the LIMIT functionality could be pushed down into the HashAgg so it could stop early, I've run into this a few times. But it just isn't implemented. It wouldn't be the hardest feature to ever add to PostgreSQL, but it also wouldn't be trivial. It would require coordinated changes both to the planner and to the executor.

Also, the use of LIMIT without an ORDER BY makes the query non-deterministic, which makes it kind of a second-class citizen. There might be more enthusiasm among experienced developers for implementing this if it weren't for that. (Although there may be related deterministic cases in which a similar limited hash agg could be useful.)

In the meantime, an index on "n" would probably cause it to switch to a Unique plan which reads in index order. This plan does get to stop early.

Cheers,

Jeff