Re: Faster distinct query? - Mailing list pgsql-general
From | Ryan Booz |
---|---|
Subject | Re: Faster distinct query? |
Date | |
Msg-id | CADyMnEzKD5t0-bOM4n7tor3CWka_oXN-BkrW=FxyqP8xrJ6+8w@mail.gmail.com Whole thread Raw |
In response to | Re: Faster distinct query? ("David G. Johnston" <david.g.johnston@gmail.com>) |
Responses |
Re: Faster distinct query?
|
List | pgsql-general |
[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.
- A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) and what we did to overcome it: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
- Plans for a similar feature in PostgreSQL proper that we'd totally support but hasn't made forward progress yet: https://commitfest.postgresql.org/19/1741/
Anyway, it might be worth a shot. HTH
Ryan B
On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.
How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.
select station, array_agg(distinct(channel)) as channelsFROM(SELECT station,channel FROM data GROUP BY station,channel) AS subgroup by station;Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choicesThere is no where clause so I'm doubtful there is much to be gained going down this path. The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that. The aggregation path might vary though it seems like that shouldn't be the case here.David J.
pgsql-general by date: