Re: Faster distinct query? - Mailing list pgsql-general

From Ryan Booz
Subject Re: Faster distinct query?
Date
Msg-id CADyMnEwHC_r1nGt+RH7cp63Jh6KooOfOwCNHVWKg3B=JnM2kFg@mail.gmail.com
Whole thread Raw
In response to Re: Faster distinct query?  (Israel Brewster <ijbrewster@alaska.edu>)
List pgsql-general
Sorry - break for dinner! So much happens on a global scale in a few hours.  :-)!

I took a few minutes and created a simple example here of what I imagine you have on that table. I only inserted ~80 million rows of test data, but hopefully, it's somewhat representative.

TimescaleDB's current implementation of SkipScan only allows distinct on one column, and because of where we can place the hook to read the query cost, a LATERAL JOIN (or similar) can't be used to get both columns like you want. So, to outsmart the planner, you can get your results with one of the DISTINCT queries in a function. I realize this is getting a bit specific, so it might not be an exact fit for you, but this query comes back on my 78 million rows in 67ms. YMMV

Step 1: Create a function that returns the array of channels per station

CREATE FUNCTION channel_array(TEXT) RETURNS text[]
AS $$
SELECT array_agg(channel) FROM (
select distinct on (channel) channel from stations where station=$1
) a
$$
LANGUAGE SQL;

Step 2: Use a CTE for the distinct stations, querying the function for each station

WITH s1 AS (
SELECT DISTINCT ON (station) station FROM stations
)
SELECT station, channel_array(station) channel
FROM s1;

If it's using the index, you should see something like:

Subquery Scan on s1  (cost=0.57..16.22 rows=19 width=34) (actual time=0.580..4.809 rows=19 loops=1)                                                              
  ->  Unique  (cost=0.57..11.28 rows=19 width=2) (actual time=0.043..0.654 rows=19 loops=1)                                                                      
        ->  Custom Scan (SkipScan) on stations  (cost=0.57..11.23 rows=19 width=2) (actual time=0.042..0.647 rows=19 loops=1)                                    
              ->  Index Only Scan using idx_station_channel on stations  (cost=0.57..1807691.34 rows=76000032 width=2) (actual time=0.040..0.641 rows=19 loops=1)
                    Index Cond: (station > NULL::text)                                                                                                           
                    Heap Fetches: 19                                                                                                                             

HTH,
Ryan                                                                                                    

On Wed, Sep 22, 2021 at 6:22 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
On Sep 22, 2021, at 2:05 PM, Ryan Booz <ryan@timescale.com> wrote:

Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as one of the execution nodes. I also realize I was making a few assumptions about your data, are channels shared among stations, or are all channels unique (like an ID) per station? That would impact the index and approach.

Ok, that may be a good point: “channel” is currently a varchar column, containing something like ‘BHZ’, ‘EHZ’, ‘BHE’ etc. There are only a handful of possible channels that I am currently aware of, which are shared among stations - most stations have a ‘BHZ’ channel, for example. That would be fairly simple to normalize out if that would help.


Something like:

station | channel
----------|-----------
1            1
1            2
2            3
2            4

or:
station | channel
----------|-----------
1            1
1            2
2            1
2            2




On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
On Sep 22, 2021, at 1:50 PM, Ryan Booz <ryan@timescale.com> wrote:

Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used.

With timescaledb extension installed, the explain is what I posted in the original message (https://explain.depesz.com/s/mtxB#html). Without timescaledb installed, the explain looks the same, except it takes twice as long to run.

Unless I missed something in your message, i.e. some sort of tweak to the query to get it to use the timescaledb features?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Sep 22, 2021, at 12:49 PM, Ryan Booz <ryan@timescale.com> wrote:

[Timescale Dev Advocate here]
I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node that works on regular BTree indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at all). In this case, your distinct query would likely run in a few milliseconds based on the counts you mention (170 stations, 3 channels per station), and then the outer aggregation would do the GROUP BY. So, you **could** add the TimescaleDB extension to your database (or a copy of) and give it a try. You don't actually need to use any TimescaleDB features otherwise.

I had actually already done that, as I was considering, in spite of past negative experiences with timescaledb, experimenting with it on this DB to see if it worked any better with this data. Out of curiosity, I tried removing the timescaledb extension, whereupon the query in question took roughly twice as long. So you are right that installing timescaledb speeds things up, even when not using any timescaledb specific functions. So that was a good call. Thanks!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


Anyway, it might be worth a shot. HTH

Ryan B

On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables,

I believe we consider it acceptable to link to an explain viewer, which is what the OP did.  Reading explain output in email has its own challenges, and I'd rather have the website than a text attachment.


How does the below work? It should do a very simple index scan only, then aggregate the relative few rows after the fact.

select station, array_agg(distinct(channel)) as channels
FROM(
SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

Yeah, am pondering this too, though seems like the queries should be identical so the plan/execution should be the same either way.


If there is correlation between station & channel, then you might look at creating a multivariate statistics object and analyzing the table so the planner can make better choices

There is no where clause so I'm doubtful there is much to be gained going down this path.  The Index-Only scan seems like an optimal way to obtain this data and the existing query already does that.  The aggregation path might vary though it seems like that shouldn't be the case here.

David J.



pgsql-general by date:

Previous
From: David Rowley
Date:
Subject: Re: Faster distinct query?
Next
From: Mladen Gogala
Date:
Subject: Re: Faster distinct query?