Re: Faster distinct query? - Mailing list pgsql-general

From Israel Brewster
Subject Re: Faster distinct query?
Date
Msg-id 681B5BFB-31D4-4CFE-895D-B2502BA86665@alaska.edu
Whole thread Raw
In response to Re: Faster distinct query?  (hubert depesz lubaczewski <depesz@depesz.com>)
List pgsql-general
> On Sep 22, 2021, at 11:04 PM, hubert depesz lubaczewski <depesz@depesz.com> wrote:
>
> On Wed, Sep 22, 2021 at 12:05:22PM -0800, Israel Brewster wrote:
>> I was wondering if there was any way to improve the performance of this query:
>>
>> SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station;
>>
>> The explain execution plan can be found here:
>> https://explain.depesz.com/s/mtxB#html <https://explain.depesz.com/s/mtxB#html>
>>
>> and it looks pretty straight forward. It does an index_only scan, followed by an aggregate, to produce a result that
isa list of stations along with a list of channels associated with each (there can be anywhere from 1 to 3 channels
associatedwith each station). This query takes around 5 minutes to run. 
>>
>> To work around the issue, I created a materialized view that I can update periodically, and of course I can query
saidview in no time flat. However, I’m concerned that as the dataset grows, the time it takes to refresh the view will
alsogrow (correct me if I am wrong there). 
>>
>> This is running PostgreSQL 13, and the index referenced is a two-column index on data(station, channel)
>
> It looks that there is ~ 170 stations, and ~ 800 million rows int he
> table.
>
> can you tell us how many rows has this:
>
> select distinct station, channel from data;

At the moment, about 170, but I would expect it to stabilize at around 510 or less once I am pulling in all the
channels.Getting this query (or the stored/cached results thereof, as it shouldn’t change too often) working fast
enoughto be used in the live system is simply the first step to pulling in three times as much data (that’ll be fun!)   

>
> If this is not huge, then you can make the query run much faster using
> skip scan - recursive cte.

Sounds like something to look into. Of course, if I go with a lookup table, updated by an on insert trigger, it becomes
amoot point. I’ll have to spend some time wrapping my head around the concept, and figuring out how to write it so that
Iget distinct per station rather than just a straight up distinct, but theoretically at least it makes sense. 

Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

>
> Best regards,
>
> depesz
>




pgsql-general by date:

Previous
From: Garfield Lewis
Date:
Subject: Re: Currently running queries with actual arguments?
Next
From: Israel Brewster
Date:
Subject: Re: Faster distinct query?