Home > mailing lists

Re: Faster distinct query? - Mailing list pgsql-general

From	David Rowley
Subject	Re: Faster distinct query?
Date	September 23, 2021 04:10:38
Msg-id	CAApHDvoT_UG8AQgyvJhBOwRQjWfd_SCyAkqmNZmp+3K+QfpnFA@mail.gmail.com Whole thread Raw
In response to	Re: Faster distinct query? ("David G. Johnston" <david.g.johnston@gmail.com>)
Responses	Re: Faster distinct query? (Israel Brewster <ijbrewster@alaska.edu>) Re: Faster distinct query? ("David G. Johnston" <david.g.johnston@gmail.com>)
List	pgsql-general

Tree view

On Thu, 23 Sept 2021 at 08:27, David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mlewis@entrata.com> wrote:
>> If there is correlation between station & channel, then you might look at creating a multivariate statistics object
andanalyzing the table so the planner can make better choices 
>
>
> There is no where clause so I'm doubtful there is much to be gained going down this path.  The Index-Only scan seems
likean optimal way to obtain this data and the existing query already does that.  The aggregation path might vary
thoughit seems like that shouldn't be the case here. 

ndistinct extended statistics would be used to estimate the number of
groups in the GROUP BY clause for the version of the query that
performs GROUP BY station, channel.   We've not seen the EXPLAIN
ANALYZE for that query, so don't know if there's any use for extended
statistics there.  However, if the planner was to think there were
more groups than there actually are, then it would be less inclined to
do parallelise the GROUP BY.  I think writing the query in such a way
that allows it to be parallelised is likely going to result in some
quite good performance improvements. i.e:

select station, array_agg(channel) as channels
FROM(
   SELECT station,channel FROM data GROUP BY station,channel
) AS sub
group by station;

which is just the same as Michael's version but without DISTINCT.
Also, Tom's point about lots of heap fetches is going to count for
quite a bit too, especially so if I/O plays a large part in the total
query time.

David

pgsql-general by date:

From: David Rowley
Date: 23 September 2021, 03:58:31
Subject: Re: Faster distinct query?

From: Israel Brewster
Date: 23 September 2021, 04:21:53
Subject: Re: Faster distinct query?

Re: Faster distinct query? - Mailing list pgsql-general

Previous

Next