Thread: optimizing query with multiple aggregates

optimizing query with multiple aggregates

From
Doug Cole
Date:
I have a reporting query that is taking nearly all of it's time in aggregate functions and I'm trying to figure out how to optimize it.  The query takes approximately 170ms when run with "select *", but when run with all the aggregate functions the query takes 18 seconds.  The slowness comes from our attempt to find distribution data using selects of the form:

SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

repeated across many different x,y values and fields to build out several histograms of the data.  The main culprit appears to be the CASE statement, but I'm not sure what to use instead.  I'm sure other people have had similar queries and I was wondering what methods they used to build out data like this?
Thanks for your help,
Doug

Re: optimizing query with multiple aggregates

From
Merlin Moncure
Date:
On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote:
> I have a reporting query that is taking nearly all of it's time in aggregate
> functions and I'm trying to figure out how to optimize it.  The query takes
> approximately 170ms when run with "select *", but when run with all the
> aggregate functions the query takes 18 seconds.  The slowness comes from our
> attempt to find distribution data using selects of the form:
>
> SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
>
> repeated across many different x,y values and fields to build out several
> histograms of the data.  The main culprit appears to be the CASE statement,
> but I'm not sure what to use instead.  I'm sure other people have had
> similar queries and I was wondering what methods they used to build out data
> like this?

have you tried:

count(*) where field >= x AND field < y;

??

merlin

Re: optimizing query with multiple aggregates

From
Doug Cole
Date:
On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote:
> > I have a reporting query that is taking nearly all of it's time in aggregate
> > functions and I'm trying to figure out how to optimize it.  The query takes
> > approximately 170ms when run with "select *", but when run with all the
> > aggregate functions the query takes 18 seconds.  The slowness comes from our
> > attempt to find distribution data using selects of the form:
> >
> > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
> >
> > repeated across many different x,y values and fields to build out several
> > histograms of the data.  The main culprit appears to be the CASE statement,
> > but I'm not sure what to use instead.  I'm sure other people have had
> > similar queries and I was wondering what methods they used to build out data
> > like this?
>
> have you tried:
>
> count(*) where field >= x AND field < y;
>
> ??
>
> merlin

Unless I'm misunderstanding you, that would require breaking each bin
into a separate sql statement and since I'm trying to calculate more
than 100 bins between the different fields any improvement in the
aggregate functions would be overwhelmed by the cost of the actual
query, which is about 170ms.
Thanks,
Doug

Re: optimizing query with multiple aggregates

From
David Wilson
Date:


On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote:

repeated across many different x,y values and fields to build out several histograms of the data.  The main culprit appears to be the CASE statement, but I'm not sure what to use instead.  I'm sure other people have had similar queries and I was wondering what methods they used to build out data like this?

Use group by with an appropriate division/rounding to create the appropriate buckets, if they're all the same size.

select round(field/100) as bucket, count(*) as cnt from foo group by round(field/100);

--
- David T. Wilson
david.t.wilson@gmail.com

Re: optimizing query with multiple aggregates

From
Nikolas Everett
Date:
So you've got a query like:

SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as zeroToTen,
              SUM(CASE WHEN field >= 10 AND field < 20 THEN 1 ELSE 0 END) as tenToTwenty,
              SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0 END) as tenToTwenty,
...
FROM  bigtable


My guess is this forcing a whole bunch of if checks and your getting cpu bound.  Could you try something like:

SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as zeroToTen,
              SUM(CASE WHEN field >= 10 AND field < 20 THEN count ELSE 0 END) as tenToTwenty,
              SUM(CASE WHEN field >= 20 AND field < 30 THEN count ELSE 0 END) as tenToTwenty,
...
FROM  (SELECT field, count(*) FROM bigtable GROUP BY field)

which will allow a hash aggregate?  You'd do a hash aggregate on the whole table which should be quick and then you'd summarize your bins.

This all supposes that you don't want to just query postgres's column statistics.

On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole <dougcole@gmail.com> wrote:
On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote:
> > I have a reporting query that is taking nearly all of it's time in aggregate
> > functions and I'm trying to figure out how to optimize it.  The query takes
> > approximately 170ms when run with "select *", but when run with all the
> > aggregate functions the query takes 18 seconds.  The slowness comes from our
> > attempt to find distribution data using selects of the form:
> >
> > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
> >
> > repeated across many different x,y values and fields to build out several
> > histograms of the data.  The main culprit appears to be the CASE statement,
> > but I'm not sure what to use instead.  I'm sure other people have had
> > similar queries and I was wondering what methods they used to build out data
> > like this?
>
> have you tried:
>
> count(*) where field >= x AND field < y;
>
> ??
>
> merlin

Unless I'm misunderstanding you, that would require breaking each bin
into a separate sql statement and since I'm trying to calculate more
than 100 bins between the different fields any improvement in the
aggregate functions would be overwhelmed by the cost of the actual
query, which is about 170ms.
Thanks,
Doug

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: optimizing query with multiple aggregates

From
Kenneth Marshall
Date:
On Wed, Oct 21, 2009 at 03:51:25PM -0700, Doug Cole wrote:
> I have a reporting query that is taking nearly all of it's time in aggregate
> functions and I'm trying to figure out how to optimize it.  The query takes
> approximately 170ms when run with "select *", but when run with all the
> aggregate functions the query takes 18 seconds.  The slowness comes from our
> attempt to find distribution data using selects of the form:
>
> SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
>
> repeated across many different x,y values and fields to build out several
> histograms of the data.  The main culprit appears to be the CASE statement,
> but I'm not sure what to use instead.  I'm sure other people have had
> similar queries and I was wondering what methods they used to build out data
> like this?
> Thanks for your help,
> Doug

Hi Doug,

Have you tried using the width_bucket() function? Here is a nice
article describing its use for making histograms:

http://quantmeditate.blogspot.com/2005/03/creating-histograms-using-sql-function.html

Regards,
Ken

Re: optimizing query with multiple aggregates

From
Robert Haas
Date:
On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole <dougcole@gmail.com> wrote:
> On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>
>> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote:
>> > I have a reporting query that is taking nearly all of it's time in aggregate
>> > functions and I'm trying to figure out how to optimize it.  The query takes
>> > approximately 170ms when run with "select *", but when run with all the
>> > aggregate functions the query takes 18 seconds.  The slowness comes from our
>> > attempt to find distribution data using selects of the form:
>> >
>> > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
>> >
>> > repeated across many different x,y values and fields to build out several
>> > histograms of the data.  The main culprit appears to be the CASE statement,
>> > but I'm not sure what to use instead.  I'm sure other people have had
>> > similar queries and I was wondering what methods they used to build out data
>> > like this?
>>
>> have you tried:
>>
>> count(*) where field >= x AND field < y;
>>
>> ??
>>
>> merlin
>
> Unless I'm misunderstanding you, that would require breaking each bin
> into a separate sql statement and since I'm trying to calculate more
> than 100 bins between the different fields any improvement in the
> aggregate functions would be overwhelmed by the cost of the actual
> query, which is about 170ms.

Well, you might be able to use subselects to fetch all the results in
a single query, but it might still be slow.

...Robert

Re: optimizing query with multiple aggregates

From
Scott Carey
Date:


On 10/21/09 3:51 PM, "Doug Cole" <dougcole@gmail.com> wrote:

> I have a reporting query that is taking nearly all of it's time in aggregate
> functions and I'm trying to figure out how to optimize it.  The query takes
> approximately 170ms when run with "select *", but when run with all the
> aggregate functions the query takes 18 seconds.  The slowness comes from our
> attempt to find distribution data using selects of the form:
>
> SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
>
> repeated across many different x,y values and fields to build out several
> histograms of the data.  The main culprit appears to be the CASE statement,
> but I'm not sure what to use instead.  I'm sure other people have had similar
> queries and I was wondering what methods they used to build out data like
> this?

You might be able to do this with plain aggregates.  Define a function that
generates your partitions that you can group by, then aggregate functions
for the outputs

In either case, rather than each result being a column in one result row,
each result will be its own row.

Each row would have a column that defines the type of the result (that you
grouped on), and one with the result value.  If each is just a sum, its
easy.  If there are lots of different calculation types, it would be harder.
Potentially, you could wrap that in a subselect to pull out each into its
own column but that is a bit messy.

Also, in 8.4 window functions could be helpful.  PARTITION BY something that
represents your buckets perhaps?
http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html

This will generally force a sort, but shouldn't be that bad.

The function used for the group by or partition by could just be a big case
statement to generate a unique int per bucket, or a truncate/rounding
function.  It just needs to spit out a unique result for each bucket for the
group or partition.


> Thanks for your help,
> Doug
>


Re: optimizing query with multiple aggregates

From
"Marc Mamin"
Date:
Hello,
 
I didn't try it, but following should be slightly faster:
 
COUNT( CASE WHEN field >= x AND field < y THEN true END)
intead of
SUM( CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
 
HTH,
 
Marc Mamin



From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Nikolas Everett
Sent: Thursday, October 22, 2009 4:48 AM
To: Doug Cole
Cc: pgsql-performance
Subject: Re: [PERFORM] optimizing query with multiple aggregates

So you've got a query like:

SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as zeroToTen,
              SUM(CASE WHEN field >= 10 AND field < 20 THEN 1 ELSE 0 END) as tenToTwenty,
              SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0 END) as tenToTwenty,
...
FROM  bigtable


My guess is this forcing a whole bunch of if checks and your getting cpu bound.  Could you try something like:

SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as zeroToTen,
              SUM(CASE WHEN field >= 10 AND field < 20 THEN count ELSE 0 END) as tenToTwenty,
              SUM(CASE WHEN field >= 20 AND field < 30 THEN count ELSE 0 END) as tenToTwenty,
...
FROM  (SELECT field, count(*) FROM bigtable GROUP BY field)

which will allow a hash aggregate?  You'd do a hash aggregate on the whole table which should be quick and then you'd summarize your bins.

This all supposes that you don't want to just query postgres's column statistics.

On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole <dougcole@gmail.com> wrote:
On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole <dougcole@gmail.com> wrote:
> > I have a reporting query that is taking nearly all of it's time in aggregate
> > functions and I'm trying to figure out how to optimize it.  The query takes
> > approximately 170ms when run with "select *", but when run with all the
> > aggregate functions the query takes 18 seconds.  The slowness comes from our
> > attempt to find distribution data using selects of the form:
> >
> > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
> >
> > repeated across many different x,y values and fields to build out several
> > histograms of the data.  The main culprit appears to be the CASE statement,
> > but I'm not sure what to use instead.  I'm sure other people have had
> > similar queries and I was wondering what methods they used to build out data
> > like this?
>
> have you tried:
>
> count(*) where field >= x AND field < y;
>
> ??
>
> merlin

Unless I'm misunderstanding you, that would require breaking each bin
into a separate sql statement and since I'm trying to calculate more
than 100 bins between the different fields any improvement in the
aggregate functions would be overwhelmed by the cost of the actual
query, which is about 170ms.
Thanks,
Doug

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: optimizing query with multiple aggregates

From
Doug Cole
Date:
On Thu, Oct 22, 2009 at 6:22 AM, Kenneth Marshall <ktm@rice.edu> wrote:
> On Wed, Oct 21, 2009 at 03:51:25PM -0700, Doug Cole wrote:
>> I have a reporting query that is taking nearly all of it's time in aggregate
>> functions and I'm trying to figure out how to optimize it.  The query takes
>> approximately 170ms when run with "select *", but when run with all the
>> aggregate functions the query takes 18 seconds.  The slowness comes from our
>> attempt to find distribution data using selects of the form:
>>
>> SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
>>
>> repeated across many different x,y values and fields to build out several
>> histograms of the data.  The main culprit appears to be the CASE statement,
>> but I'm not sure what to use instead.  I'm sure other people have had
>> similar queries and I was wondering what methods they used to build out data
>> like this?
>> Thanks for your help,
>> Doug
>
> Hi Doug,
>
> Have you tried using the width_bucket() function? Here is a nice
> article describing its use for making histograms:
>
> http://quantmeditate.blogspot.com/2005/03/creating-histograms-using-sql-function.html
>
> Regards,
> Ken
>

Thanks Ken,
  I ended up going with this approach - it meant I had to break it
into a lot more queries, one for each histogram, but even with that
added overhead I cut the time down from 18 seconds to right around 1
second.
Doug