Thread: Optimizing sum() operations

Optimizing sum() operations

From
"Dobes Vandermeer"
Date:
I'm currently using sum() to compute historical values in reports;
basically select sum(amount) on records where date <= '...' and date
>= '...' who = X.

Of course, I have an index on the table for who and date, but that
still leaves potentially thousands of rows to scan.

First, should I be worried about the performance of this, or will
postgres sum a few thousand rows in a few milliseconds on a decent
system anyway?

Second, if this is a concern, is there a best practice for optimizing
these kinds of queries?

Any help or advice appreciated!

Re: Optimizing sum() operations

From
"Sean Davis"
Date:
On Fri, Oct 3, 2008 at 4:51 AM, Dobes Vandermeer <dobesv@gmail.com> wrote:
> I'm currently using sum() to compute historical values in reports;
> basically select sum(amount) on records where date <= '...' and date
>>= '...' who = X.
>
> Of course, I have an index on the table for who and date, but that
> still leaves potentially thousands of rows to scan.
>
> First, should I be worried about the performance of this, or will
> postgres sum a few thousand rows in a few milliseconds on a decent
> system anyway?
>
> Second, if this is a concern, is there a best practice for optimizing
> these kinds of queries?

You'll need to test to see what performance you get.  That said,
indexing is a good place to start.  You can always run explain and
explain analyze on the queries to double-check the planner.

Sean

Re: Optimizing sum() operations

From
"Dobes Vandermeer"
Date:
On Fri, Oct 3, 2008 at 4:51 AM, Sean Davis <sdavis2@mail.nih.gov> wrote:
> On Fri, Oct 3, 2008 at 4:51 AM, Dobes Vandermeer <dobesv@gmail.com> wrote:
>> I'm currently using sum() to compute historical values in reports;
>> basically select sum(amount) on records where date <= '...' and date
>>>= '...' who = X.
>>
>> Second, if this is a concern, is there a best practice for optimizing
>> these kinds of queries?
>
> You'll need to test to see what performance you get.  That said,
> indexing is a good place to start.  You can always run explain and
> explain analyze on the queries to double-check the planner.

Could I create an index that includes a sum() function - like:

create index sumidx on records (who, date, sum(amount)) ?

I'm sure that theoretically this is possible, but does postgres support it?

--

Dobes Vandermeer
Director, Habitsoft Inc.
dobesv@habitsoft.com
778-891-2922

Re: Optimizing sum() operations

From
"Sean Davis"
Date:
On Fri, Oct 3, 2008 at 3:13 PM, Dobes Vandermeer <dobesv@gmail.com> wrote:
> On Fri, Oct 3, 2008 at 4:51 AM, Sean Davis <sdavis2@mail.nih.gov> wrote:
>> On Fri, Oct 3, 2008 at 4:51 AM, Dobes Vandermeer <dobesv@gmail.com> wrote:
>>> I'm currently using sum() to compute historical values in reports;
>>> basically select sum(amount) on records where date <= '...' and date
>>>>= '...' who = X.
>>>
>>> Second, if this is a concern, is there a best practice for optimizing
>>> these kinds of queries?
>>
>> You'll need to test to see what performance you get.  That said,
>> indexing is a good place to start.  You can always run explain and
>> explain analyze on the queries to double-check the planner.
>
> Could I create an index that includes a sum() function - like:
>
> create index sumidx on records (who, date, sum(amount)) ?
>
> I'm sure that theoretically this is possible, but does postgres support it?

I'm not sure what you want to do.  Trying to make an index on a sum()
doesn't make any sense because the sum() depends on the rows used in a
query; i.e., sum() is an aggregate and cannot be used in an index.
You CAN index functions that operate on a single row.

What is wrong with an index on who and date and then doing the sum?

Sean

Re: Optimizing sum() operations

From
"Dobes Vandermeer"
Date:
On Fri, Oct 3, 2008 at 12:23 PM, Sean Davis <sdavis2@mail.nih.gov> wrote:
> On Fri, Oct 3, 2008 at 3:13 PM, Dobes Vandermeer <dobesv@gmail.com> wrote:
>> On Fri, Oct 3, 2008 at 4:51 AM, Sean Davis <sdavis2@mail.nih.gov> wrote:
>>> On Fri, Oct 3, 2008 at 4:51 AM, Dobes Vandermeer <dobesv@gmail.com> wrote:
>>>> I'm currently using sum() to compute historical values in reports;
>>>> basically select sum(amount) on records where date <= '...' and date
>>>>>= '...' who = X.
>>>>
>>>> Second, if this is a concern, is there a best practice for optimizing
>>>> these kinds of queries?
>>>
>>> You'll need to test to see what performance you get.  That said,
>>> indexing is a good place to start.  You can always run explain and
>>> explain analyze on the queries to double-check the planner.
>>
>> Could I create an index that includes a sum() function - like:
>>
>> create index sumidx on records (who, date, sum(amount)) ?
>>
>> I'm sure that theoretically this is possible, but does postgres support it?
>
> I'm not sure what you want to do.  Trying to make an index on a sum()
> doesn't make any sense because the sum() depends on the rows used in a
> query; i.e., sum() is an aggregate and cannot be used in an index.

Well, if you think about the structure of a B-Tree, each page of the
index contains a list of references to other subtrees, and the key
value ranges for those subtrees.  In the case of sum() you could cache
the sum total of that column for an entire subtree and, if the range
of the query includes the whole subtree, you could skip descending
into the subtree and take the sum straight from that page in the
index.

This is a just a general theory that occurred to me, it's probably a
pretty specialized kind of indexing that isn't supported by any RDBMS,
but it's possible there could be a postgres plugin which did this,
though.  Has anyone heard of something like that?

> What is wrong with an index on who and date and then doing the sum?

I think that if there are a lot of rows that match the query, it'll
take a long time, so I thought I'd start inquiring about whether
anyone has a good algorithm for accelerating these kinds of queries.

--

Dobes Vandermeer
Director, Habitsoft Inc.
dobesv@habitsoft.com
778-891-2922

Re: Optimizing sum() operations

From
Mark Roberts
Date:
> This is a just a general theory that occurred to me, it's probably a
> pretty specialized kind of indexing that isn't supported by any RDBMS,
> but it's possible there could be a postgres plugin which did this,
> though.  Has anyone heard of something like that?

The problem with this is that there's generally more than one metadata
field for each key set - and frequently you might want to sum or max
each value.  I think the value of having this kind of structure would be
quickly mitigated by virtue of increasing node size and slowing node
retrieval times.

> I think that if there are a lot of rows that match the query, it'll
> take a long time, so I thought I'd start inquiring about whether
> anyone has a good algorithm for accelerating these kinds of queries.

The best solution that I've found for things like this is to look to
data warehousing: if you have a frequently used aggregation of facts,
then preaggregate (summarize) it and pull from there instead.

-Mark


Re: Optimizing sum() operations

From
"Dobes Vandermeer"
Date:
On Fri, Oct 3, 2008 at 4:06 PM, Mark Roberts
<mailing_lists@pandapocket.com> wrote:
>> I think that if there are a lot of rows that match the query, it'll
>> take a long time, so I thought I'd start inquiring about whether
>> anyone has a good algorithm for accelerating these kinds of queries.
>
> The best solution that I've found for things like this is to look to
> data warehousing: if you have a frequently used aggregation of facts,
> then preaggregate (summarize) it and pull from there instead.

Do you mean creating another table and manually caching the values in
there?  I'm not sure what "data warehousing" means in this context.


--

Dobes Vandermeer
Director, Habitsoft Inc.
dobesv@habitsoft.com
778-891-2922

Re: Optimizing sum() operations

From
"Harold A. Giménez Ch."
Date:
Your best bet is probably to do some bookkeeping on a different table, and keep it up to date with the aggregations you will require (ie: cache it). Maintain this table on inserts to the main table via triggers, or whichever mechanism inserts the data in the first place such as a web app, a script, etc.

This would definitely scale, although the indexing strategy proposed earlier seems exotic ;)

On Fri, Oct 3, 2008 at 6:38 PM, Dobes Vandermeer <dobesv@gmail.com> wrote:
On Fri, Oct 3, 2008 at 12:23 PM, Sean Davis <sdavis2@mail.nih.gov> wrote:
> On Fri, Oct 3, 2008 at 3:13 PM, Dobes Vandermeer <dobesv@gmail.com> wrote:
>> On Fri, Oct 3, 2008 at 4:51 AM, Sean Davis <sdavis2@mail.nih.gov> wrote:
>>> On Fri, Oct 3, 2008 at 4:51 AM, Dobes Vandermeer <dobesv@gmail.com> wrote:
>>>> I'm currently using sum() to compute historical values in reports;
>>>> basically select sum(amount) on records where date <= '...' and date
>>>>>= '...' who = X.
>>>>
>>>> Second, if this is a concern, is there a best practice for optimizing
>>>> these kinds of queries?
>>>
>>> You'll need to test to see what performance you get.  That said,
>>> indexing is a good place to start.  You can always run explain and
>>> explain analyze on the queries to double-check the planner.
>>
>> Could I create an index that includes a sum() function - like:
>>
>> create index sumidx on records (who, date, sum(amount)) ?
>>
>> I'm sure that theoretically this is possible, but does postgres support it?
>
> I'm not sure what you want to do.  Trying to make an index on a sum()
> doesn't make any sense because the sum() depends on the rows used in a
> query; i.e., sum() is an aggregate and cannot be used in an index.

Well, if you think about the structure of a B-Tree, each page of the
index contains a list of references to other subtrees, and the key
value ranges for those subtrees.  In the case of sum() you could cache
the sum total of that column for an entire subtree and, if the range
of the query includes the whole subtree, you could skip descending
into the subtree and take the sum straight from that page in the
index.

This is a just a general theory that occurred to me, it's probably a
pretty specialized kind of indexing that isn't supported by any RDBMS,
but it's possible there could be a postgres plugin which did this,
though.  Has anyone heard of something like that?

> What is wrong with an index on who and date and then doing the sum?

I think that if there are a lot of rows that match the query, it'll
take a long time, so I thought I'd start inquiring about whether
anyone has a good algorithm for accelerating these kinds of queries.

--

Dobes Vandermeer
Director, Habitsoft Inc.
dobesv@habitsoft.com
778-891-2922

--
Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice