Re: The Future of Aggregation - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: The Future of Aggregation
Date
Msg-id 1986970674.211372.1433943568062.JavaMail.yahoo@mail.yahoo.com
Whole thread Raw
In response to Re: The Future of Aggregation  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: The Future of Aggregation  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 10 June 2015 at 02:52, Kevin Grittner <kgrittn@ymail.com> wrote:
>> David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> The idea I discussed in the link in item 5 above gets around this
>>> problem, but it's a perhaps more surprise filled implementation
>>> as it will mean "select avg(x),sum(x),count(x) from t" is
>>> actually faster than "select sum(x),count(x) from t" as the agg
>>> state for avg() will satisfy sum and count too.
>>
>> I'm skeptical that it will be noticeably faster.  It's easy to see
>> why this optimization will make a query *with all three* faster,
>> but I would not expect the process of accumulating the sum and
>> count to be about the same speed whether performed by one
>> transition function or two.  Of course I could be persuaded by a
>> benchmark showing otherwise.

Of course, after reading Tom's post and digging into what
aggregates share a transition function, I was already prepared to
eat my words above.  Since the sum() aggregate is using the
xxx_avg_accum transition function, it is clearly doing the work of
accumulating the count already, so it's clear that the above can
be a win.

> Assuming that if we reuse the avg(x) state for count(x) and
> sum(x) then it will perform almost exactly like a query
> containing just avg(x), the only additional overhead is the call
> to the final functions per group, so in the following case that's
> likely immeasurable:
>
> /* setup */ create table millionrowtable as select
> generate_series(1,1000000)::numeric as x;
> /* test 1 */ SELECT sum(x) / count(x)  from millionrowtable;
> /* test 2 */ SELECT avg(x) from millionrowtable;
>
> Test 1:
> 274.979 ms
> 272.104 ms
> 269.915 ms
>
> Test 2:
> 229.619 ms
> 220.703 ms
> 234.743 ms
>

> (About 19% slower)

Of course, with Tom's approach you would see the benefit; the two
statements should run at about the same speed.

I am a little curious what sort of machine you're running on,
because my i7 is much slower.  I ran a few other tests with your
table for perspective.

To get the raw time to just pass the tuples:

SELECT from millionrowtable where xmin = '0';
Time: 125.340 ms
Time: 124.443 ms
Time: 115.629 ms

Just the count(*) of those rows didn't boost the time much:

SELECT count(*) from millionrowtable;
Time: 132.128 ms
Time: 128.036 ms
Time: 125.400 ms

The NULL check added by specifying count(x) boosted it more:

SELECT count(x) from millionrowtable;
Time: 165.858 ms
Time: 163.872 ms
Time: 165.448 ms

A NULL check plus numeric addition gets expensive:

SELECT sum(x) from millionrowtable;
Time: 366.879 ms
Time: 364.503 ms
Time: 365.418 ms

Since sum() and avg() use the same transition function, I was
suprised to see a difference here:

SELECT avg(x) from millionrowtable;
Time: 374.339 ms
Time: 372.294 ms
Time: 366.933 ms

Here's the statement you are talking about optimizing:

SELECT sum(x), count(x)  from millionrowtable;
Time: 441.331 ms
Time: 442.501 ms
Time: 436.930 ms

To confirm that projecting the extra column compared to avg() was
not significant:

SELECT sum(x) / count(x)  from millionrowtable;
Time: 442.404 ms
Time: 436.241 ms
Time: 442.381 ms

So this can reasonably be compared to the avg(x) time above.

On my machine this optimization could be expected to shave off
about 16% of current run time.

One question that arose in my mind running this was whether might
be able to combine sum(x) with count(*) if x was NOT NULL, even
though the arguments don't match.  It might not be worth the
gymnastics of recognizing the special case, and I certainly
wouldn't recommend looking at that optimization in a first pass;
but it might be worth jotting down on a list somewhere....

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: s_lock() seems too aggressive for machines with many sockets
Next
From: Noah Misch
Date:
Subject: Re: [COMMITTERS] pgsql: Add pg_audit, an auditing extension