Thread: sum(int4)/sum(int2) improvement

sum(int4)/sum(int2) improvement

From

Atsushi Ogawa

Date:

01 September 2005, 13:32:40

When sum(int4) or sum(int2) is executed, many cycles are spent by
AllocSetReset. Because per-tuple context is used to allocate the
first data of each group.

An attached patch uses AggState->aggcontext instead of per-tuple
context to allocate the data. As a result, per-tuple context is not
used, and the cycles of AllocSetReset is reduced.

test data:
pgbench -i -s 5

SQL:
select a.bid, sum(a.abalance)
from accounts a
group by a.bid;

execution time(compile option "-O2"):
 original: 1.530s
 patched:  1.441s

profile result of original code(compile option "-g -pg"):
----------------------------------------------------------------------------
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 15.64      0.35     0.35  1500000     0.00     0.00  slot_deform_tuple
 11.67      0.62     0.27  1000002     0.00     0.00  AllocSetReset
  6.61      0.77     0.15  1999995     0.00     0.00  slot_getattr
  5.29      0.89     0.12   500002     0.00     0.00  heapgettup
  3.52      0.97     0.08   524420     0.00     0.00  hash_search

profile result of patched code(compile option "-g -pg"):
----------------------------------------------------------------------------
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 17.39      0.32     0.32  1500000     0.00     0.00  slot_deform_tuple
  6.52      0.44     0.12   500002     0.00     0.00  heapgettup
  6.25      0.56     0.12  1999995     0.00     0.00  slot_getattr
  4.35      0.64     0.08   524420     0.00     0.00  hash_search
  4.35      0.71     0.08   499995     0.00     0.00  execTuplesMatch
   (skip ...)
  0.54      1.67     0.01  1000002     0.00     0.00  AllocSetReset

regards,

--- Atsushi Ogawa

Attachment

sum.patch

Re: sum(int4)/sum(int2) improvement

From

Tom Lane

Date:

01 September 2005, 13:53:20

Atsushi Ogawa <a_ogawa@hi-ho.ne.jp> writes:
> An attached patch uses AggState->aggcontext instead of per-tuple
> context to allocate the data. As a result, per-tuple context is not
> used, and the cycles of AllocSetReset is reduced.

Why is this better than the fix already in place?

            regards, tom lane

Re: sum(int4)/sum(int2) improvement

From

Atsushi Ogawa

Date:

01 September 2005, 13:56:50

Tom Lane wrote:
> Atsushi Ogawa <a_ogawa@hi-ho.ne.jp> writes:
> > An attached patch uses AggState->aggcontext instead of per-tuple
> > context to allocate the data. As a result, per-tuple context is not
> > used, and the cycles of AllocSetReset is reduced.
>
> Why is this better than the fix already in place?

Because per-tuple context is reset many times. If per-tuple context is
never used, the following codes of AllocSetReset become effective.

    /* Nothing to do if context has never contained any data */
    if (block == NULL)
        return;

--- Atsushi Ogawa

Re: sum(int4)/sum(int2) improvement

From

Tom Lane

Date:

01 September 2005, 17:15:07

Atsushi Ogawa <a_ogawa@hi-ho.ne.jp> writes:
> Tom Lane wrote:
>> Why is this better than the fix already in place?

> Because per-tuple context is reset many times. If per-tuple context is
> never used, the following codes of AllocSetReset become effective.

>     /* Nothing to do if context has never contained any data */
>     if (block == NULL)
>         return;

Hm.  It's still pretty ugly though, since fixing only those two places
isn't very exciting.

I went back and looked at your AllocSetReset patch from last May, and
liked it better on second consideration.  The point I had missed before
is that even though we're adding an instruction to palloc, it's only
needed in the "slow" paths.  The cases that we have to be worried about,
IMHO, are:

1. Loop containing palloc(s) and pfree(s).

2. Loop containing palloc(s) and MemoryContextReset.

3. Loop containing only MemoryContextReset.

(We need not worry about cases doing only palloc since that is obviously
not sustainable over any long period.)  Case 1 is the one that was
bothering me, but in practice, once the loop is rolling the pallocs
should mostly be reusing previously-pfreed chunks.  That is the "fast"
path through AllocSetAlloc, and it doesn't need any extra overhead
because it doesn't need to clear the alloc-set-is-empty flag --- the
set is clearly already nonempty if it has free chunks.  There's enough
instructions in the other paths that one more isn't going to hurt.

In case 2 we're adding instructions to both palloc and
MemoryContextReset, for no gain :-( ... but again, these are in code
paths that aren't the fastest possible.  I don't think it'll hurt
noticeably.

I notice also that MemoryContextIsEmpty could return a better answer
if we maintained a flag like this.  This is not real important in the
current scheme of things, but might be more interesting someday.

I'll fix up the old patch and apply it.  I think that's a better answer
than kluging up the aggregate functions themselves.

            regards, tom lane

Re: sum(int4)/sum(int2) improvement

From

Atsushi Ogawa

Date:

02 September 2005, 12:15:56

Tom Lane wrote:
> I went back and looked at your AllocSetReset patch from last May, and
> liked it better on second consideration.

> I'll fix up the old patch and apply it.  I think that's a better answer
> than kluging up the aggregate functions themselves.

Thanks. I also think this fix is better. It speeds up hash join etc. besides
sum(int4) and sum(int2).

Example:
- SQL
select b.bid, count(*)
from accounts a, branches b
where a.bid = b.bid
group by b.bid;

- Query Plan
HashAggregate  (cost=23198.06..23198.08 rows=5 width=4)
  ->  Hash Join  (cost=1.06..20698.06 rows=500000 width=4)
        Hash Cond: ("outer".bid = "inner".bid)
        ->  Seq Scan on accounts a  (cost=0.00..13197.00 rows=500000
width=4)
        ->  Hash  (cost=1.05..1.05 rows=5 width=4)
              ->  Seq Scan on branches b  (cost=0.00..1.05 rows=5 width=4)

- Result
 original: 2.416s
 patched:  2.286s

regards,

--- Atsushi Ogawa

Re: sum(int4)/sum(int2) improvement

From

Bruce Momjian

Date:

16 September 2005, 01:08:39

This has been saved for the 8.2 release:

    http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Atsushi Ogawa wrote:
>
> When sum(int4) or sum(int2) is executed, many cycles are spent by
> AllocSetReset. Because per-tuple context is used to allocate the
> first data of each group.
>
> An attached patch uses AggState->aggcontext instead of per-tuple
> context to allocate the data. As a result, per-tuple context is not
> used, and the cycles of AllocSetReset is reduced.
>
> test data:
> pgbench -i -s 5
>
> SQL:
> select a.bid, sum(a.abalance)
> from accounts a
> group by a.bid;
>
> execution time(compile option "-O2"):
>  original: 1.530s
>  patched:  1.441s
>
> profile result of original code(compile option "-g -pg"):
> ----------------------------------------------------------------------------
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  15.64      0.35     0.35  1500000     0.00     0.00  slot_deform_tuple
>  11.67      0.62     0.27  1000002     0.00     0.00  AllocSetReset
>   6.61      0.77     0.15  1999995     0.00     0.00  slot_getattr
>   5.29      0.89     0.12   500002     0.00     0.00  heapgettup
>   3.52      0.97     0.08   524420     0.00     0.00  hash_search
>
> profile result of patched code(compile option "-g -pg"):
> ----------------------------------------------------------------------------
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  17.39      0.32     0.32  1500000     0.00     0.00  slot_deform_tuple
>   6.52      0.44     0.12   500002     0.00     0.00  heapgettup
>   6.25      0.56     0.12  1999995     0.00     0.00  slot_getattr
>   4.35      0.64     0.08   524420     0.00     0.00  hash_search
>   4.35      0.71     0.08   499995     0.00     0.00  execTuplesMatch
>    (skip ...)
>   0.54      1.67     0.01  1000002     0.00     0.00  AllocSetReset
>
> regards,
>
> --- Atsushi Ogawa

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: sum(int4)/sum(int2) improvement

From

Tom Lane

Date:

16 September 2005, 02:18:51

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> This has been saved for the 8.2 release:
>     http://momjian.postgresql.org/cgi-bin/pgpatches_hold

I believe that patch has been withdrawn because we did it in another
way.

            regards, tom lane

Re: sum(int4)/sum(int2) improvement

From

Bruce Momjian

Date:

16 September 2005, 04:08:44

Patch already applied in 8.1 in different way.

---------------------------------------------------------------------------

Atsushi Ogawa wrote:
>
> When sum(int4) or sum(int2) is executed, many cycles are spent by
> AllocSetReset. Because per-tuple context is used to allocate the
> first data of each group.
>
> An attached patch uses AggState->aggcontext instead of per-tuple
> context to allocate the data. As a result, per-tuple context is not
> used, and the cycles of AllocSetReset is reduced.
>
> test data:
> pgbench -i -s 5
>
> SQL:
> select a.bid, sum(a.abalance)
> from accounts a
> group by a.bid;
>
> execution time(compile option "-O2"):
>  original: 1.530s
>  patched:  1.441s
>
> profile result of original code(compile option "-g -pg"):
> ----------------------------------------------------------------------------
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  15.64      0.35     0.35  1500000     0.00     0.00  slot_deform_tuple
>  11.67      0.62     0.27  1000002     0.00     0.00  AllocSetReset
>   6.61      0.77     0.15  1999995     0.00     0.00  slot_getattr
>   5.29      0.89     0.12   500002     0.00     0.00  heapgettup
>   3.52      0.97     0.08   524420     0.00     0.00  hash_search
>
> profile result of patched code(compile option "-g -pg"):
> ----------------------------------------------------------------------------
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  17.39      0.32     0.32  1500000     0.00     0.00  slot_deform_tuple
>   6.52      0.44     0.12   500002     0.00     0.00  heapgettup
>   6.25      0.56     0.12  1999995     0.00     0.00  slot_getattr
>   4.35      0.64     0.08   524420     0.00     0.00  hash_search
>   4.35      0.71     0.08   499995     0.00     0.00  execTuplesMatch
>    (skip ...)
>   0.54      1.67     0.01  1000002     0.00     0.00  AllocSetReset
>
> regards,
>
> --- Atsushi Ogawa

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073