Thread: AllocSetReset improvement

AllocSetReset improvement

From
a_ogawa
Date:
In SQL that executes aggregation, AllocSetReset is called many times and
spend a lot of cycles.
This patch saves the cycles spent by AllocSetReset.

An idea of the patch is to add a flag to AllocSetContext. This flag
shows whether AllocSetReset should work.

The effect of the patch that I measured is as follows:

o Data for test was created by 'pgbench -i -s 5'.

o Test SQL:
select b.bid, sum(a.abalance), avg(a.abalance)
from accounts a, branches b
where a.bid = b.bid
group by b.bid;

o I measured time that executed the SQL ten times.
(1)Linux(CPU: Pentium III, Compiler option: -O2)
 - original: 31.310s
 - patched : 28.812s

(2)Linux(CPU: Pentium 4, Compiler option: -O2)
 - original: 8.953s
 - patched : 7.753s

(3)Solaris(CPU: Ultra SPARC III, Compiler option: -O2)
 - original: 41.8s
 - patched : 38.6s

o gprof result(Linux, Compiler option: -O2 -pg -DLINUX_PROFILE)
- original
Each sample counts as 0.01 seconds.
  %   cumulative  self            self   total
 time   seconds  seconds    calls s/call s/call  name
  9.20      3.06    3.06 38500155   0.00   0.00  AllocSetReset
  8.99      6.05    2.99 27500055   0.00   0.00  slot_deform_tuple
  7.40      8.51    2.46 44000000   0.00   0.00  slot_getattr
  4.81     10.11    1.60 27500110   0.00   0.00  ExecEvalVar
  3.64     11.32    1.21 38500143   0.00   0.00  MemoryContextReset
  3.64     12.53    1.21  6007086   0.00   0.00  LWLockRelease
  3.31     13.63    1.10  5500079   0.00   0.00  heapgettup

- patched
Each sample counts as 0.01 seconds.
  %   cumulative  self            self   total
 time   seconds  seconds    calls s/call s/call  name
  8.76      2.82    2.82 27500055   0.00   0.00  slot_deform_tuple
  7.73      5.31    2.49 44000000   0.00   0.00  slot_getattr
  4.72      6.83    1.52 27500110   0.00   0.00  ExecEvalVar
  4.32      8.22    1.39  5500011   0.00   0.00  ExecHashJoin
  4.28      9.60    1.38  6007086   0.00   0.00  LWLockRelease
  4.04     10.90    1.30 38500143   0.00   0.00  MemoryContextReset
  3.63     12.07    1.17  5500079   0.00   0.00  heapgettup
  3.04     13.05    0.98  5499989   0.00   0.00
ExecMakeFunctionResultNoSets
  2.67     13.91    0.86  5500110   0.00   0.00  ExecProject
  2.61     14.75    0.84 11000000   0.00   0.00  advance_transition_function
  2.55     15.57    0.82 38500155   0.00   0.00  AllocSetReset

regards,

---
Atsushi Ogawa

Attachment

Re: AllocSetReset improvement

From
Tom Lane
Date:
a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> In SQL that executes aggregation, AllocSetReset is called many times and
> spend a lot of cycles.
> This patch saves the cycles spent by AllocSetReset.

Hmm.  It doesn't seem like this could be a big win overall.  It's not
possible to save a whole lot of cycles inside AllocSetReset, because if
there isn't anything for it to do, it should fall through pretty quickly
anyway.  And I'm worried about adding even a small amount of overhead to
palloc/pfree --- on the vast majority of the profiles I look at, those
are more expensive than AllocSetReset.

I duplicated your test case to see where the reset calls were coming
from, and got this:

                0.00    0.00      25/17500065     agg_retrieve_hash_table [453]
                0.29    0.10 2499975/17500065     execTuplesMatch [33]
                0.29    0.10 2500000/17500065     agg_fill_hash_table <cycle 3> [22]
                0.29    0.10 2500000/17500065     ExecScanHashBucket [32]
                0.29    0.10 2500025/17500065     ExecHashGetHashValue [53]
                0.29    0.10 2500035/17500065     ExecScan [52]
                0.58    0.20 5000005/17500065     ExecHashJoin <cycle 3> [15]
[30]     5.8    2.04    0.70 17500065         MemoryContextReset [30]
                0.70    0.00 17500065/17500065     MemoryContextResetChildren [56]

(Does this match your profile?  I only ran the query 5 times not 10.)

This shows that the majority of the resets are coming from the hashjoin
code not the aggregation code.  I wonder if we could reduce the number
of reset calls in that logic.  I think I threw in a few extras in the
recent hashjoin revisions, feeling that they were cheap enough that I
didn't need to think too hard about whether any were unnecessary.
I think what you've really proven here is that that was a wrong guess
...

            regards, tom lane

Re: AllocSetReset improvement

From
a_ogawa
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> > In SQL that executes aggregation, AllocSetReset is called many times and
> > spend a lot of cycles.
> > This patch saves the cycles spent by AllocSetReset.
>
> Hmm.  It doesn't seem like this could be a big win overall.  It's not
> possible to save a whole lot of cycles inside AllocSetReset, because if
> there isn't anything for it to do, it should fall through pretty quickly
> anyway.

I thought that I was able to avoid MemSet in AllocSetReset.

    MemSet(set->freelist, 0, sizeof(set->freelist));

My profile result in previous mail is as follows:
     %   cumulative  self            self   total
    time   seconds  seconds    calls s/call s/call  name
     9.20      3.06    3.06 38500155   0.00   0.00  AllocSetReset

Therefore, sizeof(set->freelist) * (number of calls) =
                                 44 bytes * 38500155 = 1615 Mbytes.

> And I'm worried about adding even a small amount of overhead to
> palloc/pfree --- on the vast majority of the profiles I look at, those
> are more expensive than AllocSetReset.

I don't worry about palloc. Because overhead increases only when malloc
is executed in AllocSetAlloc. But I'm wooried about pfree, too. However,
when palloc/pfree was executed many times, I did not see a bad influence.
It is a result of executing 'select * from accounts' 20 times as follows.

original code:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  6.79      4.03     4.03 90000599     0.00     0.00  appendBinaryStringInfo
  6.57      7.93     3.90 50005879     0.00     0.00  AllocSetAlloc
  5.63     11.27     3.34 10000000     0.00     0.00  printtup
  5.61     14.60     3.33 10000000     0.00     0.00  slot_deform_tuple
  5.36     17.78     3.18 50001421     0.00     0.00  AllocSetFree

patched code:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  8.07      4.78     4.78 90000599     0.00     0.00  appendBinaryStringInfo
  7.23      9.06     4.28 50005879     0.00     0.00  AllocSetAlloc
  5.40     12.26     3.20 10000000     0.00     0.00  printtup
  5.20     15.34     3.08 10000000     0.00     0.00  slot_deform_tuple
  5.13     18.38     3.04 50001421     0.00     0.00  AllocSetFree

I think that it is difficult to measure the influence that this patch
gives palloc/pfree.

> I duplicated your test case to see where the reset calls were coming
> from, and got this:
>
> (Does this match your profile?  I only ran the query 5 times not 10.)

I'm sorry. My profile in previous mail were 11 times not 10. And your
profile and my profile are match.

> This shows that the majority of the resets are coming from the hashjoin
> code not the aggregation code.

You are right. I measured where MemoryContextReset had been called.
(The SQL was executed once)

  filename(line)        function             number of calls
 -------------------------------------------------------------
  execGrouping.c(65)    execTuplesMatch           499995
  execScan.c(86)        ExecScan                  500007
  nodeAgg.c(924)        agg_fill_hash_table       500000
  nodeAgg.c(979)        agg_retrieve_hash_table        5
  nodeHash.c(669)       ExecHashGetHashValue      500005
  nodeHash.c(785)       ExecScanHashBucket        500000
  nodeHashjoin.c(108)   ExecHashJoin              500001
  nodeHashjoin.c(217)   ExecHashJoin              500000
 -------------------------------------------------------------
  Total                                          3500013

Many are the one from hashjoin. Other is the one from grouping,
table/index scan, and aggregation by hash.
And I measured the number of times that was able to avoid MemSet in
AllocSetReset.

  avoided MemSet       3500008
  executed MemSet            7
 ---------------------------------------
  Total                3500015

(The execution time of AllocSetReset is more twice than MemoryContextReset
because there is MemoryContextResetAndDeleteChildren in PostgresMain)

regards,

---
Atsushi Ogawa


Re: AllocSetReset improvement

From
Tom Lane
Date:
a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Hmm.  It doesn't seem like this could be a big win overall.  It's not
>> possible to save a whole lot of cycles inside AllocSetReset, because if
>> there isn't anything for it to do, it should fall through pretty quickly
>> anyway.

> I thought that I was able to avoid MemSet in AllocSetReset.

>     MemSet(set->freelist, 0, sizeof(set->freelist));

True, but the memset is only zeroing a dozen or so pointers, so there's
not *that* much to it.

It might be worth changing it to MemSetAligned --- should be safe on any
architecture I can think of, and would save at least a few instructions.

>> And I'm worried about adding even a small amount of overhead to
>> palloc/pfree --- on the vast majority of the profiles I look at, those
>> are more expensive than AllocSetReset.

> I don't worry about palloc. Because overhead increases only when malloc
> is executed in AllocSetAlloc. But I'm wooried about pfree, too. However,
> when palloc/pfree was executed many times, I did not see a bad influence.

In most of the tests I've looked at, palloc/pfree are executed far more
often than AllocSetReset, and so adding even one instruction there to
sometimes save a little work in AllocSetReset is a bad tradeoff.  You
can't optimize to make just one test case look good.

I have another idea though: in the case you are looking at, I think
that the context in question never gets any allocations at all, which
means its blocks list stays null.  We could move the MemSet inside the
"if (blocks)" test --- if there are no blocks allocated to the context,
it surely hasn't got any chunks either, so the MemSet is unnecessary.
That should give us most of the speedup without any extra cost in
palloc/pfree.

            regards, tom lane

Re: AllocSetReset improvement

From
a_ogawa
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> >> And I'm worried about adding even a small amount of overhead to
> >> palloc/pfree --- on the vast majority of the profiles I look at, those
> >> are more expensive than AllocSetReset.
>
> > I don't worry about palloc. Because overhead increases only when malloc
> > is executed in AllocSetAlloc. But I'm wooried about pfree, too. However,
> > when palloc/pfree was executed many times, I did not see a bad
influence.
>
> In most of the tests I've looked at, palloc/pfree are executed far more
> often than AllocSetReset, and so adding even one instruction there to
> sometimes save a little work in AllocSetReset is a bad tradeoff.  You
> can't optimize to make just one test case look good.

I agree. I give up adding instruction to palloc/pfree.

> I have another idea though: in the case you are looking at, I think
> that the context in question never gets any allocations at all, which
> means its blocks list stays null.  We could move the MemSet inside the
> "if (blocks)" test --- if there are no blocks allocated to the context,
> it surely hasn't got any chunks either, so the MemSet is unnecessary.
> That should give us most of the speedup without any extra cost in
> palloc/pfree.

It is a reasonable idea. However, the majority part of MemSet was not
able to be avoided by this idea. Because the per-tuple contexts are used
at the early stage of executor.

 function that calls                  number    context    set->blocks
 MemoryContextReset                   of calls  address    is null
---------------------------------------------------------------------
 execTuplesMatch(execGrouping.c:65)   499995    0x836dd28  false
 agg_fill_hash_table(nodeAgg.c:924)   500000    0x836dd28  false
 ExecHashJoin(nodeHashjoin.c:108)     500001    0x836dec0  false
 ExecHashJoin(nodeHashjoin.c:217)     500000    0x836dec0  false
 ExecHashGetHashValue(nodeHash.c:669) 500005    0x836dec0  false
 ExecScanHashBucket(nodeHash.c:785)   500000    0x836dec0  false
 ExecScan(execScan.c:86)              500007    0x836df48  true

I am considering another idea: I think that we can change behavior of the
context by switching the method table of context.

An simple version of AllocSetAlloc and AllocSetReset is made. These API
can be accelerated instead of using neither a freelist nor the blocks
(The keeper excludes it). When the context is initialized and reset,
these new API is set to the method table. And, when a freelist or a new
block is needed, the method table is switched to normal API. This can be
executed by doing the pfree/repalloc hook. As a result, overhead of pfree
becomes only once about one context.

I think that this idea is effective in context that executes repeatedly
reset after small allocations such as per-tuple context.  And I think that
overhead given to the context that executes a lot of palloc/pfree is a
very few.

An attached patch is a draft of that implementation. Test and comment on
the source code are insufficient yet.

regards,

---
Atsushi Ogawa

Attachment

Re: AllocSetReset improvement

From
Tom Lane
Date:
a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> It is a reasonable idea. However, the majority part of MemSet was not
> able to be avoided by this idea. Because the per-tuple contexts are used
> at the early stage of executor.

Drat.  Well, what about changing that?  We could introduce additional
contexts or change the startup behavior so that the ones that are
frequently reset don't have any data in them unless you are working
with pass-by-ref values inside the inner loop.

> I am considering another idea: I think that we can change behavior of the
> context by switching the method table of context.

That's a possible solution but it feels a bit klugy somehow.  I can't
quite articulate what is bothering me ... will think more.

            regards, tom lane

Re: AllocSetReset improvement

From
Karel Zak
Date:
On Thu, 2005-05-12 at 11:26 -0400, Tom Lane wrote:

> I have another idea though: in the case you are looking at, I think
> that the context in question never gets any allocations at all, which
> means its blocks list stays null.  We could move the MemSet inside the
> "if (blocks)" test --- if there are no blocks allocated to the context,
> it surely hasn't got any chunks either, so the MemSet is unnecessary.

 Good point. There's same MemSet in AllocSetDelete() too.

    Karel

--
Karel Zak <zakkr@zf.jcu.cz>


Re: AllocSetReset improvement

From
a_ogawa
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> > It is a reasonable idea. However, the majority part of MemSet was not
> > able to be avoided by this idea. Because the per-tuple contexts are used
> > at the early stage of executor.
>
> Drat.  Well, what about changing that?  We could introduce additional
> contexts or change the startup behavior so that the ones that are
> frequently reset don't have any data in them unless you are working
> with pass-by-ref values inside the inner loop.

That might be possible. However, I think that we should change only
aset.c about this article.
I thought further: We can check whether context was used from the last
reset even when blocks list is not empty. Please see attached patch.

The effect of the patch that I measured is as follows:

o Execution time that executed the SQL ten times.
(1)Linux(CPU: Pentium III, Compiler option: -O2)
 - original: 24.960s
 - patched : 23.114s

(2)Linux(CPU: Pentium 4, Compiler option: -O2)
 - original: 8.730s
 - patched : 7.962s

(3)Solaris(CPU: Ultra SPARC III, Compiler option: -O2)
 - original: 37.0s
 - patched : 33.7s

regards,

---
Atsushi Ogawa

Attachment

Re: AllocSetReset improvement

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


a_ogawa wrote:
>
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> > a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> > > It is a reasonable idea. However, the majority part of MemSet was not
> > > able to be avoided by this idea. Because the per-tuple contexts are used
> > > at the early stage of executor.
> >
> > Drat.  Well, what about changing that?  We could introduce additional
> > contexts or change the startup behavior so that the ones that are
> > frequently reset don't have any data in them unless you are working
> > with pass-by-ref values inside the inner loop.
>
> That might be possible. However, I think that we should change only
> aset.c about this article.
> I thought further: We can check whether context was used from the last
> reset even when blocks list is not empty. Please see attached patch.
>
> The effect of the patch that I measured is as follows:
>
> o Execution time that executed the SQL ten times.
> (1)Linux(CPU: Pentium III, Compiler option: -O2)
>  - original: 24.960s
>  - patched : 23.114s
>
> (2)Linux(CPU: Pentium 4, Compiler option: -O2)
>  - original: 8.730s
>  - patched : 7.962s
>
> (3)Solaris(CPU: Ultra SPARC III, Compiler option: -O2)
>  - original: 37.0s
>  - patched : 33.7s
>
> regards,
>
> ---
> Atsushi Ogawa

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: AllocSetReset improvement

From
Bruce Momjian
Date:
Patch applied.  Thanks.  (The first if == NULL test was already in CVS).

---------------------------------------------------------------------------


a_ogawa wrote:
>
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> > a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> > > It is a reasonable idea. However, the majority part of MemSet was not
> > > able to be avoided by this idea. Because the per-tuple contexts are used
> > > at the early stage of executor.
> >
> > Drat.  Well, what about changing that?  We could introduce additional
> > contexts or change the startup behavior so that the ones that are
> > frequently reset don't have any data in them unless you are working
> > with pass-by-ref values inside the inner loop.
>
> That might be possible. However, I think that we should change only
> aset.c about this article.
> I thought further: We can check whether context was used from the last
> reset even when blocks list is not empty. Please see attached patch.
>
> The effect of the patch that I measured is as follows:
>
> o Execution time that executed the SQL ten times.
> (1)Linux(CPU: Pentium III, Compiler option: -O2)
>  - original: 24.960s
>  - patched : 23.114s
>
> (2)Linux(CPU: Pentium 4, Compiler option: -O2)
>  - original: 8.730s
>  - patched : 7.962s
>
> (3)Solaris(CPU: Ultra SPARC III, Compiler option: -O2)
>  - original: 37.0s
>  - patched : 33.7s
>
> regards,
>
> ---
> Atsushi Ogawa

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: AllocSetReset improvement

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Patch applied.  Thanks.  (The first if == NULL test was already in CVS).

The first if == NULL test was the only part I wanted to apply ...
I do not think this patch is a performance win in general.

            regards, tom lane

Re: AllocSetReset improvement

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Patch applied.  Thanks.  (The first if == NULL test was already in CVS).
>
> The first if == NULL test was the only part I wanted to apply ...
> I do not think this patch is a performance win in general.

OK, patch reverted.  a_ogawa, would you run tests with just the part I
reverted to see if it is a win. Thanks.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: AllocSetReset improvement

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Patch applied.  Thanks.  (The first if == NULL test was already in CVS).
>
> The first if == NULL test was the only part I wanted to apply ...
> I do not think this patch is a performance win in general.

Attached is the part I backed out of CVS.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
Index: aset.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/mmgr/aset.c,v
retrieving revision 1.60
retrieving revision 1.61
diff -c -c -r1.60 -r1.61
*** aset.c    14 May 2005 20:29:13 -0000    1.60
--- aset.c    4 Jun 2005 20:14:12 -0000    1.61
***************
*** 399,404 ****
--- 399,415 ----
      if (block == NULL)
          return;

+     /*
+      * When blocks list has only "keeper" block and freeptr of the block
+      * is initial value, the context is not used from last reset.
+      */
+     if (block == set->keeper && block->next == NULL)
+     {
+         char       *datastart = ((char *) block) + ALLOC_BLOCKHDRSZ;
+         if (block->freeptr == datastart)
+             return;
+     }
+
      /* Clear chunk freelists */
      MemSetAligned(set->freelist, 0, sizeof(set->freelist));