Thread: sum(int4)/sum(int2) improvement
When sum(int4) or sum(int2) is executed, many cycles are spent by AllocSetReset. Because per-tuple context is used to allocate the first data of each group. An attached patch uses AggState->aggcontext instead of per-tuple context to allocate the data. As a result, per-tuple context is not used, and the cycles of AllocSetReset is reduced. test data: pgbench -i -s 5 SQL: select a.bid, sum(a.abalance) from accounts a group by a.bid; execution time(compile option "-O2"): original: 1.530s patched: 1.441s profile result of original code(compile option "-g -pg"): ---------------------------------------------------------------------------- % cumulative self self total time seconds seconds calls s/call s/call name 15.64 0.35 0.35 1500000 0.00 0.00 slot_deform_tuple 11.67 0.62 0.27 1000002 0.00 0.00 AllocSetReset 6.61 0.77 0.15 1999995 0.00 0.00 slot_getattr 5.29 0.89 0.12 500002 0.00 0.00 heapgettup 3.52 0.97 0.08 524420 0.00 0.00 hash_search profile result of patched code(compile option "-g -pg"): ---------------------------------------------------------------------------- % cumulative self self total time seconds seconds calls s/call s/call name 17.39 0.32 0.32 1500000 0.00 0.00 slot_deform_tuple 6.52 0.44 0.12 500002 0.00 0.00 heapgettup 6.25 0.56 0.12 1999995 0.00 0.00 slot_getattr 4.35 0.64 0.08 524420 0.00 0.00 hash_search 4.35 0.71 0.08 499995 0.00 0.00 execTuplesMatch (skip ...) 0.54 1.67 0.01 1000002 0.00 0.00 AllocSetReset regards, --- Atsushi Ogawa
Attachment
Atsushi Ogawa <a_ogawa@hi-ho.ne.jp> writes: > An attached patch uses AggState->aggcontext instead of per-tuple > context to allocate the data. As a result, per-tuple context is not > used, and the cycles of AllocSetReset is reduced. Why is this better than the fix already in place? regards, tom lane
Tom Lane wrote: > Atsushi Ogawa <a_ogawa@hi-ho.ne.jp> writes: > > An attached patch uses AggState->aggcontext instead of per-tuple > > context to allocate the data. As a result, per-tuple context is not > > used, and the cycles of AllocSetReset is reduced. > > Why is this better than the fix already in place? Because per-tuple context is reset many times. If per-tuple context is never used, the following codes of AllocSetReset become effective. /* Nothing to do if context has never contained any data */ if (block == NULL) return; --- Atsushi Ogawa
Atsushi Ogawa <a_ogawa@hi-ho.ne.jp> writes: > Tom Lane wrote: >> Why is this better than the fix already in place? > Because per-tuple context is reset many times. If per-tuple context is > never used, the following codes of AllocSetReset become effective. > /* Nothing to do if context has never contained any data */ > if (block == NULL) > return; Hm. It's still pretty ugly though, since fixing only those two places isn't very exciting. I went back and looked at your AllocSetReset patch from last May, and liked it better on second consideration. The point I had missed before is that even though we're adding an instruction to palloc, it's only needed in the "slow" paths. The cases that we have to be worried about, IMHO, are: 1. Loop containing palloc(s) and pfree(s). 2. Loop containing palloc(s) and MemoryContextReset. 3. Loop containing only MemoryContextReset. (We need not worry about cases doing only palloc since that is obviously not sustainable over any long period.) Case 1 is the one that was bothering me, but in practice, once the loop is rolling the pallocs should mostly be reusing previously-pfreed chunks. That is the "fast" path through AllocSetAlloc, and it doesn't need any extra overhead because it doesn't need to clear the alloc-set-is-empty flag --- the set is clearly already nonempty if it has free chunks. There's enough instructions in the other paths that one more isn't going to hurt. In case 2 we're adding instructions to both palloc and MemoryContextReset, for no gain :-( ... but again, these are in code paths that aren't the fastest possible. I don't think it'll hurt noticeably. I notice also that MemoryContextIsEmpty could return a better answer if we maintained a flag like this. This is not real important in the current scheme of things, but might be more interesting someday. I'll fix up the old patch and apply it. I think that's a better answer than kluging up the aggregate functions themselves. regards, tom lane
Tom Lane wrote: > I went back and looked at your AllocSetReset patch from last May, and > liked it better on second consideration. > I'll fix up the old patch and apply it. I think that's a better answer > than kluging up the aggregate functions themselves. Thanks. I also think this fix is better. It speeds up hash join etc. besides sum(int4) and sum(int2). Example: - SQL select b.bid, count(*) from accounts a, branches b where a.bid = b.bid group by b.bid; - Query Plan HashAggregate (cost=23198.06..23198.08 rows=5 width=4) -> Hash Join (cost=1.06..20698.06 rows=500000 width=4) Hash Cond: ("outer".bid = "inner".bid) -> Seq Scan on accounts a (cost=0.00..13197.00 rows=500000 width=4) -> Hash (cost=1.05..1.05 rows=5 width=4) -> Seq Scan on branches b (cost=0.00..1.05 rows=5 width=4) - Result original: 2.416s patched: 2.286s regards, --- Atsushi Ogawa
This has been saved for the 8.2 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --------------------------------------------------------------------------- Atsushi Ogawa wrote: > > When sum(int4) or sum(int2) is executed, many cycles are spent by > AllocSetReset. Because per-tuple context is used to allocate the > first data of each group. > > An attached patch uses AggState->aggcontext instead of per-tuple > context to allocate the data. As a result, per-tuple context is not > used, and the cycles of AllocSetReset is reduced. > > test data: > pgbench -i -s 5 > > SQL: > select a.bid, sum(a.abalance) > from accounts a > group by a.bid; > > execution time(compile option "-O2"): > original: 1.530s > patched: 1.441s > > profile result of original code(compile option "-g -pg"): > ---------------------------------------------------------------------------- > % cumulative self self total > time seconds seconds calls s/call s/call name > 15.64 0.35 0.35 1500000 0.00 0.00 slot_deform_tuple > 11.67 0.62 0.27 1000002 0.00 0.00 AllocSetReset > 6.61 0.77 0.15 1999995 0.00 0.00 slot_getattr > 5.29 0.89 0.12 500002 0.00 0.00 heapgettup > 3.52 0.97 0.08 524420 0.00 0.00 hash_search > > profile result of patched code(compile option "-g -pg"): > ---------------------------------------------------------------------------- > % cumulative self self total > time seconds seconds calls s/call s/call name > 17.39 0.32 0.32 1500000 0.00 0.00 slot_deform_tuple > 6.52 0.44 0.12 500002 0.00 0.00 heapgettup > 6.25 0.56 0.12 1999995 0.00 0.00 slot_getattr > 4.35 0.64 0.08 524420 0.00 0.00 hash_search > 4.35 0.71 0.08 499995 0.00 0.00 execTuplesMatch > (skip ...) > 0.54 1.67 0.01 1000002 0.00 0.00 AllocSetReset > > regards, > > --- Atsushi Ogawa [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > This has been saved for the 8.2 release: > http://momjian.postgresql.org/cgi-bin/pgpatches_hold I believe that patch has been withdrawn because we did it in another way. regards, tom lane
Patch already applied in 8.1 in different way. --------------------------------------------------------------------------- Atsushi Ogawa wrote: > > When sum(int4) or sum(int2) is executed, many cycles are spent by > AllocSetReset. Because per-tuple context is used to allocate the > first data of each group. > > An attached patch uses AggState->aggcontext instead of per-tuple > context to allocate the data. As a result, per-tuple context is not > used, and the cycles of AllocSetReset is reduced. > > test data: > pgbench -i -s 5 > > SQL: > select a.bid, sum(a.abalance) > from accounts a > group by a.bid; > > execution time(compile option "-O2"): > original: 1.530s > patched: 1.441s > > profile result of original code(compile option "-g -pg"): > ---------------------------------------------------------------------------- > % cumulative self self total > time seconds seconds calls s/call s/call name > 15.64 0.35 0.35 1500000 0.00 0.00 slot_deform_tuple > 11.67 0.62 0.27 1000002 0.00 0.00 AllocSetReset > 6.61 0.77 0.15 1999995 0.00 0.00 slot_getattr > 5.29 0.89 0.12 500002 0.00 0.00 heapgettup > 3.52 0.97 0.08 524420 0.00 0.00 hash_search > > profile result of patched code(compile option "-g -pg"): > ---------------------------------------------------------------------------- > % cumulative self self total > time seconds seconds calls s/call s/call name > 17.39 0.32 0.32 1500000 0.00 0.00 slot_deform_tuple > 6.52 0.44 0.12 500002 0.00 0.00 heapgettup > 6.25 0.56 0.12 1999995 0.00 0.00 slot_getattr > 4.35 0.64 0.08 524420 0.00 0.00 hash_search > 4.35 0.71 0.08 499995 0.00 0.00 execTuplesMatch > (skip ...) > 0.54 1.67 0.01 1000002 0.00 0.00 AllocSetReset > > regards, > > --- Atsushi Ogawa [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073