Re: Combining Aggregates - Mailing list pgsql-hackers

From David Rowley
Subject Re: Combining Aggregates
Date
Msg-id CAKJS1f9KbzhOi89qPWg-kEy1h_gEpHM+uGP2gRkEz_1-OWPXHg@mail.gmail.com
Whole thread Raw
In response to Re: Combining Aggregates  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Combining Aggregates  (Haribabu Kommi <kommi.haribabu@gmail.com>)
Re: Combining Aggregates  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 18 January 2016 at 16:44, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Jan 17, 2016 at 9:26 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> hmm, so wouldn't that mean that the transition function would need to (for
> each input tuple):
>
> 1. Parse that StringInfo into tokens.
> 2. Create a new aggregate state object.
> 3. Populate the new aggregate state based on the tokenised StringInfo, this
> would perhaps require that various *_in() functions are called on each
> token.
> 4. Add the new tuple to the aggregate state.
> 5. Build a new StringInfo based on the aggregate state modified in 4.
>
> ?

I don't really know what you mean by parse the StringInfo into tokens.
The whole point of the expanded-object interface is to be able to keep
things in an expanded internal form so that you *don't* have to
repeatedly construct and deconstruct internal data structures. 

That was a response to Haribabu proposal, although perhaps I misunderstood that. However I'm not sure how a PolyNumAggState could be converted to a string and back again without doing any string parsing.

I worked up an example of this approach using string_agg(), which I
attach here.  This changes the transition type of string_agg() from
internal to text.  The same code would work for bytea_string_agg(),
which would allow removal of some other code, but this patch doesn't
do that, because the point of this is to elucidate the approach.


Many thanks for working up that patch. I was clearly missing what you meant previously. I understand it much better now. Thank you for taking the time on that.
 
In my tests, this seems to be slightly slower than what we're doing
today; worst of all, it adds a handful of cycles to
advance_transition_function() even when the aggregate is not an
expanded object or, indeed, not even pass-by-reference.  Some of this
might be able to be fixed by a little massaging - in particular,
DatumIsReadWriteExpandedObject() seems like it could be partly or
entirely inlined, and maybe there's some other way to improve the
coding here.

It also seems that an expanded object requires a new memory context which must be malloc()d and free()d. This has added quite an overhead in my testing. I assume that we must be doing that so that we can ensure that all memory is properly free()d once we're done with the expanded-object.

create table ab(a int, b text);
insert into ab select x,'aaaaaaaaaaaaaaaaaaaaaaaaaaa' from generate_Series(1,1000000) x(x);
set work_mem='1GB';
vacuum analyze ab;

explain analyze select a%1000000,length(string_agg(b,',')) from ab group by 1;

Patched
1521.045 ms
1515.905 ms
1530.920 ms

Master
932.457 ms
959.285 ms
991.021 ms

Although performance of the patched version is closer to master when we have less groups, I felt it necessary to show the extreme case. I feel this puts a bit of a dampener on the future of this idea, as I've previously had patches rejected for adding 2-5% on planning time alone, adding over 50% execution time seems like a dead-end. 

I've run profiles on this and malloc() and free() are both top of the profile with the patched version with about 30% CPU time each.


Generally, I think finding a way to pass expanded objects through
nodeAgg.c would be a good thing to pursue, if we can make it work.
The immediate impetus for changing things this way would be that we
wouldn't need to add a mechanism for serializing and deserializing
internal functions just to pass around partial aggregates.  But
there's another advantage, too: right now,
advance_transition_function() does a lot of data copying to move data
from per-call context to the per-aggregate context.  When an expanded
object is in use, this can be skipped.

The part I quite liked about the serialize/deserialize is that there's no need to add any overhead at all for serializing and deserializing the states when the aggregation is done in a single backend process. We'd simply just have the planner pass the make_agg()'s serializeStates as false when we're working within a single backend. This does not appear possible with your proposed implementation, since it makes changes to each transition function. It is my understanding that we normally bend over backwards with new code to try and stop any performance regression. I'm not quite sure the expanded-object idea works well in this regard, but I do agree your approach seems neater. I just don't want to waste my time writing all the code required to replace all INTERNAL aggregate states when I know the performance regression is going to be unacceptable.

I also witnessed another regression with your patch when testing on another machine. It caused the plan to change to a HashAgg instead of GroupAgg causing a significant slowdown.

Unpatched

# explain analyze select a%1000000,length(string_agg(b,',')) from ab group by 1;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=119510.84..144510.84 rows=1000000 width=32) (actual time=538.938..1015.278 rows=1000000 loops=1)
   Group Key: ((a % 1000000))
   ->  Sort  (cost=119510.84..122010.84 rows=1000000 width=32) (actual time=538.917..594.194 rows=1000000 loops=1)
         Sort Key: ((a % 1000000))
         Sort Method: quicksort  Memory: 102702kB
         ->  Seq Scan on ab  (cost=0.00..19853.00 rows=1000000 width=32) (actual time=0.016..138.964 rows=1000000 loops=1)
 Planning time: 0.146 ms
 Execution time: 1047.511 ms


Patched
# explain analyze select a%1000000,length(string_agg(b,',')) from ab group by 1;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=24853.00..39853.00 rows=1000000 width=32) (actual time=8072.346..144424.872 rows=1000000 loops=1)
   Group Key: (a % 1000000)
   ->  Seq Scan on ab  (cost=0.00..19853.00 rows=1000000 width=32) (actual time=0.025..481.332 rows=1000000 loops=1)
 Planning time: 0.164 ms
 Execution time: 263288.332 ms

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)
Next
From: Thom Brown
Date:
Subject: Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)