Thread: Parallel Aggregates for string_agg and array_agg

Parallel Aggregates for string_agg and array_agg

From
Tomer Praizler
Date:
Hey!

Just wanted to check if there is any magic that can be done to make this happen. 

I didn't try it, but wanted to check if there is any way to deal with the need to aggregate on a string, by creating an array while doing a group by? Should I manipulate my data to be able to do it, maybe by generating an int out of those strings? Any other idea?

Thanks!

Re: Parallel Aggregates for string_agg and array_agg

From
David Rowley
Date:
On Thu, 14 Mar 2019 at 09:53, Tomer Praizler <tomer.praizler@gmail.com> wrote:
> Just wanted to check if there is any magic that can be done to make this happen.
> The closest thing I ran into was this guy patch -
https://www.postgresql.org/message-id/CAKJS1f8LV7AT%3DAAhdYGKtGrGkSkEgO6C_SW2Ztz1sR3encisqw%40mail.gmail.com

If you're able to change the SQLs and point them at some other
aggregate function, then you could create an extension with the code
from that patch and CREATE AGGREGATE your own version of those
functions using the combine and [de]serial functions.  I know that
creating your own extension is pretty out there for the novice mailing
list, but if I thought of another easier way if have told you that
instead.

> I didn't try it, but wanted to check if there is any way to deal with the need to aggregate on a string, by creating
anarray while doing a group by? Should I manipulate my data to be able to do it, maybe by generating an int out of
thosestrings? Any other idea? 

Well, array_agg is non-parallel too, so don't see how having an array
and converting that into a string later would help.  The other
aggregates that are parallel aware don't really let you get individual
values back out of the aggregated state, so there's not really a way
to turn that into an array or a string containing all the values that
were aggregated.

Does your use-case really need parallel versions of these aggregates?
I imagined that these would perform best when the underlying scan had
to skip lots of values, or when the aggregate had a FILTER (WHERE ...)
clause. Maybe if the filtering can be done by using an index then
performance would up to the level you need?

If you could share a simplified version of your use case perhaps
someone can suggest a way to speed it up another way.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel Aggregates for string_agg and array_agg

From
Tomer Praizler
Date:
Thank you, David!
I guess I wasn't 100% clear, an example will do the trick.

So let's take for example the following records:

timestamp                      port     super_name   name      ids    count
2019-03-06 10:00:00      22             ssh            abc        1,2     10
2019-03-06 10:00:00      22             ssh            xyz         3,4     20
2019-03-06 10:00:00      22             ssh            abc        5,6     30
2019-03-06 10:00:00      22             ssh            abc        7,8     40
2019-03-06 10:00:00      22             ssh            foo        9,10    50


The primary key is combined of 6 columns in this example: (timestamp, port, super_name), I have around 8 values for every key.
My table is partitioned by day, and I have around 2-3M rows in each partition. I am trying to aggregate on the last 7 days, which is around 23M rows. 

my query looks something like this:

SELECT x.timestamp, x.port, x.super_name, max(x.timestamp) AS last_seen, coalesce(array_length(array_merge_agg(x.ids), 1), 0) AS my_count, coalesce(array_length(array_agg(DISTINCT x.name), 1), 0) AS names, sum(x.count) AS final_count 
FROM x 
GROUP BY x.timestamp, x.port, x.super_name
ORDER BY sum(x.count)

This result in a plan without parallel execution because of the array_agg on a string field. when I remove it the query planner spawns a parallel execution plan. 
It reduces the time from 5 minutes to around 1 minute which is also a lot. (if there is any idea on how to optimize farther please help:) btw, hardware resources is not a problem) 

Thanks!


On Thu, 14 Mar 2019 at 01:02 David Rowley <david.rowley@2ndquadrant.com> wrote:
On Thu, 14 Mar 2019 at 09:53, Tomer Praizler <tomer.praizler@gmail.com> wrote:
> Just wanted to check if there is any magic that can be done to make this happen.
> The closest thing I ran into was this guy patch - https://www.postgresql.org/message-id/CAKJS1f8LV7AT%3DAAhdYGKtGrGkSkEgO6C_SW2Ztz1sR3encisqw%40mail.gmail.com

If you're able to change the SQLs and point them at some other
aggregate function, then you could create an extension with the code
from that patch and CREATE AGGREGATE your own version of those
functions using the combine and [de]serial functions.  I know that
creating your own extension is pretty out there for the novice mailing
list, but if I thought of another easier way if have told you that
instead.

> I didn't try it, but wanted to check if there is any way to deal with the need to aggregate on a string, by creating an array while doing a group by? Should I manipulate my data to be able to do it, maybe by generating an int out of those strings? Any other idea?

Well, array_agg is non-parallel too, so don't see how having an array
and converting that into a string later would help.  The other
aggregates that are parallel aware don't really let you get individual
values back out of the aggregated state, so there's not really a way
to turn that into an array or a string containing all the values that
were aggregated.

Does your use-case really need parallel versions of these aggregates?
I imagined that these would perform best when the underlying scan had
to skip lots of values, or when the aggregate had a FILTER (WHERE ...)
clause. Maybe if the filtering can be done by using an index then
performance would up to the level you need?

If you could share a simplified version of your use case perhaps
someone can suggest a way to speed it up another way.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Aggregates for string_agg and array_agg

From
David Rowley
Date:
On Thu, 14 Mar 2019 at 12:43, Tomer Praizler <tomer.praizler@gmail.com> wrote:
> my query looks something like this:
>
> SELECT x.timestamp, x.port, x.super_name, max(x.timestamp) AS last_seen,
coalesce(array_length(array_merge_agg(x.ids),1), 0) AS my_count, coalesce(array_length(array_agg(DISTINCT x.name), 1),
0)AS names, sum(x.count) AS final_count
 
> FROM x
> GROUP BY x.timestamp, x.port, x.super_name
> ORDER BY sum(x.count)
>
> This result in a plan without parallel execution because of the array_agg on a string field. when I remove it the
queryplanner spawns a parallel execution plan.
 
> It reduces the time from 5 minutes to around 1 minute which is also a lot. (if there is any idea on how to optimize
fartherplease help:) btw, hardware resources is not a problem)
 

I think most of that speedup will be coming from removing the
aggregate that has a DISTINCT rather than removing it because it can't
be parallelised. You could find that out by seeing how fast it is
without the array_agg after having SET max_parallel_workers_per_gather
TO 0;

FWIW, even with the patch you mentioned earlier, this query would not
use parallel aggregates due to that DISTINCT.  All other aggregates
don't store the individual aggregated values, so ordinarily, it's not
possible for the aggregate combine phase to combine the aggregate
states from parallel workers and ignore the values that are duplicated
between workers.  array_agg does happen to store the individual
values, but the parallel aggregate infrastructure has no support in
the combine phase to combine and eliminate the duplicates. I don't
think it would be entirely impossible to add it, but, out of the
standard set of aggregate functions, it would only be string_agg and
array_agg along with some xml and json aggs that could use that, so it
might not be worth the trouble.

The reason your query will be taking so long with the
array_agg(DISTINCT ...) is that internally the executor performs a
sort of the entire results so that it can perform the duplicate
elimination by checking if the current value is the same as the
previous.  I think the best thing that can be done to speed that up is
the thing that Tom hints at in the final paragraph in [1].  This would
have the planner consider choosing a plan that provides pre-sorted
input to the aggregate node. In your case that would be x.timestamp,
x.port, x.super_name, x.name.  If there was an index supporting that
then no sorting would need to be done for the aggregation phase at
all. Anyway, that's all future stuff that does not exist today.

For today, I think the only way you'll get an improvement is to
somehow precalculate the DISTINCT x.name.  Alternatively, you could
consider some normalisation and have x.name be an int column and put
the actual names into another table, then just look those up after
aggregation.  This would only help in the sense that sorting an int
column is faster and uses less memory than sorting a text column.

[1] https://www.postgresql.org/message-id/22068.1522103326%40sss.pgh.pa.us

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services