Home > mailing lists

Thread: GroupAggregate and Integer Arrays

GroupAggregate and Integer Arrays

From

David Osborne

Date:

23 October 2015, 14:29:24

Hi,

Wondering if anyone could suggest how we could improve the performance of this type of query?

The intensive part is the summing of integer arrays as far as I can see.

We're thinking there's not much we can do to improve performance apart from throw more CPU at it... would love to be proven wrong though!

Query:

explain (analyse,buffers)

select

sum(s2.array_a),sum(s2.array_b)

from mytable s1 left join mytable s2

on s1.code=s2.code and s1.buyer=s2.seller and s2.seller='XX'

where s1.buyer='XX'

group by s1.buyer,s1.code

;

Depesz Explain Link:

http://explain.depesz.com/s/m3XP

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

GroupAggregate (cost=275573.49..336223.36 rows=2547 width=524) (actual time=1059.340..22946.772 rows=22730 loops=1)

Buffers: shared hit=113596 read=1020 dirtied=15

-> Merge Left Join (cost=275573.49..278850.09 rows=113560 width=524) (actual time=1058.773..1728.186 rows=240979 loops=1)

Merge Cond: ((s1.code)::text = (s2.code)::text)

Join Filter: (s1.buyer = (s2.seller)::bpchar)

Buffers: shared hit=113596 read=1020 dirtied=15

-> Index Only Scan using mytable_buyer_idx on mytable s1 (cost=0.42..1226.06 rows=25465 width=12) (actual time=0.015..35.790 rows=22730 loops=1)

Index Cond: (buyer = 'XX'::bpchar)

Heap Fetches: 3739

Buffers: shared hit=16805 dirtied=1

-> Sort (cost=275573.07..275818.33 rows=98106 width=525) (actual time=1058.736..1141.560 rows=231662 loops=1)

Sort Key: s2.code

Sort Method: quicksort Memory: 241426kB

Buffers: shared hit=96791 read=1020 dirtied=14

-> Bitmap Heap Scan on mytable s2 (cost=12256.28..267439.07 rows=98106 width=525) (actual time=60.330..325.730 rows=231662 loops=1)

Recheck Cond: ((seller)::text = 'XX'::text)

Filter: ((seller)::bpchar = 'XX'::bpchar)

Buffers: shared hit=96791 read=1020 dirtied=14

-> Bitmap Index Scan on mytable_seller_idx (cost=0.00..12231.75 rows=254844 width=0) (actual time=40.474..40.474 rows=233244 loops=1)

Index Cond: ((seller)::text = 'XX'::text)

Buffers: shared hit=30 read=1020

Total runtime: 22968.292 ms

(22 rows)

Table size:

=> select count(*) from mytable;

count

--------

602669

(1 row)

Array types:

# select array_a,array_b from mytable limit 1;

array_a | array_b

---------------------------+---------------------------

{0,0,0,0,0,0,0,0,0,0,0,0} | {0,0,0,0,0,0,0,0,0,0,0,0}

Example schema:

# \d mytable

Table "public.mytable"

Column | Type | Modifiers

-------------------+-----------------------+------------------------

buyer | character(2) | not null

code | character varying(20) | not null

seller | character varying(50) |

array_a | integer[] |

array_b | integer[] |

Indexes:

"mytable_buyer_code_idx" UNIQUE, btree (buyer, code) CLUSTER

"mytable_buyer_idx" btree (buyer)

"mytable_code_idx" btree (code)

"mytable_seller_idx" btree (seller)

Version:

> SELECT version() ;

version

--------------------------------------------------------------------------------------------------------------

PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2), 64-bit

(1 row)

This is running on an AWS RDS instance.

Thanks for any pointers

David

Re: GroupAggregate and Integer Arrays

From

Jeff Janes

Date:

23 October 2015, 16:15:46

On Fri, Oct 23, 2015 at 7:29 AM, David Osborne <david@qcode.co.uk> wrote:

Hi,

Wondering if anyone could suggest how we could improve the performance of this type of query?
The intensive part is the summing of integer arrays as far as I can see.

Postgres does not ship with any 'sum' function which takes array arguments.

> select sum('{1,2,3,4,5,6}'::int[]);

ERROR: function sum(integer[]) does not exist

Are you using a user defined function? If so, how did you define it?

Cheers,

Jeff

Re: GroupAggregate and Integer Arrays

From

David Osborne

Date:

23 October 2015, 16:26:31

Ah yes sorry:

I think these cover it...

CREATE AGGREGATE sum (

sfunc = array_add,

basetype = INTEGER[],

stype = INTEGER[],

initcond = '{}'

);

CREATE OR REPLACE FUNCTION array_add(int[],int[]) RETURNS int[] AS $$

-- Add two arrays.

select

ARRAY (

SELECT coalesce($1[i],0) + coalesce($2[i],0)

FROM (

select generate_series(least(array_lower($1, 1),array_lower($2, 1)), greatest(array_upper($1, 1),array_upper($2, 1)), 1) AS i

) sub

GROUP BY i

ORDER BY i

);

$$ LANGUAGE sql STRICT IMMUTABLE;

On 23 October 2015 at 17:15, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Oct 23, 2015 at 7:29 AM, David Osborne <david@qcode.co.uk> wrote:

Hi,

Wondering if anyone could suggest how we could improve the performance of this type of query?
The intensive part is the summing of integer arrays as far as I can see.

Postgres does not ship with any 'sum' function which takes array arguments.

> select sum('{1,2,3,4,5,6}'::int[]);

ERROR: function sum(integer[]) does not exist

Are you using a user defined function? If so, how did you define it?

Cheers,

Jeff

Re: GroupAggregate and Integer Arrays

From

Merlin Moncure

Date:

23 October 2015, 17:35:28

On Friday, October 23, 2015, David Osborne <david@qcode.co.uk> wrote:

Hi,

Wondering if anyone could suggest how we could improve the performance of this type of query?
The intensive part is the summing of integer arrays as far as I can see.
We're thinking there's not much we can do to improve performance apart from throw more CPU at it... would love to be proven wrong though!

Query:

explain (analyse,buffers)
select
sum(s2.array_a),sum(s2.array_b)
from mytable s1 left join mytable s2
on s1.code=s2.code and s1.buyer=s2.seller and s2.seller='XX'
where s1.buyer='XX'
group by s1.buyer,s1.code
;

Depesz Explain Link:

http://explain.depesz.com/s/m3XP

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=275573.49..336223.36 rows=2547 width=524) (actual time=1059.340..22946.772 rows=22730 loops=1)
Buffers: shared hit=113596 read=1020 dirtied=15
-> Merge Left Join (cost=275573.49..278850.09 rows=113560 width=524) (actual time=1058.773..1728.186 rows=240979 loops=1)
Merge Cond: ((s1.code)::text = (s2.code)::text)
Join Filter: (s1.buyer = (s2.seller)::bpchar)
Buffers: shared hit=113596 read=1020 dirtied=15
-> Index Only Scan using mytable_buyer_idx on mytable s1 (cost=0.42..1226.06 rows=25465 width=12) (actual time=0.015..35.790 rows=22730 loops=1)
Index Cond: (buyer = 'XX'::bpchar)
Heap Fetches: 3739
Buffers: shared hit=16805 dirtied=1
-> Sort (cost=275573.07..275818.33 rows=98106 width=525) (actual time=1058.736..1141.560 rows=231662 loops=1)
Sort Key: s2.code
Sort Method: quicksort Memory: 241426kB
Buffers: shared hit=96791 read=1020 dirtied=14
-> Bitmap Heap Scan on mytable s2 (cost=12256.28..267439.07 rows=98106 width=525) (actual time=60.330..325.730 rows=231662 loops=1)
Recheck Cond: ((seller)::text = 'XX'::text)
Filter: ((seller)::bpchar = 'XX'::bpchar)
Buffers: shared hit=96791 read=1020 dirtied=14
-> Bitmap Index Scan on mytable_seller_idx (cost=0.00..12231.75 rows=254844 width=0) (actual time=40.474..40.474 rows=233244 loops=1)
Index Cond: ((seller)::text = 'XX'::text)
Buffers: shared hit=30 read=1020
Total runtime: 22968.292 ms
(22 rows)

Table size:

=> select count(*) from mytable;
count
--------
602669
(1 row)

Array types:

# select array_a,array_b from mytable limit 1;
array_a | array_b
---------------------------+---------------------------
{0,0,0,0,0,0,0,0,0,0,0,0} | {0,0,0,0,0,0,0,0,0,0,0,0}

Example schema:

# \d mytable
Table "public.mytable"
Column | Type | Modifiers
-------------------+-----------------------+------------------------
buyer | character(2) | not null
code | character varying(20) | not null
seller | character varying(50) |
array_a | integer[] |
array_b | integer[] |
Indexes:
"mytable_buyer_code_idx" UNIQUE, btree (buyer, code) CLUSTER
"mytable_buyer_idx" btree (buyer)
"mytable_code_idx" btree (code)
"mytable_seller_idx" btree (seller)

Version:

> SELECT version() ;
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2), 64-bit
(1 row)

This is running on an AWS RDS instance.

Thanks for any pointers
--
David

What's physical memory and setting of work_mem?

merlin

Re: GroupAggregate and Integer Arrays

From

Jeff Janes

Date:

24 October 2015, 19:27:13

On Fri, Oct 23, 2015 at 9:26 AM, David Osborne <david@qcode.co.uk> wrote:
> Ah yes sorry:
>
> I think these cover it...
>
> CREATE AGGREGATE sum (
>       sfunc = array_add,
>       basetype = INTEGER[],
>       stype = INTEGER[],
>       initcond = '{}'
>    );
>
> CREATE OR REPLACE FUNCTION array_add(int[],int[]) RETURNS int[] AS $$
>    -- Add two arrays.
>    select
>       ARRAY (
>          SELECT coalesce($1[i],0) + coalesce($2[i],0)
>          FROM (
>             select generate_series(least(array_lower($1, 1),array_lower($2,
> 1)), greatest(array_upper($1, 1),array_upper($2, 1)), 1) AS i
>          ) sub
>    GROUP BY i
>    ORDER BY i
>    );
> $$ LANGUAGE sql STRICT IMMUTABLE;

You are paying a lot for the convenience of using a sql language
function here.  If you want much better performance, you would
probably have to rewrite it into C.  But that would be a drag, and I
would try just throwing more CPU at it first.

Cheers,

Jeff

Re: GroupAggregate and Integer Arrays

From

Marc Mamin

Date:

24 October 2015, 20:23:50

> CREATE OR REPLACE FUNCTION array_add(int[],int[]) RETURNS int[] AS $$
>    -- Add two arrays.
>    select
>       ARRAY (
>          SELECT coalesce($1[i],0) + coalesce($2[i],0)
>          FROM (
>             select generate_series(least(array_lower($1, 1),array_lower($2,
> 1)), greatest(array_upper($1, 1),array_upper($2, 1)), 1) AS i
>          ) sub
>    GROUP BY i
>    ORDER BY i
>    );
> $$ LANGUAGE sql STRICT IMMUTABLE;

it seems that both the GROUP and ORDER BY are superfluous and adding some cycles.

regards,

Marc Mamin

Re: GroupAggregate and Integer Arrays

From

David Osborne

Date:

26 October 2015, 17:45:42

Physical memory is 61GB at the moment.

work_mem is 1,249,104kB

What's physical memory and setting of work_mem?

merlin

David Osborne

Qcode Software Limited

http://www.qcode.co.uk

T: +44 (0)1463 896484

Re: GroupAggregate and Integer Arrays

From

Merlin Moncure

Date:

26 October 2015, 18:21:14

On Mon, Oct 26, 2015 at 12:45 PM, David Osborne <david@qcode.co.uk> wrote:
> Physical memory is 61GB at the moment.
>
> work_mem is 1,249,104kB

I'm not sure if this query is a candidate because of the function, but
you can try progressively cranking work_mem and running explain to see
what it'd take to get a hashaggregate plan.

merlin