Thread: GroupAggregate and Integer Arrays
Hi,
Wondering if anyone could suggest how we could improve the performance of this type of query?
The intensive part is the summing of integer arrays as far as I can see.
We're thinking there's not much we can do to improve performance apart from throw more CPU at it... would love to be proven wrong though!
Query:
explain (analyse,buffers)
select
sum(s2.array_a),sum(s2.array_b)
from mytable s1 left join mytable s2
on s1.code=s2.code and s1.buyer=s2.seller and s2.seller='XX'
where s1.buyer='XX'
group by s1.buyer,s1.code
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=275573.49..336223.36 rows=2547 width=524) (actual time=1059.340..22946.772 rows=22730 loops=1)
Buffers: shared hit=113596 read=1020 dirtied=15
-> Merge Left Join (cost=275573.49..278850.09 rows=113560 width=524) (actual time=1058.773..1728.186 rows=240979 loops=1)
Merge Cond: ((s1.code)::text = (s2.code)::text)
Join Filter: (s1.buyer = (s2.seller)::bpchar)
Buffers: shared hit=113596 read=1020 dirtied=15
-> Index Only Scan using mytable_buyer_idx on mytable s1 (cost=0.42..1226.06 rows=25465 width=12) (actual time=0.015..35.790 rows=22730 loops=1)
Index Cond: (buyer = 'XX'::bpchar)
Heap Fetches: 3739
Buffers: shared hit=16805 dirtied=1
-> Sort (cost=275573.07..275818.33 rows=98106 width=525) (actual time=1058.736..1141.560 rows=231662 loops=1)
Sort Key: s2.code
Sort Method: quicksort Memory: 241426kB
Buffers: shared hit=96791 read=1020 dirtied=14
-> Bitmap Heap Scan on mytable s2 (cost=12256.28..267439.07 rows=98106 width=525) (actual time=60.330..325.730 rows=231662 loops=1)
Recheck Cond: ((seller)::text = 'XX'::text)
Filter: ((seller)::bpchar = 'XX'::bpchar)
Buffers: shared hit=96791 read=1020 dirtied=14
-> Bitmap Index Scan on mytable_seller_idx (cost=0.00..12231.75 rows=254844 width=0) (actual time=40.474..40.474 rows=233244 loops=1)
Index Cond: ((seller)::text = 'XX'::text)
Buffers: shared hit=30 read=1020
Total runtime: 22968.292 ms
(22 rows)
Table size:
=> select count(*) from mytable;
count
--------
602669
(1 row)
Array types:
# select array_a,array_b from mytable limit 1;
array_a | array_b
---------------------------+---------------------------
{0,0,0,0,0,0,0,0,0,0,0,0} | {0,0,0,0,0,0,0,0,0,0,0,0}
Example schema:
# \d mytable
Table "public.mytable"
Column | Type | Modifiers
-------------------+-----------------------+------------------------
buyer | character(2) | not null
code | character varying(20) | not null
seller | character varying(50) |
array_a | integer[] |
array_b | integer[] |
Indexes:
"mytable_buyer_code_idx" UNIQUE, btree (buyer, code) CLUSTER
"mytable_buyer_idx" btree (buyer)
"mytable_code_idx" btree (code)
"mytable_seller_idx" btree (seller)
Version:
> SELECT version() ;
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2), 64-bit
(1 row)
This is running on an AWS RDS instance.
Thanks for any pointers
-- David
On Fri, Oct 23, 2015 at 7:29 AM, David Osborne <david@qcode.co.uk> wrote:
Hi,Wondering if anyone could suggest how we could improve the performance of this type of query?The intensive part is the summing of integer arrays as far as I can see.
Postgres does not ship with any 'sum' function which takes array arguments.
> select sum('{1,2,3,4,5,6}'::int[]);
ERROR: function sum(integer[]) does not exist
Are you using a user defined function? If so, how did you define it?
Cheers,
Jeff
Ah yes sorry:
I think these cover it...
CREATE AGGREGATE sum (
sfunc = array_add,
basetype = INTEGER[],
stype = INTEGER[],
initcond = '{}'
);
CREATE OR REPLACE FUNCTION array_add(int[],int[]) RETURNS int[] AS $$
-- Add two arrays.
select
ARRAY (
SELECT coalesce($1[i],0) + coalesce($2[i],0)
FROM (
select generate_series(least(array_lower($1, 1),array_lower($2, 1)), greatest(array_upper($1, 1),array_upper($2, 1)), 1) AS i
) sub
GROUP BY i
ORDER BY i
);
$$ LANGUAGE sql STRICT IMMUTABLE;
On 23 October 2015 at 17:15, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Oct 23, 2015 at 7:29 AM, David Osborne <david@qcode.co.uk> wrote:Hi,Wondering if anyone could suggest how we could improve the performance of this type of query?The intensive part is the summing of integer arrays as far as I can see.Postgres does not ship with any 'sum' function which takes array arguments.> select sum('{1,2,3,4,5,6}'::int[]);ERROR: function sum(integer[]) does not existAre you using a user defined function? If so, how did you define it?Cheers,Jeff
On Friday, October 23, 2015, David Osborne <david@qcode.co.uk> wrote:
Hi,Wondering if anyone could suggest how we could improve the performance of this type of query?The intensive part is the summing of integer arrays as far as I can see.We're thinking there's not much we can do to improve performance apart from throw more CPU at it... would love to be proven wrong though!Query:explain (analyse,buffers)selectsum(s2.array_a),sum(s2.array_b)from mytable s1 left join mytable s2on s1.code=s2.code and s1.buyer=s2.seller and s2.seller='XX'where s1.buyer='XX'group by s1.buyer,s1.code;QUERY PLAN------------------------------------------------------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=275573.49..336223.36 rows=2547 width=524) (actual time=1059.340..22946.772 rows=22730 loops=1)Buffers: shared hit=113596 read=1020 dirtied=15-> Merge Left Join (cost=275573.49..278850.09 rows=113560 width=524) (actual time=1058.773..1728.186 rows=240979 loops=1)Merge Cond: ((s1.code)::text = (s2.code)::text)Join Filter: (s1.buyer = (s2.seller)::bpchar)Buffers: shared hit=113596 read=1020 dirtied=15-> Index Only Scan using mytable_buyer_idx on mytable s1 (cost=0.42..1226.06 rows=25465 width=12) (actual time=0.015..35.790 rows=22730 loops=1)Index Cond: (buyer = 'XX'::bpchar)Heap Fetches: 3739Buffers: shared hit=16805 dirtied=1-> Sort (cost=275573.07..275818.33 rows=98106 width=525) (actual time=1058.736..1141.560 rows=231662 loops=1)Sort Key: s2.codeSort Method: quicksort Memory: 241426kBBuffers: shared hit=96791 read=1020 dirtied=14-> Bitmap Heap Scan on mytable s2 (cost=12256.28..267439.07 rows=98106 width=525) (actual time=60.330..325.730 rows=231662 loops=1)Recheck Cond: ((seller)::text = 'XX'::text)Filter: ((seller)::bpchar = 'XX'::bpchar)Buffers: shared hit=96791 read=1020 dirtied=14-> Bitmap Index Scan on mytable_seller_idx (cost=0.00..12231.75 rows=254844 width=0) (actual time=40.474..40.474 rows=233244 loops=1)Index Cond: ((seller)::text = 'XX'::text)Buffers: shared hit=30 read=1020Total runtime: 22968.292 ms(22 rows)Table size:=> select count(*) from mytable;count--------602669(1 row)Array types:# select array_a,array_b from mytable limit 1;array_a | array_b---------------------------+---------------------------{0,0,0,0,0,0,0,0,0,0,0,0} | {0,0,0,0,0,0,0,0,0,0,0,0}Example schema:# \d mytableTable "public.mytable"Column | Type | Modifiers-------------------+-----------------------+------------------------buyer | character(2) | not nullcode | character varying(20) | not nullseller | character varying(50) |array_a | integer[] |array_b | integer[] |Indexes:"mytable_buyer_code_idx" UNIQUE, btree (buyer, code) CLUSTER"mytable_buyer_idx" btree (buyer)"mytable_code_idx" btree (code)"mytable_seller_idx" btree (seller)Version:> SELECT version() ;version--------------------------------------------------------------------------------------------------------------PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2), 64-bit(1 row)This is running on an AWS RDS instance.Thanks for any pointers--David
What's physical memory and setting of work_mem?
merlin
On Fri, Oct 23, 2015 at 9:26 AM, David Osborne <david@qcode.co.uk> wrote: > Ah yes sorry: > > I think these cover it... > > CREATE AGGREGATE sum ( > sfunc = array_add, > basetype = INTEGER[], > stype = INTEGER[], > initcond = '{}' > ); > > CREATE OR REPLACE FUNCTION array_add(int[],int[]) RETURNS int[] AS $$ > -- Add two arrays. > select > ARRAY ( > SELECT coalesce($1[i],0) + coalesce($2[i],0) > FROM ( > select generate_series(least(array_lower($1, 1),array_lower($2, > 1)), greatest(array_upper($1, 1),array_upper($2, 1)), 1) AS i > ) sub > GROUP BY i > ORDER BY i > ); > $$ LANGUAGE sql STRICT IMMUTABLE; You are paying a lot for the convenience of using a sql language function here. If you want much better performance, you would probably have to rewrite it into C. But that would be a drag, and I would try just throwing more CPU at it first. Cheers, Jeff
> CREATE OR REPLACE FUNCTION array_add(int[],int[]) RETURNS int[] AS $$ > -- Add two arrays. > select > ARRAY ( > SELECT coalesce($1[i],0) + coalesce($2[i],0) > FROM ( > select generate_series(least(array_lower($1, 1),array_lower($2, > 1)), greatest(array_upper($1, 1),array_upper($2, 1)), 1) AS i > ) sub > GROUP BY i > ORDER BY i > ); > $$ LANGUAGE sql STRICT IMMUTABLE; it seems that both the GROUP and ORDER BY are superfluous and adding some cycles. regards, Marc Mamin
Physical memory is 61GB at the moment.
--
work_mem is 1,249,104kB
What's physical memory and setting of work_mem?merlin
David Osborne
Qcode Software Limited
T: +44 (0)1463 896484
On Mon, Oct 26, 2015 at 12:45 PM, David Osborne <david@qcode.co.uk> wrote: > Physical memory is 61GB at the moment. > > work_mem is 1,249,104kB I'm not sure if this query is a candidate because of the function, but you can try progressively cranking work_mem and running explain to see what it'd take to get a hashaggregate plan. merlin