Home > mailing lists

Significant performance issues with array_agg() + HashAggregate plans on Postgres 17 - Mailing list pgsql-performance

From	Scott Carey
Subject	Significant performance issues with array_agg() + HashAggregate plans on Postgres 17
Date	March 31 12:28:59
Msg-id	CA+vubOGaanYaGGb98etJPeJCbN=RAawx652MdmJVRK8+BeEP0w@mail.gmail.com Whole thread Raw
Responses	Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17
List	pgsql-performance

Tree view

Problem Summary:

A simple aggregate using array_agg goes significantly faster and faster the smaller the (work_mem * hash_mem_multiplier), with the same simple query plan: HashAggregate over a sequential scan. Changing to a simple aggregate, such as max() does not have this behavior and is always fast. Switching to another aggregate that grows in size for each element, such as json_agg or string_agg also does not have this behavior. If I add an order by clause inside array_agg, performance significantly improves as it changes from a HashAggregate of a sequential scan to a GroupAggregate over a sort over a sequential scan. Something seems specifically broken with array_agg + HashAggregate.

These queries are anywhere from 10x to 1000x slower on Postgres 17.9 than they were on Postgres 12.19 on production data. Some of our OLTP queries have gone from minutes to 6 hours to complete. I do not know if this happens on Postgres 18, I can confirm it also happens on Postgres 16.8. I do not know about 13 through 15.

Below is a simplified reproduction with a test table below:

show server_version;
server_version
----------------
17.9

create table array_agg_test(product_id bigint not null, region_id bigint not null, available boolean not null);

insert into array_agg_test (product_id, region_id, available) SELECT generate_series(1, 50000) as product_id, (ARRAY[1,2,3,4])[floor(random()*4)+1] as region_id, true as available;
insert into array_agg_test (product_id, region_id, available) SELECT generate_series(1, 50000) as product_id, (ARRAY[11,12,13,14])[floor(random()*4)+1] as region_id, true as available;
insert into array_agg_test (product_id, region_id, available) SELECT generate_series(1, 50000) as product_id, (ARRAY[111,112,113,114])[floor(random()*4)+1] as region_id, true as available;
insert into array_agg_test (product_id, region_id, available) SELECT generate_series(1, 50000) as product_id, (ARRAY[1111,1112,1113,1114])[floor(random()*4)+1] as region_id, true as available;
vacuum analyze array_agg_test;

We now have a table with 200000 rows, 50000 distinct product_id with 4 rows each, with a distinct region_id. It is simple enough that default statistics are fine here; we want to trigger HashAgg over SeqScan anyway.

Other query plans that avoid HashAggregate don't have the issue -- index scans, group aggregate are fine.

set hash_mem_multiplier = 2;
set work_mem = "100MB";

explain (analyze, buffers) select product_id, array_agg(region_id) from array_agg_test group by product_id;
                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=4274.00..4771.49 rows=49749 width=40) (actual time=4628.278..4643.765 rows=50000 loops=1)
  Group Key: product_id
  Batches: 1 Memory Usage: 55649kB
  Buffers: shared hit=1274
  -> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16) (actual time=0.030..16.694 rows=200000 loops=1)
        Buffers: shared hit=1274
Planning Time: 0.067 ms
Execution Time: 4648.698 ms

set work_mem = "20MB";

explain (analyze, buffers) select product_id, array_agg(region_id) from array_agg_test group by product_id;
                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16086.50..18537.11 rows=49749 width=40) (actual time=2568.837..2672.140 rows=50000 loops=1)
  Group Key: product_id
  Planned Partitions: 4 Batches: 5 Memory Usage: 40954kB Disk Usage: 3352kB
  Buffers: shared hit=1274, temp read=243 written=594
  -> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16) (actual time=0.013..15.266 rows=200000 loops=1)
        Buffers: shared hit=1274
Planning Time: 0.051 ms
Execution Time: 2674.329 ms

set work_mem = "10MB";

explain (analyze, buffers) select product_id, array_agg(region_id) from array_agg_test group by product_id;
                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16086.50..18537.11 rows=49749 width=40) (actual time=635.816..888.167 rows=50000 loops=1)
  Group Key: product_id
  Planned Partitions: 8 Batches: 9 Memory Usage: 20474kB Disk Usage: 7272kB
  Buffers: shared hit=1274, temp read=566 written=1388
  -> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16) (actual time=0.018..12.689 rows=200000 loops=1)
        Buffers: shared hit=1274
Planning Time: 0.057 ms
Execution Time: 890.987 ms

set work_mem = "5MB";

explain (analyze, buffers) select product_id, array_agg(region_id) from array_agg_test group by product_id;
                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16086.50..18537.11 rows=49749 width=40) (actual time=172.948..341.847 rows=50000 loops=1)
  Group Key: product_id
  Planned Partitions: 16 Batches: 17 Memory Usage: 10234kB Disk Usage: 7080kB
  Buffers: shared hit=1274, temp read=731 written=1553
  -> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16) (actual time=0.010..11.715 rows=200000 loops=1)
        Buffers: shared hit=1274
Planning Time: 0.064 ms
Execution Time: 344.248 ms

set work_mem = "1MB";

explain (analyze, buffers) select product_id, array_agg(region_id) from array_agg_test group by product_id;
                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16086.50..18537.11 rows=49749 width=40) (actual time=56.102..144.350 rows=50000 loops=1)
  Group Key: product_id
  Planned Partitions: 64 Batches: 65 Memory Usage: 2050kB Disk Usage: 12200kB
  Buffers: shared hit=1274, temp read=892 written=2374
  -> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16) (actual time=0.017..12.346 rows=200000 loops=1)
        Buffers: shared hit=1274
Planning Time: 0.053 ms
Execution Time: 147.254 ms

Below this work_mem size it chooses a GroupAggregate and sorted scan

set hash_mem_multiplier = 20;

explain (analyze, buffers) select product_id, array_agg(region_id) from array_agg_test group by product_id;
                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=16086.50..18537.11 rows=49749 width=40) (actual time=654.729..890.816 rows=50000 loops=1)
  Group Key: product_id
  Planned Partitions: 8 Batches: 9 Memory Usage: 20480kB Disk Usage: 7264kB
  Buffers: shared read=1274, temp read=561 written=1384
  -> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16) (actual time=0.044..18.521 rows=200000 loops=1)
        Buffers: shared read=1274
Planning Time: 0.067 ms
Execution Time: 893.208 ms

Note the performance is a function of hash_mem_multiplier * work_mem, as it seems to be related to the number of Batches. The more batches the faster it goes.

Below, note json_agg does not have this problem:

set hash_mem_multiplier = 2;
set work_mem = "500MB";

explain (analyze, buffers) select product_id, json_agg(region_id) from array_agg_test group by product_id;
                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=4274.00..4895.86 rows=49749 width=40) (actual time=110.975..122.781 rows=50000 loops=1)
  Group Key: product_id
  Batches: 1 Memory Usage: 67089kB
  Buffers: shared read=1274
  -> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=16) (actual time=0.034..17.659 rows=200000 loops=1)
        Buffers: shared read=1274
Planning:
  Buffers: shared hit=10
Planning Time: 0.054 ms
Execution Time: 124.929 ms

and adding a useless 'order by' clause inside array_agg triggers a GroupAggregate which is ok as well:

explain (analyze, buffers) select product_id, array_agg(region_id order by available) from array_agg_test group by product_id;
                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=20883.64..22881.13 rows=49749 width=40) (actual time=40.013..73.385 rows=50000 loops=1)
  Group Key: product_id
  Buffers: shared hit=5 read=1274
  -> Sort (cost=20883.64..21383.64 rows=200000 width=17) (actual time=40.000..46.090 rows=200000 loops=1)
        Sort Key: product_id, available
        Sort Method: quicksort Memory: 13957kB
        Buffers: shared hit=5 read=1274
        -> Seq Scan on array_agg_test (cost=0.00..3274.00 rows=200000 width=17) (actual time=0.033..17.911 rows=200000 loops=1)
              Buffers: shared read=1274
Planning:
  Buffers: shared hit=7
Planning Time: 0.063 ms
Execution Time: 74.823 ms

The "missing time" here is in between the end of the sequential scan, which takes < 20ms, and the 'start' of the GroupAggregate, which in the worst case example here is several seconds later.

I am fairly stuck here. I am looking at modifying client code to use json_agg instead of array_agg where possible as a work-around, but ideally that would not be needed, array_agg shouldn't be significantly different.

A secondary observation, related but not the issue at hand:

The row size estimate for the aggregate is always `width=40` here, no matter how large the resulting arrays are expected to be. In extreme cases this can lead to hash memory consumption that is far larger than predicted. On postgres 12 a several years ago, I once saw a query with work_mem 1000MB use up 290GB and crash the server as it was running a complex json_agg across a large number of values per bucket and the query planner did not expect to store json data so large per output row of the aggregate. Disk backed aggregates now prevent the crash, but it would probably help the query planner if array_agg (and other accumulating aggregators like json_agg) could provide an output size estimate that is a function of the number of expected elements aggregated over.

Thanks in advance for any help here!

Scott Carey

pgsql-performance by date:

From: Merlin Moncure
Date: 25 March, 19:53:42
Subject: Re: postgres chooses objectively wrong index

From: David Rowley
Date: 31 March, 15:03:13
Subject: Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17

Significant performance issues with array_agg() + HashAggregate plans on Postgres 17 - Mailing list pgsql-performance

Previous

Next