Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17 - Mailing list pgsql-performance

From Scott Carey
Subject Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17
Date
Msg-id CA+vubOEbfm5+BxtuFGb_tTj3sSX1u0Y+=zHUxoF1Q9feGpRfNg@mail.gmail.com
Whole thread
In response to Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17
List pgsql-performance
On Tue, Mar 31, 2026 at 12:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Scott Carey <scott.carey@algonomy.com> writes:
>> On Tue, Mar 31, 2026 at 5:03 AM David Rowley <dgrowleyml@gmail.com> wrote:
>>> I tried and failed to recreate this locally on 17.9. For me the
>>> json_agg query is slower than array_agg(). I tried making the table
>>> 10x bigger and still don't see the same issue. The one with more
>>> work_mem and fewer batches is always faster for me.

>> I don't know what other differences there could be, other than OS.  This
>> reproduces for me on Linux with the above on a RHEL 9 clone (pg 17) or with
>> Ubuntu 25.10 (pg 16) so I suspect it is not too picky about the distro used.

Like David, I can't reproduce the described behavior.  I tried on
RHEL8/x86_64 and on macOS/M4, and got runtimes that barely vary
across different work_mem settings, all sub-100ms.  It should be
noted that I tested v17 branch tip not precisely 17.9 --- but there's
nothing in the commit log to suggest that we changed v17's behavior
since February.

One thing I find interesting is that your results show significantly
more memory consumption as well as runtime.  I had to add a run with
work_mem = "200MB" to get the no-batching behavior you show at
work_mem = "100MB", and then my results look like


The memory difference is strange.  
I now have 6 systems that I have tested this on.  One of them behaves just like yours above, with the same memory usage and appropriate performance.  5 batches and 9009kB at work_mem = "100MB";

The other 5 all misbehave and have ~ 50x worse performance when there is only one batch. (work_mem 1000MB).   These use a little bit over 3x the memory for the single batch.
 
Some thoughts:

* Does it repro without the "vector" extension?  Seems unlikely that
that is related, but we're at the grasping-at-straws stage.


Although I cannot remove the vector extension safely in production, I tried adding the extension to the system that does not reproduce the problem, and that did not trigger it.
 
* More grasping at straws: is this stock community Postgres, or
some vendor's modification (eg RDS or Aurora)?

This is from the pgdg repo for RHEL 9, from the postgresql.org website.  
 

* It would be worth doing the EXPLAINs with the SETTINGS option,
just to make sure that there's not some non-default setting you
forgot to mention.

I did not mention a few values that differ between the servers that reproduce this, like autovacuum tuning parameters and maintenance_work_men.  adding settings to the explain gives a couple more, unlikely to be related to the problem:

Settings: temp_buffers = '512MB', work_mem = '1000MB', effective_io_concurrency = '16', effective_cache_size = '150GB'


While writing this, I decided to test out a few more vector extension test cases, and discovered something new and mind boggling: :

On systems that reproduces the problem, if I create a new test database, then test the query in that database, the problem does not occur.

e.g. 

create database test;
\c test
.... run all the commands in my first email

In the 'test' database everything is fine.  Queries are fast, memory use is the same as yours.  If I go back to the production database, the problem occurs again.
One thing in common for the systems with the problem is that they had the pgvector extension installed on them for a while, and have gone through some pg_update cycles.

I have no idea where to go from here on identifying why one database would behave like this but not the other -- on the same posrgres instance.
This _could_ still be the pgvector extension, or at least something to do with using it through a pg_upgrade.   The  extension upgrade was executed after pg_update, but maybe something is wrong with that for this extension.


Before discovering the above, my next plan was to set up linux perf and capture some OS level profiling on one of the near idle read-only standbys that show the problem, to see if there is something we can see there.   I haven't investigated how to get decent stacks from that, I assume installing some debug packages from the repo would enable that but I have done no research and haven't attempted to use `perf` on postgres before.   

 

                        regards, tom lane
 

pgsql-performance by date:

Previous
From: David Rowley
Date:
Subject: Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17
Next
From: Scott Carey
Date:
Subject: Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17