Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17 - Mailing list pgsql-performance
| From | Scott Carey |
|---|---|
| Subject | Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17 |
| Date | |
| Msg-id | CA+vubOGqg7f2E0_OW8EMruAObejnCS1gCgguT+cnwHG9bRxMDQ@mail.gmail.com Whole thread |
| In response to | Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17 (Tom Lane <tgl@sss.pgh.pa.us>) |
| Responses |
Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17
|
| List | pgsql-performance |
That is a very strong clue. Check for property differences (e.g.
with psql's "\l+" and "\drds") between the new test database and
the database where you see the problem.
This database is old. It pre-dates Postgres 8.4 which introduced array_agg. Apparently, in some version prior to 8.4 array_agg was added as a user function, defined as below for bigint:
create AGGREGATE array_agg(
BASETYPE = bigint,
SFUNC = array_append,
STYPE = bigint[],
INITCOND = '{}'
);
BASETYPE = bigint,
SFUNC = array_append,
STYPE = bigint[],
INITCOND = '{}'
);
So if you create a test database and run the previous test, performance will be fine and the query will be fast. Then run:
create AGGREGATE array_agg(BASETYPE = bigint, SFUNC = array_append,STYPE = bigint[], INITCOND = '{}');
create AGGREGATE array_agg(BASETYPE = bigint, SFUNC = array_append,STYPE = bigint[], INITCOND = '{}');
It will be slow and reproduce this behavior.
This leaves a few open questions:
Why would this run so much more slowly after updating from postgres 12 to 17? It is a user defined aggregate, although maybe not as optimized as the intrinsic one it shouldn't behave this way.
If instead I create the same thing with a different name: "array_agg_alt" , the performance of that aggregate is awful too when combined with HashAggregate query plans. So this is not due to name aliasing. How many other user defined aggregates does this affect? Would it affect simple ones like an alternate for "min" or only those that grow in size and accumulate data?
It looks like I can fix this with a simple "drop aggregate array_agg(bigint)", as the built-in function remains after removing this. But I am left wondering how many user defined aggregates have a similar problem.
A bit more history / info in case someone stumbles upon this:
The fact that the problem did not reproduce on a new / fresh database on the same postgres instance that otherwise had the problem was a huge clue. I just had to think more about all the ways the databases differ. Server settings, hardware, background activity -- these could all be ruled out now as they were the same for both. My first thoughts were related to the age of the database and all of the upgrade cycles it had been through, and the size of the database. There are tens of thousands of tables in the production db (many partition tables). But one of the systems reproducing the problem identically (my laptop) had the same schema, but almost no partition tables and only small test data. So I assumed the pure db size was not to blame.
The fresh database was an almost empty schema, the one with the problem was large, old, and crufty. Schemas for this system are built from scratch regularly for testing by running a sequence of schema update files -- almost 2000 of them, essentially a history of all schema updates since the birth of the database. So I decided to do a binary search 'bisect' on these 2000 update files, halving the number of candidate changes with each iteration. It turned out one of the earliest schema changes was the one to blame, adding a user defined array_agg. This was dated from 2008, before Postgres had its own built-in array_agg function.
This did not cause any problems until the upgrade from Postgres 12 to 17 triggered this behavior. One of my test systems on Postgres 16 also reproduces the problem, so I assume this was introduced between version 13 and 16 inclusive.
pgsql-performance by date: