Thread: to_jsonb performance on array aggregated correlated subqueries
Good day, consider the following query: WITH aggregation( SELECT a.*, (SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs", (SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs", (SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds", (SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es" FROM a WHERE a.id IN (<some big list, ranging from 20-180 entries) ) SELECT to_jsonb(aggregation.*) as "value" FROM aggregation; Imagine that for each "a" there exists between 5-100 "b", "c", "d" and "e" which makes the result of this pretty big (worst case: around 300kb when saved to a text file). I noticed that adding the "to_jsonb" increases the query time by 100%, from 9-10ms to 17-23ms on average. This may not seem slow at all but this query has another issue: on an AWS Aurora Serverless V2 instance we are running into a RAM usage of around 30-50 GB compared to < 10 GB when using a simple LEFT JOINed query when under high load (> 1000 queries / sec). Furthermore the CPU usage is quite high. Is there anything I could improve? I am open for other solutions but I am wondering if I ran into an edge case of "to_jsonb" for "anonymous records" (these are just rows without a defined UDT) - this is just a wild guess though. I am mostly looking to decrease the load (CPU and memory) on Postgres itself. Furthermore I would like to know why the memory usage is so significant. Any tips on how to analyze this issue are appreciated as well - my knowledge is limited to being average at interpreting EXPLAIN ANALYZE results. Here's a succinct list of the why's, what I have found out so far and solution I already tried/ don't want to consider: - LEFT JOINing potentially creates a huge resultset because of the cartesian product, thats a nono - not using "to_jsonb" is sadly also not possible as Postgres' array + record syntax is very unfriendly and hard to parse (it's barely documented if at all and the quoting rules are cumbersome, furthermore I lack column names in the array which would make the parsing sensitive to future table changes and thus cumbersome to maintain) in my application - I know I could solve this with a separate query for a,b,c,d and e while "joinining" the result in my application, but I am looking for another way to do this (bear with me, treat this as an academic question :)) - I am using "to_jsonb" to simply map the result to my data model via a json mapper - EXPLAIN ANALYZE is not showing anything special when using "to_jsonb" vs. not using it, the outermost (hash) join just takes more time - is there a more granular EXPLAIN that shows me the runtime of functions like "to_jsonb"? - I tried an approach where b,c,d,e where array columns of UDTs: UDTs are not well supported by my application stack (JDBC) and are generally undesireable for me (because of a lack of migration possibilities) - I don't want to duplicate my data into another table (e.g. that has jsonb columns) - MATERIALIZED VIEWS are also undesirable as the manual update, its update is non-incremental which would make a refresh on a big data set take a long time - split the query into chunks to reduce the IN()-statement list size makes no measurable difference - I don't want to use JSONB columns for b,c,d and e because future changes of b,c,d or e's structure (e.g. new fields, changing a datatype) are harder to achieve with JSONB and it lacks constraint checks on insert (e.g. not null on column b.xy) Kind regards and thank you for your time, Nico Heller P.S: Sorry for the long list of "I don't want to do this", some of them are not possible because of other requirements
What version of postgres ? I wonder if you're hitting the known memory leak involving jit. Try with jit=off or jit_inline_above_cost=-1. -- Justin
I knew I forgot something: We are currently on 13.6. When was this issue fixed?
What version of postgres ? I wonder if you're hitting the known memory leak involving jit. Try with jit=off or jit_inline_above_cost=-1.
Good day, consider the following query: WITH aggregation( SELECT a.*, (SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs", (SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs", (SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds", (SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es" FROM a WHERE a.id IN (<some big list, ranging from 20-180 entries) ) SELECT to_jsonb(aggregation.*) as "value" FROM aggregation; Imagine that for each "a" there exists between 5-100 "b", "c", "d" and "e" which makes the result of this pretty big (worst case: around 300kb when saved to a text file). I noticed that adding the "to_jsonb" increases the query time by 100%, from 9-10ms to 17-23ms on average. This may not seem slow at all but this query has another issue: on an AWS Aurora Serverless V2 instance we are running into a RAM usage of around 30-50 GB compared to < 10 GB when using a simple LEFT JOINed query when under high load (> 1000 queries / sec). Furthermore the CPU usage is quite high. Is there anything I could improve? I am open for other solutions but I am wondering if I ran into an edge case of "to_jsonb" for "anonymous records" (these are just rows without a defined UDT) - this is just a wild guess though. I am mostly looking to decrease the load (CPU and memory) on Postgres itself. Furthermore I would like to know why the memory usage is so significant. Any tips on how to analyze this issue are appreciated as well - my knowledge is limited to being average at interpreting EXPLAIN ANALYZE results. Here's a succinct list of the why's, what I have found out so far and solution I already tried/ don't want to consider: - LEFT JOINing potentially creates a huge resultset because of the cartesian product, thats a nono - not using "to_jsonb" is sadly also not possible as Postgres' array + record syntax is very unfriendly and hard to parse (it's barely documented if at all and the quoting rules are cumbersome, furthermore I lack column names in the array which would make the parsing sensitive to future table changes and thus cumbersome to maintain) in my application - I know I could solve this with a separate query for a,b,c,d and e while "joinining" the result in my application, but I am looking for another way to do this (bear with me, treat this as an academic question :)) - I am using "to_jsonb" to simply map the result to my data model via a json mapper - EXPLAIN ANALYZE is not showing anything special when using "to_jsonb" vs. not using it, the outermost (hash) join just takes more time - is there a more granular EXPLAIN that shows me the runtime of functions like "to_jsonb"? - I tried an approach where b,c,d,e where array columns of UDTs: UDTs are not well supported by my application stack (JDBC) and are generally undesireable for me (because of a lack of migration possibilities) - I don't want to duplicate my data into another table (e.g. that has jsonb columns) - MATERIALIZED VIEWS are also undesirable as the manual update, its update is non-incremental which would make a refresh on a big data set take a long time - split the query into chunks to reduce the IN()-statement list size makes no measurable difference - I don't want to use JSONB columns for b,c,d and e because future changes of b,c,d or e's structure (e.g. new fields, changing a datatype) are harder to achieve with JSONB and it lacks constraint checks on insert (e.g. not null on column b.xy) Kind regards and thank you for your time, Nico Heller P.S: Sorry for the long list of "I don't want to do this", some of them are not possible because of other requirements
Am 12.08.2022 um 21:02 schrieb Rick Otten:
Yes there are indices on all referenced columns of the subselect (they are all primary keys anyway)On Fri, Aug 12, 2022 at 2:50 PM Nico Heller <nico.heller@posteo.de> wrote:Good day,
consider the following query:
WITH aggregation(
SELECT
a.*,
(SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs",
(SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs",
(SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds",
(SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es"
FROM a WHERE a.id IN (<some big list, ranging from 20-180 entries)
)
SELECT to_jsonb(aggregation.*) as "value" FROM aggregation;- You do have an index on `b.a_id` and `c.a_id`, etc... ? You didn't say...
Yes, EXPLAIN ANALYZE shows a doubling of execution time - I don't have numbers on the memory usage difference though- Are you sure it is the `to_jsonb` that is making this query slow?
- Since you are serializing this for easy machine readable consumption outside of the database, does it make a difference if you use `to_json` instead?
Using to_json vs. to_jsonb makes no difference in regards to runtime, I will check if the memory consumption is different on monday - thank you for the idea!
On Fri, Aug 12, 2022 at 07:02:36PM +0000, Nico Heller wrote: > I knew I forgot something: We are currently on 13.6. When was this issue > fixed? There's a WIP/proposed fix, but the fix is not released. I asked about your version because jit was disabled by default in v11. But it's enabled by default in v12. https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items#Older_bugs_affecting_stable_branches -- Justin
Hi, On 2022-08-12 18:49:58 +0000, Nico Heller wrote: > WITH aggregation( > SELECT > a.*, > (SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs", > (SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs", > (SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds", > (SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es" > FROM a WHERE a.id IN (<some big list, ranging from 20-180 entries) > ) > SELECT to_jsonb(aggregation.*) as "value" FROM aggregation; > Imagine that for each "a" there exists between 5-100 "b", "c", "d" and "e" > which makes the result of this pretty big (worst case: around 300kb when > saved to a text file). > I noticed that adding the "to_jsonb" increases the query time by 100%, from > 9-10ms to 17-23ms on average. Could we see the explain? Have you tried using json[b]_agg()? > This may not seem slow at all but this query has another issue: on an AWS > Aurora Serverless V2 instance we are running into a RAM usage of around > 30-50 GB compared to < 10 GB when using a simple LEFT JOINed query when > under high load (> 1000 queries / sec). Furthermore the CPU usage is quite > high. We can't say much about aurora. It's a heavily modified fork of postgres. Did you reproduce this with vanilla postgres? And if so, do you have it in a form that somebody could try out? Greetings, Andres Freund
On Fri, Aug 12, 2022 at 3:07 PM Nico Heller <nico.heller@posteo.de> wrote:Am 12.08.2022 um 21:02 schrieb Rick Otten:
Yes there are indices on all referenced columns of the subselect (they are all primary keys anyway)On Fri, Aug 12, 2022 at 2:50 PM Nico Heller <nico.heller@posteo.de> wrote:Good day,
consider the following query:
WITH aggregation(
SELECT
a.*,
(SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs",
(SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs",
(SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds",
(SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es"
FROM a WHERE a.id IN (<some big list, ranging from 20-180 entries)
)
SELECT to_jsonb(aggregation.*) as "value" FROM aggregation;- You do have an index on `b.a_id` and `c.a_id`, etc... ? You didn't say...Yes, EXPLAIN ANALYZE shows a doubling of execution time - I don't have numbers on the memory usage difference though- Are you sure it is the `to_jsonb` that is making this query slow?- Since you are serializing this for easy machine readable consumption outside of the database, does it make a difference if you use `to_json` instead?Using to_json vs. to_jsonb makes no difference in regards to runtime, I will check if the memory consumption is different on monday - thank you for the idea!
One other thought. Does it help if you convert the arrays to json first before you convert the whole row? ie, add some to_json()'s around the bs, cs, ds, es columns in the CTE. I'm wondering if breaking the json conversions up into smaller pieces will let the outer to_json() have less work to do and overall run faster. You could even separately serialize the elements inside the array too. I wouldn't think it would make a huge difference, you'd be making a bunch of extra to_json calls, but maybe it avoids some large memory structure that would otherwise have to be constructed to serialize all of those objects in all of the arrays all at the same time.
Using jsonb_array_agg and another to_jsonb at the (its still needed to create one value at the end and to include the columns "a.*") worsens the query performance by 100%, I can't speak for the memory usage because I would have to push these changes to preproduction - will try this on monday, thanks.
Here are the query plans (I hope my anonymization didn't break them). I ran every query a couple times before copying the plan to avoid timing issues because of disk access.
Ignore the sequential scan on one of the tables, it's very small (will change in the future) so Postgres opts for a faster sequential scan - the other sequential scan is on the IN()-statement which uses a VALUE list in the actual query (using a non-VALUE list makes no difference).
Overall the plan is quite optimal for me and performs really well considering the amount of rows it extracts and converts to json.
Notice how removing to_jsonb improves the query performance significantly (see last query plan) and how the cost is attributed to the hash join.
Using to_jsonb instead of to_jsonb or json_agg instead of jsonb_agg makes no difference in query plan or execution time.
I used random id's so I don't know how how big the result got but it shouldn't matter for the query plan:
array_agg, then to_jsonb (my initially posted query)
Hash Semi Join (cost=5.00..15947.39 rows=200 width=32) (actual time=0.266..18.128 rows=200 loops=1)
" Hash Cond: (a.id = ""*VALUES*"".column1)"
-> Seq Scan on a (cost=0.00..41.02 rows=502 width=422) (actual time=0.013..0.268 rows=502 loops=1)
-> Hash (cost=2.50..2.50 rows=200 width=32) (actual time=0.091..0.092 rows=200 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 21kB
" -> Values Scan on ""*VALUES*"" (cost=0.00..2.50 rows=200 width=32) (actual time=0.001..0.040 rows=200 loops=1)"
SubPlan 1
-> Aggregate (cost=42.20..42.21 rows=1 width=32) (actual time=0.020..0.020 rows=1 loops=200)
-> Bitmap Heap Scan on b (cost=4.38..42.17 rows=12 width=156) (actual time=0.012..0.017 rows=12 loops=200)
Recheck Cond: (a_id = a.id)
Heap Blocks: exact=382
-> Bitmap Index Scan on fk_b_idx (cost=0.00..4.37 rows=12 width=0) (actual time=0.008..0.008 rows=14 loops=200)
Index Cond: (a_id = a.id)
SubPlan 2
-> Aggregate (cost=27.68..27.69 rows=1 width=32) (actual time=0.012..0.012 rows=1 loops=200)
-> Bitmap Heap Scan on c (cost=4.35..27.66 rows=9 width=98) (actual time=0.009..0.010 rows=5 loops=200)
Recheck Cond: (a_id = a.id)
Heap Blocks: exact=169
-> Bitmap Index Scan on fk_c_idx (cost=0.00..4.35 rows=9 width=0) (actual time=0.007..0.007 rows=5 loops=200)
Index Cond: (a_id = a.id)
SubPlan 3
-> Aggregate (cost=8.30..8.31 rows=1 width=32) (actual time=0.009..0.010 rows=1 loops=200)
-> Index Scan using fk_d_idx on d (cost=0.28..8.29 rows=1 width=81) (actual time=0.008..0.008 rows=1 loops=200)
Index Cond: (a_id = a.id)
SubPlan 4
-> Aggregate (cost=1.27..1.28 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=200)
-> Seq Scan on e (cost=0.00..1.26 rows=1 width=76) (actual time=0.004..0.004 rows=0 loops=200)
Filter: (a_id = a.id)
Rows Removed by Filter: 21
Planning Time: 0.520 ms
Execution Time: 18.650 ms
jsonb_agg instead of array_agg, then to_jsonb
Hash Semi Join (cost=5.00..15947.39 rows=200 width=32) (actual time=0.338..23.921 rows=200 loops=1)
" Hash Cond: (a.id = ""*VALUES*"".column1)"
-> Seq Scan on a (cost=0.00..41.02 rows=502 width=422) (actual time=0.012..0.244 rows=502 loops=1)
-> Hash (cost=2.50..2.50 rows=200 width=32) (actual time=0.090..0.091 rows=200 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 21kB
" -> Values Scan on ""*VALUES*"" (cost=0.00..2.50 rows=200 width=32) (actual time=0.001..0.040 rows=200 loops=1)"
SubPlan 1
-> Aggregate (cost=42.20..42.21 rows=1 width=32) (actual time=0.050..0.050 rows=1 loops=200)
-> Bitmap Heap Scan on b (cost=4.38..42.17 rows=12 width=156) (actual time=0.012..0.018 rows=12 loops=200)
Recheck Cond: (a_id = a.id)
Heap Blocks: exact=382
-> Bitmap Index Scan on fk_b_idx (cost=0.00..4.37 rows=12 width=0) (actual time=0.008..0.008 rows=14 loops=200)
Index Cond: (a_id = a.id)
SubPlan 2
-> Aggregate (cost=27.68..27.69 rows=1 width=32) (actual time=0.028..0.028 rows=1 loops=200)
-> Bitmap Heap Scan on c (cost=4.35..27.66 rows=9 width=98) (actual time=0.009..0.011 rows=5 loops=200)
Recheck Cond: (a_id = a.id)
Heap Blocks: exact=169
-> Bitmap Index Scan on fk_c_idx (cost=0.00..4.35 rows=9 width=0) (actual time=0.007..0.007 rows=5 loops=200)
Index Cond: (a_id = a.id)
SubPlan 3
-> Aggregate (cost=8.30..8.31 rows=1 width=32) (actual time=0.014..0.014 rows=1 loops=200)
-> Index Scan using fk_d_idx on d (cost=0.28..8.29 rows=1 width=81) (actual time=0.008..0.008 rows=1 loops=200)
Index Cond: (a_id = a.id)
SubPlan 4
-> Aggregate (cost=1.27..1.28 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=200)
-> Seq Scan on e (cost=0.00..1.26 rows=1 width=76) (actual time=0.004..0.004 rows=0 loops=200)
Filter: (a_id = a.id)
Rows Removed by Filter: 21
Planning Time: 0.513 ms
Execution Time: 24.020 ms
array_agg without to_jsonb at the end
Hash Semi Join (cost=5.00..15946.89 rows=200 width=550) (actual time=0.209..9.784 rows=200 loops=1)
" Hash Cond: (a.id = ""*VALUES*"".column1)"
-> Seq Scan on a (cost=0.00..41.02 rows=502 width=422) (actual time=0.013..0.190 rows=502 loops=1)
-> Hash (cost=2.50..2.50 rows=200 width=32) (actual time=0.079..0.080 rows=200 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 21kB
" -> Values Scan on ""*VALUES*"" (cost=0.00..2.50 rows=200 width=32) (actual time=0.001..0.040 rows=200 loops=1)"
SubPlan 1
-> Aggregate (cost=42.20..42.21 rows=1 width=32) (actual time=0.019..0.019 rows=1 loops=200)
-> Bitmap Heap Scan on b (cost=4.38..42.17 rows=12 width=156) (actual time=0.012..0.017 rows=12 loops=200)
Recheck Cond: (a_id = a.id)
Heap Blocks: exact=382
-> Bitmap Index Scan on fk_b_idx (cost=0.00..4.37 rows=12 width=0) (actual time=0.008..0.008 rows=14 loops=200)
Index Cond: (a_id = a.id)
SubPlan 2
-> Aggregate (cost=27.68..27.69 rows=1 width=32) (actual time=0.012..0.012 rows=1 loops=200)
-> Bitmap Heap Scan on c (cost=4.35..27.66 rows=9 width=98) (actual time=0.008..0.010 rows=5 loops=200)
Recheck Cond: (a_id = a.id)
Heap Blocks: exact=169
-> Bitmap Index Scan on fk_c_idx (cost=0.00..4.35 rows=9 width=0) (actual time=0.007..0.007 rows=5 loops=200)
Index Cond: (a_id = a.id)
SubPlan 3
-> Aggregate (cost=8.30..8.31 rows=1 width=32) (actual time=0.009..0.009 rows=1 loops=200)
-> Index Scan using fk_d_idx on d (cost=0.28..8.29 rows=1 width=81) (actual time=0.008..0.008 rows=1 loops=200)
Index Cond: (a_id = a.id)
SubPlan 4
-> Aggregate (cost=1.27..1.28 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=200)
-> Seq Scan on e (cost=0.00..1.26 rows=1 width=76) (actual time=0.004..0.004 rows=0 loops=200)
Filter: (a_id = a.id)
Rows Removed by Filter: 21
Planning Time: 0.496 ms
Execution Time: 9.892 ms
Hi, On 2022-08-12 18:49:58 +0000, Nico Heller wrote:WITH aggregation( SELECT a.*, (SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs", (SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs", (SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds", (SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es" FROM a WHERE a.id IN (<some big list, ranging from 20-180 entries) ) SELECT to_jsonb(aggregation.*) as "value" FROM aggregation;Imagine that for each "a" there exists between 5-100 "b", "c", "d" and "e" which makes the result of this pretty big (worst case: around 300kb when saved to a text file). I noticed that adding the "to_jsonb" increases the query time by 100%, from 9-10ms to 17-23ms on average.Could we see the explain? Have you tried using json[b]_agg()?This may not seem slow at all but this query has another issue: on an AWS Aurora Serverless V2 instance we are running into a RAM usage of around 30-50 GB compared to < 10 GB when using a simple LEFT JOINed query when under high load (> 1000 queries / sec). Furthermore the CPU usage is quite high.We can't say much about aurora. It's a heavily modified fork of postgres. Did you reproduce this with vanilla postgres? And if so, do you have it in a form that somebody could try out? Greetings, Andres Freund
On Fri, Aug 12, 2022 at 2:50 PM Nico Heller <nico.heller@posteo.de> wrote:Good day,
consider the following query:
WITH aggregation(
SELECT
a.*,
(SELECT array_agg(b.*) FROM b WHERE b.a_id = a.id) as "bs",
(SELECT array_agg(c.*) FROM c WHERE c.a_id = a.id) as "cs",
(SELECT array_agg(d.*) FROM d WHERE d.a_id = a.id) as "ds",
(SELECT array_agg(e.*) FROM d WHERE e.a_id = a.id) as "es"
FROM a WHERE a.id IN (<some big list, ranging from 20-180 entries)
)
SELECT to_jsonb(aggregation.*) as "value" FROM aggregation;- You do have an index on `b.a_id` and `c.a_id`, etc... ? You didn't say...- Are you sure it is the `to_jsonb` that is making this query slow?- Since you are serializing this for easy machine readable consumption outside of the database, does it make a difference if you use `to_json` instead?