Thread: json_to_recordset() and CTE performance
Hello, I’ve written an sql function that takes a single JSON parameter, in this case an array of objects each with eight properties(example below.) This function uses json_to_recordset() in a CTE to insert three rows on two tables. It takes nearly7 minutes to insert my dataset of 8935 records (in JSON), using a small python script (the script parses the JSON in< 1s) and PostgreSQL 13 on my Macbook Air. As an experiment I wrote a second sql function with 8 parameters representing each property of the JSON objects, requiringa discrete function call per record. I wrote a python script to insert all 8935 records (making as many calls tothe sql function) which executed in around 2 minutes. I’m very much a novice at interpreting EXPLAIN (ANALYZE) and hoping someone can help me better optimize my original function.Both the function and results of explain/analyze are provided below. Is it perhaps a limitation of CTEs or json_to_recordset(),and an entirely different approach is necessary (like the second function I wrote with one call per record?)I was really hoping to make this work, I’ve written a small API to my database using sql and plpgsql functions eachtaking a single JSON parameter, and my web backend acts almost as an http proxy to the database. It’s a different wayof doing things as far as webdev goes, but (for me) an interesting experiment. I like PostgreSQL and really want to takeadvantage of it. I’ve tried a few “optimizations”. Removing the final SELECT and returning VOID saves around 1.5 min. Removing some extraJOINs saves a little time but nothing substantial (the joins against account.t_account in some places are to check “ownership”of the record by a given client_id.) Using a subquery seems like it could be much faster than a CTE, but I don’t know how to insert three rows on two tables usinga subquery. Any advice is appreciated, thank you in advance. Matt INPUT: (The CTE illustrates all the properties and types) [ { “bank_account_id”: 1324, “transaction_id”: “abc123”, “transaction_date”: “2020-10-20”, … }, … ] OUTPUT: (Not sure what I’ve done to create the nested arrays, but it’s unnecessary.. I can shave off ~1.5min by returning VOID.) [ [ { "id": 250185 }, { "id": 250186 }, ... ] ] FUNCTION: CREATE OR REPLACE FUNCTION journal.create_with_categories(in_json JSON) RETURNS JSON AS $$ WITH data AS ( SELECT (in_json#>>'{context, client_id}')::BIGINT AS client_id, nextval(pg_get_serial_sequence('journal', 'id')) AS journal_id, bank_transaction_type, x.* FROM json_to_recordset(in_json->'data') AS x ( bank_account_id BIGINT, transaction_id TEXT, transaction_date DATE, posted_date DATE, amount finance.monetary, description TEXT, parent_account_id BIGINT, child_account_id BIGINT ), LATERAL bank_account.get_transaction_type_by_id(x.bank_account_id, x.amount) AS bank_transaction_type ), insert_journal_entry AS ( INSERT INTO journal.journal ( client_id, id, bank_account_id, transaction_id, transaction_date, posted_date, description ) SELECT client_id, journal_id, bank_account_id, transaction_id, transaction_date, posted_date, description FROM data ), insert_account_entries AS ( INSERT INTO journal.account2journal( account_id, journal_id, amount, type ) -- T account SELECT t.id, d.journal_id, @ d.amount, CASE WHEN d.bank_transaction_type = 'debit'::transaction_type THEN 'credit'::transaction_type ELSE 'debit'::transaction_type END FROM data d LEFT JOIN account.t_account t ON (t.id = COALESCE(d.child_account_id, d.parent_account_id)) WHERE t.client_id = d.client_id OR t.id IS NULL UNION ALL -- bank account SELECT t.id, d.journal_id, @ d.amount, d.bank_transaction_type FROM data d JOIN bank_account.bank_account b ON (b.id = d.bank_account_id) JOIN account.t_account t ON (t.id = b.t_account_id) WHERE t.client_id = d.client_id ) SELECT json_agg(d) FROM (SELECT d.journal_id AS id FROM data AS d) AS d; $$ LANGUAGE sql; EXPLAIN ANALYZE: (From logs) Aggregate (cost=24.24..24.25 rows=1 width=32) (actual time=388926.249..388926.371 rows=1 loops=1) Buffers: shared hit=53877 dirtied=2 CTE data -> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1) Buffers: shared hit=53877 dirtied=2 -> Function Scan on json_to_recordset x (cost=0.01..1.00 rows=100 width=128) (actual time=130.645..142.316rows=8935 loops=1) -> Function Scan on get_transaction_type_by_id bank_transaction_type (cost=0.25..0.26 rows=1 width=4)(actual time=0.154..0.156 rows=1 loops=8935) Buffers: shared hit=18054 CTE insert_journal_entry -> Insert on journal (cost=0.00..2.00 rows=100 width=96) (actual time=453.563..453.563 rows=0 loops=1) Buffers: shared hit=79242 dirtied=295 -> CTE Scan on data (cost=0.00..2.00 rows=100 width=96) (actual time=0.006..10.001 rows=8935 loops=1) CTE insert_account_entries -> Insert on account2journal (cost=4.86..15.23 rows=2 width=52) (actual time=816.381..816.381 rows=0 loops=1) Buffers: shared hit=159273 dirtied=335 written=17 -> Result (cost=4.86..15.23 rows=2 width=52) (actual time=0.206..109.222 rows=17870 loops=1) Buffers: shared hit=5 -> Append (cost=4.86..15.20 rows=2 width=52) (actual time=0.197..95.060 rows=17870 loops=1) Buffers: shared hit=5 -> Hash Left Join (cost=4.86..7.14 rows=1 width=52) (actual time=0.195..35.512 rows=8935loops=1) Hash Cond: (COALESCE(d_1.child_account_id, d_1.parent_account_id) = t.id) Filter: ((t.client_id = d_1.client_id) OR (t.id IS NULL)) Buffers: shared hit=2 -> CTE Scan on data d_1 (cost=0.00..2.00 rows=100 width=68) (actual time=0.004..6.544rows=8935 loops=1) -> Hash (cost=3.27..3.27 rows=127 width=16) (actual time=0.137..0.137 rows=127 loops=1) Buffers: shared hit=2 -> Seq Scan on t_account t (cost=0.00..3.27 rows=127 width=16) (actual time=0.026..0.073rows=127 loops=1) Buffers: shared hit=2 -> Hash Join (cost=3.80..8.03 rows=1 width=52) (actual time=40.182..53.796 rows=8935 loops=1) Hash Cond: ((t_1.id = b.t_account_id) AND (t_1.client_id = d_2.client_id)) Buffers: shared hit=3 -> Seq Scan on t_account t_1 (cost=0.00..3.27 rows=127 width=16) (actual time=0.022..0.079rows=127 loops=1) Buffers: shared hit=2 -> Hash (cost=3.59..3.59 rows=14 width=60) (actual time=40.118..40.118 rows=8935 loops=1) Buffers: shared hit=1 -> Hash Join (cost=1.32..3.59 rows=14 width=60) (actual time=0.071..17.863 rows=8935loops=1) Hash Cond: (d_2.bank_account_id = b.id) Buffers: shared hit=1 -> CTE Scan on data d_2 (cost=0.00..2.00 rows=100 width=60) (actual time=0.005..3.740rows=8935 loops=1) -> Hash (cost=1.14..1.14 rows=14 width=16) (actual time=0.030..0.030 rows=14loops=1) Buffers: shared hit=1 -> Seq Scan on bank_account b (cost=0.00..1.14 rows=14 width=16)(actual time=0.012..0.016 rows=14 loops=1) Buffers: shared hit=1 -> CTE Scan on data d (cost=0.00..2.00 rows=100 width=8) (actual time=183.918..388812.950 rows=8935 loops=1) Buffers: shared hit=53877 dirtied=2 Trigger for constraint journal_client_id_fkey on journal: time=194.194 calls=8935 Trigger for constraint journal_bank_account_id_fkey on journal: time=204.014 calls=8935 Trigger trigger_journal_import_sequence on journal: time=373.344 calls=1 Trigger for constraint account2journal_account_id_fkey on account2journal: time=580.482 calls=17870 Trigger trigger_debits_equal_credits_on_insert on account2journal: time=116.653 calls=1
Version? What is the value for work_mem and other configs that are non-default? I see some estimates that are rather off like -
-> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
-> Function Scan on json_to_recordset x (cost=0.01..1.00 rows=100 width=128) (actual time=130.645..142.316 rows=8935 loops=1)
-> Function Scan on get_transaction_type_by_id bank_transaction_type (cost=0.25..0.26 rows=1 width=4) (actual time=0.154..0.156 rows=1 loops=8935)
Buffers: shared hit=18054
Sometimes putting data into a temp table and analyzing it can be rather helpful to ensure the planner has statistics on the number of records, ndistinct, most common values, etc. I would try doing that with the result of json_to_recordset and skipping the function call to get_transaction_type_by_id until later, just to see how it performs.
That said, it seems like a hardcoded estimate from json_to_recordset is 100 perhaps. I haven't checked source code, but I know when defining a set returning function, there is a ROWS option which provides the planner a static value to assume will come out of that function so it would make sense perhaps.
PostgreSQL 13.0.
You’d have to be specific about the configs you’re looking for, I’m using Postgres.app (postgresapp.com) and am uncertain if it’s distributed with non-default configs.
But, a quick grep shows these items that are configured:
max_wal_size = 1GB
min_wal_size = 80MB
shared_buffers = 128MB
work_mem is not configured so presumably it’s the default of 4MB.
I’ll try the temp tables. That seems familiar to what I found searching online - are you suggesting that as a permanent solution, or just as a means to better analyze performance?
Thanks,
Matt
On Oct 21, 2020, at 1:25 PM, Michael Lewis <mlewis@entrata.com> wrote:Version? What is the value for work_mem and other configs that are non-default? I see some estimates that are rather off like -
-> Nested Loop (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
Buffers: shared hit=53877 dirtied=2
-> Function Scan on json_to_recordset x (cost=0.01..1.00 rows=100 width=128) (actual time=130.645..142.316 rows=8935 loops=1)
-> Function Scan on get_transaction_type_by_id bank_transaction_type (cost=0.25..0.26 rows=1 width=4) (actual time=0.154..0.156 rows=1 loops=8935)
Buffers: shared hit=18054Sometimes putting data into a temp table and analyzing it can be rather helpful to ensure the planner has statistics on the number of records, ndistinct, most common values, etc. I would try doing that with the result of json_to_recordset and skipping the function call to get_transaction_type_by_id until later, just to see how it performs.That said, it seems like a hardcoded estimate from json_to_recordset is 100 perhaps. I haven't checked source code, but I know when defining a set returning function, there is a ROWS option which provides the planner a static value to assume will come out of that function so it would make sense perhaps.