Thread: json_to_recordset() and CTE performance

json_to_recordset() and CTE performance

From
Matt DeLuco
Date:
Hello,

I’ve written an sql function that takes a single JSON parameter, in this case an array of objects each with eight
properties(example below.) This function uses json_to_recordset() in a CTE to insert three rows on two tables. It takes
nearly7 minutes to insert my dataset of 8935 records (in JSON), using a small python script (the script parses the JSON
in< 1s) and PostgreSQL 13 on my Macbook Air. 

As an experiment I wrote a second sql function with 8 parameters representing each property of the JSON objects,
requiringa discrete function call per record. I wrote a python script to insert all 8935 records (making as many calls
tothe sql function) which executed in around 2 minutes. 

I’m very much a novice at interpreting EXPLAIN (ANALYZE) and hoping someone can help me better optimize my original
function.Both the function and results of explain/analyze are provided below. Is it perhaps a limitation of CTEs or
json_to_recordset(),and an entirely different approach is necessary (like the second function I wrote with one call per
record?)I was really hoping to make this work, I’ve written a small API to my database using sql and plpgsql functions
eachtaking a single JSON parameter, and my web backend acts almost as an http proxy to the database. It’s a different
wayof doing things as far as webdev goes, but (for me) an interesting experiment. I like PostgreSQL and really want to
takeadvantage of it. 

I’ve tried a few “optimizations”. Removing the final SELECT and returning VOID saves around 1.5 min. Removing some
extraJOINs saves a little time but nothing substantial (the joins against account.t_account in some places are to check
“ownership”of the record by a given client_id.) 

Using a subquery seems like it could be much faster than a CTE, but I don’t know how to insert three rows on two tables
usinga subquery. 

Any advice is appreciated, thank you in advance.

Matt


INPUT:
(The CTE illustrates all the properties and types)

[
  {
    “bank_account_id”: 1324,
    “transaction_id”: “abc123”,
    “transaction_date”: “2020-10-20”,
    …
  },
  …
]

OUTPUT:
(Not sure what I’ve done to create the nested arrays, but it’s unnecessary..
I can shave off ~1.5min by returning VOID.)

[
  [
    {
      "id": 250185
    },
    {
      "id": 250186
    },
    ...
  ]
]

FUNCTION:

CREATE OR REPLACE FUNCTION journal.create_with_categories(in_json JSON)
RETURNS JSON
AS $$

    WITH data AS (
        SELECT
            (in_json#>>'{context, client_id}')::BIGINT AS client_id,
            nextval(pg_get_serial_sequence('journal', 'id')) AS journal_id,
            bank_transaction_type,
            x.*
        FROM json_to_recordset(in_json->'data')
        AS x (
            bank_account_id BIGINT,
            transaction_id TEXT,
            transaction_date DATE,
            posted_date DATE,
            amount finance.monetary,
            description TEXT,
            parent_account_id BIGINT,
            child_account_id BIGINT
        ),
        LATERAL bank_account.get_transaction_type_by_id(x.bank_account_id, x.amount) AS bank_transaction_type
    ),
    insert_journal_entry AS (
        INSERT INTO journal.journal (
            client_id, id, bank_account_id,
            transaction_id, transaction_date, posted_date,
            description
        )
        SELECT
            client_id, journal_id, bank_account_id,
            transaction_id, transaction_date, posted_date,
            description
        FROM data
    ),
    insert_account_entries AS (
        INSERT INTO journal.account2journal(
            account_id, journal_id, amount, type
        )
        -- T account
        SELECT
            t.id,
            d.journal_id,
            @ d.amount,
            CASE WHEN d.bank_transaction_type = 'debit'::transaction_type
                THEN 'credit'::transaction_type
                ELSE 'debit'::transaction_type
            END
        FROM data d
        LEFT JOIN account.t_account t
            ON (t.id = COALESCE(d.child_account_id, d.parent_account_id))
        WHERE t.client_id = d.client_id OR t.id IS NULL

        UNION ALL

        -- bank account
        SELECT
            t.id, d.journal_id, @ d.amount, d.bank_transaction_type
        FROM data d
        JOIN bank_account.bank_account b
            ON (b.id = d.bank_account_id)
        JOIN account.t_account t
            ON (t.id = b.t_account_id)
        WHERE
            t.client_id = d.client_id
    )
    SELECT json_agg(d) FROM (SELECT d.journal_id AS id FROM data AS d) AS d;

$$ LANGUAGE sql;


EXPLAIN ANALYZE:
(From logs)

        Aggregate  (cost=24.24..24.25 rows=1 width=32) (actual time=388926.249..388926.371 rows=1 loops=1)
          Buffers: shared hit=53877 dirtied=2
          CTE data
            ->  Nested Loop  (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
                  Buffers: shared hit=53877 dirtied=2
                  ->  Function Scan on json_to_recordset x  (cost=0.01..1.00 rows=100 width=128) (actual
time=130.645..142.316rows=8935 loops=1) 
                  ->  Function Scan on get_transaction_type_by_id bank_transaction_type  (cost=0.25..0.26 rows=1
width=4)(actual time=0.154..0.156 rows=1 loops=8935) 
                        Buffers: shared hit=18054
          CTE insert_journal_entry
            ->  Insert on journal  (cost=0.00..2.00 rows=100 width=96) (actual time=453.563..453.563 rows=0 loops=1)
                  Buffers: shared hit=79242 dirtied=295
                  ->  CTE Scan on data  (cost=0.00..2.00 rows=100 width=96) (actual time=0.006..10.001 rows=8935
loops=1)
          CTE insert_account_entries
            ->  Insert on account2journal  (cost=4.86..15.23 rows=2 width=52) (actual time=816.381..816.381 rows=0
loops=1)
                  Buffers: shared hit=159273 dirtied=335 written=17
                  ->  Result  (cost=4.86..15.23 rows=2 width=52) (actual time=0.206..109.222 rows=17870 loops=1)
                        Buffers: shared hit=5
                        ->  Append  (cost=4.86..15.20 rows=2 width=52) (actual time=0.197..95.060 rows=17870 loops=1)
                              Buffers: shared hit=5
                              ->  Hash Left Join  (cost=4.86..7.14 rows=1 width=52) (actual time=0.195..35.512
rows=8935loops=1) 
                                    Hash Cond: (COALESCE(d_1.child_account_id, d_1.parent_account_id) = t.id)
                                    Filter: ((t.client_id = d_1.client_id) OR (t.id IS NULL))
                                    Buffers: shared hit=2
                                    ->  CTE Scan on data d_1  (cost=0.00..2.00 rows=100 width=68) (actual
time=0.004..6.544rows=8935 loops=1) 
                                    ->  Hash  (cost=3.27..3.27 rows=127 width=16) (actual time=0.137..0.137 rows=127
loops=1)
                                          Buffers: shared hit=2
                                          ->  Seq Scan on t_account t  (cost=0.00..3.27 rows=127 width=16) (actual
time=0.026..0.073rows=127 loops=1) 
                                                Buffers: shared hit=2
                              ->  Hash Join  (cost=3.80..8.03 rows=1 width=52) (actual time=40.182..53.796 rows=8935
loops=1)
                                    Hash Cond: ((t_1.id = b.t_account_id) AND (t_1.client_id = d_2.client_id))
                                    Buffers: shared hit=3
                                    ->  Seq Scan on t_account t_1  (cost=0.00..3.27 rows=127 width=16) (actual
time=0.022..0.079rows=127 loops=1) 
                                          Buffers: shared hit=2
                                    ->  Hash  (cost=3.59..3.59 rows=14 width=60) (actual time=40.118..40.118 rows=8935
loops=1)
                                          Buffers: shared hit=1
                                          ->  Hash Join  (cost=1.32..3.59 rows=14 width=60) (actual time=0.071..17.863
rows=8935loops=1) 
                                                Hash Cond: (d_2.bank_account_id = b.id)
                                                Buffers: shared hit=1
                                                ->  CTE Scan on data d_2  (cost=0.00..2.00 rows=100 width=60) (actual
time=0.005..3.740rows=8935 loops=1) 
                                                ->  Hash  (cost=1.14..1.14 rows=14 width=16) (actual time=0.030..0.030
rows=14loops=1) 
                                                      Buffers: shared hit=1
                                                      ->  Seq Scan on bank_account b  (cost=0.00..1.14 rows=14
width=16)(actual time=0.012..0.016 rows=14 loops=1) 
                                                            Buffers: shared hit=1
          ->  CTE Scan on data d  (cost=0.00..2.00 rows=100 width=8) (actual time=183.918..388812.950 rows=8935
loops=1)
                Buffers: shared hit=53877 dirtied=2
        Trigger for constraint journal_client_id_fkey on journal: time=194.194 calls=8935
        Trigger for constraint journal_bank_account_id_fkey on journal: time=204.014 calls=8935
        Trigger trigger_journal_import_sequence on journal: time=373.344 calls=1
        Trigger for constraint account2journal_account_id_fkey on account2journal: time=580.482 calls=17870
        Trigger trigger_debits_equal_credits_on_insert on account2journal: time=116.653 calls=1





Re: json_to_recordset() and CTE performance

From
Michael Lewis
Date:
Version? What is the value for work_mem and other configs that are non-default? I see some estimates that are rather off like -

            ->  Nested Loop  (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
                  Buffers: shared hit=53877 dirtied=2
                  ->  Function Scan on json_to_recordset x  (cost=0.01..1.00 rows=100 width=128) (actual time=130.645..142.316 rows=8935 loops=1)
                  ->  Function Scan on get_transaction_type_by_id bank_transaction_type  (cost=0.25..0.26 rows=1 width=4) (actual time=0.154..0.156 rows=1 loops=8935)
                        Buffers: shared hit=18054

Sometimes putting data into a temp table and analyzing it can be rather helpful to ensure the planner has statistics on the number of records, ndistinct, most common values, etc. I would try doing that with the result of json_to_recordset and skipping the function call to get_transaction_type_by_id until later, just to see how it performs.

That said, it seems like a hardcoded estimate from json_to_recordset is 100 perhaps. I haven't checked source code, but I know when defining a set returning function, there is a ROWS option which provides the planner a static value to assume will come out of that function so it would make sense perhaps.

Re: json_to_recordset() and CTE performance

From
Matt DeLuco
Date:
PostgreSQL 13.0.

You’d have to be specific about the configs you’re looking for, I’m using Postgres.app (postgresapp.com) and am uncertain if it’s distributed with non-default configs.

But, a quick grep shows these items that are configured:
max_wal_size = 1GB
min_wal_size = 80MB
shared_buffers = 128MB

work_mem is not configured so presumably it’s the default of 4MB.

I’ll try the temp tables. That seems familiar to what I found searching online - are you suggesting that as a permanent solution, or just as a means to better analyze performance?

Thanks,

Matt

On Oct 21, 2020, at 1:25 PM, Michael Lewis <mlewis@entrata.com> wrote:

Version? What is the value for work_mem and other configs that are non-default? I see some estimates that are rather off like -

            ->  Nested Loop  (cost=0.26..4.76 rows=100 width=148) (actual time=183.906..388716.550 rows=8935 loops=1)
                  Buffers: shared hit=53877 dirtied=2
                  ->  Function Scan on json_to_recordset x  (cost=0.01..1.00 rows=100 width=128) (actual time=130.645..142.316 rows=8935 loops=1)
                  ->  Function Scan on get_transaction_type_by_id bank_transaction_type  (cost=0.25..0.26 rows=1 width=4) (actual time=0.154..0.156 rows=1 loops=8935)
                        Buffers: shared hit=18054

Sometimes putting data into a temp table and analyzing it can be rather helpful to ensure the planner has statistics on the number of records, ndistinct, most common values, etc. I would try doing that with the result of json_to_recordset and skipping the function call to get_transaction_type_by_id until later, just to see how it performs.

That said, it seems like a hardcoded estimate from json_to_recordset is 100 perhaps. I haven't checked source code, but I know when defining a set returning function, there is a ROWS option which provides the planner a static value to assume will come out of that function so it would make sense perhaps.