Re: Collecting statistics about contents of JSONB columns - Mailing list pgsql-hackers

From Nikita Glukhov
Subject Re: Collecting statistics about contents of JSONB columns
Date
Msg-id bf7e1b50-8a35-a98a-251f-2a243558e470@postgrespro.ru
Whole thread Raw
In response to Re: Collecting statistics about contents of JSONB columns  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Collecting statistics about contents of JSONB columns  (Mahendra Singh Thalor <mahi6run@gmail.com>)
Re: Collecting statistics about contents of JSONB columns  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers


On 04.02.2022 05:47, Tomas Vondra wrote:
On 1/25/22 17:56, Mahendra Singh Thalor wrote:
>
...

For the last few days, I was trying to understand these patches, and based on Tomas's suggestion, I was doing some performance tests.

With the attached .SQL file, I can see that analyze is taking more time with these patches.

I haven't found the root cause of this but I feel that this time is due to a loop of all the paths.
In my test data, there is a total of 951 different-2 paths. While doing analysis, first we check all the sample rows(30000) and we collect all the different-2 paths (here 951), and after that for every single path, we loop over all the sample rows again to collect stats for a particular path. I feel that these loops might be taking time.

Thanks, I've been doing some performance tests too, and you're right it takes quite a bit of time.

That is absolutely not surprising, I have warned about poor performance
in cases with a large number of paths.

I agree the slowness is largely due to extracting all paths and then processing them one by one - which means we have to loop over the tuples over and over. In this case there's about 850k distinct paths extracted, so we do ~850k loops over 30k tuples. That's gotta take time.

I don't know what exactly to do about this, but I already mentioned we may need to pick a subset of paths to keep, similarly to how we pick items for MCV. I mean, if we only saw a path once or twice, it's unlikely to be interesting enough to build stats for it. I haven't tried, but I'd bet most of the 850k paths might be ignored like this.

The other thing we might do is making it the loops more efficient. For example, we might track which documents contain each path (by a small bitmap or something), so that in the loop we can skip rows that don't contain the path we're currently processing. Or something like that.

Apart from this performance issue, I haven't found any crashes or issues.


Well, I haven't seen any crashes either, but as I mentioned for complex documents (2 levels, many distinct keys) the ANALYZE starts consuming a lot of memory and may get killed by OOM. For example if you generate documents like this

 ./json-generate.py 30000 2 8 1000 6 1000

and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext): and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext):

-------------------------------------------------------------------------
TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks); 66784 used
  ...
  TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
    PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536 used: <unnamed>
      Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks); 469000720 used
        Analyze Column: 921177696 total in 120 blocks; 5123256 free (238 chunks); 916054440 used
          Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free (1 chunks); 2472 used
            Json Analyze Pass Context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
          JSON analyze path table: 1639706040 total in 25084 blocks; 1513640 free (33 chunks); 1638192400 used
      Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
...
Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352 chunks); 3024345064 used
-------------------------------------------------------------------------


Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze path table" context, 400MB in "analyze" and 900MB in "analyze column" contexts. I mean, that seems a bit excessive. And it grows over time, so after a while my laptop gives up and kills the backend.

I'm not sure if it's a memory leak (which would be fixable), or it's due to keeping stats for all the extracted paths. I mean, in this particular case we have 850k paths - even if stats are just 1kB per path,  that's 850MB. This requires more investigation.
Thank you for the tests and investigation.

I have tried to reduce memory consumption and speed up row scanning:

1. "JSON analyze path table" context contained ~1KB JsonPathAnlStats   structure per JSON path in the global hash table.  I have moved   JsonPathAnlStats to the stack of compute_json_stats(), and hash    table now consumes ~70 bytes per path.

2. I have fixed copying of resulting JSONB stats into context, which   reduced the size of "Analyze Column" context.

3. I have optimized consumption of single-pass algorithm by storing   only value lists in the non-temporary context.  That helped to    execute "2 64 64" test case in 30 seconds.  Single-pass is a    bit faster in non-TOASTed cases, and much faster in TOASTed.   But it consumes much more memory and still goes to OOM in the    cases with more than ~100k paths.

4. I have implemented per-path document lists/bitmaps, which really    speed up the case "2 8 1000".  List is converted into bitmap when    it becomes larger than bitmap.

5. Also I have fixed some bugs.


All these changes you can find commit form in our GitHub repository 
on the branch jsonb_stats_20220310 [1].


Updated results of the test:

levels keys  uniq keys  paths   master    multi-pass    single-pass                                              ms  MB       ms    MB
-------------------------------------------------------------------     1    1         1       2      153      122   10       82    14     1    1      1000    1001      134      105   11       78    38     1    8         8       9      157      384   19      328    32     1    8      1000    1001      155      454   23      402    72     1   64        64      65      129     2889   45     2386   155     1   64      1000    1001      158     3990   94     1447   177     2    1         1       3      237      147   10       91    16     2    1      1000   30577      152      264   32      394   234     2    8         8      72      245     1943   37     1692   139     2    8      1000  852333      152     9175  678         OOM     2   64        64    4161     1784 ~1 hour    53    30018  1750     2   64      1000 1001001     4715 ~4 hours 1600         OOM

The two last multi-pass results are too slow, because JSONBs becomes 
TOASTed.  For measuring master in these tests, I disabled 
WIDTH_THRESHOLD check which skipped TOASTed values > 1KB.


Next, I am going to try to disable all-paths collection and implement 
collection of most common paths (and/or hashed paths maybe).


[1] https://github.com/postgrespro/postgres/tree/jsonb_stats_20220310
-- 
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

pgsql-hackers by date:

Previous
From: Robert Treat
Date:
Subject: Re: Changing "Hot Standby" to "hot standby"
Next
From: Robert Treat
Date:
Subject: Re: [Proposal] vacuumdb --schema only