Re: Collecting statistics about contents of JSONB columns - Mailing list pgsql-hackers

From Mahendra Singh Thalor
Subject Re: Collecting statistics about contents of JSONB columns
Date
Msg-id CAKYtNAq=Qcy2SGLQKVcaQ8UXu0dVxQUDdTQGyMyZGTPakvGpNg@mail.gmail.com
Whole thread Raw
In response to Re: Collecting statistics about contents of JSONB columns  (Nikita Glukhov <n.gluhov@postgrespro.ru>)
Responses Re: Collecting statistics about contents of JSONB columns
List pgsql-hackers
On Fri, 11 Mar 2022 at 04:29, Nikita Glukhov <n.gluhov@postgrespro.ru> wrote:
>
>
> On 04.02.2022 05:47, Tomas Vondra wrote:
>
> On 1/25/22 17:56, Mahendra Singh Thalor wrote:
> >
>
> ...
>
> For the last few days, I was trying to understand these patches, and based on Tomas's suggestion, I was doing some performance tests.
>
> With the attached .SQL file, I can see that analyze is taking more time with these patches.
>
> I haven't found the root cause of this but I feel that this time is due to a loop of all the paths.
> In my test data, there is a total of 951 different-2 paths. While doing analysis, first we check all the sample rows(30000) and we collect all the different-2 paths (here 951), and after that for every single path, we loop over all the sample rows again to collect stats for a particular path. I feel that these loops might be taking time.
>
> Thanks, I've been doing some performance tests too, and you're right it takes quite a bit of time.
>
>
> That is absolutely not surprising, I have warned about poor performance
> in cases with a large number of paths.
>
>
> I agree the slowness is largely due to extracting all paths and then processing them one by one - which means we have to loop over the tuples over and over. In this case there's about 850k distinct paths extracted, so we do ~850k loops over 30k tuples. That's gotta take time.
>
> I don't know what exactly to do about this, but I already mentioned we may need to pick a subset of paths to keep, similarly to how we pick items for MCV. I mean, if we only saw a path once or twice, it's unlikely to be interesting enough to build stats for it. I haven't tried, but I'd bet most of the 850k paths might be ignored like this.
>
> The other thing we might do is making it the loops more efficient. For example, we might track which documents contain each path (by a small bitmap or something), so that in the loop we can skip rows that don't contain the path we're currently processing. Or something like that.
>
> Apart from this performance issue, I haven't found any crashes or issues.
>
>
> Well, I haven't seen any crashes either, but as I mentioned for complex documents (2 levels, many distinct keys) the ANALYZE starts consuming a lot of memory and may get killed by OOM. For example if you generate documents like this
>
>  ./json-generate.py 30000 2 8 1000 6 1000
>
> and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext): and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext):
>
> -------------------------------------------------------------------------
> TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks); 66784 used
>   ...
>   TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
>     PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536 used: <unnamed>
>       Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks); 469000720 used
>         Analyze Column: 921177696 total in 120 blocks; 5123256 free (238 chunks); 916054440 used
>           Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free (1 chunks); 2472 used
>             Json Analyze Pass Context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
>           JSON analyze path table: 1639706040 total in 25084 blocks; 1513640 free (33 chunks); 1638192400 used
>       Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
> ...
> Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352 chunks); 3024345064 used
> -------------------------------------------------------------------------
>
>
> Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze path table" context, 400MB in "analyze" and 900MB in "analyze column" contexts. I mean, that seems a bit excessive. And it grows over time, so after a while my laptop gives up and kills the backend.
>
> I'm not sure if it's a memory leak (which would be fixable), or it's due to keeping stats for all the extracted paths. I mean, in this particular case we have 850k paths - even if stats are just 1kB per path,  that's 850MB. This requires more investigation.
>
> Thank you for the tests and investigation.
>
> I have tried to reduce memory consumption and speed up row scanning:
>
> 1. "JSON analyze path table" context contained ~1KB JsonPathAnlStats
>    structure per JSON path in the global hash table.  I have moved
>    JsonPathAnlStats to the stack of compute_json_stats(), and hash
>    table now consumes ~70 bytes per path.
>
> 2. I have fixed copying of resulting JSONB stats into context, which
>    reduced the size of "Analyze Column" context.
>
> 3. I have optimized consumption of single-pass algorithm by storing
>    only value lists in the non-temporary context.  That helped to
>    execute "2 64 64" test case in 30 seconds.  Single-pass is a
>    bit faster in non-TOASTed cases, and much faster in TOASTed.
>    But it consumes much more memory and still goes to OOM in the
>    cases with more than ~100k paths.
>
> 4. I have implemented per-path document lists/bitmaps, which really
>    speed up the case "2 8 1000".  List is converted into bitmap when
>    it becomes larger than bitmap.
>
> 5. Also I have fixed some bugs.
>
>
> All these changes you can find commit form in our GitHub repository
> on the branch jsonb_stats_20220310 [1].
>
>
> Updated results of the test:
>
> levels keys  uniq keys  paths   master    multi-pass    single-pass
>                                               ms  MB       ms    MB
> -------------------------------------------------------------------
>      1    1         1       2      153      122   10       82    14
>      1    1      1000    1001      134      105   11       78    38
>      1    8         8       9      157      384   19      328    32
>      1    8      1000    1001      155      454   23      402    72
>      1   64        64      65      129     2889   45     2386   155
>      1   64      1000    1001      158     3990   94     1447   177
>      2    1         1       3      237      147   10       91    16
>      2    1      1000   30577      152      264   32      394   234
>      2    8         8      72      245     1943   37     1692   139
>      2    8      1000  852333      152     9175  678         OOM
>      2   64        64    4161     1784 ~1 hour    53    30018  1750
>      2   64      1000 1001001     4715 ~4 hours 1600         OOM
>
> The two last multi-pass results are too slow, because JSONBs becomes
> TOASTed.  For measuring master in these tests, I disabled
> WIDTH_THRESHOLD check which skipped TOASTed values > 1KB.
>
>
> Next, I am going to try to disable all-paths collection and implement
> collection of most common paths (and/or hashed paths maybe).

Hi Nikita,
I and Tomas discussed the design for disabling all-paths collection(collect stats for only some paths). Below are some thoughts/doubts/questions.

Point 1) Please can you elaborate more that how are you going to implement this(collect stats for only some paths).
Point 2) As JSON stats are taking time so should we add an on/off switch to collect JSON stats?
Point 3) We thought of one more design: we can give an explicit path to collect stats for a particular path only or we can pass a subset of the JSON values but this may require a lot of code changes like syntax and all so we are thinking that it will be good if we can collect stats only for some common paths(by limit or any other way)

Thoughts?

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Expand palloc/pg_malloc API
Next
From: Bharath Rupireddy
Date:
Subject: Re: Expand palloc/pg_malloc API