Re: Collecting statistics about contents of JSONB columns - Mailing list pgsql-hackers

From Mahendra Singh Thalor
Subject Re: Collecting statistics about contents of JSONB columns
Date
Msg-id CAKYtNAo1Pf0c-4t2UYobg73UR+fkkfBA8ZYHkeUGpV5bE2i7Sw@mail.gmail.com
Whole thread Raw
In response to Re: Collecting statistics about contents of JSONB columns  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Collecting statistics about contents of JSONB columns  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On Fri, 4 Feb 2022 at 08:30, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 2/4/22 03:47, Tomas Vondra wrote:
> > ./json-generate.py 30000 2 8 1000 6 1000
>
> Sorry, this should be (different order of parameters):
>
> ./json-generate.py 30000 2 1000 8 6 1000
>

Thanks, Tomas for this test case.

Hi Hackers,

For the last few days, I was doing testing on the top of these JSON optimizers patches and was taking help fro Tomas Vondra to understand patches and testing results.
Thanks, Tomas for your feedback and suggestions.

Below is the summary:
Point 1) analyse is taking very much time for large documents:
For large JSON documents, analyze took very large time as compared to the current head. For reference, I am attaching test file (./json-generate.py 30000 2 1000 8 6 1000)

Head: analyze test ; Time: 120.864 ms
With patch: analyze test ; Time: more than 2 hours

analyze is taking a very large time because with these patches, firstly we iterate over all sample rows (in above case 30000), and we store all the paths (here around 850k paths).
In another pass, we took 1 path at a time and collects stats for the particular path by analyzing all the sample rows and we continue this process for all 850k paths or we can say that we do 850k loops, and in each loop we extract values for a single path.

Point 2) memory consummation increases rapidly for large documents:
In the above test case, there are total 851k paths and to keep stats for one path, we allocate 1120 bytes.

Total paths : 852689 ~ 852k

Memory for 1 path to keep stats: 1120 ~ 1 KB

(sizeof(JsonValueStats) = 1120 from “Analyze Column”)

Total memory for all paths: 852689 * 1120 = 955011680 ~ 955 MB

Extra memory for each path will be more. I mean, while analyzing each path, we allocate some more memory based on frequency and others

To keep all entries(851k paths) in the hash, we use around 1GB memory for hash so this is also very large.

Point 3) Review comment noticed by Tomas Vondra:

+       oldcxt = MemoryContextSwitchTo(ctx->stats->anl_context);
+       pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+       MemoryContextSwitchTo(oldcxt);

Above should be:
+       oldcxt = MemoryContextSwitchTo(ctx->mcxt);
+       pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+       MemoryContextSwitchTo(oldcxt);

Response from Tomas Vondra:
The problem is "anl_context" is actually "Analyze", i.e. the context for the whole ANALYZE command, for all the columns. But we only want to keep those path stats while processing a particular column. At the end, after processing all paths from a column, we need to "build" the final stats in the column, and this result needs to go into "Analyze" context. But all the partial results need to go into "Analyze Column" context.

Point 4)

+/*

+ * jsonAnalyzeCollectPath

+ *             Extract a single path from JSON documents and collect its values.

+ */

+static void

+jsonAnalyzeCollectPath(JsonAnalyzeContext *ctx, Jsonb *jb, void *param)

+{

+       JsonPathAnlStats *pstats = (JsonPathAnlStats *) param;

+       JsonbValue      jbvtmp;

+       JsonbValue *jbv = JsonValueInitBinary(&jbvtmp, jb);

+       JsonPathEntry *path;

+       JsonPathEntry **entries;

+       int                     i;

+

+       entries = palloc(sizeof(*entries) * pstats->depth);

+

+       /* Build entry array in direct order */

+       for (path = &pstats->path, i = pstats->depth - 1;

+                path->parent && i >= 0;

+                path = path->parent, i--)

+               entries[i] = path;

+

+       jsonAnalyzeCollectSubpath(ctx, pstats, jbv, entries, 0);

+

+       pfree(entries);

----many times, we are trying to palloc with zero size and entries is pointing to invalid memory (because pstats->depth=0) so I think, we should not try to palloc with 0??

Fix:

+       If (pstats->depth)

  +            entries = palloc(sizeof(*entries) * pstats->depth);



From these points, we can say that we should rethink our design to collect stats for all paths.

We can set limits(like MCV) for paths or we can give an explicit path to collect stats for a particular path only or we can pass a subset of the JSON values.

In the above case, there are total 851k paths, but we can collect stats for only 1000 paths that are most common so this way we can minimize time and memory also and we might even keep at
least frequencies for the non-analyzed paths.

Next, I will take the latest patches from Nikita's last email and I will do more tests.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: role self-revocation
Next
From: Aleksander Alekseev
Date:
Subject: Re: Add 64-bit XIDs into PostgreSQL 15