Thread: Collecting statistics about contents of JSONB columns

Collecting statistics about contents of JSONB columns

From
Tomas Vondra
Date:
Hi,

One of the complaints I sometimes hear from users and customers using 
Postgres to store JSON documents (as JSONB type, of course) is that the 
selectivity estimates are often pretty poor.

Currently we only really have MCV and histograms with whole documents, 
and we can deduce some stats from that. But that is somewhat bogus 
because there's only ~100 documents in either MCV/histogram (with the 
default statistics target). And moreover we discard all "oversized" 
values (over 1kB) before even calculating those stats, which makes it 
even less representative.

A couple weeks ago I started playing with this, and I experimented with 
improving extended statistics in this direction. After a while I noticed 
a forgotten development branch from 2016 which tried to do this by 
adding a custom typanalyze function, which seemed like a more natural 
idea (because it's really a statistics for a single column).

But then I went to pgconf NYC in early December, and I spoke to Oleg 
about various JSON-related things, and he mentioned they've been working 
on this topic some time ago too, but did not have time to pursue it. So 
he pointed me to a branch [1] developed by Nikita Glukhov.

I like Nikita's branch because it solved a couple architectural issues 
that I've been aware of, but only solved them in a rather hackish way.

I had a discussion with Nikita about his approach what can we do to move 
it forward. He's focusing on other JSON stuff, but he's OK with me 
taking over and moving it forward. So here we go ...

Nikita rebased his branch recently, I've kept improving it in various 
(mostly a lot of comments and docs, and some minor fixes and tweaks). 
I've pushed my version with a couple extra commits in [2], but you can 
ignore that except if you want to see what I added/changed.

Attached is a couple patches adding adding the main part of the feature. 
There's a couple more commits in the github repositories, adding more 
advanced features - I'll briefly explain those later, but I'm not 
including them here because those are optional features and it'd be 
distracting to include them here.

There are 6 patches in the series, but the magic mostly happens in parts 
0001 and 0006. The other parts are mostly just adding infrastructure, 
which may be a sizeable amount of code, but the changes are fairly 
simple and obvious. So let's focus on 0001 and 0006.

To add JSON statistics we need to do two basic things - we need to build 
the statistics and then we need to allow using them while estimating 
conditions.


1) building stats

Let's talk about building the stats first. The patch does one of the 
things I experimented with - 0006 adds a jsonb_typanalyze function, and 
it associates it with the data type. The function extracts paths and 
values from the JSONB document, builds the statistics, and then stores 
the result in pg_statistic as a new stakind.

I've been planning to store the stats in pg_statistic too, but I've been 
considering to use a custom data type. The patch does something far more 
elegant - it simply uses stavalues to store an array of JSONB documents, 
each describing stats for one path extracted from the sampled documents.

One (very simple) element of the array might look like this:

   {"freq": 1,
    "json": {
      "mcv": {
         "values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
         "numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
      "width": 19,
      "distinct": 10,
      "nullfrac": 0,
      "correlation": 0.10449},
    "path": "$.\"a\"",
    "freq_null": 0, "freq_array": 0, "freq_object": 0,
    "freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}

In this case there's only a MCV list (represented by two arrays, just 
like in pg_statistic), but there might be another part with a histogram. 
There's also the other columns we'd expect to see in pg_statistic.

In principle, we need pg_statistic for each path we extract from the 
JSON documents and decide it's interesting enough for estimation. There 
are probably other ways to serialize/represent this, but I find using 
JSONB for this pretty convenient because we need to deal with a mix of 
data types (for the same path), and other JSON specific stuff. Storing 
that in Postgres arrays would be problematic.

I'm sure there's plenty open questions - for example I think we'll need 
some logic to decide which paths to keep, otherwise the statistics can 
get quite big, if we're dealing with large / variable documents. We're 
already doing something similar for MCV lists.

One of Nikita's patches not included in this thread allow "selective" 
statistics, where you can define in advance a "filter" restricting which 
parts are considered interesting by ANALYZE. That's interesting, but I 
think we need some simple MCV-like heuristics first anyway.

Another open question is how deep the stats should be. Imagine documents 
like this:

   {"a" : {"b" : {"c" : {"d" : ...}}}}

The current patch build stats for all possible paths:

  "a"
  "a.b"
  "a.b.c"
  "a.b.c.d"

and of course many of the paths will often have JSONB documents as 
values, not simple scalar values. I wonder if we should limit the depth 
somehow, and maybe build stats only for scalar values.


2) applying the statistics

One of the problems is how to actually use the statistics. For @> 
operator it's simple enough, because it's (jsonb @> jsonb) so we have 
direct access to the stats. But often the conditions look like this:

     jsonb_column ->> 'key' = 'value'

so the condition is actually on an expression, not on the JSONB column 
directly. My solutions were pretty ugly hacks, but Nikita had a neat 
idea - we can define a custom procedure for each operator, which is 
responsible for "calculating" the statistics for the expression.

So in this case "->>" will have such "oprstat" procedure, which fetches 
stats for the JSONB column, extracts stats for the "key" path. And then 
we can use that for estimation of the (text = text) condition.

This is what 0001 does, pretty much. We simply look for expression stats 
provided by an index, extended statistics, and then - if oprstat is 
defined for the operator - we try to derive the stats.

This opens other interesting opportunities for the future - one of the 
parts adds oprstat for basic arithmetic operators, which allows deducing 
statistics for expressions like (a+10) from statistics on column (a).

Which seems like a neat feature on it's own, but it also interacts with 
the extended statistics in somewhat non-obvious ways (especially when 
estimating GROUP BY cardinalities).

Of course, there's a limit of what we can reasonably estimate - for 
example, there may be statistical dependencies between paths, and this 
patch does not even attempt to deal with that. In a way, this is similar 
to correlation between columns, except that here we have a dynamic set 
of columns, which makes it much much harder. We'd need something like 
extended stats on steroids, pretty much.


I'm sure I've forgotten various important bits - many of them are 
mentioned or explained in comments, but I'm sure others are not. And I'd 
bet there are things I forgot about entirely or got wrong. So feel free 
to ask.


In any case, I think this seems like a good first step to improve our 
estimates for JSOB columns.

regards


[1] https://github.com/postgrespro/postgres/tree/jsonb_stats

[2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Collecting statistics about contents of JSONB columns

From
Zhihong Yu
Date:


On Fri, Dec 31, 2021 at 2:07 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

One of the complaints I sometimes hear from users and customers using
Postgres to store JSON documents (as JSONB type, of course) is that the
selectivity estimates are often pretty poor.

Currently we only really have MCV and histograms with whole documents,
and we can deduce some stats from that. But that is somewhat bogus
because there's only ~100 documents in either MCV/histogram (with the
default statistics target). And moreover we discard all "oversized"
values (over 1kB) before even calculating those stats, which makes it
even less representative.

A couple weeks ago I started playing with this, and I experimented with
improving extended statistics in this direction. After a while I noticed
a forgotten development branch from 2016 which tried to do this by
adding a custom typanalyze function, which seemed like a more natural
idea (because it's really a statistics for a single column).

But then I went to pgconf NYC in early December, and I spoke to Oleg
about various JSON-related things, and he mentioned they've been working
on this topic some time ago too, but did not have time to pursue it. So
he pointed me to a branch [1] developed by Nikita Glukhov.

I like Nikita's branch because it solved a couple architectural issues
that I've been aware of, but only solved them in a rather hackish way.

I had a discussion with Nikita about his approach what can we do to move
it forward. He's focusing on other JSON stuff, but he's OK with me
taking over and moving it forward. So here we go ...

Nikita rebased his branch recently, I've kept improving it in various
(mostly a lot of comments and docs, and some minor fixes and tweaks).
I've pushed my version with a couple extra commits in [2], but you can
ignore that except if you want to see what I added/changed.

Attached is a couple patches adding adding the main part of the feature.
There's a couple more commits in the github repositories, adding more
advanced features - I'll briefly explain those later, but I'm not
including them here because those are optional features and it'd be
distracting to include them here.

There are 6 patches in the series, but the magic mostly happens in parts
0001 and 0006. The other parts are mostly just adding infrastructure,
which may be a sizeable amount of code, but the changes are fairly
simple and obvious. So let's focus on 0001 and 0006.

To add JSON statistics we need to do two basic things - we need to build
the statistics and then we need to allow using them while estimating
conditions.


1) building stats

Let's talk about building the stats first. The patch does one of the
things I experimented with - 0006 adds a jsonb_typanalyze function, and
it associates it with the data type. The function extracts paths and
values from the JSONB document, builds the statistics, and then stores
the result in pg_statistic as a new stakind.

I've been planning to store the stats in pg_statistic too, but I've been
considering to use a custom data type. The patch does something far more
elegant - it simply uses stavalues to store an array of JSONB documents,
each describing stats for one path extracted from the sampled documents.

One (very simple) element of the array might look like this:

   {"freq": 1,
    "json": {
      "mcv": {
         "values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
         "numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
      "width": 19,
      "distinct": 10,
      "nullfrac": 0,
      "correlation": 0.10449},
    "path": "$.\"a\"",
    "freq_null": 0, "freq_array": 0, "freq_object": 0,
    "freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}

In this case there's only a MCV list (represented by two arrays, just
like in pg_statistic), but there might be another part with a histogram.
There's also the other columns we'd expect to see in pg_statistic.

In principle, we need pg_statistic for each path we extract from the
JSON documents and decide it's interesting enough for estimation. There
are probably other ways to serialize/represent this, but I find using
JSONB for this pretty convenient because we need to deal with a mix of
data types (for the same path), and other JSON specific stuff. Storing
that in Postgres arrays would be problematic.

I'm sure there's plenty open questions - for example I think we'll need
some logic to decide which paths to keep, otherwise the statistics can
get quite big, if we're dealing with large / variable documents. We're
already doing something similar for MCV lists.

One of Nikita's patches not included in this thread allow "selective"
statistics, where you can define in advance a "filter" restricting which
parts are considered interesting by ANALYZE. That's interesting, but I
think we need some simple MCV-like heuristics first anyway.

Another open question is how deep the stats should be. Imagine documents
like this:

   {"a" : {"b" : {"c" : {"d" : ...}}}}

The current patch build stats for all possible paths:

  "a"
  "a.b"
  "a.b.c"
  "a.b.c.d"

and of course many of the paths will often have JSONB documents as
values, not simple scalar values. I wonder if we should limit the depth
somehow, and maybe build stats only for scalar values.


2) applying the statistics

One of the problems is how to actually use the statistics. For @>
operator it's simple enough, because it's (jsonb @> jsonb) so we have
direct access to the stats. But often the conditions look like this:

     jsonb_column ->> 'key' = 'value'

so the condition is actually on an expression, not on the JSONB column
directly. My solutions were pretty ugly hacks, but Nikita had a neat
idea - we can define a custom procedure for each operator, which is
responsible for "calculating" the statistics for the expression.

So in this case "->>" will have such "oprstat" procedure, which fetches
stats for the JSONB column, extracts stats for the "key" path. And then
we can use that for estimation of the (text = text) condition.

This is what 0001 does, pretty much. We simply look for expression stats
provided by an index, extended statistics, and then - if oprstat is
defined for the operator - we try to derive the stats.

This opens other interesting opportunities for the future - one of the
parts adds oprstat for basic arithmetic operators, which allows deducing
statistics for expressions like (a+10) from statistics on column (a).

Which seems like a neat feature on it's own, but it also interacts with
the extended statistics in somewhat non-obvious ways (especially when
estimating GROUP BY cardinalities).

Of course, there's a limit of what we can reasonably estimate - for
example, there may be statistical dependencies between paths, and this
patch does not even attempt to deal with that. In a way, this is similar
to correlation between columns, except that here we have a dynamic set
of columns, which makes it much much harder. We'd need something like
extended stats on steroids, pretty much.


I'm sure I've forgotten various important bits - many of them are
mentioned or explained in comments, but I'm sure others are not. And I'd
bet there are things I forgot about entirely or got wrong. So feel free
to ask.


In any case, I think this seems like a good first step to improve our
estimates for JSOB columns.

regards


[1] https://github.com/postgrespro/postgres/tree/jsonb_stats

[2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,
For patch 1:

+   List       *statisticsName = NIL;   /* optional stats estimat. procedure */

I think if the variable is named estimatorName (or something similar), it would be easier for people to grasp its purpose.

+       /* XXX perhaps full "statistics" wording would be better */
+       else if (strcmp(defel->defname, "stats") == 0)

I would recommend (stats sounds too general):

+       else if (strcmp(defel->defname, "statsestimator") == 0)

+       statisticsOid = ValidateStatisticsEstimator(statisticsName);

statisticsOid -> statsEstimatorOid

For get_oprstat():

+   }
+   else
+       return (RegProcedure) InvalidOid;

keyword else is not needed (considering the return statement in if block).

For patch 06:

+   /* FIXME Could be before the memset, I guess? Checking vardata->statsTuple. */
+   if (!data->statsTuple)
+       return false;

I would agree the check can be lifted above the memset call.

+ * XXX This does not really extract any stats, it merely allocates the struct?
+ */
+static JsonPathStats
+jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)

As comments says, I think allocJsonPathStats() would be better name for the func.

+ * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what
+ * jsonPathStatsGetLengthStats does?

I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check should be added for jsonPathStatsGetArrayLengthStats().

To be continued ...

Re: Collecting statistics about contents of JSONB columns

From
Zhihong Yu
Date:


On Sat, Jan 1, 2022 at 7:33 AM Zhihong Yu <zyu@yugabyte.com> wrote:


On Fri, Dec 31, 2021 at 2:07 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

One of the complaints I sometimes hear from users and customers using
Postgres to store JSON documents (as JSONB type, of course) is that the
selectivity estimates are often pretty poor.

Currently we only really have MCV and histograms with whole documents,
and we can deduce some stats from that. But that is somewhat bogus
because there's only ~100 documents in either MCV/histogram (with the
default statistics target). And moreover we discard all "oversized"
values (over 1kB) before even calculating those stats, which makes it
even less representative.

A couple weeks ago I started playing with this, and I experimented with
improving extended statistics in this direction. After a while I noticed
a forgotten development branch from 2016 which tried to do this by
adding a custom typanalyze function, which seemed like a more natural
idea (because it's really a statistics for a single column).

But then I went to pgconf NYC in early December, and I spoke to Oleg
about various JSON-related things, and he mentioned they've been working
on this topic some time ago too, but did not have time to pursue it. So
he pointed me to a branch [1] developed by Nikita Glukhov.

I like Nikita's branch because it solved a couple architectural issues
that I've been aware of, but only solved them in a rather hackish way.

I had a discussion with Nikita about his approach what can we do to move
it forward. He's focusing on other JSON stuff, but he's OK with me
taking over and moving it forward. So here we go ...

Nikita rebased his branch recently, I've kept improving it in various
(mostly a lot of comments and docs, and some minor fixes and tweaks).
I've pushed my version with a couple extra commits in [2], but you can
ignore that except if you want to see what I added/changed.

Attached is a couple patches adding adding the main part of the feature.
There's a couple more commits in the github repositories, adding more
advanced features - I'll briefly explain those later, but I'm not
including them here because those are optional features and it'd be
distracting to include them here.

There are 6 patches in the series, but the magic mostly happens in parts
0001 and 0006. The other parts are mostly just adding infrastructure,
which may be a sizeable amount of code, but the changes are fairly
simple and obvious. So let's focus on 0001 and 0006.

To add JSON statistics we need to do two basic things - we need to build
the statistics and then we need to allow using them while estimating
conditions.


1) building stats

Let's talk about building the stats first. The patch does one of the
things I experimented with - 0006 adds a jsonb_typanalyze function, and
it associates it with the data type. The function extracts paths and
values from the JSONB document, builds the statistics, and then stores
the result in pg_statistic as a new stakind.

I've been planning to store the stats in pg_statistic too, but I've been
considering to use a custom data type. The patch does something far more
elegant - it simply uses stavalues to store an array of JSONB documents,
each describing stats for one path extracted from the sampled documents.

One (very simple) element of the array might look like this:

   {"freq": 1,
    "json": {
      "mcv": {
         "values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
         "numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
      "width": 19,
      "distinct": 10,
      "nullfrac": 0,
      "correlation": 0.10449},
    "path": "$.\"a\"",
    "freq_null": 0, "freq_array": 0, "freq_object": 0,
    "freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}

In this case there's only a MCV list (represented by two arrays, just
like in pg_statistic), but there might be another part with a histogram.
There's also the other columns we'd expect to see in pg_statistic.

In principle, we need pg_statistic for each path we extract from the
JSON documents and decide it's interesting enough for estimation. There
are probably other ways to serialize/represent this, but I find using
JSONB for this pretty convenient because we need to deal with a mix of
data types (for the same path), and other JSON specific stuff. Storing
that in Postgres arrays would be problematic.

I'm sure there's plenty open questions - for example I think we'll need
some logic to decide which paths to keep, otherwise the statistics can
get quite big, if we're dealing with large / variable documents. We're
already doing something similar for MCV lists.

One of Nikita's patches not included in this thread allow "selective"
statistics, where you can define in advance a "filter" restricting which
parts are considered interesting by ANALYZE. That's interesting, but I
think we need some simple MCV-like heuristics first anyway.

Another open question is how deep the stats should be. Imagine documents
like this:

   {"a" : {"b" : {"c" : {"d" : ...}}}}

The current patch build stats for all possible paths:

  "a"
  "a.b"
  "a.b.c"
  "a.b.c.d"

and of course many of the paths will often have JSONB documents as
values, not simple scalar values. I wonder if we should limit the depth
somehow, and maybe build stats only for scalar values.


2) applying the statistics

One of the problems is how to actually use the statistics. For @>
operator it's simple enough, because it's (jsonb @> jsonb) so we have
direct access to the stats. But often the conditions look like this:

     jsonb_column ->> 'key' = 'value'

so the condition is actually on an expression, not on the JSONB column
directly. My solutions were pretty ugly hacks, but Nikita had a neat
idea - we can define a custom procedure for each operator, which is
responsible for "calculating" the statistics for the expression.

So in this case "->>" will have such "oprstat" procedure, which fetches
stats for the JSONB column, extracts stats for the "key" path. And then
we can use that for estimation of the (text = text) condition.

This is what 0001 does, pretty much. We simply look for expression stats
provided by an index, extended statistics, and then - if oprstat is
defined for the operator - we try to derive the stats.

This opens other interesting opportunities for the future - one of the
parts adds oprstat for basic arithmetic operators, which allows deducing
statistics for expressions like (a+10) from statistics on column (a).

Which seems like a neat feature on it's own, but it also interacts with
the extended statistics in somewhat non-obvious ways (especially when
estimating GROUP BY cardinalities).

Of course, there's a limit of what we can reasonably estimate - for
example, there may be statistical dependencies between paths, and this
patch does not even attempt to deal with that. In a way, this is similar
to correlation between columns, except that here we have a dynamic set
of columns, which makes it much much harder. We'd need something like
extended stats on steroids, pretty much.


I'm sure I've forgotten various important bits - many of them are
mentioned or explained in comments, but I'm sure others are not. And I'd
bet there are things I forgot about entirely or got wrong. So feel free
to ask.


In any case, I think this seems like a good first step to improve our
estimates for JSOB columns.

regards


[1] https://github.com/postgrespro/postgres/tree/jsonb_stats

[2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,
For patch 1:

+   List       *statisticsName = NIL;   /* optional stats estimat. procedure */

I think if the variable is named estimatorName (or something similar), it would be easier for people to grasp its purpose.

+       /* XXX perhaps full "statistics" wording would be better */
+       else if (strcmp(defel->defname, "stats") == 0)

I would recommend (stats sounds too general):

+       else if (strcmp(defel->defname, "statsestimator") == 0)

+       statisticsOid = ValidateStatisticsEstimator(statisticsName);

statisticsOid -> statsEstimatorOid

For get_oprstat():

+   }
+   else
+       return (RegProcedure) InvalidOid;

keyword else is not needed (considering the return statement in if block).

For patch 06:

+   /* FIXME Could be before the memset, I guess? Checking vardata->statsTuple. */
+   if (!data->statsTuple)
+       return false;

I would agree the check can be lifted above the memset call.

+ * XXX This does not really extract any stats, it merely allocates the struct?
+ */
+static JsonPathStats
+jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)

As comments says, I think allocJsonPathStats() would be better name for the func.

+ * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what
+ * jsonPathStatsGetLengthStats does?

I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check should be added for jsonPathStatsGetArrayLengthStats().

To be continued ...
Hi,

+static JsonPathStats
+jsonStatsFindPathStats(JsonStats jsdata, char *path, int pathlen)

Stats appears twice in the method name. I think findJsonPathStats() should suffice.
It should check `if (jsdata->nullfrac >= 1.0)` as jsonStatsGetPathStatsStr does.

+JsonPathStats
+jsonStatsGetPathStatsStr(JsonStats jsdata, const char *subpath, int subpathlen)

This func can be static, right ?
I think findJsonPathStatsWithPrefix() would be a better name for the func.

+ * XXX Doesn't this need ecape_json too?
+ */
+static void
+jsonPathAppendEntryWithLen(StringInfo path, const char *entry, int len)
+{
+   char *tmpentry = pnstrdup(entry, len);
+   jsonPathAppendEntry(path, tmpentry);

ecape_json() is called within jsonPathAppendEntry(). The XXX comment can be dropped.

+jsonPathStatsGetArrayIndexSelectivity(JsonPathStats pstats, int index)

It seems getJsonSelectivityWithArrayIndex() would be a better name.

+       sel = scalarineqsel(NULL, operator,
+                           operator == JsonbGtOperator ||
+                           operator == JsonbGeOperator,
+                           operator == JsonbLeOperator ||
+                           operator == JsonbGeOperator,

Looking at the comment for scalarineqsel():

 *  scalarineqsel       - Selectivity of "<", "<=", ">", ">=" for scalars.
 *
 * This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
 * The isgt and iseq flags distinguish which of the four cases apply.

It seems JsonbLtOperator doesn't appear in the call, can I ask why ?

Cheers

Re: Collecting statistics about contents of JSONB columns

From
Thomas Munro
Date:
On Sat, Jan 1, 2022 at 11:07 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> 0006-Add-jsonb-statistics-20211230.patch

Hi Tomas,

-CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text)
+CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text)

https://cirrus-ci.com/task/6405547984420864

It looks like there is a Unicode BOM sequence in there, which is
upsetting our Windows testing but not the 3 Unixes (not sure why).
Probably added by an editor.



Re: Collecting statistics about contents of JSONB columns

From
Simon Riggs
Date:
On Fri, 31 Dec 2021 at 22:07, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> The patch does something far more
> elegant - it simply uses stavalues to store an array of JSONB documents,
> each describing stats for one path extracted from the sampled documents.

Sounds good.

> I'm sure there's plenty open questions - for example I think we'll need
> some logic to decide which paths to keep, otherwise the statistics can
> get quite big, if we're dealing with large / variable documents. We're
> already doing something similar for MCV lists.
>
> One of Nikita's patches not included in this thread allow "selective"
> statistics, where you can define in advance a "filter" restricting which
> parts are considered interesting by ANALYZE. That's interesting, but I
> think we need some simple MCV-like heuristics first anyway.
>
> Another open question is how deep the stats should be. Imagine documents
> like this:
>
>    {"a" : {"b" : {"c" : {"d" : ...}}}}
>
> The current patch build stats for all possible paths:
>
>   "a"
>   "a.b"
>   "a.b.c"
>   "a.b.c.d"
>
> and of course many of the paths will often have JSONB documents as
> values, not simple scalar values. I wonder if we should limit the depth
> somehow, and maybe build stats only for scalar values.

The user interface for designing filters sounds hard, so I'd hope we
can ignore that for now.

--
Simon Riggs                http://www.EnterpriseDB.com/



Re: Collecting statistics about contents of JSONB columns

From
Tomas Vondra
Date:
On 1/1/22 16:33, Zhihong Yu wrote:
> Hi,
> For patch 1:
> 
> +   List       *statisticsName = NIL;   /* optional stats estimat. 
> procedure */
> 
> I think if the variable is named estimatorName (or something similar), 
> it would be easier for people to grasp its purpose.
> 

I agree "statisticsName" might be too generic or confusing, but I'm not 
sure "estimator" is an improvement. Because this is not an "estimator" 
(in the sense of estimating selectivity). It "transforms" statistics to 
match the expression.

> +       /* XXX perhaps full "statistics" wording would be better */
> +       else if (strcmp(defel->defname, "stats") == 0)
> 
> I would recommend (stats sounds too general):
> 
> +       else if (strcmp(defel->defname, "statsestimator") == 0)
> 
> +       statisticsOid = ValidateStatisticsEstimator(statisticsName);
> 
> statisticsOid -> statsEstimatorOid
> 

Same issue with the "estimator" bit :-(

> For get_oprstat():
> 
> +   }
> +   else
> +       return (RegProcedure) InvalidOid;
> 
> keyword else is not needed (considering the return statement in if block).
> 

True, but this is how the other get_ functions in lsyscache.c do it. 
Like get_oprjoin().

> For patch 06:
> 
> +   /* FIXME Could be before the memset, I guess? Checking 
> vardata->statsTuple. */
> +   if (!data->statsTuple)
> +       return false;
> 
> I would agree the check can be lifted above the memset call.
> 

OK.

> + * XXX This does not really extract any stats, it merely allocates the 
> struct?
> + */
> +static JsonPathStats
> +jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)
> 
> As comments says, I think allocJsonPathStats() would be better name for 
> the func.
> 
> + * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what
> + * jsonPathStatsGetLengthStats does?
> 
> I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check 
> should be added for jsonPathStatsGetArrayLengthStats().
> 
> To be continued ...

OK. I'll see if Nikita has some ideas about the naming changes.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Collecting statistics about contents of JSONB columns

From
Tomas Vondra
Date:
On 1/1/22 22:16, Zhihong Yu wrote:
> Hi,
> 
> +static JsonPathStats
> +jsonStatsFindPathStats(JsonStats jsdata, char *path, int pathlen)
> 
> Stats appears twice in the method name. I think findJsonPathStats() 
> should suffice.
> It should check `if (jsdata->nullfrac >= 1.0)` 
> as jsonStatsGetPathStatsStr does.
> 
> +JsonPathStats
> +jsonStatsGetPathStatsStr(JsonStats jsdata, const char *subpath, int 
> subpathlen)
> 
> This func can be static, right ?
> I think findJsonPathStatsWithPrefix() would be a better name for the func.
> 
> + * XXX Doesn't this need ecape_json too?
> + */
> +static void
> +jsonPathAppendEntryWithLen(StringInfo path, const char *entry, int len)
> +{
> +   char *tmpentry = pnstrdup(entry, len);
> +   jsonPathAppendEntry(path, tmpentry);
> 
> ecape_json() is called within jsonPathAppendEntry(). The XXX comment can 
> be dropped.
> 
> +jsonPathStatsGetArrayIndexSelectivity(JsonPathStats pstats, int index)
> 
> It seems getJsonSelectivityWithArrayIndex() would be a better name.
> 

Thanks. I'll think about the naming changes.

> +       sel = scalarineqsel(NULL, operator,
> +                           operator == JsonbGtOperator ||
> +                           operator == JsonbGeOperator,
> +                           operator == JsonbLeOperator ||
> +                           operator == JsonbGeOperator,
> 
> Looking at the comment for scalarineqsel():
> 
>   *  scalarineqsel       - Selectivity of "<", "<=", ">", ">=" for scalars.
>   *
>   * This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
>   * The isgt and iseq flags distinguish which of the four cases apply.
> 
> It seems JsonbLtOperator doesn't appear in the call, can I ask why ?
> 

Because the scalarineqsel signature is this

     scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
                   Oid collation,
                   VariableStatData *vardata, Datum constval,
                   Oid consttype)

so

     /* is it greater or greater-or-equal */
     isgt = operator == JsonbGtOperator ||
            operator == JsonbGeOperator

     /* is it equality? */
     iseq = operator == JsonbLeOperator ||
            operator == JsonbGeOperator,

So I think this is correct. A comment explaining this would be nice.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Collecting statistics about contents of JSONB columns

From
Tomas Vondra
Date:
On 1/5/22 21:13, Thomas Munro wrote:
> On Sat, Jan 1, 2022 at 11:07 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> 0006-Add-jsonb-statistics-20211230.patch
> 
> Hi Tomas,
> 
> -CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text)
> +CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text)
> 
> https://cirrus-ci.com/task/6405547984420864
> 
> It looks like there is a Unicode BOM sequence in there, which is
> upsetting our Windows testing but not the 3 Unixes (not sure why).
> Probably added by an editor.
> 

Thanks, fixed along with some whitespace issues.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Collecting statistics about contents of JSONB columns

From
Tomas Vondra
Date:

On 1/5/22 21:22, Simon Riggs wrote:
> On Fri, 31 Dec 2021 at 22:07, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> 
>> The patch does something far more
>> elegant - it simply uses stavalues to store an array of JSONB documents,
>> each describing stats for one path extracted from the sampled documents.
> 
> Sounds good.
> 
>> I'm sure there's plenty open questions - for example I think we'll need
>> some logic to decide which paths to keep, otherwise the statistics can
>> get quite big, if we're dealing with large / variable documents. We're
>> already doing something similar for MCV lists.
>>
>> One of Nikita's patches not included in this thread allow "selective"
>> statistics, where you can define in advance a "filter" restricting which
>> parts are considered interesting by ANALYZE. That's interesting, but I
>> think we need some simple MCV-like heuristics first anyway.
>>
>> Another open question is how deep the stats should be. Imagine documents
>> like this:
>>
>>     {"a" : {"b" : {"c" : {"d" : ...}}}}
>>
>> The current patch build stats for all possible paths:
>>
>>    "a"
>>    "a.b"
>>    "a.b.c"
>>    "a.b.c.d"
>>
>> and of course many of the paths will often have JSONB documents as
>> values, not simple scalar values. I wonder if we should limit the depth
>> somehow, and maybe build stats only for scalar values.
> 
> The user interface for designing filters sounds hard, so I'd hope we
> can ignore that for now.
> 

Not sure I understand. I wasn't suggesting any user-defined filtering, 
but something done by default, similarly to what we do for regular MCV 
lists, based on frequency. We'd include frequent paths while excluding 
rare ones.

So no need for a user interface.

That might not work for documents with stable schema and a lot of 
top-level paths, because all the top-level paths have 1.0 frequency. But 
for documents with dynamic schema (different documents having different 
schemas/paths) it might help.

Similarly for the non-scalar values - I don't think we can really keep 
regular statistics on such values (for the same reason why it's not 
enough for whole JSONB columns), so why to build/store that anyway.


Nikita did implement a way to specify custom filters using jsonpath, but 
I did not include that into this patch series. And questions regarding 
the interface were one of the reasons.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Collecting statistics about contents of JSONB columns

From
Nikita Glukhov
Date:
Hi!

I am glad that you found my very old patch interesting and started to
work on it.  We failed to post it in 2016 mostly because we were not
satisfied with JSONB storage.  Also we decided to wait for completion
of work on extended statistics as we thought that it could help us.
But in early 2017 we switched to SQL/JSON and forgot about this patch.


I think custom datatype is necessary for better performance. With a
plain JSONB we need to do a lot of work for extraction of path stats: - iterate through MCV/Histogram JSONB arrays - cast numeric values to float, string to text etc. - build PG arrays from extracted datums - form pg_statistic tuple.

With a custom data type we could store pg_statistic tuple unmodified
and use it without any overhead.  But then we need modify a bit
VariableStatData and several functions to pass additional nullfrac
corrections.

Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
would be enough.



Also there is an idea to store per-path separately in pg_statistic_ext
rows using expression like (jb #> '{foo,bar}') as stxexprs.  This could
also help user to select which paths to analyze simply by using some
sort of CREATE STATISTICS.  But it is really unclear how to: * store pg_statistic_ext rows from typanalyze * form path expressions for array elements (maybe add new jsonpath   operator) * match various -> operator chains to stxexprs * jsonpath-like languages need simple API for searching by stxexprs



Per-path statistics should only be collected for scalars.  This can be
enabled by flag JsonAnalyzeContext.scalarsOnly.  But there are is a
problem: computed MCVs and histograms will be broken and we will not be
able to use them for queries like (jb > const) in general case.  Also
we will not be and use internally in scalarineqsel() and var_eq_const()
(see jsonSelectivity()).  Instead, we will have to implement own 
estimator functions for JSONB comparison operators that will correctly
use our hacked MCVs and histograms (and maybe not all cases will be
supported; for example, comparison to scalars only).

It's possible to replace objects and arrays with empty ones when 
scalarsOnly is set to keep correct frequencies of non-scalars.  
But there is an old bug in JSONB comparison: empty arrays are placed 
before other values in the JSONB sort order, although they should go 
after all scalars.  So we need also to distinguish empty and non-empty 
arrays here.



I tried to fix a major part of places marked as XXX and FIXME, the new
version of the patches is attached.  There are a lot of changes, you
can see them in a step-by-step form in the corresponding branch
jsonb_stats_20220122 in our GitHub repo [1]. 


Below is the explanation of fixed XXXs:

Patch 0001

src/backend/commands/operatorcmds.c:

> XXX Maybe not the right name, because "estimator" implies we're 
> calculating selectivity. But we're actually deriving statistics for 
> an expression.

Renamed to "Derivator".

> XXX perhaps full "statistics" wording would be better

Renamed to "statistics".

src/backend/utils/adt/selfuncs.c:

> examine_operator_expression():

> XXX Not sure why this returns bool - we ignore the return value
> anyway. We might as well return the calculated vardata (or NULL).

Oprstat was changed to return void.

> XXX Not sure what to do about recursion - there can be another OpExpr
> in one of the arguments, and we should call this recursively from the
> oprstat procedure. But that's not possible, because it's marked as
> static.

Oprstat call chain: get_restriction_variable() => examine_variable() =>
examine_operator_expression().

The main thing here is that OpExpr with oprstat acts like ordinary Var.

> examine_variable():
> XXX Shouldn't this put more restrictions on the OpExpr? E.g. that
> one of the arguments has to be a Const or something?

This is simply a responsibility of oprstat.


Patch 0004

> XXX Not sure we want to add these functions to jsonb.h, which is the
> public API. Maybe it belongs rather to jsonb_typanalyze.c or
> elsewhere, closer to how it's used?

Maybe it needs to be moved the new file jsonb_utils.h.  I think these
functions become very useful, when we start to build JSONBs with
predefined structure.


> pushJsonbValue():
> XXX I'm not quite sure why we actually do this? Why do we need to
> change how JsonbValue is converted to Jsonb for the statistics patch?

Scalars in JSONB are encoded as one-element preudo-array containers.
So when we are inserting binary JsonbValues, that was initialized
directly from JSONB datums, inside other non-empty JSONB in
pushJsonbValue(), all values except scalars are inserted as
expected but scalars become [scalar].  So we need to extract scalar
values in caller functions or in pushJsonbValue().  I think
autoextraction in pushJsonbValue() is better.  This case simply was not
used before JSONB stats introduction.


Patch 0006

src/backend/utils/adt/jsonb_selfuncs.c

> jsonPathStatsGetSpecialStats()
> XXX This does not really extract any stats, it merely allocates the
> struct?

Renamed with "Alloc" suffix.

> jsonPathStatsGetArrayLengthStats()
> XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to
> what jsonPathStatsGetLengthStats does?

"length" stats were collected inside parent paths, but "array_length"
and "avg_array_length" stats were collected inside child array paths.
This resulted in inconsistencies in TypeFreq checks.

I have removed "length" stats, moved "array_length" and
"avg_array_length" to the parent path, and added separate
"object_length" stats.  TypeFreq checks become consistent.

> XXX Seems to do essentially what jsonStatsFindPath, except that it
> also considers jsdata->prefix. Seems fairly easy to combine those
> into a single function.

I don't think that it would be better to combine those function into
one with considerPrefix flag, because jsonStatsFindPathStats() is a
some kind of low-level function which is called only in two places.
In all other places considerPrefix will be true. Also
jsonStatsFindPathStatsStr() is exported in jsonb_selfuncs.h for to give
external jsonpath-like query operators ability to use JSON statistics.

> jsonPathStatsCompare()
> XXX Seems a bit convoluted to first cast it to Datum, then Jsonb ...
> Datum const *pdatum = pv2;
> Jsonb	   *jsonb = DatumGetJsonbP(*pdatum);

The problem of simply using 'jsonb = *(Jsonb **) pv2' is that
DatumGetJsonbP() may deTOAST datums.

> XXX Not sure about this? Does empty path mean global stats?
Empty "path" is simply a sign of invalid stats data.


> jsonStatsConvertArray()
> FIXME Does this actually work on all 32/64-bit systems? What if typid
> is FLOAT8OID or something? Should look at TypeCache instead, probably.

Used get_typlenbyvalalign() instead of hardcoded values.

> jsonb_stats()
>   XXX It might be useful to allow recursion, i.e.
>   get_restriction_variable might derive statistics too.
>   I don't think it does that now, right?

get_restriction_variable() already might derive statistics: it calls
examine_variable() on its both operands, and examine_variable() calls
examine_operator_expression().

> XXX Could we also get varonleft=false in useful cases?

All supported JSONB operators have signature  jsonb OP arg.
If varonleft = false, then we need to derive stats for expression like  '{"foo": "bar"}' -> text_column
having stats for text_column.  It is possible to implement too, but it
is not related to JSONB stats and needs to be done in the separate
patch.


> jsonSelectivityContains():
>   XXX This really needs more comments explaining the logic.

I have refactored this function and added comments.


> jsonGetFloat4():
>   XXX Not sure assert is a sufficient protection against different
>   types of JSONB values to be passed in.

I have refactored this function by passing a default value for
non-numeric JSONB case.

> jsonPathAppendEntryWithLen()
> XXX Doesn't this need ecape_json too?

Comment is removed, because jsonPathAppendEntry() is called inside.

> jsonPathStatsGetTypeFreq()
> FIXME This is really hard to read/understand, with two nested ternary
> operators.

I have refactored this place.

> XXX Seems more like an error, no? Why ignore it?

It's not an error, it's request of non-numeric type.  Length values
are always numeric, and frequence of non-numeric values is 0.

The possible case when it could happen is jsonpath '$.size() == "foo"'.
Estimator for operator == will check frequence of strings in .size()
and it will be 0.


> jsonPathStatsFormTuple()
> FIXME What does this mean?

There is no need to transform ordinary root path stats tuple, it can be
simply copied.


jsonb_typanalyze.c

> XXX We need entry+lenth because JSON path elements may contain null
> bytes, I guess?

'entry' can be not zero-terminated when it is pointing to JSONB keys,
so 'len' is necessary field.  'len' is also used for faster entry comparison, to distinguish array entries ('len' == -1).

> XXX Sould be JsonPathEntryMatch as it deals with JsonPathEntry nodes
> not whole paths, no?
> XXX Again, maybe JsonPathEntryHash would be a better name?

Functions renamed using JsonPathEntry prefix, JsonPath typedef removed.


> JsonPathMatch()
> XXX Seems a bit silly to return int, when the return statement only
> really returns bool (because of how it compares paths). It's not really
> a comparator for sorting, for example.

This function is implementation of HashCompareFunc and it needs to
return int.

> jsonAnalyzeJson()
> XXX The name seems a bit weird, with the two json bits.

Renamed to jsonAnalyzeCollectPaths(). The two similar functions were
renamed too.

> jsonAnalyzeJson():
> XXX The mix of break/return statements in this block is really
> confusing.

I have refactored this place using only breaks.

> XXX not sure why we're doing this?

Manual recursion into containers by creating child iterator together
with skipNested=true flag is used to give jsonAnalyzeJsonValue()
ability to access jbvBinary containers.


> compute_json_stats()
> XXX Not sure what the first branch is doing (or supposed to)?

> XXX It's not immediately clear why this is (-1) and not simply
> NULL. It crashes, so presumably it's used to tweak the behavior,
> but it's not clear why/how, and it affects place that is pretty
> far away, and so not obvious. We should use some sort of flag
> with a descriptive name instead.

> XXX If I understand correctly, we simply collect all paths first,
> without accumulating any Values. And then in the next step we
> process each path independently, probably to save memory (we
> don't want to accumulate all values for all paths, with a lot
> of duplicities).

There are two variants of stats collection: * single-pass - collect all values for all paths * multi-pass  - collect only values for a one path at each pass

The first variant can consume too much memory (jsonb iteration produces
a lot of garbage etc.), but works faster than second.

The first 'if (false)' is used for manual selection of one of this
variants.  This selection should be controlled by some user-specified
option (maybe GUC), or the first variant can be simply removed.

jsonAnalyzeJson()'s parameter of type JsonPathAnlStats * determines
which paths we need to consider for the value collection: * NULL  - collect values for all paths * -1    - do not collect any values * stats - collect values only for a given path

The last variant really is unused because we already have another
function jsonAnalyzeJsonPath(), which is optimized for selective path
values collection (used object key accessor JSONB functions instead of
full JSONB iteration).  I have replaced this strange parameter with
simple boolean flag.

> XXX Could the parameters be different on other platforms?

Used get_typlenbyvalalign(JSONBOID) instead of hardcoded values.

> jsonAnalyzePathValues()
> XXX Do we need to initialize all slots?

I have copied here the following comment from extended_stats.c:
/* * The fields describing the stats->stavalues[n] element types default * to the type of the data being analyzed, but the type-specific * typanalyze function can change them if it wants to store something * else. */

> XXX Not sure why we divide it by the number of json values?

We divide counts of lengths by the total number of json values to
compute correct nullfrac. I.e. not all input jsons have lengths,
length of scalar jsons is NULL.



[1] https://github.com/postgrespro/postgres/tree/jsonb_stats_20220122

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Collecting statistics about contents of JSONB columns

From
Tomas Vondra
Date:
On 1/23/22 01:24, Nikita Glukhov wrote:
> Hi!
> 
> I am glad that you found my very old patch interesting and started to
> work on it.  We failed to post it in 2016 mostly because we were not
> satisfied with JSONB storage.  Also we decided to wait for completion
> of work on extended statistics as we thought that it could help us.
> But in early 2017 we switched to SQL/JSON and forgot about this patch.
>

Understood. Let's see how feasible this idea is and if we can move this 
forward.

> 
> I think custom datatype is necessary for better performance. With a
> plain JSONB we need to do a lot of work for extraction of path stats:
>   - iterate through MCV/Histogram JSONB arrays
>   - cast numeric values to float, string to text etc.
>   - build PG arrays from extracted datums
>   - form pg_statistic tuple.
> 
> With a custom data type we could store pg_statistic tuple unmodified
> and use it without any overhead.  But then we need modify a bit
> VariableStatData and several functions to pass additional nullfrac
> corrections.
> 

I'm not against evaluating/exploring alternative storage formats, but my 
feeling is the real impact on performance will be fairly low. At least I 
haven't seen this as very expensive while profiling the patch so far. Of 
course, I may be wrong, and it may be more significant in some cases.

> Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
> would be enough.
> 
Maybe, but then you still need to store a bunch of those, right? So 
either an array (likely toasted) or 1:M table. I'm not sure it's goiing 
to be much cheaper than JSONB.

I'd suggest we focus on what we need to store first, which seems like 
tha primary question, and worry about the exact storage format then.


> Also there is an idea to store per-path separately in pg_statistic_ext
> rows using expression like (jb #> '{foo,bar}') as stxexprs.  This could
> also help user to select which paths to analyze simply by using some
> sort of CREATE STATISTICS.  But it is really unclear how to:
>   * store pg_statistic_ext rows from typanalyze
>   * form path expressions for array elements (maybe add new jsonpath
>     operator)
>   * match various -> operator chains to stxexprs
>   * jsonpath-like languages need simple API for searching by stxexprs
> 

Sure, you can do statistics on expressions, right? Of course, if that 
expression produces JSONB value, that's not very useful at the moment. 
Maybe we could have two typanalyze functions - one for regular analyze, 
one for extended statistics?

That being said, I'm not sure extended stats are a good match for this. 
My feeling was we should collect these stats for all JSONB columns, 
which is why I argued for putting that in pg_statistic.

> 
> 
> Per-path statistics should only be collected for scalars.  This can be
> enabled by flag JsonAnalyzeContext.scalarsOnly.  But there are is a
> problem: computed MCVs and histograms will be broken and we will not be
> able to use them for queries like (jb > const) in general case.  Also
> we will not be and use internally in scalarineqsel() and var_eq_const()
> (see jsonSelectivity()).  Instead, we will have to implement own
> estimator functions for JSONB comparison operators that will correctly
> use our hacked MCVs and histograms (and maybe not all cases will be
> supported; for example, comparison to scalars only).
>

Yeah, but maybe that's an acceptable trade-off? I mean, if we can 
improve estimates for most clauses, and there's a some number of clauses 
that are estimated just like without stats, that's still an improvement, 
right?

> It's possible to replace objects and arrays with empty ones when
> scalarsOnly is set to keep correct frequencies of non-scalars.
> But there is an old bug in JSONB comparison: empty arrays are placed
> before other values in the JSONB sort order, although they should go
> after all scalars.  So we need also to distinguish empty and non-empty
> arrays here.
> 

Hmmm ...
> 
> 
> I tried to fix a major part of places marked as XXX and FIXME, the new
> version of the patches is attached.  There are a lot of changes, you
> can see them in a step-by-step form in the corresponding branch
> jsonb_stats_20220122 in our GitHub repo [1].
> 

Thanks! I'll go through the changes soon.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Collecting statistics about contents of JSONB columns

From
Mahendra Singh Thalor
Date:
On Tue, 25 Jan 2022 at 03:50, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> On 1/23/22 01:24, Nikita Glukhov wrote:
> > Hi!
> >
> > I am glad that you found my very old patch interesting and started to
> > work on it.  We failed to post it in 2016 mostly because we were not
> > satisfied with JSONB storage.  Also we decided to wait for completion
> > of work on extended statistics as we thought that it could help us.
> > But in early 2017 we switched to SQL/JSON and forgot about this patch.
> >
>
> Understood. Let's see how feasible this idea is and if we can move this
> forward.
>
> >
> > I think custom datatype is necessary for better performance. With a
> > plain JSONB we need to do a lot of work for extraction of path stats:
> >   - iterate through MCV/Histogram JSONB arrays
> >   - cast numeric values to float, string to text etc.
> >   - build PG arrays from extracted datums
> >   - form pg_statistic tuple.
> >
> > With a custom data type we could store pg_statistic tuple unmodified
> > and use it without any overhead.  But then we need modify a bit
> > VariableStatData and several functions to pass additional nullfrac
> > corrections.
> >
>
> I'm not against evaluating/exploring alternative storage formats, but my
> feeling is the real impact on performance will be fairly low. At least I
> haven't seen this as very expensive while profiling the patch so far. Of
> course, I may be wrong, and it may be more significant in some cases.
>
> > Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
> > would be enough.
> >
> Maybe, but then you still need to store a bunch of those, right? So
> either an array (likely toasted) or 1:M table. I'm not sure it's goiing
> to be much cheaper than JSONB.
>
> I'd suggest we focus on what we need to store first, which seems like
> tha primary question, and worry about the exact storage format then.
>
>
> > Also there is an idea to store per-path separately in pg_statistic_ext
> > rows using expression like (jb #> '{foo,bar}') as stxexprs.  This could
> > also help user to select which paths to analyze simply by using some
> > sort of CREATE STATISTICS.  But it is really unclear how to:
> >   * store pg_statistic_ext rows from typanalyze
> >   * form path expressions for array elements (maybe add new jsonpath
> >     operator)
> >   * match various -> operator chains to stxexprs
> >   * jsonpath-like languages need simple API for searching by stxexprs
> >
>
> Sure, you can do statistics on expressions, right? Of course, if that
> expression produces JSONB value, that's not very useful at the moment.
> Maybe we could have two typanalyze functions - one for regular analyze,
> one for extended statistics?
>
> That being said, I'm not sure extended stats are a good match for this.
> My feeling was we should collect these stats for all JSONB columns,
> which is why I argued for putting that in pg_statistic.
>
> >
> >
> > Per-path statistics should only be collected for scalars.  This can be
> > enabled by flag JsonAnalyzeContext.scalarsOnly.  But there are is a
> > problem: computed MCVs and histograms will be broken and we will not be
> > able to use them for queries like (jb > const) in general case.  Also
> > we will not be and use internally in scalarineqsel() and var_eq_const()
> > (see jsonSelectivity()).  Instead, we will have to implement own
> > estimator functions for JSONB comparison operators that will correctly
> > use our hacked MCVs and histograms (and maybe not all cases will be
> > supported; for example, comparison to scalars only).
> >
>
> Yeah, but maybe that's an acceptable trade-off? I mean, if we can
> improve estimates for most clauses, and there's a some number of clauses
> that are estimated just like without stats, that's still an improvement,
> right?
>
> > It's possible to replace objects and arrays with empty ones when
> > scalarsOnly is set to keep correct frequencies of non-scalars.
> > But there is an old bug in JSONB comparison: empty arrays are placed
> > before other values in the JSONB sort order, although they should go
> > after all scalars.  So we need also to distinguish empty and non-empty
> > arrays here.
> >
>
> Hmmm ...
> >
> >
> > I tried to fix a major part of places marked as XXX and FIXME, the new
> > version of the patches is attached.  There are a lot of changes, you
> > can see them in a step-by-step form in the corresponding branch
> > jsonb_stats_20220122 in our GitHub repo [1].
> >
>
> Thanks! I'll go through the changes soon.
>
>

Thanks, Nikita and Tomas for these patches.

For the last few days, I was trying to understand these patches, and based on Tomas's suggestion, I was doing some performance tests.

With the attached .SQL file, I can see that analyze is taking more time with these patches.

Setup:
autovacuum=off
rest all are default settings.

Insert attached file with and without the patch to compare the time taken by analyze.

With json patches:
postgres=# analyze test ;
ANALYZE
Time: 28464.062 ms (00:28.464)
postgres=#
postgres=# SELECT pg_size_pretty( pg_total_relation_size('pg_catalog.pg_statistic') );
 pg_size_pretty
----------------
 328 kB
(1 row)
--

Without json patches:
postgres=# analyze test ;
ANALYZE
Time: 120.864 ms
postgres=# SELECT pg_size_pretty( pg_total_relation_size('pg_catalog.pg_statistic') );
 pg_size_pretty
----------------
 272 kB

I haven't found the root cause of this but I feel that this time is due to a loop of all the paths.
In my test data, there is a total of 951 different-2 paths. While doing analysis, first we check all the sample rows(30000) and we collect all the different-2 paths (here 951), and after that for every single path, we loop over all the sample rows again to collect stats for a particular path. I feel that these loops might be taking time.

I will run perf and will try to find out the root cause of this.

Apart from this performance issue, I haven't found any crashes or issues.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Collecting statistics about contents of JSONB columns

From
Greg Stark
Date:
On Thu, 6 Jan 2022 at 14:56, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
> Not sure I understand. I wasn't suggesting any user-defined filtering,
> but something done by default, similarly to what we do for regular MCV
> lists, based on frequency. We'd include frequent paths while excluding
> rare ones.
>
> So no need for a user interface.

Not sure but I think he was agreeing with you. That we should figure
out the baseline behaviour and get it as useful as possible first then
later look at adding some way to customize it. I agree -- I don't
think the user interface will be hard technically but I think it will
require some good ideas and there could be lots of bikeshedding. And a
lot of users will never even use it anyways so it's important to get
the defaults as useful as possible.

> Similarly for the non-scalar values - I don't think we can really keep
> regular statistics on such values (for the same reason why it's not
> enough for whole JSONB columns), so why to build/store that anyway.

For a default behaviour I wonder if it wouldn't be better to just
flatten and extract all the scalars. So if there's no matching path
then at least we have some way to estimate how often a scalar appears
anywhere in the json document.

That amounts to assuming the user knows the right path to find a given
scalar and there isn't a lot of overlap between keys. So it would at
least do something useful if you have something like {gender: female,
name: {first: nancy, last: reagan], state: california, country: usa}.
It might get things slightly wrong if you have some people named
"georgia" or have names that can be first or last names.

But it would generally be doing something more or less useful as long
as they look for "usa" in the country field and "male" in the gender
field. If they looked for "male" in $.name.first path it would give
bad estimates but assuming they know their data structure they won't
be doing that.

-- 
greg



Re: Collecting statistics about contents of JSONB columns

From
Tomas Vondra
Date:

On 1/25/22 17:56, Mahendra Singh Thalor wrote:
 >
> ...
> 
> For the last few days, I was trying to understand these patches, and 
> based on Tomas's suggestion, I was doing some performance tests.
> 
> With the attached .SQL file, I can see that analyze is taking more time 
> with these patches.
> 
> *Setup: *
> autovacuum=off
> rest all are default settings.
> 
> Insert attached file with and without the patch to compare the time 
> taken by analyze.
> 
> *With json patches:*
> postgres=# analyze test ;
> ANALYZE
> Time: *28464.062 ms (00:28.464)*
> postgres=#
> postgres=# SELECT pg_size_pretty( 
> pg_total_relation_size('pg_catalog.pg_statistic') );
>   pg_size_pretty
> ----------------
>   328 kB
> (1 row)
> -- 
> 
> *Without json patches:*
> postgres=# analyze test ;
> ANALYZE
> *Time: 120.864* ms
> postgres=# SELECT pg_size_pretty( 
> pg_total_relation_size('pg_catalog.pg_statistic') );
>   pg_size_pretty
> ----------------
>   272 kB
> 
> I haven't found the root cause of this but I feel that this time is due 
> to a loop of all the paths.
> In my test data, there is a total of 951 different-2 paths. While doing 
> analysis, first we check all the sample rows(30000) and we collect all 
> the different-2 paths (here 951), and after that for every single path, 
> we loop over all the sample rows again to collect stats for a particular 
> path. I feel that these loops might be taking time.
> 
> I will run perf and will try to find out the root cause of this.
> 

Thanks, I've been doing some performance tests too, and you're right it 
takes quite a bit of time. I wanted to compare how the timing changes 
with complexity of the JSON documents (nesting, number of keys, ...) so 
I wrote a simple python script to generate random JSON documents with 
different parameters - see the attached json-generate.py script.

It's a bit crude, but it generates synthetic documents with a chosen 
number of levels, keys per level, distinct key values, etc. The 
generated documents are loaded directly into a "json_test" database, 
into a table "test_table" with a single jsonb column called "v". 
Tweaking this to connect to a different database, or just dump the 
generated documents to a file, should be trivial.

The attached bash script runs the data generator for a couple of 
combinations, and them measures how long it takes to analyze the table, 
how large the statistics are (in a rather crude way), etc.

The results look like this (the last two columns are analyze duration in 
milliseconds, for master and with the patch):

   levels   keys  unique keys    paths      master    patched
   ----------------------------------------------------------
        1      1            1        2         153        122
        1      1         1000     1001         134       1590
        1      8            8        9         157        367
        1      8         1000     1001         155       1838
        1     64           64       65         189       2298
        1     64         1000     1001          46       9322
        2      1            1        3         237        197
        2      1         1000    30580         152      46468
        2      8            8       73         245       1780

So yeah, it's significantly slower - in most cases not as much as you 
observed, but an order of magnitude slower than master. For size of the 
statistics, it's similar:

   levels   keys  unique keys   paths  table size   master    patched
   ------------------------------------------------------------------
        1      1            1       2     1843200    16360      24325
        1      1         1000    1001     1843200    16819    1425400
        1      8            8       9     4710400    28948      88837
        1      8         1000    1001     6504448    42694    3915802
        1     64           64      65    30154752   209713     689842
        1     64         1000    1001    49086464     1093    7755214
        2      1            1       3     2572288    24883      48727
        2      1         1000   30580     2572288    11422   26396365
        2      8            8      73    23068672   164785     862258

This is measured by by dumping pg_statistic for the column, so in 
database it might be compressed etc. It's larger, but that's somewhat 
expected because we simply store more detailed stats. The size grows 
with the number of paths extracted - which is expected, of course.

If you noticed why this doesn't show data for additional combinations 
(e.g. 2 levels 8 keys and 1000 distinct key values), then that's the bad 
news - that takes ages (multiple minutes) and then it gets killed by OOM 
killer because it eats gigabytes of memory.

I agree the slowness is largely due to extracting all paths and then 
processing them one by one - which means we have to loop over the tuples 
over and over. In this case there's about 850k distinct paths extracted, 
so we do ~850k loops over 30k tuples. That's gotta take time.

I don't know what exactly to do about this, but I already mentioned we 
may need to pick a subset of paths to keep, similarly to how we pick 
items for MCV. I mean, if we only saw a path once or twice, it's 
unlikely to be interesting enough to build stats for it. I haven't 
tried, but I'd bet most of the 850k paths might be ignored like this.

The other thing we might do is making it the loops more efficient. For 
example, we might track which documents contain each path (by a small 
bitmap or something), so that in the loop we can skip rows that don't 
contain the path we're currently processing. Or something like that.

Of course, this can't eliminate all the overhead - we've building more 
stats and that has a cost. In the "common" case of stable "fixed" schema 
with the same paths in all documents we'll still need to do loop for 
each of them. So it's bound to be slower than master.

Which probably means it's a bad idea to do this for all JSONB columns, 
because in many cases the extra stats are not needed so the extra 
analyze time would be a waste. So I guess we'll need some way to enable 
this only for selected columns ... I argued against the idea to 
implement this as extended statistics in the first message, but it's a 
reasonably nice way to do such stuff (expression stats are a precedent).


> Apart from this performance issue, I haven't found any crashes or issues.
> 

Well, I haven't seen any crashes either, but as I mentioned for complex 
documents (2 levels, many distinct keys) the ANALYZE starts consuming a 
lot of memory and may get killed by OOM. For example if you generate 
documents like this

  ./json-generate.py 30000 2 8 1000 6 1000

and then run ANALYZE, that'll take ages and it very quickly gets into a 
situation like this (generated from gdb by calling MemoryContextStats on 
TopMemoryContext):


-------------------------------------------------------------------------
TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks); 66784 
used
   ...
   TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
     PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536 
used: <unnamed>
       Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks); 
469000720 used
         Analyze Column: 921177696 total in 120 blocks; 5123256 free 
(238 chunks); 916054440 used
           Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free 
(1 chunks); 2472 used
             Json Analyze Pass Context: 8192 total in 1 blocks; 7928 
free (0 chunks); 264 used
           JSON analyze path table: 1639706040 total in 25084 blocks; 
1513640 free (33 chunks); 1638192400 used
       Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
...
Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352 
chunks); 3024345064 used
-------------------------------------------------------------------------


Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze 
path table" context, 400MB in "analyze" and 900MB in "analyze column"
contexts. I mean, that seems a bit excessive. And it grows over time, so 
after a while my laptop gives up and kills the backend.

I'm not sure if it's a memory leak (which would be fixable), or it's due 
to keeping stats for all the extracted paths. I mean, in this particular 
case we have 850k paths - even if stats are just 1kB per path,  that's 
850MB. This requires more investigation.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Collecting statistics about contents of JSONB columns

From
Tomas Vondra
Date:

On 2/4/22 03:47, Tomas Vondra wrote:
> ./json-generate.py 30000 2 8 1000 6 1000

Sorry, this should be (different order of parameters):

./json-generate.py 30000 2 1000 8 6 1000


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Collecting statistics about contents of JSONB columns

From
Nikita Glukhov
Date:


On 04.02.2022 05:47, Tomas Vondra wrote:
On 1/25/22 17:56, Mahendra Singh Thalor wrote:
>
...

For the last few days, I was trying to understand these patches, and based on Tomas's suggestion, I was doing some performance tests.

With the attached .SQL file, I can see that analyze is taking more time with these patches.

I haven't found the root cause of this but I feel that this time is due to a loop of all the paths.
In my test data, there is a total of 951 different-2 paths. While doing analysis, first we check all the sample rows(30000) and we collect all the different-2 paths (here 951), and after that for every single path, we loop over all the sample rows again to collect stats for a particular path. I feel that these loops might be taking time.

Thanks, I've been doing some performance tests too, and you're right it takes quite a bit of time.

That is absolutely not surprising, I have warned about poor performance
in cases with a large number of paths.

I agree the slowness is largely due to extracting all paths and then processing them one by one - which means we have to loop over the tuples over and over. In this case there's about 850k distinct paths extracted, so we do ~850k loops over 30k tuples. That's gotta take time.

I don't know what exactly to do about this, but I already mentioned we may need to pick a subset of paths to keep, similarly to how we pick items for MCV. I mean, if we only saw a path once or twice, it's unlikely to be interesting enough to build stats for it. I haven't tried, but I'd bet most of the 850k paths might be ignored like this.

The other thing we might do is making it the loops more efficient. For example, we might track which documents contain each path (by a small bitmap or something), so that in the loop we can skip rows that don't contain the path we're currently processing. Or something like that.

Apart from this performance issue, I haven't found any crashes or issues.


Well, I haven't seen any crashes either, but as I mentioned for complex documents (2 levels, many distinct keys) the ANALYZE starts consuming a lot of memory and may get killed by OOM. For example if you generate documents like this

 ./json-generate.py 30000 2 8 1000 6 1000

and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext): and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext):

-------------------------------------------------------------------------
TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks); 66784 used
  ...
  TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
    PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536 used: <unnamed>
      Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks); 469000720 used
        Analyze Column: 921177696 total in 120 blocks; 5123256 free (238 chunks); 916054440 used
          Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free (1 chunks); 2472 used
            Json Analyze Pass Context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
          JSON analyze path table: 1639706040 total in 25084 blocks; 1513640 free (33 chunks); 1638192400 used
      Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
...
Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352 chunks); 3024345064 used
-------------------------------------------------------------------------


Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze path table" context, 400MB in "analyze" and 900MB in "analyze column" contexts. I mean, that seems a bit excessive. And it grows over time, so after a while my laptop gives up and kills the backend.

I'm not sure if it's a memory leak (which would be fixable), or it's due to keeping stats for all the extracted paths. I mean, in this particular case we have 850k paths - even if stats are just 1kB per path,  that's 850MB. This requires more investigation.
Thank you for the tests and investigation.

I have tried to reduce memory consumption and speed up row scanning:

1. "JSON analyze path table" context contained ~1KB JsonPathAnlStats   structure per JSON path in the global hash table.  I have moved   JsonPathAnlStats to the stack of compute_json_stats(), and hash    table now consumes ~70 bytes per path.

2. I have fixed copying of resulting JSONB stats into context, which   reduced the size of "Analyze Column" context.

3. I have optimized consumption of single-pass algorithm by storing   only value lists in the non-temporary context.  That helped to    execute "2 64 64" test case in 30 seconds.  Single-pass is a    bit faster in non-TOASTed cases, and much faster in TOASTed.   But it consumes much more memory and still goes to OOM in the    cases with more than ~100k paths.

4. I have implemented per-path document lists/bitmaps, which really    speed up the case "2 8 1000".  List is converted into bitmap when    it becomes larger than bitmap.

5. Also I have fixed some bugs.


All these changes you can find commit form in our GitHub repository 
on the branch jsonb_stats_20220310 [1].


Updated results of the test:

levels keys  uniq keys  paths   master    multi-pass    single-pass                                              ms  MB       ms    MB
-------------------------------------------------------------------     1    1         1       2      153      122   10       82    14     1    1      1000    1001      134      105   11       78    38     1    8         8       9      157      384   19      328    32     1    8      1000    1001      155      454   23      402    72     1   64        64      65      129     2889   45     2386   155     1   64      1000    1001      158     3990   94     1447   177     2    1         1       3      237      147   10       91    16     2    1      1000   30577      152      264   32      394   234     2    8         8      72      245     1943   37     1692   139     2    8      1000  852333      152     9175  678         OOM     2   64        64    4161     1784 ~1 hour    53    30018  1750     2   64      1000 1001001     4715 ~4 hours 1600         OOM

The two last multi-pass results are too slow, because JSONBs becomes 
TOASTed.  For measuring master in these tests, I disabled 
WIDTH_THRESHOLD check which skipped TOASTed values > 1KB.


Next, I am going to try to disable all-paths collection and implement 
collection of most common paths (and/or hashed paths maybe).


[1] https://github.com/postgrespro/postgres/tree/jsonb_stats_20220310
-- 
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Collecting statistics about contents of JSONB columns

From
Mahendra Singh Thalor
Date:
On Fri, 4 Feb 2022 at 08:30, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 2/4/22 03:47, Tomas Vondra wrote:
> > ./json-generate.py 30000 2 8 1000 6 1000
>
> Sorry, this should be (different order of parameters):
>
> ./json-generate.py 30000 2 1000 8 6 1000
>

Thanks, Tomas for this test case.

Hi Hackers,

For the last few days, I was doing testing on the top of these JSON optimizers patches and was taking help fro Tomas Vondra to understand patches and testing results.
Thanks, Tomas for your feedback and suggestions.

Below is the summary:
Point 1) analyse is taking very much time for large documents:
For large JSON documents, analyze took very large time as compared to the current head. For reference, I am attaching test file (./json-generate.py 30000 2 1000 8 6 1000)

Head: analyze test ; Time: 120.864 ms
With patch: analyze test ; Time: more than 2 hours

analyze is taking a very large time because with these patches, firstly we iterate over all sample rows (in above case 30000), and we store all the paths (here around 850k paths).
In another pass, we took 1 path at a time and collects stats for the particular path by analyzing all the sample rows and we continue this process for all 850k paths or we can say that we do 850k loops, and in each loop we extract values for a single path.

Point 2) memory consummation increases rapidly for large documents:
In the above test case, there are total 851k paths and to keep stats for one path, we allocate 1120 bytes.

Total paths : 852689 ~ 852k

Memory for 1 path to keep stats: 1120 ~ 1 KB

(sizeof(JsonValueStats) = 1120 from “Analyze Column”)

Total memory for all paths: 852689 * 1120 = 955011680 ~ 955 MB

Extra memory for each path will be more. I mean, while analyzing each path, we allocate some more memory based on frequency and others

To keep all entries(851k paths) in the hash, we use around 1GB memory for hash so this is also very large.

Point 3) Review comment noticed by Tomas Vondra:

+       oldcxt = MemoryContextSwitchTo(ctx->stats->anl_context);
+       pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+       MemoryContextSwitchTo(oldcxt);

Above should be:
+       oldcxt = MemoryContextSwitchTo(ctx->mcxt);
+       pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+       MemoryContextSwitchTo(oldcxt);

Response from Tomas Vondra:
The problem is "anl_context" is actually "Analyze", i.e. the context for the whole ANALYZE command, for all the columns. But we only want to keep those path stats while processing a particular column. At the end, after processing all paths from a column, we need to "build" the final stats in the column, and this result needs to go into "Analyze" context. But all the partial results need to go into "Analyze Column" context.

Point 4)

+/*

+ * jsonAnalyzeCollectPath

+ *             Extract a single path from JSON documents and collect its values.

+ */

+static void

+jsonAnalyzeCollectPath(JsonAnalyzeContext *ctx, Jsonb *jb, void *param)

+{

+       JsonPathAnlStats *pstats = (JsonPathAnlStats *) param;

+       JsonbValue      jbvtmp;

+       JsonbValue *jbv = JsonValueInitBinary(&jbvtmp, jb);

+       JsonPathEntry *path;

+       JsonPathEntry **entries;

+       int                     i;

+

+       entries = palloc(sizeof(*entries) * pstats->depth);

+

+       /* Build entry array in direct order */

+       for (path = &pstats->path, i = pstats->depth - 1;

+                path->parent && i >= 0;

+                path = path->parent, i--)

+               entries[i] = path;

+

+       jsonAnalyzeCollectSubpath(ctx, pstats, jbv, entries, 0);

+

+       pfree(entries);

----many times, we are trying to palloc with zero size and entries is pointing to invalid memory (because pstats->depth=0) so I think, we should not try to palloc with 0??

Fix:

+       If (pstats->depth)

  +            entries = palloc(sizeof(*entries) * pstats->depth);



From these points, we can say that we should rethink our design to collect stats for all paths.

We can set limits(like MCV) for paths or we can give an explicit path to collect stats for a particular path only or we can pass a subset of the JSON values.

In the above case, there are total 851k paths, but we can collect stats for only 1000 paths that are most common so this way we can minimize time and memory also and we might even keep at
least frequencies for the non-analyzed paths.

Next, I will take the latest patches from Nikita's last email and I will do more tests.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: Collecting statistics about contents of JSONB columns

From
Greg Stark
Date:
This patch has bitrotted, presumably after the other JSON patchset was
applied. It looks like it's failing in the json header file so it may
be as simple as additional functions added on nearby lines.

Please rebase. Reminder, it's the last week of the commitfest so time
is of the essence....



Re: Collecting statistics about contents of JSONB columns

From
Justin Pryzby
Date:
I noticed some typos.

diff --git a/src/backend/utils/adt/jsonb_selfuncs.c b/src/backend/utils/adt/jsonb_selfuncs.c
index f5520f88a1d..d98cd7020a1 100644
--- a/src/backend/utils/adt/jsonb_selfuncs.c
+++ b/src/backend/utils/adt/jsonb_selfuncs.c
@@ -1342,7 +1342,7 @@ jsonSelectivityContains(JsonStats stats, Jsonb *jb)
                     path->stats = jsonStatsFindPath(stats, pathstr.data,
                                                     pathstr.len);
 
-                /* Appeend path string entry for array elements, get stats. */
+                /* Append path string entry for array elements, get stats. */
                 jsonPathAppendEntry(&pathstr, NULL);
                 pstats = jsonStatsFindPath(stats, pathstr.data, pathstr.len);
                 freq = jsonPathStatsGetFreq(pstats, 0.0);
@@ -1367,7 +1367,7 @@ jsonSelectivityContains(JsonStats stats, Jsonb *jb)
             case WJB_END_ARRAY:
             {
                 struct Path *p = path;
-                /* Absoulte selectivity of the path with its all subpaths */
+                /* Absolute selectivity of the path with its all subpaths */
                 Selectivity abs_sel = p->sel * p->freq;
 
                 /* Pop last path entry */
diff --git a/src/backend/utils/adt/jsonb_typanalyze.c b/src/backend/utils/adt/jsonb_typanalyze.c
index 7882db23a87..9a759aadafb 100644
--- a/src/backend/utils/adt/jsonb_typanalyze.c
+++ b/src/backend/utils/adt/jsonb_typanalyze.c
@@ -123,10 +123,9 @@ typedef struct JsonScalarStats
 /*
  * Statistics calculated for a set of values.
  *
- *
  * XXX This seems rather complicated and needs simplification. We're not
  * really using all the various JsonScalarStats bits, there's a lot of
- * duplication (e.g. each JsonScalarStats contains it's own array, which
+ * duplication (e.g. each JsonScalarStats contains its own array, which
  * has a copy of data from the one in "jsons").
  */
 typedef struct JsonValueStats
@@ -849,7 +848,7 @@ jsonAnalyzePathValues(JsonAnalyzeContext *ctx, JsonScalarStats *sstats,
     stats->stanullfrac = (float4)(1.0 - freq);
 
     /*
-     * Similarly, we need to correct the MCV frequencies, becuse those are
+     * Similarly, we need to correct the MCV frequencies, because those are
      * also calculated only from the non-null values. All we need to do is
      * simply multiply that with the non-NULL frequency.
      */
@@ -1015,7 +1014,7 @@ jsonAnalyzeBuildPathStats(JsonPathAnlStats *pstats)
 
     /*
      * We keep array length stats here for queries like jsonpath '$.size() > 5'.
-     * Object lengths stats can be useful for other query lanuages.
+     * Object lengths stats can be useful for other query languages.
      */
     if (vstats->arrlens.values.count)
         jsonAnalyzeMakeScalarStats(&ps, "array_length", &vstats->arrlens.stats);
@@ -1069,7 +1068,7 @@ jsonAnalyzeCalcPathFreq(JsonAnalyzeContext *ctx, JsonPathAnlStats *pstats,
  * We're done with accumulating values for this path, so calculate the
  * statistics for the various arrays.
  *
- * XXX I wonder if we could introduce some simple heuristict on which
+ * XXX I wonder if we could introduce some simple heuristic on which
  * paths to keep, similarly to what we do for MCV lists. For example a
  * path that occurred just once is not very interesting, so we could
  * decide to ignore it and not build the stats. Although that won't
@@ -1414,7 +1413,7 @@ compute_json_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 
     /*
      * Collect and analyze JSON path values in single or multiple passes.
-     * Sigle-pass collection is faster but consumes much more memory than
+     * Single-pass collection is faster but consumes much more memory than
      * collecting and analyzing by the one path at pass.
      */
     if (ctx.single_pass)



Re: Collecting statistics about contents of JSONB columns

From
Mahendra Singh Thalor
Date:
On Fri, 1 Apr 2022 at 20:21, Greg Stark <stark@mit.edu> wrote:
>
> This patch has bitrotted, presumably after the other JSON patchset was
> applied. It looks like it's failing in the json header file so it may
> be as simple as additional functions added on nearby lines.
>
> Please rebase. Reminder, it's the last week of the commitfest so time
> is of the essence....

Thanks, Greg for the report.

Here, I am attaching re-based patches of the v05 series. These patches
are re-based on the commit 7dd3ee508432730d15c5.

> I noticed some typos.

> diff --git a/src/backend/utils/adt/jsonb_selfuncs.c b/src/backend/utils/adt/jsonb_selfuncs.c
> index f5520f88a1d..d98cd7020a1 100644

Thanks, Justin for the review. We will fix these comments in the next version.


> Next, I am going to try to disable all-paths collection and implement
> collection of most common paths (and/or hashed paths maybe).

Thanks, Nikita for the v04 series of patches. I tested on the top of
your patches and verified that time taken by analyse is reduced for
large complex json docs.

In v03 patches, it was more than 2 hours, and in v04 patches, it is 39
sec only (time for Tomas's test case).

I am waiting for your patches (disable all-paths collection and
implement collection of most common paths)

Just for testing purposes, I am posting re-based patches here.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Collecting statistics about contents of JSONB columns

From
Mahendra Singh Thalor
Date:
On Fri, 11 Mar 2022 at 04:29, Nikita Glukhov <n.gluhov@postgrespro.ru> wrote:
>
>
> On 04.02.2022 05:47, Tomas Vondra wrote:
>
> On 1/25/22 17:56, Mahendra Singh Thalor wrote:
> >
>
> ...
>
> For the last few days, I was trying to understand these patches, and based on Tomas's suggestion, I was doing some performance tests.
>
> With the attached .SQL file, I can see that analyze is taking more time with these patches.
>
> I haven't found the root cause of this but I feel that this time is due to a loop of all the paths.
> In my test data, there is a total of 951 different-2 paths. While doing analysis, first we check all the sample rows(30000) and we collect all the different-2 paths (here 951), and after that for every single path, we loop over all the sample rows again to collect stats for a particular path. I feel that these loops might be taking time.
>
> Thanks, I've been doing some performance tests too, and you're right it takes quite a bit of time.
>
>
> That is absolutely not surprising, I have warned about poor performance
> in cases with a large number of paths.
>
>
> I agree the slowness is largely due to extracting all paths and then processing them one by one - which means we have to loop over the tuples over and over. In this case there's about 850k distinct paths extracted, so we do ~850k loops over 30k tuples. That's gotta take time.
>
> I don't know what exactly to do about this, but I already mentioned we may need to pick a subset of paths to keep, similarly to how we pick items for MCV. I mean, if we only saw a path once or twice, it's unlikely to be interesting enough to build stats for it. I haven't tried, but I'd bet most of the 850k paths might be ignored like this.
>
> The other thing we might do is making it the loops more efficient. For example, we might track which documents contain each path (by a small bitmap or something), so that in the loop we can skip rows that don't contain the path we're currently processing. Or something like that.
>
> Apart from this performance issue, I haven't found any crashes or issues.
>
>
> Well, I haven't seen any crashes either, but as I mentioned for complex documents (2 levels, many distinct keys) the ANALYZE starts consuming a lot of memory and may get killed by OOM. For example if you generate documents like this
>
>  ./json-generate.py 30000 2 8 1000 6 1000
>
> and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext): and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext):
>
> -------------------------------------------------------------------------
> TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks); 66784 used
>   ...
>   TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
>     PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536 used: <unnamed>
>       Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks); 469000720 used
>         Analyze Column: 921177696 total in 120 blocks; 5123256 free (238 chunks); 916054440 used
>           Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free (1 chunks); 2472 used
>             Json Analyze Pass Context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
>           JSON analyze path table: 1639706040 total in 25084 blocks; 1513640 free (33 chunks); 1638192400 used
>       Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
> ...
> Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352 chunks); 3024345064 used
> -------------------------------------------------------------------------
>
>
> Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze path table" context, 400MB in "analyze" and 900MB in "analyze column" contexts. I mean, that seems a bit excessive. And it grows over time, so after a while my laptop gives up and kills the backend.
>
> I'm not sure if it's a memory leak (which would be fixable), or it's due to keeping stats for all the extracted paths. I mean, in this particular case we have 850k paths - even if stats are just 1kB per path,  that's 850MB. This requires more investigation.
>
> Thank you for the tests and investigation.
>
> I have tried to reduce memory consumption and speed up row scanning:
>
> 1. "JSON analyze path table" context contained ~1KB JsonPathAnlStats
>    structure per JSON path in the global hash table.  I have moved
>    JsonPathAnlStats to the stack of compute_json_stats(), and hash
>    table now consumes ~70 bytes per path.
>
> 2. I have fixed copying of resulting JSONB stats into context, which
>    reduced the size of "Analyze Column" context.
>
> 3. I have optimized consumption of single-pass algorithm by storing
>    only value lists in the non-temporary context.  That helped to
>    execute "2 64 64" test case in 30 seconds.  Single-pass is a
>    bit faster in non-TOASTed cases, and much faster in TOASTed.
>    But it consumes much more memory and still goes to OOM in the
>    cases with more than ~100k paths.
>
> 4. I have implemented per-path document lists/bitmaps, which really
>    speed up the case "2 8 1000".  List is converted into bitmap when
>    it becomes larger than bitmap.
>
> 5. Also I have fixed some bugs.
>
>
> All these changes you can find commit form in our GitHub repository
> on the branch jsonb_stats_20220310 [1].
>
>
> Updated results of the test:
>
> levels keys  uniq keys  paths   master    multi-pass    single-pass
>                                               ms  MB       ms    MB
> -------------------------------------------------------------------
>      1    1         1       2      153      122   10       82    14
>      1    1      1000    1001      134      105   11       78    38
>      1    8         8       9      157      384   19      328    32
>      1    8      1000    1001      155      454   23      402    72
>      1   64        64      65      129     2889   45     2386   155
>      1   64      1000    1001      158     3990   94     1447   177
>      2    1         1       3      237      147   10       91    16
>      2    1      1000   30577      152      264   32      394   234
>      2    8         8      72      245     1943   37     1692   139
>      2    8      1000  852333      152     9175  678         OOM
>      2   64        64    4161     1784 ~1 hour    53    30018  1750
>      2   64      1000 1001001     4715 ~4 hours 1600         OOM
>
> The two last multi-pass results are too slow, because JSONBs becomes
> TOASTed.  For measuring master in these tests, I disabled
> WIDTH_THRESHOLD check which skipped TOASTed values > 1KB.
>
>
> Next, I am going to try to disable all-paths collection and implement
> collection of most common paths (and/or hashed paths maybe).

Hi Nikita,
I and Tomas discussed the design for disabling all-paths collection(collect stats for only some paths). Below are some thoughts/doubts/questions.

Point 1) Please can you elaborate more that how are you going to implement this(collect stats for only some paths).
Point 2) As JSON stats are taking time so should we add an on/off switch to collect JSON stats?
Point 3) We thought of one more design: we can give an explicit path to collect stats for a particular path only or we can pass a subset of the JSON values but this may require a lot of code changes like syntax and all so we are thinking that it will be good if we can collect stats only for some common paths(by limit or any other way)

Thoughts?

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: Collecting statistics about contents of JSONB columns

From
Tomas Vondra
Date:
On 5/17/22 13:44, Mahendra Singh Thalor wrote:
> ...
>
> Hi Nikita,
> I and Tomas discussed the design for disabling all-paths
> collection(collect stats for only some paths). Below are some
> thoughts/doubts/questions.
> 
> *Point 1)* Please can you elaborate more that how are you going to
> implement this(collect stats for only some paths).

I think Nikita mentioned he plans to only build stats only for most
common paths, which seems generally straightforward:

1) first pass over the documents, collect distinct paths and track how
many times we saw each one

2) in the second pass extract stats only for the most common paths (e.g.
the top 100 most common ones, or whatever the statistics target says)

I guess we might store at least the frequencing for uncommon paths,
which seems somewhat useful for selectivity estimation.


I wonder if we might further optimize this for less common paths. AFAICS
one of the reasons why we process the paths one by one (in the second
pass) is to limit memory consumption. By processing a single path, we
only need to accumulate values for that path.

But if we know the path is uncommon, we know there'll be few values. For
example the path may be only in 100 documents, not the whole sample. So
maybe we might process multiple paths at once (which would mean we don't
need to detoast the JSON documents that often, etc.).

OTOH that may be pointless, because if the paths are uncommon, chances
are the subsets of documents will be different, in which case it's
probably cheaper to just process the paths one by one.


> *Point 2) *As JSON stats are taking time so should we add an on/off
> switch to collect JSON stats?

IMHO we should think about doing that. I think it's not really possible
to eliminate (significant) regressions for all corner cases, and in many
cases people don't even need this statistics (e.g. when just storing and
retrieving JSON docs, without querying contents of the docs).

I don't know how exactly to enable/disable this - it very much depends
on how we store the stats. If we store that in pg_statistic, then ALTER
TABLE ... ALTER COLUMN seems like the right way to enable/disable these
path stats. We might also have a new "json" stats and do this through
CREATE STATISTICS. Or something else, not sure.


> *Point 3)* We thought of one more design: we can give an explicit path
> to collect stats for a particular path only or we can pass a subset of
> the JSON values but this may require a lot of code changes like syntax
> and all so we are thinking that it will be good if we can collect stats
> only for some common paths(by limit or any other way)
> 

I'm not sure I understand what this is saying, particularly the part
about subset of JSON values. Can you elaborate?

I can imagine specifying a list of interesting paths, and we'd only
collect stats for the matching subset of the JSON documents. So if you
have huge JSON documents with complex schema, but you only query a very
limited subset of paths, we could restrict ANALYZE to this subset.

In fact, that's what the 'selective analyze' commit [1] in Nikita's
original patch set does in principle. We'd probably need to improve this
in some ways (e.g. to allow defining the column filter not just in
ANALYZE itself). I left it out of this patch to keep the patch as simple
as possible.

But why/how exactly would we limit the "JSON values"? Can you give some
example demonstrating that in practice?

regards



[1]
https://github.com/postgrespro/postgres/commit/7ab7397450df153e5a8563c978728cb731a0df33

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company