Thread: Collecting statistics about contents of JSONB columns
Hi, One of the complaints I sometimes hear from users and customers using Postgres to store JSON documents (as JSONB type, of course) is that the selectivity estimates are often pretty poor. Currently we only really have MCV and histograms with whole documents, and we can deduce some stats from that. But that is somewhat bogus because there's only ~100 documents in either MCV/histogram (with the default statistics target). And moreover we discard all "oversized" values (over 1kB) before even calculating those stats, which makes it even less representative. A couple weeks ago I started playing with this, and I experimented with improving extended statistics in this direction. After a while I noticed a forgotten development branch from 2016 which tried to do this by adding a custom typanalyze function, which seemed like a more natural idea (because it's really a statistics for a single column). But then I went to pgconf NYC in early December, and I spoke to Oleg about various JSON-related things, and he mentioned they've been working on this topic some time ago too, but did not have time to pursue it. So he pointed me to a branch [1] developed by Nikita Glukhov. I like Nikita's branch because it solved a couple architectural issues that I've been aware of, but only solved them in a rather hackish way. I had a discussion with Nikita about his approach what can we do to move it forward. He's focusing on other JSON stuff, but he's OK with me taking over and moving it forward. So here we go ... Nikita rebased his branch recently, I've kept improving it in various (mostly a lot of comments and docs, and some minor fixes and tweaks). I've pushed my version with a couple extra commits in [2], but you can ignore that except if you want to see what I added/changed. Attached is a couple patches adding adding the main part of the feature. There's a couple more commits in the github repositories, adding more advanced features - I'll briefly explain those later, but I'm not including them here because those are optional features and it'd be distracting to include them here. There are 6 patches in the series, but the magic mostly happens in parts 0001 and 0006. The other parts are mostly just adding infrastructure, which may be a sizeable amount of code, but the changes are fairly simple and obvious. So let's focus on 0001 and 0006. To add JSON statistics we need to do two basic things - we need to build the statistics and then we need to allow using them while estimating conditions. 1) building stats Let's talk about building the stats first. The patch does one of the things I experimented with - 0006 adds a jsonb_typanalyze function, and it associates it with the data type. The function extracts paths and values from the JSONB document, builds the statistics, and then stores the result in pg_statistic as a new stakind. I've been planning to store the stats in pg_statistic too, but I've been considering to use a custom data type. The patch does something far more elegant - it simply uses stavalues to store an array of JSONB documents, each describing stats for one path extracted from the sampled documents. One (very simple) element of the array might look like this: {"freq": 1, "json": { "mcv": { "values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]}, "width": 19, "distinct": 10, "nullfrac": 0, "correlation": 0.10449}, "path": "$.\"a\"", "freq_null": 0, "freq_array": 0, "freq_object": 0, "freq_string": 0, "freq_boolean": 0, "freq_numeric": 0} In this case there's only a MCV list (represented by two arrays, just like in pg_statistic), but there might be another part with a histogram. There's also the other columns we'd expect to see in pg_statistic. In principle, we need pg_statistic for each path we extract from the JSON documents and decide it's interesting enough for estimation. There are probably other ways to serialize/represent this, but I find using JSONB for this pretty convenient because we need to deal with a mix of data types (for the same path), and other JSON specific stuff. Storing that in Postgres arrays would be problematic. I'm sure there's plenty open questions - for example I think we'll need some logic to decide which paths to keep, otherwise the statistics can get quite big, if we're dealing with large / variable documents. We're already doing something similar for MCV lists. One of Nikita's patches not included in this thread allow "selective" statistics, where you can define in advance a "filter" restricting which parts are considered interesting by ANALYZE. That's interesting, but I think we need some simple MCV-like heuristics first anyway. Another open question is how deep the stats should be. Imagine documents like this: {"a" : {"b" : {"c" : {"d" : ...}}}} The current patch build stats for all possible paths: "a" "a.b" "a.b.c" "a.b.c.d" and of course many of the paths will often have JSONB documents as values, not simple scalar values. I wonder if we should limit the depth somehow, and maybe build stats only for scalar values. 2) applying the statistics One of the problems is how to actually use the statistics. For @> operator it's simple enough, because it's (jsonb @> jsonb) so we have direct access to the stats. But often the conditions look like this: jsonb_column ->> 'key' = 'value' so the condition is actually on an expression, not on the JSONB column directly. My solutions were pretty ugly hacks, but Nikita had a neat idea - we can define a custom procedure for each operator, which is responsible for "calculating" the statistics for the expression. So in this case "->>" will have such "oprstat" procedure, which fetches stats for the JSONB column, extracts stats for the "key" path. And then we can use that for estimation of the (text = text) condition. This is what 0001 does, pretty much. We simply look for expression stats provided by an index, extended statistics, and then - if oprstat is defined for the operator - we try to derive the stats. This opens other interesting opportunities for the future - one of the parts adds oprstat for basic arithmetic operators, which allows deducing statistics for expressions like (a+10) from statistics on column (a). Which seems like a neat feature on it's own, but it also interacts with the extended statistics in somewhat non-obvious ways (especially when estimating GROUP BY cardinalities). Of course, there's a limit of what we can reasonably estimate - for example, there may be statistical dependencies between paths, and this patch does not even attempt to deal with that. In a way, this is similar to correlation between columns, except that here we have a dynamic set of columns, which makes it much much harder. We'd need something like extended stats on steroids, pretty much. I'm sure I've forgotten various important bits - many of them are mentioned or explained in comments, but I'm sure others are not. And I'd bet there are things I forgot about entirely or got wrong. So feel free to ask. In any case, I think this seems like a good first step to improve our estimates for JSOB columns. regards [1] https://github.com/postgrespro/postgres/tree/jsonb_stats [2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Add-pg_operator.oprstat-for-derived-operato-20211230.patch
- 0002-Add-stats_form_tuple-20211230.patch
- 0003-Add-symbolic-names-for-some-jsonb-operators-20211230.patch
- 0004-Add-helper-jsonb-functions-and-macros-20211230.patch
- 0005-Export-scalarineqsel-20211230.patch
- 0006-Add-jsonb-statistics-20211230.patch
Hi,
One of the complaints I sometimes hear from users and customers using
Postgres to store JSON documents (as JSONB type, of course) is that the
selectivity estimates are often pretty poor.
Currently we only really have MCV and histograms with whole documents,
and we can deduce some stats from that. But that is somewhat bogus
because there's only ~100 documents in either MCV/histogram (with the
default statistics target). And moreover we discard all "oversized"
values (over 1kB) before even calculating those stats, which makes it
even less representative.
A couple weeks ago I started playing with this, and I experimented with
improving extended statistics in this direction. After a while I noticed
a forgotten development branch from 2016 which tried to do this by
adding a custom typanalyze function, which seemed like a more natural
idea (because it's really a statistics for a single column).
But then I went to pgconf NYC in early December, and I spoke to Oleg
about various JSON-related things, and he mentioned they've been working
on this topic some time ago too, but did not have time to pursue it. So
he pointed me to a branch [1] developed by Nikita Glukhov.
I like Nikita's branch because it solved a couple architectural issues
that I've been aware of, but only solved them in a rather hackish way.
I had a discussion with Nikita about his approach what can we do to move
it forward. He's focusing on other JSON stuff, but he's OK with me
taking over and moving it forward. So here we go ...
Nikita rebased his branch recently, I've kept improving it in various
(mostly a lot of comments and docs, and some minor fixes and tweaks).
I've pushed my version with a couple extra commits in [2], but you can
ignore that except if you want to see what I added/changed.
Attached is a couple patches adding adding the main part of the feature.
There's a couple more commits in the github repositories, adding more
advanced features - I'll briefly explain those later, but I'm not
including them here because those are optional features and it'd be
distracting to include them here.
There are 6 patches in the series, but the magic mostly happens in parts
0001 and 0006. The other parts are mostly just adding infrastructure,
which may be a sizeable amount of code, but the changes are fairly
simple and obvious. So let's focus on 0001 and 0006.
To add JSON statistics we need to do two basic things - we need to build
the statistics and then we need to allow using them while estimating
conditions.
1) building stats
Let's talk about building the stats first. The patch does one of the
things I experimented with - 0006 adds a jsonb_typanalyze function, and
it associates it with the data type. The function extracts paths and
values from the JSONB document, builds the statistics, and then stores
the result in pg_statistic as a new stakind.
I've been planning to store the stats in pg_statistic too, but I've been
considering to use a custom data type. The patch does something far more
elegant - it simply uses stavalues to store an array of JSONB documents,
each describing stats for one path extracted from the sampled documents.
One (very simple) element of the array might look like this:
{"freq": 1,
"json": {
"mcv": {
"values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
"numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
"width": 19,
"distinct": 10,
"nullfrac": 0,
"correlation": 0.10449},
"path": "$.\"a\"",
"freq_null": 0, "freq_array": 0, "freq_object": 0,
"freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}
In this case there's only a MCV list (represented by two arrays, just
like in pg_statistic), but there might be another part with a histogram.
There's also the other columns we'd expect to see in pg_statistic.
In principle, we need pg_statistic for each path we extract from the
JSON documents and decide it's interesting enough for estimation. There
are probably other ways to serialize/represent this, but I find using
JSONB for this pretty convenient because we need to deal with a mix of
data types (for the same path), and other JSON specific stuff. Storing
that in Postgres arrays would be problematic.
I'm sure there's plenty open questions - for example I think we'll need
some logic to decide which paths to keep, otherwise the statistics can
get quite big, if we're dealing with large / variable documents. We're
already doing something similar for MCV lists.
One of Nikita's patches not included in this thread allow "selective"
statistics, where you can define in advance a "filter" restricting which
parts are considered interesting by ANALYZE. That's interesting, but I
think we need some simple MCV-like heuristics first anyway.
Another open question is how deep the stats should be. Imagine documents
like this:
{"a" : {"b" : {"c" : {"d" : ...}}}}
The current patch build stats for all possible paths:
"a"
"a.b"
"a.b.c"
"a.b.c.d"
and of course many of the paths will often have JSONB documents as
values, not simple scalar values. I wonder if we should limit the depth
somehow, and maybe build stats only for scalar values.
2) applying the statistics
One of the problems is how to actually use the statistics. For @>
operator it's simple enough, because it's (jsonb @> jsonb) so we have
direct access to the stats. But often the conditions look like this:
jsonb_column ->> 'key' = 'value'
so the condition is actually on an expression, not on the JSONB column
directly. My solutions were pretty ugly hacks, but Nikita had a neat
idea - we can define a custom procedure for each operator, which is
responsible for "calculating" the statistics for the expression.
So in this case "->>" will have such "oprstat" procedure, which fetches
stats for the JSONB column, extracts stats for the "key" path. And then
we can use that for estimation of the (text = text) condition.
This is what 0001 does, pretty much. We simply look for expression stats
provided by an index, extended statistics, and then - if oprstat is
defined for the operator - we try to derive the stats.
This opens other interesting opportunities for the future - one of the
parts adds oprstat for basic arithmetic operators, which allows deducing
statistics for expressions like (a+10) from statistics on column (a).
Which seems like a neat feature on it's own, but it also interacts with
the extended statistics in somewhat non-obvious ways (especially when
estimating GROUP BY cardinalities).
Of course, there's a limit of what we can reasonably estimate - for
example, there may be statistical dependencies between paths, and this
patch does not even attempt to deal with that. In a way, this is similar
to correlation between columns, except that here we have a dynamic set
of columns, which makes it much much harder. We'd need something like
extended stats on steroids, pretty much.
I'm sure I've forgotten various important bits - many of them are
mentioned or explained in comments, but I'm sure others are not. And I'd
bet there are things I forgot about entirely or got wrong. So feel free
to ask.
In any case, I think this seems like a good first step to improve our
estimates for JSOB columns.
regards
[1] https://github.com/postgrespro/postgres/tree/jsonb_stats
[2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
+ else if (strcmp(defel->defname, "stats") == 0)
+ else
+ return (RegProcedure) InvalidOid;
+ if (!data->statsTuple)
+ return false;
+ */
+static JsonPathStats
+jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)
+ * jsonPathStatsGetLengthStats does?
On Fri, Dec 31, 2021 at 2:07 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:Hi,
One of the complaints I sometimes hear from users and customers using
Postgres to store JSON documents (as JSONB type, of course) is that the
selectivity estimates are often pretty poor.
Currently we only really have MCV and histograms with whole documents,
and we can deduce some stats from that. But that is somewhat bogus
because there's only ~100 documents in either MCV/histogram (with the
default statistics target). And moreover we discard all "oversized"
values (over 1kB) before even calculating those stats, which makes it
even less representative.
A couple weeks ago I started playing with this, and I experimented with
improving extended statistics in this direction. After a while I noticed
a forgotten development branch from 2016 which tried to do this by
adding a custom typanalyze function, which seemed like a more natural
idea (because it's really a statistics for a single column).
But then I went to pgconf NYC in early December, and I spoke to Oleg
about various JSON-related things, and he mentioned they've been working
on this topic some time ago too, but did not have time to pursue it. So
he pointed me to a branch [1] developed by Nikita Glukhov.
I like Nikita's branch because it solved a couple architectural issues
that I've been aware of, but only solved them in a rather hackish way.
I had a discussion with Nikita about his approach what can we do to move
it forward. He's focusing on other JSON stuff, but he's OK with me
taking over and moving it forward. So here we go ...
Nikita rebased his branch recently, I've kept improving it in various
(mostly a lot of comments and docs, and some minor fixes and tweaks).
I've pushed my version with a couple extra commits in [2], but you can
ignore that except if you want to see what I added/changed.
Attached is a couple patches adding adding the main part of the feature.
There's a couple more commits in the github repositories, adding more
advanced features - I'll briefly explain those later, but I'm not
including them here because those are optional features and it'd be
distracting to include them here.
There are 6 patches in the series, but the magic mostly happens in parts
0001 and 0006. The other parts are mostly just adding infrastructure,
which may be a sizeable amount of code, but the changes are fairly
simple and obvious. So let's focus on 0001 and 0006.
To add JSON statistics we need to do two basic things - we need to build
the statistics and then we need to allow using them while estimating
conditions.
1) building stats
Let's talk about building the stats first. The patch does one of the
things I experimented with - 0006 adds a jsonb_typanalyze function, and
it associates it with the data type. The function extracts paths and
values from the JSONB document, builds the statistics, and then stores
the result in pg_statistic as a new stakind.
I've been planning to store the stats in pg_statistic too, but I've been
considering to use a custom data type. The patch does something far more
elegant - it simply uses stavalues to store an array of JSONB documents,
each describing stats for one path extracted from the sampled documents.
One (very simple) element of the array might look like this:
{"freq": 1,
"json": {
"mcv": {
"values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
"numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
"width": 19,
"distinct": 10,
"nullfrac": 0,
"correlation": 0.10449},
"path": "$.\"a\"",
"freq_null": 0, "freq_array": 0, "freq_object": 0,
"freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}
In this case there's only a MCV list (represented by two arrays, just
like in pg_statistic), but there might be another part with a histogram.
There's also the other columns we'd expect to see in pg_statistic.
In principle, we need pg_statistic for each path we extract from the
JSON documents and decide it's interesting enough for estimation. There
are probably other ways to serialize/represent this, but I find using
JSONB for this pretty convenient because we need to deal with a mix of
data types (for the same path), and other JSON specific stuff. Storing
that in Postgres arrays would be problematic.
I'm sure there's plenty open questions - for example I think we'll need
some logic to decide which paths to keep, otherwise the statistics can
get quite big, if we're dealing with large / variable documents. We're
already doing something similar for MCV lists.
One of Nikita's patches not included in this thread allow "selective"
statistics, where you can define in advance a "filter" restricting which
parts are considered interesting by ANALYZE. That's interesting, but I
think we need some simple MCV-like heuristics first anyway.
Another open question is how deep the stats should be. Imagine documents
like this:
{"a" : {"b" : {"c" : {"d" : ...}}}}
The current patch build stats for all possible paths:
"a"
"a.b"
"a.b.c"
"a.b.c.d"
and of course many of the paths will often have JSONB documents as
values, not simple scalar values. I wonder if we should limit the depth
somehow, and maybe build stats only for scalar values.
2) applying the statistics
One of the problems is how to actually use the statistics. For @>
operator it's simple enough, because it's (jsonb @> jsonb) so we have
direct access to the stats. But often the conditions look like this:
jsonb_column ->> 'key' = 'value'
so the condition is actually on an expression, not on the JSONB column
directly. My solutions were pretty ugly hacks, but Nikita had a neat
idea - we can define a custom procedure for each operator, which is
responsible for "calculating" the statistics for the expression.
So in this case "->>" will have such "oprstat" procedure, which fetches
stats for the JSONB column, extracts stats for the "key" path. And then
we can use that for estimation of the (text = text) condition.
This is what 0001 does, pretty much. We simply look for expression stats
provided by an index, extended statistics, and then - if oprstat is
defined for the operator - we try to derive the stats.
This opens other interesting opportunities for the future - one of the
parts adds oprstat for basic arithmetic operators, which allows deducing
statistics for expressions like (a+10) from statistics on column (a).
Which seems like a neat feature on it's own, but it also interacts with
the extended statistics in somewhat non-obvious ways (especially when
estimating GROUP BY cardinalities).
Of course, there's a limit of what we can reasonably estimate - for
example, there may be statistical dependencies between paths, and this
patch does not even attempt to deal with that. In a way, this is similar
to correlation between columns, except that here we have a dynamic set
of columns, which makes it much much harder. We'd need something like
extended stats on steroids, pretty much.
I'm sure I've forgotten various important bits - many of them are
mentioned or explained in comments, but I'm sure others are not. And I'd
bet there are things I forgot about entirely or got wrong. So feel free
to ask.
In any case, I think this seems like a good first step to improve our
estimates for JSOB columns.
regards
[1] https://github.com/postgrespro/postgres/tree/jsonb_stats
[2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyHi,For patch 1:+ List *statisticsName = NIL; /* optional stats estimat. procedure */I think if the variable is named estimatorName (or something similar), it would be easier for people to grasp its purpose.+ /* XXX perhaps full "statistics" wording would be better */
+ else if (strcmp(defel->defname, "stats") == 0)I would recommend (stats sounds too general):+ else if (strcmp(defel->defname, "statsestimator") == 0)+ statisticsOid = ValidateStatisticsEstimator(statisticsName);statisticsOid -> statsEstimatorOidFor get_oprstat():+ }
+ else
+ return (RegProcedure) InvalidOid;keyword else is not needed (considering the return statement in if block).For patch 06:+ /* FIXME Could be before the memset, I guess? Checking vardata->statsTuple. */
+ if (!data->statsTuple)
+ return false;I would agree the check can be lifted above the memset call.+ * XXX This does not really extract any stats, it merely allocates the struct?
+ */
+static JsonPathStats
+jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)As comments says, I think allocJsonPathStats() would be better name for the func.+ * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what
+ * jsonPathStatsGetLengthStats does?I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check should be added for jsonPathStatsGetArrayLengthStats().To be continued ...
+jsonStatsGetPathStatsStr(JsonStats jsdata, const char *subpath, int subpathlen)
+ */
+static void
+jsonPathAppendEntryWithLen(StringInfo path, const char *entry, int len)
+{
+ char *tmpentry = pnstrdup(entry, len);
+ jsonPathAppendEntry(path, tmpentry);
+ operator == JsonbGtOperator ||
+ operator == JsonbGeOperator,
+ operator == JsonbLeOperator ||
+ operator == JsonbGeOperator,
*
* This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
* The isgt and iseq flags distinguish which of the four cases apply.
On Sat, Jan 1, 2022 at 11:07 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > 0006-Add-jsonb-statistics-20211230.patch Hi Tomas, -CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text) +CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text) https://cirrus-ci.com/task/6405547984420864 It looks like there is a Unicode BOM sequence in there, which is upsetting our Windows testing but not the 3 Unixes (not sure why). Probably added by an editor.
On Fri, 31 Dec 2021 at 22:07, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > The patch does something far more > elegant - it simply uses stavalues to store an array of JSONB documents, > each describing stats for one path extracted from the sampled documents. Sounds good. > I'm sure there's plenty open questions - for example I think we'll need > some logic to decide which paths to keep, otherwise the statistics can > get quite big, if we're dealing with large / variable documents. We're > already doing something similar for MCV lists. > > One of Nikita's patches not included in this thread allow "selective" > statistics, where you can define in advance a "filter" restricting which > parts are considered interesting by ANALYZE. That's interesting, but I > think we need some simple MCV-like heuristics first anyway. > > Another open question is how deep the stats should be. Imagine documents > like this: > > {"a" : {"b" : {"c" : {"d" : ...}}}} > > The current patch build stats for all possible paths: > > "a" > "a.b" > "a.b.c" > "a.b.c.d" > > and of course many of the paths will often have JSONB documents as > values, not simple scalar values. I wonder if we should limit the depth > somehow, and maybe build stats only for scalar values. The user interface for designing filters sounds hard, so I'd hope we can ignore that for now. -- Simon Riggs http://www.EnterpriseDB.com/
On 1/1/22 16:33, Zhihong Yu wrote: > Hi, > For patch 1: > > + List *statisticsName = NIL; /* optional stats estimat. > procedure */ > > I think if the variable is named estimatorName (or something similar), > it would be easier for people to grasp its purpose. > I agree "statisticsName" might be too generic or confusing, but I'm not sure "estimator" is an improvement. Because this is not an "estimator" (in the sense of estimating selectivity). It "transforms" statistics to match the expression. > + /* XXX perhaps full "statistics" wording would be better */ > + else if (strcmp(defel->defname, "stats") == 0) > > I would recommend (stats sounds too general): > > + else if (strcmp(defel->defname, "statsestimator") == 0) > > + statisticsOid = ValidateStatisticsEstimator(statisticsName); > > statisticsOid -> statsEstimatorOid > Same issue with the "estimator" bit :-( > For get_oprstat(): > > + } > + else > + return (RegProcedure) InvalidOid; > > keyword else is not needed (considering the return statement in if block). > True, but this is how the other get_ functions in lsyscache.c do it. Like get_oprjoin(). > For patch 06: > > + /* FIXME Could be before the memset, I guess? Checking > vardata->statsTuple. */ > + if (!data->statsTuple) > + return false; > > I would agree the check can be lifted above the memset call. > OK. > + * XXX This does not really extract any stats, it merely allocates the > struct? > + */ > +static JsonPathStats > +jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type) > > As comments says, I think allocJsonPathStats() would be better name for > the func. > > + * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what > + * jsonPathStatsGetLengthStats does? > > I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check > should be added for jsonPathStatsGetArrayLengthStats(). > > To be continued ... OK. I'll see if Nikita has some ideas about the naming changes. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/1/22 22:16, Zhihong Yu wrote: > Hi, > > +static JsonPathStats > +jsonStatsFindPathStats(JsonStats jsdata, char *path, int pathlen) > > Stats appears twice in the method name. I think findJsonPathStats() > should suffice. > It should check `if (jsdata->nullfrac >= 1.0)` > as jsonStatsGetPathStatsStr does. > > +JsonPathStats > +jsonStatsGetPathStatsStr(JsonStats jsdata, const char *subpath, int > subpathlen) > > This func can be static, right ? > I think findJsonPathStatsWithPrefix() would be a better name for the func. > > + * XXX Doesn't this need ecape_json too? > + */ > +static void > +jsonPathAppendEntryWithLen(StringInfo path, const char *entry, int len) > +{ > + char *tmpentry = pnstrdup(entry, len); > + jsonPathAppendEntry(path, tmpentry); > > ecape_json() is called within jsonPathAppendEntry(). The XXX comment can > be dropped. > > +jsonPathStatsGetArrayIndexSelectivity(JsonPathStats pstats, int index) > > It seems getJsonSelectivityWithArrayIndex() would be a better name. > Thanks. I'll think about the naming changes. > + sel = scalarineqsel(NULL, operator, > + operator == JsonbGtOperator || > + operator == JsonbGeOperator, > + operator == JsonbLeOperator || > + operator == JsonbGeOperator, > > Looking at the comment for scalarineqsel(): > > * scalarineqsel - Selectivity of "<", "<=", ">", ">=" for scalars. > * > * This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel. > * The isgt and iseq flags distinguish which of the four cases apply. > > It seems JsonbLtOperator doesn't appear in the call, can I ask why ? > Because the scalarineqsel signature is this scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq, Oid collation, VariableStatData *vardata, Datum constval, Oid consttype) so /* is it greater or greater-or-equal */ isgt = operator == JsonbGtOperator || operator == JsonbGeOperator /* is it equality? */ iseq = operator == JsonbLeOperator || operator == JsonbGeOperator, So I think this is correct. A comment explaining this would be nice. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/5/22 21:13, Thomas Munro wrote: > On Sat, Jan 1, 2022 at 11:07 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> 0006-Add-jsonb-statistics-20211230.patch > > Hi Tomas, > > -CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text) > +CREATE OR REPLACE FUNCTION explain_jsonb(sql_query text) > > https://cirrus-ci.com/task/6405547984420864 > > It looks like there is a Unicode BOM sequence in there, which is > upsetting our Windows testing but not the 3 Unixes (not sure why). > Probably added by an editor. > Thanks, fixed along with some whitespace issues. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Add-pg_operator.oprstat-for-derived-operato-20220106.patch
- 0002-Add-stats_form_tuple-20220106.patch
- 0003-Add-symbolic-names-for-some-jsonb-operators-20220106.patch
- 0004-Add-helper-jsonb-functions-and-macros-20220106.patch
- 0005-Export-scalarineqsel-20220106.patch
- 0006-Add-jsonb-statistics-20220106.patch
On 1/5/22 21:22, Simon Riggs wrote: > On Fri, 31 Dec 2021 at 22:07, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> The patch does something far more >> elegant - it simply uses stavalues to store an array of JSONB documents, >> each describing stats for one path extracted from the sampled documents. > > Sounds good. > >> I'm sure there's plenty open questions - for example I think we'll need >> some logic to decide which paths to keep, otherwise the statistics can >> get quite big, if we're dealing with large / variable documents. We're >> already doing something similar for MCV lists. >> >> One of Nikita's patches not included in this thread allow "selective" >> statistics, where you can define in advance a "filter" restricting which >> parts are considered interesting by ANALYZE. That's interesting, but I >> think we need some simple MCV-like heuristics first anyway. >> >> Another open question is how deep the stats should be. Imagine documents >> like this: >> >> {"a" : {"b" : {"c" : {"d" : ...}}}} >> >> The current patch build stats for all possible paths: >> >> "a" >> "a.b" >> "a.b.c" >> "a.b.c.d" >> >> and of course many of the paths will often have JSONB documents as >> values, not simple scalar values. I wonder if we should limit the depth >> somehow, and maybe build stats only for scalar values. > > The user interface for designing filters sounds hard, so I'd hope we > can ignore that for now. > Not sure I understand. I wasn't suggesting any user-defined filtering, but something done by default, similarly to what we do for regular MCV lists, based on frequency. We'd include frequent paths while excluding rare ones. So no need for a user interface. That might not work for documents with stable schema and a lot of top-level paths, because all the top-level paths have 1.0 frequency. But for documents with dynamic schema (different documents having different schemas/paths) it might help. Similarly for the non-scalar values - I don't think we can really keep regular statistics on such values (for the same reason why it's not enough for whole JSONB columns), so why to build/store that anyway. Nikita did implement a way to specify custom filters using jsonpath, but I did not include that into this patch series. And questions regarding the interface were one of the reasons. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi! I am glad that you found my very old patch interesting and started to work on it. We failed to post it in 2016 mostly because we were not satisfied with JSONB storage. Also we decided to wait for completion of work on extended statistics as we thought that it could help us. But in early 2017 we switched to SQL/JSON and forgot about this patch. I think custom datatype is necessary for better performance. With a plain JSONB we need to do a lot of work for extraction of path stats: - iterate through MCV/Histogram JSONB arrays - cast numeric values to float, string to text etc. - build PG arrays from extracted datums - form pg_statistic tuple. With a custom data type we could store pg_statistic tuple unmodified and use it without any overhead. But then we need modify a bit VariableStatData and several functions to pass additional nullfrac corrections. Maybe simple record type (path text, data pg_statistic, ext_data jsonb) would be enough. Also there is an idea to store per-path separately in pg_statistic_ext rows using expression like (jb #> '{foo,bar}') as stxexprs. This could also help user to select which paths to analyze simply by using some sort of CREATE STATISTICS. But it is really unclear how to: * store pg_statistic_ext rows from typanalyze * form path expressions for array elements (maybe add new jsonpath operator) * match various -> operator chains to stxexprs * jsonpath-like languages need simple API for searching by stxexprs Per-path statistics should only be collected for scalars. This can be enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a problem: computed MCVs and histograms will be broken and we will not be able to use them for queries like (jb > const) in general case. Also we will not be and use internally in scalarineqsel() and var_eq_const() (see jsonSelectivity()). Instead, we will have to implement own estimator functions for JSONB comparison operators that will correctly use our hacked MCVs and histograms (and maybe not all cases will be supported; for example, comparison to scalars only). It's possible to replace objects and arrays with empty ones when scalarsOnly is set to keep correct frequencies of non-scalars. But there is an old bug in JSONB comparison: empty arrays are placed before other values in the JSONB sort order, although they should go after all scalars. So we need also to distinguish empty and non-empty arrays here. I tried to fix a major part of places marked as XXX and FIXME, the new version of the patches is attached. There are a lot of changes, you can see them in a step-by-step form in the corresponding branch jsonb_stats_20220122 in our GitHub repo [1]. Below is the explanation of fixed XXXs: Patch 0001 src/backend/commands/operatorcmds.c: > XXX Maybe not the right name, because "estimator" implies we're > calculating selectivity. But we're actually deriving statistics for > an expression. Renamed to "Derivator". > XXX perhaps full "statistics" wording would be better Renamed to "statistics". src/backend/utils/adt/selfuncs.c: > examine_operator_expression(): > XXX Not sure why this returns bool - we ignore the return value > anyway. We might as well return the calculated vardata (or NULL). Oprstat was changed to return void. > XXX Not sure what to do about recursion - there can be another OpExpr > in one of the arguments, and we should call this recursively from the > oprstat procedure. But that's not possible, because it's marked as > static. Oprstat call chain: get_restriction_variable() => examine_variable() => examine_operator_expression(). The main thing here is that OpExpr with oprstat acts like ordinary Var. > examine_variable(): > XXX Shouldn't this put more restrictions on the OpExpr? E.g. that > one of the arguments has to be a Const or something? This is simply a responsibility of oprstat. Patch 0004 > XXX Not sure we want to add these functions to jsonb.h, which is the > public API. Maybe it belongs rather to jsonb_typanalyze.c or > elsewhere, closer to how it's used? Maybe it needs to be moved the new file jsonb_utils.h. I think these functions become very useful, when we start to build JSONBs with predefined structure. > pushJsonbValue(): > XXX I'm not quite sure why we actually do this? Why do we need to > change how JsonbValue is converted to Jsonb for the statistics patch? Scalars in JSONB are encoded as one-element preudo-array containers. So when we are inserting binary JsonbValues, that was initialized directly from JSONB datums, inside other non-empty JSONB in pushJsonbValue(), all values except scalars are inserted as expected but scalars become [scalar]. So we need to extract scalar values in caller functions or in pushJsonbValue(). I think autoextraction in pushJsonbValue() is better. This case simply was not used before JSONB stats introduction. Patch 0006 src/backend/utils/adt/jsonb_selfuncs.c > jsonPathStatsGetSpecialStats() > XXX This does not really extract any stats, it merely allocates the > struct? Renamed with "Alloc" suffix. > jsonPathStatsGetArrayLengthStats() > XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to > what jsonPathStatsGetLengthStats does? "length" stats were collected inside parent paths, but "array_length" and "avg_array_length" stats were collected inside child array paths. This resulted in inconsistencies in TypeFreq checks. I have removed "length" stats, moved "array_length" and "avg_array_length" to the parent path, and added separate "object_length" stats. TypeFreq checks become consistent. > XXX Seems to do essentially what jsonStatsFindPath, except that it > also considers jsdata->prefix. Seems fairly easy to combine those > into a single function. I don't think that it would be better to combine those function into one with considerPrefix flag, because jsonStatsFindPathStats() is a some kind of low-level function which is called only in two places. In all other places considerPrefix will be true. Also jsonStatsFindPathStatsStr() is exported in jsonb_selfuncs.h for to give external jsonpath-like query operators ability to use JSON statistics. > jsonPathStatsCompare() > XXX Seems a bit convoluted to first cast it to Datum, then Jsonb ... > Datum const *pdatum = pv2; > Jsonb *jsonb = DatumGetJsonbP(*pdatum); The problem of simply using 'jsonb = *(Jsonb **) pv2' is that DatumGetJsonbP() may deTOAST datums. > XXX Not sure about this? Does empty path mean global stats? Empty "path" is simply a sign of invalid stats data. > jsonStatsConvertArray() > FIXME Does this actually work on all 32/64-bit systems? What if typid > is FLOAT8OID or something? Should look at TypeCache instead, probably. Used get_typlenbyvalalign() instead of hardcoded values. > jsonb_stats() > XXX It might be useful to allow recursion, i.e. > get_restriction_variable might derive statistics too. > I don't think it does that now, right? get_restriction_variable() already might derive statistics: it calls examine_variable() on its both operands, and examine_variable() calls examine_operator_expression(). > XXX Could we also get varonleft=false in useful cases? All supported JSONB operators have signature jsonb OP arg. If varonleft = false, then we need to derive stats for expression like '{"foo": "bar"}' -> text_column having stats for text_column. It is possible to implement too, but it is not related to JSONB stats and needs to be done in the separate patch. > jsonSelectivityContains(): > XXX This really needs more comments explaining the logic. I have refactored this function and added comments. > jsonGetFloat4(): > XXX Not sure assert is a sufficient protection against different > types of JSONB values to be passed in. I have refactored this function by passing a default value for non-numeric JSONB case. > jsonPathAppendEntryWithLen() > XXX Doesn't this need ecape_json too? Comment is removed, because jsonPathAppendEntry() is called inside. > jsonPathStatsGetTypeFreq() > FIXME This is really hard to read/understand, with two nested ternary > operators. I have refactored this place. > XXX Seems more like an error, no? Why ignore it? It's not an error, it's request of non-numeric type. Length values are always numeric, and frequence of non-numeric values is 0. The possible case when it could happen is jsonpath '$.size() == "foo"'. Estimator for operator == will check frequence of strings in .size() and it will be 0. > jsonPathStatsFormTuple() > FIXME What does this mean? There is no need to transform ordinary root path stats tuple, it can be simply copied. jsonb_typanalyze.c > XXX We need entry+lenth because JSON path elements may contain null > bytes, I guess? 'entry' can be not zero-terminated when it is pointing to JSONB keys, so 'len' is necessary field. 'len' is also used for faster entry comparison, to distinguish array entries ('len' == -1). > XXX Sould be JsonPathEntryMatch as it deals with JsonPathEntry nodes > not whole paths, no? > XXX Again, maybe JsonPathEntryHash would be a better name? Functions renamed using JsonPathEntry prefix, JsonPath typedef removed. > JsonPathMatch() > XXX Seems a bit silly to return int, when the return statement only > really returns bool (because of how it compares paths). It's not really > a comparator for sorting, for example. This function is implementation of HashCompareFunc and it needs to return int. > jsonAnalyzeJson() > XXX The name seems a bit weird, with the two json bits. Renamed to jsonAnalyzeCollectPaths(). The two similar functions were renamed too. > jsonAnalyzeJson(): > XXX The mix of break/return statements in this block is really > confusing. I have refactored this place using only breaks. > XXX not sure why we're doing this? Manual recursion into containers by creating child iterator together with skipNested=true flag is used to give jsonAnalyzeJsonValue() ability to access jbvBinary containers. > compute_json_stats() > XXX Not sure what the first branch is doing (or supposed to)? > XXX It's not immediately clear why this is (-1) and not simply > NULL. It crashes, so presumably it's used to tweak the behavior, > but it's not clear why/how, and it affects place that is pretty > far away, and so not obvious. We should use some sort of flag > with a descriptive name instead. > XXX If I understand correctly, we simply collect all paths first, > without accumulating any Values. And then in the next step we > process each path independently, probably to save memory (we > don't want to accumulate all values for all paths, with a lot > of duplicities). There are two variants of stats collection: * single-pass - collect all values for all paths * multi-pass - collect only values for a one path at each pass The first variant can consume too much memory (jsonb iteration produces a lot of garbage etc.), but works faster than second. The first 'if (false)' is used for manual selection of one of this variants. This selection should be controlled by some user-specified option (maybe GUC), or the first variant can be simply removed. jsonAnalyzeJson()'s parameter of type JsonPathAnlStats * determines which paths we need to consider for the value collection: * NULL - collect values for all paths * -1 - do not collect any values * stats - collect values only for a given path The last variant really is unused because we already have another function jsonAnalyzeJsonPath(), which is optimized for selective path values collection (used object key accessor JSONB functions instead of full JSONB iteration). I have replaced this strange parameter with simple boolean flag. > XXX Could the parameters be different on other platforms? Used get_typlenbyvalalign(JSONBOID) instead of hardcoded values. > jsonAnalyzePathValues() > XXX Do we need to initialize all slots? I have copied here the following comment from extended_stats.c: /* * The fields describing the stats->stavalues[n] element types default * to the type of the data being analyzed, but the type-specific * typanalyze function can change them if it wants to store something * else. */ > XXX Not sure why we divide it by the number of json values? We divide counts of lengths by the total number of json values to compute correct nullfrac. I.e. not all input jsons have lengths, length of scalar jsons is NULL. [1] https://github.com/postgrespro/postgres/tree/jsonb_stats_20220122 -- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 1/23/22 01:24, Nikita Glukhov wrote: > Hi! > > I am glad that you found my very old patch interesting and started to > work on it. We failed to post it in 2016 mostly because we were not > satisfied with JSONB storage. Also we decided to wait for completion > of work on extended statistics as we thought that it could help us. > But in early 2017 we switched to SQL/JSON and forgot about this patch. > Understood. Let's see how feasible this idea is and if we can move this forward. > > I think custom datatype is necessary for better performance. With a > plain JSONB we need to do a lot of work for extraction of path stats: > - iterate through MCV/Histogram JSONB arrays > - cast numeric values to float, string to text etc. > - build PG arrays from extracted datums > - form pg_statistic tuple. > > With a custom data type we could store pg_statistic tuple unmodified > and use it without any overhead. But then we need modify a bit > VariableStatData and several functions to pass additional nullfrac > corrections. > I'm not against evaluating/exploring alternative storage formats, but my feeling is the real impact on performance will be fairly low. At least I haven't seen this as very expensive while profiling the patch so far. Of course, I may be wrong, and it may be more significant in some cases. > Maybe simple record type (path text, data pg_statistic, ext_data jsonb) > would be enough. > Maybe, but then you still need to store a bunch of those, right? So either an array (likely toasted) or 1:M table. I'm not sure it's goiing to be much cheaper than JSONB. I'd suggest we focus on what we need to store first, which seems like tha primary question, and worry about the exact storage format then. > Also there is an idea to store per-path separately in pg_statistic_ext > rows using expression like (jb #> '{foo,bar}') as stxexprs. This could > also help user to select which paths to analyze simply by using some > sort of CREATE STATISTICS. But it is really unclear how to: > * store pg_statistic_ext rows from typanalyze > * form path expressions for array elements (maybe add new jsonpath > operator) > * match various -> operator chains to stxexprs > * jsonpath-like languages need simple API for searching by stxexprs > Sure, you can do statistics on expressions, right? Of course, if that expression produces JSONB value, that's not very useful at the moment. Maybe we could have two typanalyze functions - one for regular analyze, one for extended statistics? That being said, I'm not sure extended stats are a good match for this. My feeling was we should collect these stats for all JSONB columns, which is why I argued for putting that in pg_statistic. > > > Per-path statistics should only be collected for scalars. This can be > enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a > problem: computed MCVs and histograms will be broken and we will not be > able to use them for queries like (jb > const) in general case. Also > we will not be and use internally in scalarineqsel() and var_eq_const() > (see jsonSelectivity()). Instead, we will have to implement own > estimator functions for JSONB comparison operators that will correctly > use our hacked MCVs and histograms (and maybe not all cases will be > supported; for example, comparison to scalars only). > Yeah, but maybe that's an acceptable trade-off? I mean, if we can improve estimates for most clauses, and there's a some number of clauses that are estimated just like without stats, that's still an improvement, right? > It's possible to replace objects and arrays with empty ones when > scalarsOnly is set to keep correct frequencies of non-scalars. > But there is an old bug in JSONB comparison: empty arrays are placed > before other values in the JSONB sort order, although they should go > after all scalars. So we need also to distinguish empty and non-empty > arrays here. > Hmmm ... > > > I tried to fix a major part of places marked as XXX and FIXME, the new > version of the patches is attached. There are a lot of changes, you > can see them in a step-by-step form in the corresponding branch > jsonb_stats_20220122 in our GitHub repo [1]. > Thanks! I'll go through the changes soon. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> On 1/23/22 01:24, Nikita Glukhov wrote:
> > Hi!
> >
> > I am glad that you found my very old patch interesting and started to
> > work on it. We failed to post it in 2016 mostly because we were not
> > satisfied with JSONB storage. Also we decided to wait for completion
> > of work on extended statistics as we thought that it could help us.
> > But in early 2017 we switched to SQL/JSON and forgot about this patch.
> >
>
> Understood. Let's see how feasible this idea is and if we can move this
> forward.
>
> >
> > I think custom datatype is necessary for better performance. With a
> > plain JSONB we need to do a lot of work for extraction of path stats:
> > - iterate through MCV/Histogram JSONB arrays
> > - cast numeric values to float, string to text etc.
> > - build PG arrays from extracted datums
> > - form pg_statistic tuple.
> >
> > With a custom data type we could store pg_statistic tuple unmodified
> > and use it without any overhead. But then we need modify a bit
> > VariableStatData and several functions to pass additional nullfrac
> > corrections.
> >
>
> I'm not against evaluating/exploring alternative storage formats, but my
> feeling is the real impact on performance will be fairly low. At least I
> haven't seen this as very expensive while profiling the patch so far. Of
> course, I may be wrong, and it may be more significant in some cases.
>
> > Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
> > would be enough.
> >
> Maybe, but then you still need to store a bunch of those, right? So
> either an array (likely toasted) or 1:M table. I'm not sure it's goiing
> to be much cheaper than JSONB.
>
> I'd suggest we focus on what we need to store first, which seems like
> tha primary question, and worry about the exact storage format then.
>
>
> > Also there is an idea to store per-path separately in pg_statistic_ext
> > rows using expression like (jb #> '{foo,bar}') as stxexprs. This could
> > also help user to select which paths to analyze simply by using some
> > sort of CREATE STATISTICS. But it is really unclear how to:
> > * store pg_statistic_ext rows from typanalyze
> > * form path expressions for array elements (maybe add new jsonpath
> > operator)
> > * match various -> operator chains to stxexprs
> > * jsonpath-like languages need simple API for searching by stxexprs
> >
>
> Sure, you can do statistics on expressions, right? Of course, if that
> expression produces JSONB value, that's not very useful at the moment.
> Maybe we could have two typanalyze functions - one for regular analyze,
> one for extended statistics?
>
> That being said, I'm not sure extended stats are a good match for this.
> My feeling was we should collect these stats for all JSONB columns,
> which is why I argued for putting that in pg_statistic.
>
> >
> >
> > Per-path statistics should only be collected for scalars. This can be
> > enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a
> > problem: computed MCVs and histograms will be broken and we will not be
> > able to use them for queries like (jb > const) in general case. Also
> > we will not be and use internally in scalarineqsel() and var_eq_const()
> > (see jsonSelectivity()). Instead, we will have to implement own
> > estimator functions for JSONB comparison operators that will correctly
> > use our hacked MCVs and histograms (and maybe not all cases will be
> > supported; for example, comparison to scalars only).
> >
>
> Yeah, but maybe that's an acceptable trade-off? I mean, if we can
> improve estimates for most clauses, and there's a some number of clauses
> that are estimated just like without stats, that's still an improvement,
> right?
>
> > It's possible to replace objects and arrays with empty ones when
> > scalarsOnly is set to keep correct frequencies of non-scalars.
> > But there is an old bug in JSONB comparison: empty arrays are placed
> > before other values in the JSONB sort order, although they should go
> > after all scalars. So we need also to distinguish empty and non-empty
> > arrays here.
> >
>
> Hmmm ...
> >
> >
> > I tried to fix a major part of places marked as XXX and FIXME, the new
> > version of the patches is attached. There are a lot of changes, you
> > can see them in a step-by-step form in the corresponding branch
> > jsonb_stats_20220122 in our GitHub repo [1].
> >
>
> Thanks! I'll go through the changes soon.
>
>
ANALYZE
Time: 28464.062 ms (00:28.464)
postgres=#
postgres=# SELECT pg_size_pretty( pg_total_relation_size('pg_catalog.pg_statistic') );
pg_size_pretty
----------------
328 kB
(1 row)
ANALYZE
Time: 120.864 ms
postgres=# SELECT pg_size_pretty( pg_total_relation_size('pg_catalog.pg_statistic') );
pg_size_pretty
----------------
272 kB
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, 6 Jan 2022 at 14:56, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > Not sure I understand. I wasn't suggesting any user-defined filtering, > but something done by default, similarly to what we do for regular MCV > lists, based on frequency. We'd include frequent paths while excluding > rare ones. > > So no need for a user interface. Not sure but I think he was agreeing with you. That we should figure out the baseline behaviour and get it as useful as possible first then later look at adding some way to customize it. I agree -- I don't think the user interface will be hard technically but I think it will require some good ideas and there could be lots of bikeshedding. And a lot of users will never even use it anyways so it's important to get the defaults as useful as possible. > Similarly for the non-scalar values - I don't think we can really keep > regular statistics on such values (for the same reason why it's not > enough for whole JSONB columns), so why to build/store that anyway. For a default behaviour I wonder if it wouldn't be better to just flatten and extract all the scalars. So if there's no matching path then at least we have some way to estimate how often a scalar appears anywhere in the json document. That amounts to assuming the user knows the right path to find a given scalar and there isn't a lot of overlap between keys. So it would at least do something useful if you have something like {gender: female, name: {first: nancy, last: reagan], state: california, country: usa}. It might get things slightly wrong if you have some people named "georgia" or have names that can be first or last names. But it would generally be doing something more or less useful as long as they look for "usa" in the country field and "male" in the gender field. If they looked for "male" in $.name.first path it would give bad estimates but assuming they know their data structure they won't be doing that. -- greg
On 1/25/22 17:56, Mahendra Singh Thalor wrote: > > ... > > For the last few days, I was trying to understand these patches, and > based on Tomas's suggestion, I was doing some performance tests. > > With the attached .SQL file, I can see that analyze is taking more time > with these patches. > > *Setup: * > autovacuum=off > rest all are default settings. > > Insert attached file with and without the patch to compare the time > taken by analyze. > > *With json patches:* > postgres=# analyze test ; > ANALYZE > Time: *28464.062 ms (00:28.464)* > postgres=# > postgres=# SELECT pg_size_pretty( > pg_total_relation_size('pg_catalog.pg_statistic') ); > pg_size_pretty > ---------------- > 328 kB > (1 row) > -- > > *Without json patches:* > postgres=# analyze test ; > ANALYZE > *Time: 120.864* ms > postgres=# SELECT pg_size_pretty( > pg_total_relation_size('pg_catalog.pg_statistic') ); > pg_size_pretty > ---------------- > 272 kB > > I haven't found the root cause of this but I feel that this time is due > to a loop of all the paths. > In my test data, there is a total of 951 different-2 paths. While doing > analysis, first we check all the sample rows(30000) and we collect all > the different-2 paths (here 951), and after that for every single path, > we loop over all the sample rows again to collect stats for a particular > path. I feel that these loops might be taking time. > > I will run perf and will try to find out the root cause of this. > Thanks, I've been doing some performance tests too, and you're right it takes quite a bit of time. I wanted to compare how the timing changes with complexity of the JSON documents (nesting, number of keys, ...) so I wrote a simple python script to generate random JSON documents with different parameters - see the attached json-generate.py script. It's a bit crude, but it generates synthetic documents with a chosen number of levels, keys per level, distinct key values, etc. The generated documents are loaded directly into a "json_test" database, into a table "test_table" with a single jsonb column called "v". Tweaking this to connect to a different database, or just dump the generated documents to a file, should be trivial. The attached bash script runs the data generator for a couple of combinations, and them measures how long it takes to analyze the table, how large the statistics are (in a rather crude way), etc. The results look like this (the last two columns are analyze duration in milliseconds, for master and with the patch): levels keys unique keys paths master patched ---------------------------------------------------------- 1 1 1 2 153 122 1 1 1000 1001 134 1590 1 8 8 9 157 367 1 8 1000 1001 155 1838 1 64 64 65 189 2298 1 64 1000 1001 46 9322 2 1 1 3 237 197 2 1 1000 30580 152 46468 2 8 8 73 245 1780 So yeah, it's significantly slower - in most cases not as much as you observed, but an order of magnitude slower than master. For size of the statistics, it's similar: levels keys unique keys paths table size master patched ------------------------------------------------------------------ 1 1 1 2 1843200 16360 24325 1 1 1000 1001 1843200 16819 1425400 1 8 8 9 4710400 28948 88837 1 8 1000 1001 6504448 42694 3915802 1 64 64 65 30154752 209713 689842 1 64 1000 1001 49086464 1093 7755214 2 1 1 3 2572288 24883 48727 2 1 1000 30580 2572288 11422 26396365 2 8 8 73 23068672 164785 862258 This is measured by by dumping pg_statistic for the column, so in database it might be compressed etc. It's larger, but that's somewhat expected because we simply store more detailed stats. The size grows with the number of paths extracted - which is expected, of course. If you noticed why this doesn't show data for additional combinations (e.g. 2 levels 8 keys and 1000 distinct key values), then that's the bad news - that takes ages (multiple minutes) and then it gets killed by OOM killer because it eats gigabytes of memory. I agree the slowness is largely due to extracting all paths and then processing them one by one - which means we have to loop over the tuples over and over. In this case there's about 850k distinct paths extracted, so we do ~850k loops over 30k tuples. That's gotta take time. I don't know what exactly to do about this, but I already mentioned we may need to pick a subset of paths to keep, similarly to how we pick items for MCV. I mean, if we only saw a path once or twice, it's unlikely to be interesting enough to build stats for it. I haven't tried, but I'd bet most of the 850k paths might be ignored like this. The other thing we might do is making it the loops more efficient. For example, we might track which documents contain each path (by a small bitmap or something), so that in the loop we can skip rows that don't contain the path we're currently processing. Or something like that. Of course, this can't eliminate all the overhead - we've building more stats and that has a cost. In the "common" case of stable "fixed" schema with the same paths in all documents we'll still need to do loop for each of them. So it's bound to be slower than master. Which probably means it's a bad idea to do this for all JSONB columns, because in many cases the extra stats are not needed so the extra analyze time would be a waste. So I guess we'll need some way to enable this only for selected columns ... I argued against the idea to implement this as extended statistics in the first message, but it's a reasonably nice way to do such stuff (expression stats are a precedent). > Apart from this performance issue, I haven't found any crashes or issues. > Well, I haven't seen any crashes either, but as I mentioned for complex documents (2 levels, many distinct keys) the ANALYZE starts consuming a lot of memory and may get killed by OOM. For example if you generate documents like this ./json-generate.py 30000 2 8 1000 6 1000 and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext): ------------------------------------------------------------------------- TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks); 66784 used ... TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536 used: <unnamed> Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks); 469000720 used Analyze Column: 921177696 total in 120 blocks; 5123256 free (238 chunks); 916054440 used Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free (1 chunks); 2472 used Json Analyze Pass Context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used JSON analyze path table: 1639706040 total in 25084 blocks; 1513640 free (33 chunks); 1638192400 used Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used ... Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352 chunks); 3024345064 used ------------------------------------------------------------------------- Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze path table" context, 400MB in "analyze" and 900MB in "analyze column" contexts. I mean, that seems a bit excessive. And it grows over time, so after a while my laptop gives up and kills the backend. I'm not sure if it's a memory leak (which would be fixable), or it's due to keeping stats for all the extracted paths. I mean, in this particular case we have 850k paths - even if stats are just 1kB per path, that's 850MB. This requires more investigation. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2/4/22 03:47, Tomas Vondra wrote: > ./json-generate.py 30000 2 8 1000 6 1000 Sorry, this should be (different order of parameters): ./json-generate.py 30000 2 1000 8 6 1000 regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/25/22 17:56, Mahendra Singh Thalor wrote:
>...Thanks, I've been doing some performance tests too, and you're right it takes quite a bit of time.
For the last few days, I was trying to understand these patches, and based on Tomas's suggestion, I was doing some performance tests.
With the attached .SQL file, I can see that analyze is taking more time with these patches.
I haven't found the root cause of this but I feel that this time is due to a loop of all the paths.
In my test data, there is a total of 951 different-2 paths. While doing analysis, first we check all the sample rows(30000) and we collect all the different-2 paths (here 951), and after that for every single path, we loop over all the sample rows again to collect stats for a particular path. I feel that these loops might be taking time.
That is absolutely not surprising, I have warned about poor performance in cases with a large number of paths.
I agree the slowness is largely due to extracting all paths and then processing them one by one - which means we have to loop over the tuples over and over. In this case there's about 850k distinct paths extracted, so we do ~850k loops over 30k tuples. That's gotta take time.
I don't know what exactly to do about this, but I already mentioned we may need to pick a subset of paths to keep, similarly to how we pick items for MCV. I mean, if we only saw a path once or twice, it's unlikely to be interesting enough to build stats for it. I haven't tried, but I'd bet most of the 850k paths might be ignored like this.
The other thing we might do is making it the loops more efficient. For example, we might track which documents contain each path (by a small bitmap or something), so that in the loop we can skip rows that don't contain the path we're currently processing. Or something like that.Apart from this performance issue, I haven't found any crashes or issues.
Well, I haven't seen any crashes either, but as I mentioned for complex documents (2 levels, many distinct keys) the ANALYZE starts consuming a lot of memory and may get killed by OOM. For example if you generate documents like this
./json-generate.py 30000 2 8 1000 6 1000
and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext): and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext):
-------------------------------------------------------------------------
TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks); 66784 used
...
TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536 used: <unnamed>
Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks); 469000720 used
Analyze Column: 921177696 total in 120 blocks; 5123256 free (238 chunks); 916054440 used
Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free (1 chunks); 2472 used
Json Analyze Pass Context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
JSON analyze path table: 1639706040 total in 25084 blocks; 1513640 free (33 chunks); 1638192400 used
Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
...
Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352 chunks); 3024345064 used
-------------------------------------------------------------------------
Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze path table" context, 400MB in "analyze" and 900MB in "analyze column" contexts. I mean, that seems a bit excessive. And it grows over time, so after a while my laptop gives up and kills the backend.
I'm not sure if it's a memory leak (which would be fixable), or it's due to keeping stats for all the extracted paths. I mean, in this particular case we have 850k paths - even if stats are just 1kB per path, that's 850MB. This requires more investigation.
Thank you for the tests and investigation. I have tried to reduce memory consumption and speed up row scanning: 1. "JSON analyze path table" context contained ~1KB JsonPathAnlStats structure per JSON path in the global hash table. I have moved JsonPathAnlStats to the stack of compute_json_stats(), and hash table now consumes ~70 bytes per path. 2. I have fixed copying of resulting JSONB stats into context, which reduced the size of "Analyze Column" context. 3. I have optimized consumption of single-pass algorithm by storing only value lists in the non-temporary context. That helped to execute "2 64 64" test case in 30 seconds. Single-pass is a bit faster in non-TOASTed cases, and much faster in TOASTed. But it consumes much more memory and still goes to OOM in the cases with more than ~100k paths. 4. I have implemented per-path document lists/bitmaps, which really speed up the case "2 8 1000". List is converted into bitmap when it becomes larger than bitmap. 5. Also I have fixed some bugs. All these changes you can find commit form in our GitHub repository on the branch jsonb_stats_20220310 [1]. Updated results of the test: levels keys uniq keys paths master multi-pass single-pass ms MB ms MB ------------------------------------------------------------------- 1 1 1 2 153 122 10 82 14 1 1 1000 1001 134 105 11 78 38 1 8 8 9 157 384 19 328 32 1 8 1000 1001 155 454 23 402 72 1 64 64 65 129 2889 45 2386 155 1 64 1000 1001 158 3990 94 1447 177 2 1 1 3 237 147 10 91 16 2 1 1000 30577 152 264 32 394 234 2 8 8 72 245 1943 37 1692 139 2 8 1000 852333 152 9175 678 OOM 2 64 64 4161 1784 ~1 hour 53 30018 1750 2 64 1000 1001001 4715 ~4 hours 1600 OOM The two last multi-pass results are too slow, because JSONBs becomes TOASTed. For measuring master in these tests, I disabled WIDTH_THRESHOLD check which skipped TOASTed values > 1KB. Next, I am going to try to disable all-paths collection and implement collection of most common paths (and/or hashed paths maybe). [1] https://github.com/postgrespro/postgres/tree/jsonb_stats_20220310
-- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
- 0001-Add-pg_operator.oprstat-for-derived-operator-statist-20220310.patch
- 0002-Add-stats_form_tuple-20220310.patch
- 0003-Add-symbolic-names-for-some-jsonb-operators-20220310.patch
- 0004-Add-helper-jsonb-functions-and-macros-20220310.patch
- 0005-Export-scalarineqsel-20220310.patch
- 0006-Add-jsonb-statistics-20220310.patch
>
>
>
> On 2/4/22 03:47, Tomas Vondra wrote:
> > ./json-generate.py 30000 2 8 1000 6 1000
>
> Sorry, this should be (different order of parameters):
>
> ./json-generate.py 30000 2 1000 8 6 1000
>
Thanks, Tomas for this test case.
Hi Hackers,
Below is the summary:
Point 1) analyse is taking very much time for large documents:
For large JSON documents, analyze took very large time as compared to the current head. For reference, I am attaching test file (./json-generate.py 30000 2 1000 8 6 1000)
Head: analyze test ; Time: 120.864 ms
With patch: analyze test ; Time: more than 2 hours
analyze is taking a very large time because with these patches, firstly we iterate over all sample rows (in above case 30000), and we store all the paths (here around 850k paths).
Point 2) memory consummation increases rapidly for large documents:
In the above test case, there are total 851k paths and to keep stats for one path, we allocate 1120 bytes.
Memory for 1 path to keep stats: 1120 ~ 1 KB
(sizeof(JsonValueStats) = 1120 from “Analyze Column”)
Total memory for all paths: 852689 * 1120 = 955011680 ~ 955 MB
Extra memory for each path will be more. I mean, while analyzing each path, we allocate some more memory based on frequency and others
To keep all entries(851k paths) in the hash, we use around 1GB memory for hash so this is also very large.
Point 3) Review comment noticed by Tomas Vondra:
+ oldcxt = MemoryContextSwitchTo(ctx->stats->anl_context);
+ pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+ MemoryContextSwitchTo(oldcxt);
Above should be:
+ oldcxt = MemoryContextSwitchTo(ctx->mcxt);
+ pstats->stats = jsonAnalyzeBuildPathStats(pstats);
+ MemoryContextSwitchTo(oldcxt);
+/*
+ * jsonAnalyzeCollectPath
+ * Extract a single path from JSON documents and collect its values.
+ */
+static void
+jsonAnalyzeCollectPath(JsonAnalyzeContext *ctx, Jsonb *jb, void *param)
+{
+ JsonPathAnlStats *pstats = (JsonPathAnlStats *) param;
+ JsonbValue jbvtmp;
+ JsonbValue *jbv = JsonValueInitBinary(&jbvtmp, jb);
+ JsonPathEntry *path;
+ JsonPathEntry **entries;
+ int i;
+
+ entries = palloc(sizeof(*entries) * pstats->depth);
+
+ /* Build entry array in direct order */
+ for (path = &pstats->path, i = pstats->depth - 1;
+ path->parent && i >= 0;
+ path = path->parent, i--)
+ entries[i] = path;
+
+ jsonAnalyzeCollectSubpath(ctx, pstats, jbv, entries, 0);
+
+ pfree(entries);
----many times, we are trying to palloc with zero size and entries is pointing to invalid memory (because pstats->depth=0) so I think, we should not try to palloc with 0??
Fix:
+ If (pstats->depth)
+ entries = palloc(sizeof(*entries) * pstats->depth);
We can set limits(like MCV) for paths or we can give an explicit path to collect stats for a particular path only or we can pass a subset of the JSON values.
In the above case, there are total 851k paths, but we can collect stats for only 1000 paths that are most common so this way we can minimize time and memory also and we might even keep at
least frequencies for the non-analyzed paths.
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
This patch has bitrotted, presumably after the other JSON patchset was applied. It looks like it's failing in the json header file so it may be as simple as additional functions added on nearby lines. Please rebase. Reminder, it's the last week of the commitfest so time is of the essence....
I noticed some typos. diff --git a/src/backend/utils/adt/jsonb_selfuncs.c b/src/backend/utils/adt/jsonb_selfuncs.c index f5520f88a1d..d98cd7020a1 100644 --- a/src/backend/utils/adt/jsonb_selfuncs.c +++ b/src/backend/utils/adt/jsonb_selfuncs.c @@ -1342,7 +1342,7 @@ jsonSelectivityContains(JsonStats stats, Jsonb *jb) path->stats = jsonStatsFindPath(stats, pathstr.data, pathstr.len); - /* Appeend path string entry for array elements, get stats. */ + /* Append path string entry for array elements, get stats. */ jsonPathAppendEntry(&pathstr, NULL); pstats = jsonStatsFindPath(stats, pathstr.data, pathstr.len); freq = jsonPathStatsGetFreq(pstats, 0.0); @@ -1367,7 +1367,7 @@ jsonSelectivityContains(JsonStats stats, Jsonb *jb) case WJB_END_ARRAY: { struct Path *p = path; - /* Absoulte selectivity of the path with its all subpaths */ + /* Absolute selectivity of the path with its all subpaths */ Selectivity abs_sel = p->sel * p->freq; /* Pop last path entry */ diff --git a/src/backend/utils/adt/jsonb_typanalyze.c b/src/backend/utils/adt/jsonb_typanalyze.c index 7882db23a87..9a759aadafb 100644 --- a/src/backend/utils/adt/jsonb_typanalyze.c +++ b/src/backend/utils/adt/jsonb_typanalyze.c @@ -123,10 +123,9 @@ typedef struct JsonScalarStats /* * Statistics calculated for a set of values. * - * * XXX This seems rather complicated and needs simplification. We're not * really using all the various JsonScalarStats bits, there's a lot of - * duplication (e.g. each JsonScalarStats contains it's own array, which + * duplication (e.g. each JsonScalarStats contains its own array, which * has a copy of data from the one in "jsons"). */ typedef struct JsonValueStats @@ -849,7 +848,7 @@ jsonAnalyzePathValues(JsonAnalyzeContext *ctx, JsonScalarStats *sstats, stats->stanullfrac = (float4)(1.0 - freq); /* - * Similarly, we need to correct the MCV frequencies, becuse those are + * Similarly, we need to correct the MCV frequencies, because those are * also calculated only from the non-null values. All we need to do is * simply multiply that with the non-NULL frequency. */ @@ -1015,7 +1014,7 @@ jsonAnalyzeBuildPathStats(JsonPathAnlStats *pstats) /* * We keep array length stats here for queries like jsonpath '$.size() > 5'. - * Object lengths stats can be useful for other query lanuages. + * Object lengths stats can be useful for other query languages. */ if (vstats->arrlens.values.count) jsonAnalyzeMakeScalarStats(&ps, "array_length", &vstats->arrlens.stats); @@ -1069,7 +1068,7 @@ jsonAnalyzeCalcPathFreq(JsonAnalyzeContext *ctx, JsonPathAnlStats *pstats, * We're done with accumulating values for this path, so calculate the * statistics for the various arrays. * - * XXX I wonder if we could introduce some simple heuristict on which + * XXX I wonder if we could introduce some simple heuristic on which * paths to keep, similarly to what we do for MCV lists. For example a * path that occurred just once is not very interesting, so we could * decide to ignore it and not build the stats. Although that won't @@ -1414,7 +1413,7 @@ compute_json_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc, /* * Collect and analyze JSON path values in single or multiple passes. - * Sigle-pass collection is faster but consumes much more memory than + * Single-pass collection is faster but consumes much more memory than * collecting and analyzing by the one path at pass. */ if (ctx.single_pass)
On Fri, 1 Apr 2022 at 20:21, Greg Stark <stark@mit.edu> wrote: > > This patch has bitrotted, presumably after the other JSON patchset was > applied. It looks like it's failing in the json header file so it may > be as simple as additional functions added on nearby lines. > > Please rebase. Reminder, it's the last week of the commitfest so time > is of the essence.... Thanks, Greg for the report. Here, I am attaching re-based patches of the v05 series. These patches are re-based on the commit 7dd3ee508432730d15c5. > I noticed some typos. > diff --git a/src/backend/utils/adt/jsonb_selfuncs.c b/src/backend/utils/adt/jsonb_selfuncs.c > index f5520f88a1d..d98cd7020a1 100644 Thanks, Justin for the review. We will fix these comments in the next version. > Next, I am going to try to disable all-paths collection and implement > collection of most common paths (and/or hashed paths maybe). Thanks, Nikita for the v04 series of patches. I tested on the top of your patches and verified that time taken by analyse is reduced for large complex json docs. In v03 patches, it was more than 2 hours, and in v04 patches, it is 39 sec only (time for Tomas's test case). I am waiting for your patches (disable all-paths collection and implement collection of most common paths) Just for testing purposes, I am posting re-based patches here. -- Thanks and Regards Mahendra Singh Thalor EnterpriseDB: http://www.enterprisedb.com
Attachment
>
>
> On 04.02.2022 05:47, Tomas Vondra wrote:
>
> On 1/25/22 17:56, Mahendra Singh Thalor wrote:
> >
>
> ...
>
> For the last few days, I was trying to understand these patches, and based on Tomas's suggestion, I was doing some performance tests.
>
> With the attached .SQL file, I can see that analyze is taking more time with these patches.
>
> I haven't found the root cause of this but I feel that this time is due to a loop of all the paths.
> In my test data, there is a total of 951 different-2 paths. While doing analysis, first we check all the sample rows(30000) and we collect all the different-2 paths (here 951), and after that for every single path, we loop over all the sample rows again to collect stats for a particular path. I feel that these loops might be taking time.
>
> Thanks, I've been doing some performance tests too, and you're right it takes quite a bit of time.
>
>
> That is absolutely not surprising, I have warned about poor performance
> in cases with a large number of paths.
>
>
> I agree the slowness is largely due to extracting all paths and then processing them one by one - which means we have to loop over the tuples over and over. In this case there's about 850k distinct paths extracted, so we do ~850k loops over 30k tuples. That's gotta take time.
>
> I don't know what exactly to do about this, but I already mentioned we may need to pick a subset of paths to keep, similarly to how we pick items for MCV. I mean, if we only saw a path once or twice, it's unlikely to be interesting enough to build stats for it. I haven't tried, but I'd bet most of the 850k paths might be ignored like this.
>
> The other thing we might do is making it the loops more efficient. For example, we might track which documents contain each path (by a small bitmap or something), so that in the loop we can skip rows that don't contain the path we're currently processing. Or something like that.
>
> Apart from this performance issue, I haven't found any crashes or issues.
>
>
> Well, I haven't seen any crashes either, but as I mentioned for complex documents (2 levels, many distinct keys) the ANALYZE starts consuming a lot of memory and may get killed by OOM. For example if you generate documents like this
>
> ./json-generate.py 30000 2 8 1000 6 1000
>
> and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext): and then run ANALYZE, that'll take ages and it very quickly gets into a situation like this (generated from gdb by calling MemoryContextStats on TopMemoryContext):
>
> -------------------------------------------------------------------------
> TopMemoryContext: 80776 total in 6 blocks; 13992 free (18 chunks); 66784 used
> ...
> TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used
> PortalContext: 1024 total in 1 blocks; 488 free (0 chunks); 536 used: <unnamed>
> Analyze: 472726496 total in 150 blocks; 3725776 free (4 chunks); 469000720 used
> Analyze Column: 921177696 total in 120 blocks; 5123256 free (238 chunks); 916054440 used
> Json Analyze Tmp Context: 8192 total in 1 blocks; 5720 free (1 chunks); 2472 used
> Json Analyze Pass Context: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
> JSON analyze path table: 1639706040 total in 25084 blocks; 1513640 free (33 chunks); 1638192400 used
> Vacuum: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
> ...
> Grand total: 3035316184 bytes in 25542 blocks; 10971120 free (352 chunks); 3024345064 used
> -------------------------------------------------------------------------
>
>
> Yes, that's backend 3GB of memory, out of which 1.6GB is in "analyze path table" context, 400MB in "analyze" and 900MB in "analyze column" contexts. I mean, that seems a bit excessive. And it grows over time, so after a while my laptop gives up and kills the backend.
>
> I'm not sure if it's a memory leak (which would be fixable), or it's due to keeping stats for all the extracted paths. I mean, in this particular case we have 850k paths - even if stats are just 1kB per path, that's 850MB. This requires more investigation.
>
> Thank you for the tests and investigation.
>
> I have tried to reduce memory consumption and speed up row scanning:
>
> 1. "JSON analyze path table" context contained ~1KB JsonPathAnlStats
> structure per JSON path in the global hash table. I have moved
> JsonPathAnlStats to the stack of compute_json_stats(), and hash
> table now consumes ~70 bytes per path.
>
> 2. I have fixed copying of resulting JSONB stats into context, which
> reduced the size of "Analyze Column" context.
>
> 3. I have optimized consumption of single-pass algorithm by storing
> only value lists in the non-temporary context. That helped to
> execute "2 64 64" test case in 30 seconds. Single-pass is a
> bit faster in non-TOASTed cases, and much faster in TOASTed.
> But it consumes much more memory and still goes to OOM in the
> cases with more than ~100k paths.
>
> 4. I have implemented per-path document lists/bitmaps, which really
> speed up the case "2 8 1000". List is converted into bitmap when
> it becomes larger than bitmap.
>
> 5. Also I have fixed some bugs.
>
>
> All these changes you can find commit form in our GitHub repository
> on the branch jsonb_stats_20220310 [1].
>
>
> Updated results of the test:
>
> levels keys uniq keys paths master multi-pass single-pass
> ms MB ms MB
> -------------------------------------------------------------------
> 1 1 1 2 153 122 10 82 14
> 1 1 1000 1001 134 105 11 78 38
> 1 8 8 9 157 384 19 328 32
> 1 8 1000 1001 155 454 23 402 72
> 1 64 64 65 129 2889 45 2386 155
> 1 64 1000 1001 158 3990 94 1447 177
> 2 1 1 3 237 147 10 91 16
> 2 1 1000 30577 152 264 32 394 234
> 2 8 8 72 245 1943 37 1692 139
> 2 8 1000 852333 152 9175 678 OOM
> 2 64 64 4161 1784 ~1 hour 53 30018 1750
> 2 64 1000 1001001 4715 ~4 hours 1600 OOM
>
> The two last multi-pass results are too slow, because JSONBs becomes
> TOASTed. For measuring master in these tests, I disabled
> WIDTH_THRESHOLD check which skipped TOASTed values > 1KB.
>
>
> Next, I am going to try to disable all-paths collection and implement
> collection of most common paths (and/or hashed paths maybe).
Hi Nikita,
I and Tomas discussed the design for disabling all-paths collection(collect stats for only some paths). Below are some thoughts/doubts/questions.
Point 1) Please can you elaborate more that how are you going to implement this(collect stats for only some paths).
Point 2) As JSON stats are taking time so should we add an on/off switch to collect JSON stats?
Point 3) We thought of one more design: we can give an explicit path to collect stats for a particular path only or we can pass a subset of the JSON values but this may require a lot of code changes like syntax and all so we are thinking that it will be good if we can collect stats only for some common paths(by limit or any other way)
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
On 5/17/22 13:44, Mahendra Singh Thalor wrote: > ... > > Hi Nikita, > I and Tomas discussed the design for disabling all-paths > collection(collect stats for only some paths). Below are some > thoughts/doubts/questions. > > *Point 1)* Please can you elaborate more that how are you going to > implement this(collect stats for only some paths). I think Nikita mentioned he plans to only build stats only for most common paths, which seems generally straightforward: 1) first pass over the documents, collect distinct paths and track how many times we saw each one 2) in the second pass extract stats only for the most common paths (e.g. the top 100 most common ones, or whatever the statistics target says) I guess we might store at least the frequencing for uncommon paths, which seems somewhat useful for selectivity estimation. I wonder if we might further optimize this for less common paths. AFAICS one of the reasons why we process the paths one by one (in the second pass) is to limit memory consumption. By processing a single path, we only need to accumulate values for that path. But if we know the path is uncommon, we know there'll be few values. For example the path may be only in 100 documents, not the whole sample. So maybe we might process multiple paths at once (which would mean we don't need to detoast the JSON documents that often, etc.). OTOH that may be pointless, because if the paths are uncommon, chances are the subsets of documents will be different, in which case it's probably cheaper to just process the paths one by one. > *Point 2) *As JSON stats are taking time so should we add an on/off > switch to collect JSON stats? IMHO we should think about doing that. I think it's not really possible to eliminate (significant) regressions for all corner cases, and in many cases people don't even need this statistics (e.g. when just storing and retrieving JSON docs, without querying contents of the docs). I don't know how exactly to enable/disable this - it very much depends on how we store the stats. If we store that in pg_statistic, then ALTER TABLE ... ALTER COLUMN seems like the right way to enable/disable these path stats. We might also have a new "json" stats and do this through CREATE STATISTICS. Or something else, not sure. > *Point 3)* We thought of one more design: we can give an explicit path > to collect stats for a particular path only or we can pass a subset of > the JSON values but this may require a lot of code changes like syntax > and all so we are thinking that it will be good if we can collect stats > only for some common paths(by limit or any other way) > I'm not sure I understand what this is saying, particularly the part about subset of JSON values. Can you elaborate? I can imagine specifying a list of interesting paths, and we'd only collect stats for the matching subset of the JSON documents. So if you have huge JSON documents with complex schema, but you only query a very limited subset of paths, we could restrict ANALYZE to this subset. In fact, that's what the 'selective analyze' commit [1] in Nikita's original patch set does in principle. We'd probably need to improve this in some ways (e.g. to allow defining the column filter not just in ANALYZE itself). I left it out of this patch to keep the patch as simple as possible. But why/how exactly would we limit the "JSON values"? Can you give some example demonstrating that in practice? regards [1] https://github.com/postgrespro/postgres/commit/7ab7397450df153e5a8563c978728cb731a0df33 -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company