Re: Collecting statistics about contents of JSONB columns - Mailing list pgsql-hackers
From | Zhihong Yu |
---|---|
Subject | Re: Collecting statistics about contents of JSONB columns |
Date | |
Msg-id | CALNJ-vQ7p=ZaZ=prwU=tV3FqXhF6CrW+u2Wve3iJVn2wGMBS8A@mail.gmail.com Whole thread Raw |
In response to | Collecting statistics about contents of JSONB columns (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Collecting statistics about contents of JSONB columns
Re: Collecting statistics about contents of JSONB columns |
List | pgsql-hackers |
On Fri, Dec 31, 2021 at 2:07 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,
One of the complaints I sometimes hear from users and customers using
Postgres to store JSON documents (as JSONB type, of course) is that the
selectivity estimates are often pretty poor.
Currently we only really have MCV and histograms with whole documents,
and we can deduce some stats from that. But that is somewhat bogus
because there's only ~100 documents in either MCV/histogram (with the
default statistics target). And moreover we discard all "oversized"
values (over 1kB) before even calculating those stats, which makes it
even less representative.
A couple weeks ago I started playing with this, and I experimented with
improving extended statistics in this direction. After a while I noticed
a forgotten development branch from 2016 which tried to do this by
adding a custom typanalyze function, which seemed like a more natural
idea (because it's really a statistics for a single column).
But then I went to pgconf NYC in early December, and I spoke to Oleg
about various JSON-related things, and he mentioned they've been working
on this topic some time ago too, but did not have time to pursue it. So
he pointed me to a branch [1] developed by Nikita Glukhov.
I like Nikita's branch because it solved a couple architectural issues
that I've been aware of, but only solved them in a rather hackish way.
I had a discussion with Nikita about his approach what can we do to move
it forward. He's focusing on other JSON stuff, but he's OK with me
taking over and moving it forward. So here we go ...
Nikita rebased his branch recently, I've kept improving it in various
(mostly a lot of comments and docs, and some minor fixes and tweaks).
I've pushed my version with a couple extra commits in [2], but you can
ignore that except if you want to see what I added/changed.
Attached is a couple patches adding adding the main part of the feature.
There's a couple more commits in the github repositories, adding more
advanced features - I'll briefly explain those later, but I'm not
including them here because those are optional features and it'd be
distracting to include them here.
There are 6 patches in the series, but the magic mostly happens in parts
0001 and 0006. The other parts are mostly just adding infrastructure,
which may be a sizeable amount of code, but the changes are fairly
simple and obvious. So let's focus on 0001 and 0006.
To add JSON statistics we need to do two basic things - we need to build
the statistics and then we need to allow using them while estimating
conditions.
1) building stats
Let's talk about building the stats first. The patch does one of the
things I experimented with - 0006 adds a jsonb_typanalyze function, and
it associates it with the data type. The function extracts paths and
values from the JSONB document, builds the statistics, and then stores
the result in pg_statistic as a new stakind.
I've been planning to store the stats in pg_statistic too, but I've been
considering to use a custom data type. The patch does something far more
elegant - it simply uses stavalues to store an array of JSONB documents,
each describing stats for one path extracted from the sampled documents.
One (very simple) element of the array might look like this:
{"freq": 1,
"json": {
"mcv": {
"values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
"numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
"width": 19,
"distinct": 10,
"nullfrac": 0,
"correlation": 0.10449},
"path": "$.\"a\"",
"freq_null": 0, "freq_array": 0, "freq_object": 0,
"freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}
In this case there's only a MCV list (represented by two arrays, just
like in pg_statistic), but there might be another part with a histogram.
There's also the other columns we'd expect to see in pg_statistic.
In principle, we need pg_statistic for each path we extract from the
JSON documents and decide it's interesting enough for estimation. There
are probably other ways to serialize/represent this, but I find using
JSONB for this pretty convenient because we need to deal with a mix of
data types (for the same path), and other JSON specific stuff. Storing
that in Postgres arrays would be problematic.
I'm sure there's plenty open questions - for example I think we'll need
some logic to decide which paths to keep, otherwise the statistics can
get quite big, if we're dealing with large / variable documents. We're
already doing something similar for MCV lists.
One of Nikita's patches not included in this thread allow "selective"
statistics, where you can define in advance a "filter" restricting which
parts are considered interesting by ANALYZE. That's interesting, but I
think we need some simple MCV-like heuristics first anyway.
Another open question is how deep the stats should be. Imagine documents
like this:
{"a" : {"b" : {"c" : {"d" : ...}}}}
The current patch build stats for all possible paths:
"a"
"a.b"
"a.b.c"
"a.b.c.d"
and of course many of the paths will often have JSONB documents as
values, not simple scalar values. I wonder if we should limit the depth
somehow, and maybe build stats only for scalar values.
2) applying the statistics
One of the problems is how to actually use the statistics. For @>
operator it's simple enough, because it's (jsonb @> jsonb) so we have
direct access to the stats. But often the conditions look like this:
jsonb_column ->> 'key' = 'value'
so the condition is actually on an expression, not on the JSONB column
directly. My solutions were pretty ugly hacks, but Nikita had a neat
idea - we can define a custom procedure for each operator, which is
responsible for "calculating" the statistics for the expression.
So in this case "->>" will have such "oprstat" procedure, which fetches
stats for the JSONB column, extracts stats for the "key" path. And then
we can use that for estimation of the (text = text) condition.
This is what 0001 does, pretty much. We simply look for expression stats
provided by an index, extended statistics, and then - if oprstat is
defined for the operator - we try to derive the stats.
This opens other interesting opportunities for the future - one of the
parts adds oprstat for basic arithmetic operators, which allows deducing
statistics for expressions like (a+10) from statistics on column (a).
Which seems like a neat feature on it's own, but it also interacts with
the extended statistics in somewhat non-obvious ways (especially when
estimating GROUP BY cardinalities).
Of course, there's a limit of what we can reasonably estimate - for
example, there may be statistical dependencies between paths, and this
patch does not even attempt to deal with that. In a way, this is similar
to correlation between columns, except that here we have a dynamic set
of columns, which makes it much much harder. We'd need something like
extended stats on steroids, pretty much.
I'm sure I've forgotten various important bits - many of them are
mentioned or explained in comments, but I'm sure others are not. And I'd
bet there are things I forgot about entirely or got wrong. So feel free
to ask.
In any case, I think this seems like a good first step to improve our
estimates for JSOB columns.
regards
[1] https://github.com/postgrespro/postgres/tree/jsonb_stats
[2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
For patch 1:
+ List *statisticsName = NIL; /* optional stats estimat. procedure */
I think if the variable is named estimatorName (or something similar), it would be easier for people to grasp its purpose.
+ /* XXX perhaps full "statistics" wording would be better */
+ else if (strcmp(defel->defname, "stats") == 0)
+ else if (strcmp(defel->defname, "stats") == 0)
I would recommend (stats sounds too general):
+ else if (strcmp(defel->defname, "statsestimator") == 0)
+ statisticsOid = ValidateStatisticsEstimator(statisticsName);
statisticsOid -> statsEstimatorOid
For get_oprstat():
+ }
+ else
+ return (RegProcedure) InvalidOid;
+ else
+ return (RegProcedure) InvalidOid;
keyword else is not needed (considering the return statement in if block).
For patch 06:
+ /* FIXME Could be before the memset, I guess? Checking vardata->statsTuple. */
+ if (!data->statsTuple)
+ return false;
+ if (!data->statsTuple)
+ return false;
I would agree the check can be lifted above the memset call.
+ * XXX This does not really extract any stats, it merely allocates the struct?
+ */
+static JsonPathStats
+jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)
+ */
+static JsonPathStats
+jsonPathStatsGetSpecialStats(JsonPathStats pstats, JsonPathStatsType type)
As comments says, I think allocJsonPathStats() would be better name for the func.
+ * XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to what
+ * jsonPathStatsGetLengthStats does?
+ * jsonPathStatsGetLengthStats does?
I think `jsonPathStatsGetTypeFreq(pstats, jbvArray, 0.0) <= 0.0` check should be added for jsonPathStatsGetArrayLengthStats().
To be continued ...
pgsql-hackers by date: