Re: Collecting statistics about contents of JSONB columns - Mailing list pgsql-hackers

From Nikita Glukhov
Subject Re: Collecting statistics about contents of JSONB columns
Date
Msg-id c802791a-beb4-471a-3f63-4d40e3564aa8@postgrespro.ru
Whole thread Raw
In response to Re: Collecting statistics about contents of JSONB columns  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Collecting statistics about contents of JSONB columns
List pgsql-hackers
Hi!

I am glad that you found my very old patch interesting and started to
work on it.  We failed to post it in 2016 mostly because we were not
satisfied with JSONB storage.  Also we decided to wait for completion
of work on extended statistics as we thought that it could help us.
But in early 2017 we switched to SQL/JSON and forgot about this patch.


I think custom datatype is necessary for better performance. With a
plain JSONB we need to do a lot of work for extraction of path stats: - iterate through MCV/Histogram JSONB arrays - cast numeric values to float, string to text etc. - build PG arrays from extracted datums - form pg_statistic tuple.

With a custom data type we could store pg_statistic tuple unmodified
and use it without any overhead.  But then we need modify a bit
VariableStatData and several functions to pass additional nullfrac
corrections.

Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
would be enough.



Also there is an idea to store per-path separately in pg_statistic_ext
rows using expression like (jb #> '{foo,bar}') as stxexprs.  This could
also help user to select which paths to analyze simply by using some
sort of CREATE STATISTICS.  But it is really unclear how to: * store pg_statistic_ext rows from typanalyze * form path expressions for array elements (maybe add new jsonpath   operator) * match various -> operator chains to stxexprs * jsonpath-like languages need simple API for searching by stxexprs



Per-path statistics should only be collected for scalars.  This can be
enabled by flag JsonAnalyzeContext.scalarsOnly.  But there are is a
problem: computed MCVs and histograms will be broken and we will not be
able to use them for queries like (jb > const) in general case.  Also
we will not be and use internally in scalarineqsel() and var_eq_const()
(see jsonSelectivity()).  Instead, we will have to implement own 
estimator functions for JSONB comparison operators that will correctly
use our hacked MCVs and histograms (and maybe not all cases will be
supported; for example, comparison to scalars only).

It's possible to replace objects and arrays with empty ones when 
scalarsOnly is set to keep correct frequencies of non-scalars.  
But there is an old bug in JSONB comparison: empty arrays are placed 
before other values in the JSONB sort order, although they should go 
after all scalars.  So we need also to distinguish empty and non-empty 
arrays here.



I tried to fix a major part of places marked as XXX and FIXME, the new
version of the patches is attached.  There are a lot of changes, you
can see them in a step-by-step form in the corresponding branch
jsonb_stats_20220122 in our GitHub repo [1]. 


Below is the explanation of fixed XXXs:

Patch 0001

src/backend/commands/operatorcmds.c:

> XXX Maybe not the right name, because "estimator" implies we're 
> calculating selectivity. But we're actually deriving statistics for 
> an expression.

Renamed to "Derivator".

> XXX perhaps full "statistics" wording would be better

Renamed to "statistics".

src/backend/utils/adt/selfuncs.c:

> examine_operator_expression():

> XXX Not sure why this returns bool - we ignore the return value
> anyway. We might as well return the calculated vardata (or NULL).

Oprstat was changed to return void.

> XXX Not sure what to do about recursion - there can be another OpExpr
> in one of the arguments, and we should call this recursively from the
> oprstat procedure. But that's not possible, because it's marked as
> static.

Oprstat call chain: get_restriction_variable() => examine_variable() =>
examine_operator_expression().

The main thing here is that OpExpr with oprstat acts like ordinary Var.

> examine_variable():
> XXX Shouldn't this put more restrictions on the OpExpr? E.g. that
> one of the arguments has to be a Const or something?

This is simply a responsibility of oprstat.


Patch 0004

> XXX Not sure we want to add these functions to jsonb.h, which is the
> public API. Maybe it belongs rather to jsonb_typanalyze.c or
> elsewhere, closer to how it's used?

Maybe it needs to be moved the new file jsonb_utils.h.  I think these
functions become very useful, when we start to build JSONBs with
predefined structure.


> pushJsonbValue():
> XXX I'm not quite sure why we actually do this? Why do we need to
> change how JsonbValue is converted to Jsonb for the statistics patch?

Scalars in JSONB are encoded as one-element preudo-array containers.
So when we are inserting binary JsonbValues, that was initialized
directly from JSONB datums, inside other non-empty JSONB in
pushJsonbValue(), all values except scalars are inserted as
expected but scalars become [scalar].  So we need to extract scalar
values in caller functions or in pushJsonbValue().  I think
autoextraction in pushJsonbValue() is better.  This case simply was not
used before JSONB stats introduction.


Patch 0006

src/backend/utils/adt/jsonb_selfuncs.c

> jsonPathStatsGetSpecialStats()
> XXX This does not really extract any stats, it merely allocates the
> struct?

Renamed with "Alloc" suffix.

> jsonPathStatsGetArrayLengthStats()
> XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to
> what jsonPathStatsGetLengthStats does?

"length" stats were collected inside parent paths, but "array_length"
and "avg_array_length" stats were collected inside child array paths.
This resulted in inconsistencies in TypeFreq checks.

I have removed "length" stats, moved "array_length" and
"avg_array_length" to the parent path, and added separate
"object_length" stats.  TypeFreq checks become consistent.

> XXX Seems to do essentially what jsonStatsFindPath, except that it
> also considers jsdata->prefix. Seems fairly easy to combine those
> into a single function.

I don't think that it would be better to combine those function into
one with considerPrefix flag, because jsonStatsFindPathStats() is a
some kind of low-level function which is called only in two places.
In all other places considerPrefix will be true. Also
jsonStatsFindPathStatsStr() is exported in jsonb_selfuncs.h for to give
external jsonpath-like query operators ability to use JSON statistics.

> jsonPathStatsCompare()
> XXX Seems a bit convoluted to first cast it to Datum, then Jsonb ...
> Datum const *pdatum = pv2;
> Jsonb	   *jsonb = DatumGetJsonbP(*pdatum);

The problem of simply using 'jsonb = *(Jsonb **) pv2' is that
DatumGetJsonbP() may deTOAST datums.

> XXX Not sure about this? Does empty path mean global stats?
Empty "path" is simply a sign of invalid stats data.


> jsonStatsConvertArray()
> FIXME Does this actually work on all 32/64-bit systems? What if typid
> is FLOAT8OID or something? Should look at TypeCache instead, probably.

Used get_typlenbyvalalign() instead of hardcoded values.

> jsonb_stats()
>   XXX It might be useful to allow recursion, i.e.
>   get_restriction_variable might derive statistics too.
>   I don't think it does that now, right?

get_restriction_variable() already might derive statistics: it calls
examine_variable() on its both operands, and examine_variable() calls
examine_operator_expression().

> XXX Could we also get varonleft=false in useful cases?

All supported JSONB operators have signature  jsonb OP arg.
If varonleft = false, then we need to derive stats for expression like  '{"foo": "bar"}' -> text_column
having stats for text_column.  It is possible to implement too, but it
is not related to JSONB stats and needs to be done in the separate
patch.


> jsonSelectivityContains():
>   XXX This really needs more comments explaining the logic.

I have refactored this function and added comments.


> jsonGetFloat4():
>   XXX Not sure assert is a sufficient protection against different
>   types of JSONB values to be passed in.

I have refactored this function by passing a default value for
non-numeric JSONB case.

> jsonPathAppendEntryWithLen()
> XXX Doesn't this need ecape_json too?

Comment is removed, because jsonPathAppendEntry() is called inside.

> jsonPathStatsGetTypeFreq()
> FIXME This is really hard to read/understand, with two nested ternary
> operators.

I have refactored this place.

> XXX Seems more like an error, no? Why ignore it?

It's not an error, it's request of non-numeric type.  Length values
are always numeric, and frequence of non-numeric values is 0.

The possible case when it could happen is jsonpath '$.size() == "foo"'.
Estimator for operator == will check frequence of strings in .size()
and it will be 0.


> jsonPathStatsFormTuple()
> FIXME What does this mean?

There is no need to transform ordinary root path stats tuple, it can be
simply copied.


jsonb_typanalyze.c

> XXX We need entry+lenth because JSON path elements may contain null
> bytes, I guess?

'entry' can be not zero-terminated when it is pointing to JSONB keys,
so 'len' is necessary field.  'len' is also used for faster entry comparison, to distinguish array entries ('len' == -1).

> XXX Sould be JsonPathEntryMatch as it deals with JsonPathEntry nodes
> not whole paths, no?
> XXX Again, maybe JsonPathEntryHash would be a better name?

Functions renamed using JsonPathEntry prefix, JsonPath typedef removed.


> JsonPathMatch()
> XXX Seems a bit silly to return int, when the return statement only
> really returns bool (because of how it compares paths). It's not really
> a comparator for sorting, for example.

This function is implementation of HashCompareFunc and it needs to
return int.

> jsonAnalyzeJson()
> XXX The name seems a bit weird, with the two json bits.

Renamed to jsonAnalyzeCollectPaths(). The two similar functions were
renamed too.

> jsonAnalyzeJson():
> XXX The mix of break/return statements in this block is really
> confusing.

I have refactored this place using only breaks.

> XXX not sure why we're doing this?

Manual recursion into containers by creating child iterator together
with skipNested=true flag is used to give jsonAnalyzeJsonValue()
ability to access jbvBinary containers.


> compute_json_stats()
> XXX Not sure what the first branch is doing (or supposed to)?

> XXX It's not immediately clear why this is (-1) and not simply
> NULL. It crashes, so presumably it's used to tweak the behavior,
> but it's not clear why/how, and it affects place that is pretty
> far away, and so not obvious. We should use some sort of flag
> with a descriptive name instead.

> XXX If I understand correctly, we simply collect all paths first,
> without accumulating any Values. And then in the next step we
> process each path independently, probably to save memory (we
> don't want to accumulate all values for all paths, with a lot
> of duplicities).

There are two variants of stats collection: * single-pass - collect all values for all paths * multi-pass  - collect only values for a one path at each pass

The first variant can consume too much memory (jsonb iteration produces
a lot of garbage etc.), but works faster than second.

The first 'if (false)' is used for manual selection of one of this
variants.  This selection should be controlled by some user-specified
option (maybe GUC), or the first variant can be simply removed.

jsonAnalyzeJson()'s parameter of type JsonPathAnlStats * determines
which paths we need to consider for the value collection: * NULL  - collect values for all paths * -1    - do not collect any values * stats - collect values only for a given path

The last variant really is unused because we already have another
function jsonAnalyzeJsonPath(), which is optimized for selective path
values collection (used object key accessor JSONB functions instead of
full JSONB iteration).  I have replaced this strange parameter with
simple boolean flag.

> XXX Could the parameters be different on other platforms?

Used get_typlenbyvalalign(JSONBOID) instead of hardcoded values.

> jsonAnalyzePathValues()
> XXX Do we need to initialize all slots?

I have copied here the following comment from extended_stats.c:
/* * The fields describing the stats->stavalues[n] element types default * to the type of the data being analyzed, but the type-specific * typanalyze function can change them if it wants to store something * else. */

> XXX Not sure why we divide it by the number of json values?

We divide counts of lengths by the total number of json values to
compute correct nullfrac. I.e. not all input jsons have lengths,
length of scalar jsons is NULL.



[1] https://github.com/postgrespro/postgres/tree/jsonb_stats_20220122

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Warning in geqo_main.c from clang 13
Next
From: Tomas Vondra
Date:
Subject: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas