Re: Collecting statistics about contents of JSONB columns - Mailing list pgsql-hackers
From | Nikita Glukhov |
---|---|
Subject | Re: Collecting statistics about contents of JSONB columns |
Date | |
Msg-id | c802791a-beb4-471a-3f63-4d40e3564aa8@postgrespro.ru Whole thread Raw |
In response to | Re: Collecting statistics about contents of JSONB columns (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Collecting statistics about contents of JSONB columns
|
List | pgsql-hackers |
Hi! I am glad that you found my very old patch interesting and started to work on it. We failed to post it in 2016 mostly because we were not satisfied with JSONB storage. Also we decided to wait for completion of work on extended statistics as we thought that it could help us. But in early 2017 we switched to SQL/JSON and forgot about this patch. I think custom datatype is necessary for better performance. With a plain JSONB we need to do a lot of work for extraction of path stats: - iterate through MCV/Histogram JSONB arrays - cast numeric values to float, string to text etc. - build PG arrays from extracted datums - form pg_statistic tuple. With a custom data type we could store pg_statistic tuple unmodified and use it without any overhead. But then we need modify a bit VariableStatData and several functions to pass additional nullfrac corrections. Maybe simple record type (path text, data pg_statistic, ext_data jsonb) would be enough. Also there is an idea to store per-path separately in pg_statistic_ext rows using expression like (jb #> '{foo,bar}') as stxexprs. This could also help user to select which paths to analyze simply by using some sort of CREATE STATISTICS. But it is really unclear how to: * store pg_statistic_ext rows from typanalyze * form path expressions for array elements (maybe add new jsonpath operator) * match various -> operator chains to stxexprs * jsonpath-like languages need simple API for searching by stxexprs Per-path statistics should only be collected for scalars. This can be enabled by flag JsonAnalyzeContext.scalarsOnly. But there are is a problem: computed MCVs and histograms will be broken and we will not be able to use them for queries like (jb > const) in general case. Also we will not be and use internally in scalarineqsel() and var_eq_const() (see jsonSelectivity()). Instead, we will have to implement own estimator functions for JSONB comparison operators that will correctly use our hacked MCVs and histograms (and maybe not all cases will be supported; for example, comparison to scalars only). It's possible to replace objects and arrays with empty ones when scalarsOnly is set to keep correct frequencies of non-scalars. But there is an old bug in JSONB comparison: empty arrays are placed before other values in the JSONB sort order, although they should go after all scalars. So we need also to distinguish empty and non-empty arrays here. I tried to fix a major part of places marked as XXX and FIXME, the new version of the patches is attached. There are a lot of changes, you can see them in a step-by-step form in the corresponding branch jsonb_stats_20220122 in our GitHub repo [1]. Below is the explanation of fixed XXXs: Patch 0001 src/backend/commands/operatorcmds.c: > XXX Maybe not the right name, because "estimator" implies we're > calculating selectivity. But we're actually deriving statistics for > an expression. Renamed to "Derivator". > XXX perhaps full "statistics" wording would be better Renamed to "statistics". src/backend/utils/adt/selfuncs.c: > examine_operator_expression(): > XXX Not sure why this returns bool - we ignore the return value > anyway. We might as well return the calculated vardata (or NULL). Oprstat was changed to return void. > XXX Not sure what to do about recursion - there can be another OpExpr > in one of the arguments, and we should call this recursively from the > oprstat procedure. But that's not possible, because it's marked as > static. Oprstat call chain: get_restriction_variable() => examine_variable() => examine_operator_expression(). The main thing here is that OpExpr with oprstat acts like ordinary Var. > examine_variable(): > XXX Shouldn't this put more restrictions on the OpExpr? E.g. that > one of the arguments has to be a Const or something? This is simply a responsibility of oprstat. Patch 0004 > XXX Not sure we want to add these functions to jsonb.h, which is the > public API. Maybe it belongs rather to jsonb_typanalyze.c or > elsewhere, closer to how it's used? Maybe it needs to be moved the new file jsonb_utils.h. I think these functions become very useful, when we start to build JSONBs with predefined structure. > pushJsonbValue(): > XXX I'm not quite sure why we actually do this? Why do we need to > change how JsonbValue is converted to Jsonb for the statistics patch? Scalars in JSONB are encoded as one-element preudo-array containers. So when we are inserting binary JsonbValues, that was initialized directly from JSONB datums, inside other non-empty JSONB in pushJsonbValue(), all values except scalars are inserted as expected but scalars become [scalar]. So we need to extract scalar values in caller functions or in pushJsonbValue(). I think autoextraction in pushJsonbValue() is better. This case simply was not used before JSONB stats introduction. Patch 0006 src/backend/utils/adt/jsonb_selfuncs.c > jsonPathStatsGetSpecialStats() > XXX This does not really extract any stats, it merely allocates the > struct? Renamed with "Alloc" suffix. > jsonPathStatsGetArrayLengthStats() > XXX Why doesn't this do jsonPathStatsGetTypeFreq check similar to > what jsonPathStatsGetLengthStats does? "length" stats were collected inside parent paths, but "array_length" and "avg_array_length" stats were collected inside child array paths. This resulted in inconsistencies in TypeFreq checks. I have removed "length" stats, moved "array_length" and "avg_array_length" to the parent path, and added separate "object_length" stats. TypeFreq checks become consistent. > XXX Seems to do essentially what jsonStatsFindPath, except that it > also considers jsdata->prefix. Seems fairly easy to combine those > into a single function. I don't think that it would be better to combine those function into one with considerPrefix flag, because jsonStatsFindPathStats() is a some kind of low-level function which is called only in two places. In all other places considerPrefix will be true. Also jsonStatsFindPathStatsStr() is exported in jsonb_selfuncs.h for to give external jsonpath-like query operators ability to use JSON statistics. > jsonPathStatsCompare() > XXX Seems a bit convoluted to first cast it to Datum, then Jsonb ... > Datum const *pdatum = pv2; > Jsonb *jsonb = DatumGetJsonbP(*pdatum); The problem of simply using 'jsonb = *(Jsonb **) pv2' is that DatumGetJsonbP() may deTOAST datums. > XXX Not sure about this? Does empty path mean global stats? Empty "path" is simply a sign of invalid stats data. > jsonStatsConvertArray() > FIXME Does this actually work on all 32/64-bit systems? What if typid > is FLOAT8OID or something? Should look at TypeCache instead, probably. Used get_typlenbyvalalign() instead of hardcoded values. > jsonb_stats() > XXX It might be useful to allow recursion, i.e. > get_restriction_variable might derive statistics too. > I don't think it does that now, right? get_restriction_variable() already might derive statistics: it calls examine_variable() on its both operands, and examine_variable() calls examine_operator_expression(). > XXX Could we also get varonleft=false in useful cases? All supported JSONB operators have signature jsonb OP arg. If varonleft = false, then we need to derive stats for expression like '{"foo": "bar"}' -> text_column having stats for text_column. It is possible to implement too, but it is not related to JSONB stats and needs to be done in the separate patch. > jsonSelectivityContains(): > XXX This really needs more comments explaining the logic. I have refactored this function and added comments. > jsonGetFloat4(): > XXX Not sure assert is a sufficient protection against different > types of JSONB values to be passed in. I have refactored this function by passing a default value for non-numeric JSONB case. > jsonPathAppendEntryWithLen() > XXX Doesn't this need ecape_json too? Comment is removed, because jsonPathAppendEntry() is called inside. > jsonPathStatsGetTypeFreq() > FIXME This is really hard to read/understand, with two nested ternary > operators. I have refactored this place. > XXX Seems more like an error, no? Why ignore it? It's not an error, it's request of non-numeric type. Length values are always numeric, and frequence of non-numeric values is 0. The possible case when it could happen is jsonpath '$.size() == "foo"'. Estimator for operator == will check frequence of strings in .size() and it will be 0. > jsonPathStatsFormTuple() > FIXME What does this mean? There is no need to transform ordinary root path stats tuple, it can be simply copied. jsonb_typanalyze.c > XXX We need entry+lenth because JSON path elements may contain null > bytes, I guess? 'entry' can be not zero-terminated when it is pointing to JSONB keys, so 'len' is necessary field. 'len' is also used for faster entry comparison, to distinguish array entries ('len' == -1). > XXX Sould be JsonPathEntryMatch as it deals with JsonPathEntry nodes > not whole paths, no? > XXX Again, maybe JsonPathEntryHash would be a better name? Functions renamed using JsonPathEntry prefix, JsonPath typedef removed. > JsonPathMatch() > XXX Seems a bit silly to return int, when the return statement only > really returns bool (because of how it compares paths). It's not really > a comparator for sorting, for example. This function is implementation of HashCompareFunc and it needs to return int. > jsonAnalyzeJson() > XXX The name seems a bit weird, with the two json bits. Renamed to jsonAnalyzeCollectPaths(). The two similar functions were renamed too. > jsonAnalyzeJson(): > XXX The mix of break/return statements in this block is really > confusing. I have refactored this place using only breaks. > XXX not sure why we're doing this? Manual recursion into containers by creating child iterator together with skipNested=true flag is used to give jsonAnalyzeJsonValue() ability to access jbvBinary containers. > compute_json_stats() > XXX Not sure what the first branch is doing (or supposed to)? > XXX It's not immediately clear why this is (-1) and not simply > NULL. It crashes, so presumably it's used to tweak the behavior, > but it's not clear why/how, and it affects place that is pretty > far away, and so not obvious. We should use some sort of flag > with a descriptive name instead. > XXX If I understand correctly, we simply collect all paths first, > without accumulating any Values. And then in the next step we > process each path independently, probably to save memory (we > don't want to accumulate all values for all paths, with a lot > of duplicities). There are two variants of stats collection: * single-pass - collect all values for all paths * multi-pass - collect only values for a one path at each pass The first variant can consume too much memory (jsonb iteration produces a lot of garbage etc.), but works faster than second. The first 'if (false)' is used for manual selection of one of this variants. This selection should be controlled by some user-specified option (maybe GUC), or the first variant can be simply removed. jsonAnalyzeJson()'s parameter of type JsonPathAnlStats * determines which paths we need to consider for the value collection: * NULL - collect values for all paths * -1 - do not collect any values * stats - collect values only for a given path The last variant really is unused because we already have another function jsonAnalyzeJsonPath(), which is optimized for selective path values collection (used object key accessor JSONB functions instead of full JSONB iteration). I have replaced this strange parameter with simple boolean flag. > XXX Could the parameters be different on other platforms? Used get_typlenbyvalalign(JSONBOID) instead of hardcoded values. > jsonAnalyzePathValues() > XXX Do we need to initialize all slots? I have copied here the following comment from extended_stats.c: /* * The fields describing the stats->stavalues[n] element types default * to the type of the data being analyzed, but the type-specific * typanalyze function can change them if it wants to store something * else. */ > XXX Not sure why we divide it by the number of json values? We divide counts of lengths by the total number of json values to compute correct nullfrac. I.e. not all input jsons have lengths, length of scalar jsons is NULL. [1] https://github.com/postgrespro/postgres/tree/jsonb_stats_20220122 -- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: