Collecting statistics about contents of JSONB columns - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Collecting statistics about contents of JSONB columns |
Date | |
Msg-id | c9c4bd20-996c-100e-25e7-27e27bb1da7c@enterprisedb.com Whole thread Raw |
Responses |
Re: Collecting statistics about contents of JSONB columns
Re: Collecting statistics about contents of JSONB columns Re: Collecting statistics about contents of JSONB columns |
List | pgsql-hackers |
Hi, One of the complaints I sometimes hear from users and customers using Postgres to store JSON documents (as JSONB type, of course) is that the selectivity estimates are often pretty poor. Currently we only really have MCV and histograms with whole documents, and we can deduce some stats from that. But that is somewhat bogus because there's only ~100 documents in either MCV/histogram (with the default statistics target). And moreover we discard all "oversized" values (over 1kB) before even calculating those stats, which makes it even less representative. A couple weeks ago I started playing with this, and I experimented with improving extended statistics in this direction. After a while I noticed a forgotten development branch from 2016 which tried to do this by adding a custom typanalyze function, which seemed like a more natural idea (because it's really a statistics for a single column). But then I went to pgconf NYC in early December, and I spoke to Oleg about various JSON-related things, and he mentioned they've been working on this topic some time ago too, but did not have time to pursue it. So he pointed me to a branch [1] developed by Nikita Glukhov. I like Nikita's branch because it solved a couple architectural issues that I've been aware of, but only solved them in a rather hackish way. I had a discussion with Nikita about his approach what can we do to move it forward. He's focusing on other JSON stuff, but he's OK with me taking over and moving it forward. So here we go ... Nikita rebased his branch recently, I've kept improving it in various (mostly a lot of comments and docs, and some minor fixes and tweaks). I've pushed my version with a couple extra commits in [2], but you can ignore that except if you want to see what I added/changed. Attached is a couple patches adding adding the main part of the feature. There's a couple more commits in the github repositories, adding more advanced features - I'll briefly explain those later, but I'm not including them here because those are optional features and it'd be distracting to include them here. There are 6 patches in the series, but the magic mostly happens in parts 0001 and 0006. The other parts are mostly just adding infrastructure, which may be a sizeable amount of code, but the changes are fairly simple and obvious. So let's focus on 0001 and 0006. To add JSON statistics we need to do two basic things - we need to build the statistics and then we need to allow using them while estimating conditions. 1) building stats Let's talk about building the stats first. The patch does one of the things I experimented with - 0006 adds a jsonb_typanalyze function, and it associates it with the data type. The function extracts paths and values from the JSONB document, builds the statistics, and then stores the result in pg_statistic as a new stakind. I've been planning to store the stats in pg_statistic too, but I've been considering to use a custom data type. The patch does something far more elegant - it simply uses stavalues to store an array of JSONB documents, each describing stats for one path extracted from the sampled documents. One (very simple) element of the array might look like this: {"freq": 1, "json": { "mcv": { "values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], "numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]}, "width": 19, "distinct": 10, "nullfrac": 0, "correlation": 0.10449}, "path": "$.\"a\"", "freq_null": 0, "freq_array": 0, "freq_object": 0, "freq_string": 0, "freq_boolean": 0, "freq_numeric": 0} In this case there's only a MCV list (represented by two arrays, just like in pg_statistic), but there might be another part with a histogram. There's also the other columns we'd expect to see in pg_statistic. In principle, we need pg_statistic for each path we extract from the JSON documents and decide it's interesting enough for estimation. There are probably other ways to serialize/represent this, but I find using JSONB for this pretty convenient because we need to deal with a mix of data types (for the same path), and other JSON specific stuff. Storing that in Postgres arrays would be problematic. I'm sure there's plenty open questions - for example I think we'll need some logic to decide which paths to keep, otherwise the statistics can get quite big, if we're dealing with large / variable documents. We're already doing something similar for MCV lists. One of Nikita's patches not included in this thread allow "selective" statistics, where you can define in advance a "filter" restricting which parts are considered interesting by ANALYZE. That's interesting, but I think we need some simple MCV-like heuristics first anyway. Another open question is how deep the stats should be. Imagine documents like this: {"a" : {"b" : {"c" : {"d" : ...}}}} The current patch build stats for all possible paths: "a" "a.b" "a.b.c" "a.b.c.d" and of course many of the paths will often have JSONB documents as values, not simple scalar values. I wonder if we should limit the depth somehow, and maybe build stats only for scalar values. 2) applying the statistics One of the problems is how to actually use the statistics. For @> operator it's simple enough, because it's (jsonb @> jsonb) so we have direct access to the stats. But often the conditions look like this: jsonb_column ->> 'key' = 'value' so the condition is actually on an expression, not on the JSONB column directly. My solutions were pretty ugly hacks, but Nikita had a neat idea - we can define a custom procedure for each operator, which is responsible for "calculating" the statistics for the expression. So in this case "->>" will have such "oprstat" procedure, which fetches stats for the JSONB column, extracts stats for the "key" path. And then we can use that for estimation of the (text = text) condition. This is what 0001 does, pretty much. We simply look for expression stats provided by an index, extended statistics, and then - if oprstat is defined for the operator - we try to derive the stats. This opens other interesting opportunities for the future - one of the parts adds oprstat for basic arithmetic operators, which allows deducing statistics for expressions like (a+10) from statistics on column (a). Which seems like a neat feature on it's own, but it also interacts with the extended statistics in somewhat non-obvious ways (especially when estimating GROUP BY cardinalities). Of course, there's a limit of what we can reasonably estimate - for example, there may be statistical dependencies between paths, and this patch does not even attempt to deal with that. In a way, this is similar to correlation between columns, except that here we have a dynamic set of columns, which makes it much much harder. We'd need something like extended stats on steroids, pretty much. I'm sure I've forgotten various important bits - many of them are mentioned or explained in comments, but I'm sure others are not. And I'd bet there are things I forgot about entirely or got wrong. So feel free to ask. In any case, I think this seems like a good first step to improve our estimates for JSOB columns. regards [1] https://github.com/postgrespro/postgres/tree/jsonb_stats [2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- 0001-Add-pg_operator.oprstat-for-derived-operato-20211230.patch
- 0002-Add-stats_form_tuple-20211230.patch
- 0003-Add-symbolic-names-for-some-jsonb-operators-20211230.patch
- 0004-Add-helper-jsonb-functions-and-macros-20211230.patch
- 0005-Export-scalarineqsel-20211230.patch
- 0006-Add-jsonb-statistics-20211230.patch
pgsql-hackers by date: