Collecting statistics about contents of JSONB columns - Mailing list pgsql-hackers

From Tomas Vondra
Subject Collecting statistics about contents of JSONB columns
Date
Msg-id c9c4bd20-996c-100e-25e7-27e27bb1da7c@enterprisedb.com
Whole thread Raw
Responses Re: Collecting statistics about contents of JSONB columns
Re: Collecting statistics about contents of JSONB columns
Re: Collecting statistics about contents of JSONB columns
List pgsql-hackers
Hi,

One of the complaints I sometimes hear from users and customers using 
Postgres to store JSON documents (as JSONB type, of course) is that the 
selectivity estimates are often pretty poor.

Currently we only really have MCV and histograms with whole documents, 
and we can deduce some stats from that. But that is somewhat bogus 
because there's only ~100 documents in either MCV/histogram (with the 
default statistics target). And moreover we discard all "oversized" 
values (over 1kB) before even calculating those stats, which makes it 
even less representative.

A couple weeks ago I started playing with this, and I experimented with 
improving extended statistics in this direction. After a while I noticed 
a forgotten development branch from 2016 which tried to do this by 
adding a custom typanalyze function, which seemed like a more natural 
idea (because it's really a statistics for a single column).

But then I went to pgconf NYC in early December, and I spoke to Oleg 
about various JSON-related things, and he mentioned they've been working 
on this topic some time ago too, but did not have time to pursue it. So 
he pointed me to a branch [1] developed by Nikita Glukhov.

I like Nikita's branch because it solved a couple architectural issues 
that I've been aware of, but only solved them in a rather hackish way.

I had a discussion with Nikita about his approach what can we do to move 
it forward. He's focusing on other JSON stuff, but he's OK with me 
taking over and moving it forward. So here we go ...

Nikita rebased his branch recently, I've kept improving it in various 
(mostly a lot of comments and docs, and some minor fixes and tweaks). 
I've pushed my version with a couple extra commits in [2], but you can 
ignore that except if you want to see what I added/changed.

Attached is a couple patches adding adding the main part of the feature. 
There's a couple more commits in the github repositories, adding more 
advanced features - I'll briefly explain those later, but I'm not 
including them here because those are optional features and it'd be 
distracting to include them here.

There are 6 patches in the series, but the magic mostly happens in parts 
0001 and 0006. The other parts are mostly just adding infrastructure, 
which may be a sizeable amount of code, but the changes are fairly 
simple and obvious. So let's focus on 0001 and 0006.

To add JSON statistics we need to do two basic things - we need to build 
the statistics and then we need to allow using them while estimating 
conditions.


1) building stats

Let's talk about building the stats first. The patch does one of the 
things I experimented with - 0006 adds a jsonb_typanalyze function, and 
it associates it with the data type. The function extracts paths and 
values from the JSONB document, builds the statistics, and then stores 
the result in pg_statistic as a new stakind.

I've been planning to store the stats in pg_statistic too, but I've been 
considering to use a custom data type. The patch does something far more 
elegant - it simply uses stavalues to store an array of JSONB documents, 
each describing stats for one path extracted from the sampled documents.

One (very simple) element of the array might look like this:

   {"freq": 1,
    "json": {
      "mcv": {
         "values": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
         "numbers": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
      "width": 19,
      "distinct": 10,
      "nullfrac": 0,
      "correlation": 0.10449},
    "path": "$.\"a\"",
    "freq_null": 0, "freq_array": 0, "freq_object": 0,
    "freq_string": 0, "freq_boolean": 0, "freq_numeric": 0}

In this case there's only a MCV list (represented by two arrays, just 
like in pg_statistic), but there might be another part with a histogram. 
There's also the other columns we'd expect to see in pg_statistic.

In principle, we need pg_statistic for each path we extract from the 
JSON documents and decide it's interesting enough for estimation. There 
are probably other ways to serialize/represent this, but I find using 
JSONB for this pretty convenient because we need to deal with a mix of 
data types (for the same path), and other JSON specific stuff. Storing 
that in Postgres arrays would be problematic.

I'm sure there's plenty open questions - for example I think we'll need 
some logic to decide which paths to keep, otherwise the statistics can 
get quite big, if we're dealing with large / variable documents. We're 
already doing something similar for MCV lists.

One of Nikita's patches not included in this thread allow "selective" 
statistics, where you can define in advance a "filter" restricting which 
parts are considered interesting by ANALYZE. That's interesting, but I 
think we need some simple MCV-like heuristics first anyway.

Another open question is how deep the stats should be. Imagine documents 
like this:

   {"a" : {"b" : {"c" : {"d" : ...}}}}

The current patch build stats for all possible paths:

  "a"
  "a.b"
  "a.b.c"
  "a.b.c.d"

and of course many of the paths will often have JSONB documents as 
values, not simple scalar values. I wonder if we should limit the depth 
somehow, and maybe build stats only for scalar values.


2) applying the statistics

One of the problems is how to actually use the statistics. For @> 
operator it's simple enough, because it's (jsonb @> jsonb) so we have 
direct access to the stats. But often the conditions look like this:

     jsonb_column ->> 'key' = 'value'

so the condition is actually on an expression, not on the JSONB column 
directly. My solutions were pretty ugly hacks, but Nikita had a neat 
idea - we can define a custom procedure for each operator, which is 
responsible for "calculating" the statistics for the expression.

So in this case "->>" will have such "oprstat" procedure, which fetches 
stats for the JSONB column, extracts stats for the "key" path. And then 
we can use that for estimation of the (text = text) condition.

This is what 0001 does, pretty much. We simply look for expression stats 
provided by an index, extended statistics, and then - if oprstat is 
defined for the operator - we try to derive the stats.

This opens other interesting opportunities for the future - one of the 
parts adds oprstat for basic arithmetic operators, which allows deducing 
statistics for expressions like (a+10) from statistics on column (a).

Which seems like a neat feature on it's own, but it also interacts with 
the extended statistics in somewhat non-obvious ways (especially when 
estimating GROUP BY cardinalities).

Of course, there's a limit of what we can reasonably estimate - for 
example, there may be statistical dependencies between paths, and this 
patch does not even attempt to deal with that. In a way, this is similar 
to correlation between columns, except that here we have a dynamic set 
of columns, which makes it much much harder. We'd need something like 
extended stats on steroids, pretty much.


I'm sure I've forgotten various important bits - many of them are 
mentioned or explained in comments, but I'm sure others are not. And I'd 
bet there are things I forgot about entirely or got wrong. So feel free 
to ask.


In any case, I think this seems like a good first step to improve our 
estimates for JSOB columns.

regards


[1] https://github.com/postgrespro/postgres/tree/jsonb_stats

[2] https://github.com/tvondra/postgres/tree/jsonb_stats_rework

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Use generation context to speed up tuplesorts
Next
From: David Rowley
Date:
Subject: Re: Speed up transaction completion faster after many relations are accessed in a transaction