Re: working around JSONB's lack of stats? - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: working around JSONB's lack of stats?
Date
Msg-id 54C976ED.6030309@2ndquadrant.com
Whole thread Raw
In response to Re: working around JSONB's lack of stats?  (Josh Berkus <josh@agliodbs.com>)
List pgsql-performance
On 29.1.2015 00:03, Josh Berkus wrote:
> On 01/28/2015 11:48 AM, Tomas Vondra wrote:
>> On 27.1.2015 08:06, Josh Berkus wrote:
>>> Folks,
>>>
>> ...
>>>
>>> On a normal column, I'd raise n_distinct to reflect the higher
>>> selecivity of the search terms.  However, since @> uses contsel,
>>> n_distinct is ignored.  Anyone know a clever workaround I don't
>>> currently see?
>>
>> I don't see any reasonable workaround :-(
>>
>> ISTM we'll have to invent a way to collect useful stats about contents
>> of JSON/JSONB documents. JSONB is cool, but at the moment we're mostly
>> relying on defaults that may be reasonable, but still misfire in many
>> cases. Do we have any ideas of how that might work?
>>
>> We're already collecting stats about contents of arrays, and maybe we
>> could do something similar for JSONB? The nested nature of JSON makes
>> that rather incompatible with the flat MCV/histogram stats, though.
>
> Well, I was thinking about this.
>
> We already have most_common_elem (MCE) for arrays and tsearch. What
> if we put JSONB's most common top-level keys (or array elements,
> depending) in the MCE array? Then we could still apply a simple rule
> for any path criteria below the top-level keys, say assuming that any
> sub-key criteria would match 10% of the time. While it wouldn't be
> perfect, it would be better than what we have now.

So how would that work with your 'tags' example? ISTM most of your
documents have 'tags' as top-level key, so that would end up in the MCV
list. But there's no info about the elements of the 'tags' array (thus
the 10% default, which may work in this particular case, but it's hardly
a general solution and I doubt it's somehow superior to the defaults
we're using right now).

I think a 'proper' solution to JSONB stats needs to somehow reflect the
nested structure. What I was thinking about is tracking MCV for
"complete paths", i.e. for a document:

  {
    "keyA" : {
      "keyB" : "x",
      "keyC" : "z",
    }
    "keyD" : [1, 2, 3, 4]
  }

We'd extract three paths

   "keyA.keyB"
   "keyA.keyC"
   "keyD"

and aggregate that over all the documents to select the MCV paths.
And then, for each of those MCV paths track the most common values.

ISTM this would allow better estimations, but it has issues too:

Firstly, it does not match the MCV structure, because it requires
storing (a) MCV paths and (b) MCV values for those paths. Moreover, (b)
probably stores different data types (some values are strings, some
integers, etc.). Arrays might be handled just like regular arrays, i.e.
tracking stats of elements, but it's still mixed data types.

Secondly, I think it's based on the assumption of independence (i.e.
that the occurence of one path does not depend on occurence of a
different path in the same document). Same for values x paths. Which may
or may not be be true - it's essentially the same as assumption of
independence for predicates on multiple columns. While I do have ideas
on how to approach this in the multi-column case, handling this for
JSONB is going to be much more complex I think.

But the first question (what stats to collect and how to store them) is
the most important at this point, I guess.

regards

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-performance by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: working around JSONB's lack of stats?
Next
From: Josh Berkus
Date:
Subject: Re: working around JSONB's lack of stats?