On 02/02/2015 05:48 PM, Jim Nasby wrote:
> On 2/1/15 3:08 PM, Josh Berkus wrote:
>> I'm not clear on what you're suggesting here. I'm discussing how the
>> stats for a JSONB field would be stored and accessed; I don't understand
>> what that has to do with indexing.
>
> The JSON problem is similar to the problem of doing multi-column
> statistics: there's no way to simply try to keep statistics on all
> possible combinations because that's something that's can be extremely
> large.
>
> Something that's been proposed is only trying to keep multi-column stats
> on column combinations that we have an index on (which in a way is
> really just keeping stats on the index itself).
The difficulty with column correlation (as with value correlation for
JSONB) is the combination of *values*, not the combination of *columns*.
Via the brute force method, imagine you have one column with cardinalty
100, and another with cardinality 100,000. This would require you do
keep 10 million different correlation coefficients in order to be able
to estimate correctly. Even correlating MCVs would add up to quite a
bit of stats in short order; these days people frequently set statistics
to 1000. The same goes for JSON keys.
> If we built that, we could use the same technique for JSON by simply
> defining the indexes you needed for whatever you were searching on.
>
> Obviously that's not as ideal is simply keeping full statistics on
> everything in a JSON document, but it might be a heck of a lot easier to
> accomplish.
Walk before trying to run, let alone fly, please. Right now we don't
have selectivity estimation for a *single* key; let's do that before we
start talking about better estimation for combinations of keys.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com