Re: Cross-column statistics revisited - Mailing list pgsql-hackers

From Richard Huxton
Subject Re: Cross-column statistics revisited
Date
Msg-id 48F8745D.1050205@archonet.com
Whole thread Raw
In response to Re: Cross-column statistics revisited  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers
Gregory Stark wrote:
> They're certainly very much not independent variables. There are lots of ways
> of measuring how much dependence there is between them. I don't know enough
> about the math to know if your maps are equivalent to any of them.

I think "dependency" captures the way I think about it rather than
correlation (although I can see there must be function that could map
that dependency onto how we think of correlations).

> In any case as I described it's not enough information to know that the two
> data sets are heavily dependent. You need to know for which pairs (or ntuples)
> that dependency results in a higher density and for which it results in lower
> density and how much higher or lower. That seems like a lot of information to
> encode (and a lot to find in the sample).

Like Josh Berkus mentioned a few points back, it's the handful of
plan-changing values you're looking for.

So, it seems like we've got:
1. Implied dependencies: zip-code=>city
2. Implied+constraint: start-date < end-date and the difference between
the two is usually less than a week
3. "Top-heavy" foreign-key stats.

#1 and #2 obviously need new infrastructure.

From a non-dev point of view it looks like #3 could use the existing
stats on each side of the join. I'm not sure whether you could do
anything meaningful for joins that don't explicitly specify one side of
the join though.

> Perhaps just knowing whether that there's a dependence between two data sets
> might be somewhat useful if the planner kept a confidence value for all its
> estimates. It would know to have a lower confidence value for estimates coming
> from highly dependent clauses? It wouldn't be very easy for the planner to
> distinguish "safe" plans for low confidence estimates and "risky" plans which
> might blow up if the estimates are wrong though. And of course that's a lot
> less interesting than just getting better estimates :)

If we could abort a plan and restart then we could just try the
quick-but-risky plan and if we reach 50 rows rather than the expected 10
try a different approach. That way we'd not need to gather stats, just
react to the situation in individual queries.

--  Richard Huxton Archonet Ltd


pgsql-hackers by date:

Previous
From: "Pavel Stehule"
Date:
Subject: WIP: grouping sets support
Next
From: Tom Lane
Date:
Subject: Re: Cross-column statistics revisited