Home > mailing lists

Re: Cross-column statistics revisited - Mailing list pgsql-hackers

From	Richard Huxton
Subject	Re: Cross-column statistics revisited
Date	October 17, 2008 08:18:34
Msg-id	48F8745D.1050205@archonet.com Whole thread Raw
In response to	Re: Cross-column statistics revisited (Gregory Stark <stark@enterprisedb.com>)
List	pgsql-hackers

Tree view

Gregory Stark wrote:
> They're certainly very much not independent variables. There are lots of ways
> of measuring how much dependence there is between them. I don't know enough
> about the math to know if your maps are equivalent to any of them.

I think "dependency" captures the way I think about it rather than
correlation (although I can see there must be function that could map
that dependency onto how we think of correlations).

> In any case as I described it's not enough information to know that the two
> data sets are heavily dependent. You need to know for which pairs (or ntuples)
> that dependency results in a higher density and for which it results in lower
> density and how much higher or lower. That seems like a lot of information to
> encode (and a lot to find in the sample).

Like Josh Berkus mentioned a few points back, it's the handful of
plan-changing values you're looking for.

So, it seems like we've got:
1. Implied dependencies: zip-code=>city
2. Implied+constraint: start-date < end-date and the difference between
the two is usually less than a week
3. "Top-heavy" foreign-key stats.

#1 and #2 obviously need new infrastructure.

From a non-dev point of view it looks like #3 could use the existing
stats on each side of the join. I'm not sure whether you could do
anything meaningful for joins that don't explicitly specify one side of
the join though.

> Perhaps just knowing whether that there's a dependence between two data sets
> might be somewhat useful if the planner kept a confidence value for all its
> estimates. It would know to have a lower confidence value for estimates coming
> from highly dependent clauses? It wouldn't be very easy for the planner to
> distinguish "safe" plans for low confidence estimates and "risky" plans which
> might blow up if the estimates are wrong though. And of course that's a lot
> less interesting than just getting better estimates :)

If we could abort a plan and restart then we could just try the
quick-but-risky plan and if we reach 50 rows rather than the expected 10
try a different approach. That way we'd not need to gather stats, just
react to the situation in individual queries.

--  Richard Huxton Archonet Ltd

pgsql-hackers by date:

From: "Pavel Stehule"
Date: 17 October 2008, 07:16:23
Subject: WIP: grouping sets support

From: Tom Lane
Date: 17 October 2008, 09:46:35
Subject: Re: Cross-column statistics revisited

Re: Cross-column statistics revisited - Mailing list pgsql-hackers

Previous

Next