Re: proposal : cross-column stats - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: proposal : cross-column stats
Date
Msg-id 4D10EBFA.1080203@fuzzy.cz
Whole thread Raw
In response to Re: proposal : cross-column stats  (Florian Pflug <fgp@phlo.org>)
Responses Re: proposal : cross-column stats
List pgsql-hackers
Dne 21.12.2010 16:54, Florian Pflug napsal(a):
>> Hmmm, maybe we could give this possibility (to identify two separate
>> groups of columns) to the user. So instead of 'buid stats for (A,B,C)' the
>> user would say 'build stats for (A,B) and (C)' - this actually represents
>> apriori knowledge of dependencies supplied by the user.
>>
>> In that case we could search for 'implicativeness' between those two
>> groups (and not within the groups), and we could restrict ourselves to 2D
>> (and thus use a more sophisticated formula).
> 
> Hm, I hated this idea at first, but I'm starting to like it more and more.
> It *does* seem rather unrealistic that a user would know that a bunch of
> columns are correlated, but have no idea in what way... 

Yes, that's true. Although sometimes the dependency may be very
complicated - but let's restrict to 2D for now, build something that
solves this simplified case and then we can discuss higher dimensions.

> Any examples when this's be the case would be very much appreciated - Maybe
> we should ask around on -general about this?

Well, I think the ZIP code example i a typical case of this - the users
know about the dependency between ZIP codes and cities. A natural
workaround would be to omit the dependent column from the query, but
that's not always possible (e.g. when an ORM is involved, building the
queries automatically).

>> But we should be able to do some basic analysis even when the user
>> supplies a list of columns without such apriori knowledge.
> 
> That, I think, overcomplicates things, at least for a first cut.
> 
> To summarize, I think you've shown quite nicely that the uniform bayesian
> approach is a very sensible first step towards better estimates in the case
> of correlated columns. It's statistically sound, and the dist(A,B) estimates
> it requires are probably a necessary ingredient of any solution to the problem.
> If we can make it degrade more gracefully if the columns are uncorrelated we
> should do that, but if we can't thats still no reason to drop the whole idea.

Agreed. IMHO the uncorrelated case is not a big concern, as the users
usually know something's wrong with the columns. But we should introduce
some 'autodetect' but let's leave that for the future.

> So I guess we should turn our attention to how we'd obtain reasonably good estimates
> of dist(A,B), and return to the current discussion once the other pieces are in place.
> 
> I think that maybe it'd be acceptable to scan a large portion of the
> table to estimate dist(A,B), *if* we must only do so only once in a while. But even with
> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory,
> and spilling them into, say, an on-disk hash table adds even more overhead to the already
> expensive full scan. Maybe using a bloom filter instead of a hash table could avoid
> the spilling to disk, in exchange for a slightly less precise result...

I have no idea what a Bloom filter is (shame on me). I was not thinking
about collecting the stats, I was interested primarily in what data do
we actually need. And my knowledge about the algorithms currently used
is very limited :-(

But I agree we should at least discuss the possible solutions. Until now
I've done something like this
  SELECT COUNT(DISTINCT a) AS dist_a,         COUNT(DISTINCT b) AS dist_b,         COUNT(DISTINCT a || ':' || b) AS
dist_abFROM my_table;
 

but that's not very efficient.

My plan for the near future (a few weeks) is to build a simple 'module'
with the ability to estimate the number of rows for a given condition.
This could actually be quite useful as a stand-alone contrib module, as
the users often ask how to get a number of rows fast (usually for paging).

That may be quite slow when the query returns too many rows, even when
there is an index. It may be even much slower than the actual query (as
it usually contains a small LIMIT).

An estimate is often sufficient, but the 'pg_class.tuples' does not
really work with conditions. So this might be an improvement ...

regards
Tomas


pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: GiST insert algorithm rewrite
Next
From: Eric Ridge
Date:
Subject: Re: bug in SignalSomeChildren