WIP: cross column correlation, 2nd shot - Mailing list pgsql-hackers

From Hans-Jürgen Schönig
Subject WIP: cross column correlation, 2nd shot
Date
Msg-id D9D4375E-6B24-4BB2-916A-50208EF15B90@cybertec.at
Whole thread Raw
Responses Re: WIP: cross column correlation, 2nd shot  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
hello,

we (zoli, ants and myself) have put some more effort into our struggle to make cross column estimates better and more
reasonable.
this is supposed to fix problems related to columns which are not "independent" (such as "time of the year" and
temperatureand so on). 
this patch is far from ready but does some nice and useful stuff already:

Here's the cross-col patch against todays master branch. Also, a data
generator that creates a few data distributions for testing. Just run
it on an empty db, it will create 3 tables, expdist, normdist and
weigheddist. A good example would be:

EXPLAIN ANALYZE SELECT * FROM weigheddist WHERE a = 'a' AND b = 3; --
estimates 1 rows, actual 0
DROP CROSS COLUMN STATISTICS ON TABLE weigheddist (a, b);
EXPLAIN ANALYZE SELECT * FROM weigheddist WHERE a = 'a' AND b = 3; --
estimates 161k rows

Some notes:
* Currently only exact equality operators on one table are supported.
* The operators = ANY (ie blah IN (1,2,3)), != and IS NULL would be
quite simple to estimate with the current design.
* Range ops (var BETWEEN c1 AND c2) would require decent
multi-dimensional histogram construction. The previous MCV based
bucket selection mechanism created pretty useless histograms for cases
where there are more values than mcv buckets (should most cases where
range ops are useful). I found some papers with algorithms for good
multidimensional histogram construction.
* I'll look if there are any decent algorithms for "functional dependency"
stats (given tuples t1 and t2, calculate P(t1.b = t2.b|t1.a = t2.a)),
seems like a very useful stat to estimate joins.
* The stats catalog design seems top hacky, probably needs a good
rethink. Maybe a separate catalog for multi attribute stats. I'd like
to work out a decent roadmap where this feature is going before
redesigning it, otherwise multi table stats will require another
redesign.
* A decent set of real-worldish usecases would be really good to
double check if I'm on the right track.

one more thing:
ants has come up with the idea of making the stats collecting engine for more complicated multi-table-cross correlation
plugable;we should ship a default collector but allow people to put in some other functionality if needed (could make
sensefor some not-so-easy set of data and some nasty corner cases). any thoughts on that one? 

    many thanks,

        hans



--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Measuring relation free space
Next
From: "Kevin Grittner"
Date:
Subject: Re: patch for parallel pg_dump