Re: two dimensional statistics in Postgres - Mailing list pgsql-hackers

From Gavin Flower
Subject Re: two dimensional statistics in Postgres
Date
Msg-id 545B5291.6080807@archidevsys.co.nz
Whole thread Raw
In response to two dimensional statistics in Postgres  (Katharina Büchse<katharina.buechse@uni-jena.de>)
Responses Re: two dimensional statistics in Postgres  ("Tomas Vondra" <tv@fuzzy.cz>)
List pgsql-hackers
On 06/11/14 23:15, Katharina Büchse wrote:
> Hi,
>
> I'm a phd-student at the university of Jena, Thüringen, Germany, in 
> the field of data bases, more accurate query optimization.
> I want to implement a system in PostgreSQL that detects column 
> correlations and creates statistical data about correlated columns for 
> the optimizer. Therefore I need to store two dimensional statistics 
> (especially two dimensional histograms) in PostgreSQL.
> I had a look at the description of "WIP: multivariate statistics / 
> proof of concept", which looks really promising, I guess these 
> statistics are based on scans of the data? For my system I need both 
> -- statistical data based on table scans (actually, samples are 
> enough) and those based on query feedback. Query feedback (tuple 
> counts and, speaking a little inaccurately, the where-part of the 
> query itself) needs to be extracted and there needs to be a decision 
> for the optimizer, when to take multivariate statistics and when to 
> use the one dimensional ones. Oracle in this case just disables one 
> dimensional histograms if there is already a multidimensional 
> histogram, but this is not always useful, especially in the case of a 
> feedback based histogram (which might not cover the whole data space). 
> I want to use both kinds of histograms because correlations might 
> occur only in parts of the data. In this case a histogram based on a 
> sample of the whole table might not get the point and wouldn't help 
> for the part of the data the user seems to be interested in.
> There are special data structures for storing multidimensional 
> histograms based on feedback and I already tried to implement one of 
> these in C. In the case of two dimensions they are of course not "for 
> free" (one dimensional would be much cheaper), but based on the 
> principle of maximum entropy they deliver really good results. I 
> decided for only two dimensions because in this case we have the best 
> proportion of cost and benefit when searching for correlation (here 
> I'm relying on tests that were made in DB2 within a project called 
> CORDS which detects correlations even between different tables).
>
> I'd be grateful for any advices and discussions.
> Regards,
>
> Katharina
>
>
Could you store a 2 dimensional histogram in a one dimensional array: 
A[z] = value, where z = col * rowSize + row (zero starting index)?


Cheers,
Gavin





pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Typo in comment
Next
From: Andreas Karlsson
Date:
Subject: Re: B-Tree index builds, CLUSTER, and sortsupport