Re: two dimensional statistics in Postgres - Mailing list pgsql-hackers

From Katharina Büchse
Subject Re: two dimensional statistics in Postgres
Date
Msg-id 545CB8CA.8020908@uni-jena.de
Whole thread Raw
In response to Re: two dimensional statistics in Postgres  ("Tomas Vondra" <tv@fuzzy.cz>)
Responses Re: two dimensional statistics in Postgres  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
Ahoj ;-)

On 06.11.2014 11:56, Tomas Vondra wrote:
> Hi,
>
> Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):
>> Hi,
>>
>> I'm a phd-student at the university of Jena, Thüringen, Germany, in the
>> field of data bases, more accurate query optimization.
>> I want to implement a system in PostgreSQL that detects column
>> correlations and creates statistical data about correlated columns for
>> the optimizer. Therefore I need to store two dimensional statistics
>> (especially two dimensional histograms) in PostgreSQL.
> Cool!
>
>> I had a look at the description of "WIP: multivariate statistics / proof
>> of concept", which looks really promising, I guess these statistics are
>> based on scans of the data? For my system I need both -- statistical
> Yes, it's based on a sample of the data.
>
>> data based on table scans (actually, samples are enough) and those based
>> on query feedback. Query feedback (tuple counts and, speaking a little
>> inaccurately, the where-part of the query itself) needs to be extracted
>> and there needs to be a decision for the optimizer, when to take
>> multivariate statistics and when to use the one dimensional ones. Oracle
>> in this case just disables one dimensional histograms if there is
>> already a multidimensional histogram, but this is not always useful,
>> especially in the case of a feedback based histogram (which might not
>> cover the whole data space). I want to use both kinds of histograms
> What do you mean by not covering the whole data space? I assume that when
> building feedback-based histogram, parts of the data will be filtered out
> because of WHERE clauses etc. Is that what you mean? I don't see how this
> could happen for regular histograms, though.
Yes, you're right. Because of the where clauses, some parts of the data 
might be filtered out in feedback based histograms. This cannot happen 
in "regular" histograms, but as I mentioned -- I would like to use both 
kinds of histograms.
>
>> because correlations might occur only in parts of the data. In this case
>> a histogram based on a sample of the whole table might not get the point
>> and wouldn't help for the part of the data the user seems to be
>> interested in.
> Yeah, there may be dependencies that are difficult to spot in the whole
> dataset, but emerge once you filter to a specific subset.
>
> Now, how would that work in practice? Initially the query needs to be
> planned using regular stats (because there's no feedback yet), and then -
> when we decide the estimates are way off - may be re-planned using the
> feedback. The feedback is inherently query-specific, so I'm not sure if
> it's possible to reuse it for multiple queries (might be possible for
> "sufficiently similar" ones).
>
> Would this be done automatically for all queries / all conditions, or only
> when specifically enabled (on a table, columns, ...)?
The idea is the following: I want to find out correlations with two 
different algorithms, one scanning some samples of the data, the other 
analyzing query feedback. If both decide for a column combination that 
it's correlated, then there should be made a "regular histogram" for 
this combination. If only the "scanning"-algorithm says "correlated", 
then it means that there is some correlation, but this is not 
interesting for the user right now. So I would only "leave some note" 
for the optimizer that there is correlation and if the user interest 
changes and query results differ a lot from the estimates in the plan, 
then again -- "regular histogram". If only the "feedback"-algorithm 
decides that the columns are correlated, a histogram based on query 
feedback is the most useful choice to support the work of the optimizer.
>
>> There are special data structures for storing multidimensional
>> histograms based on feedback and I already tried to implement one of
>> these in C. In the case of two dimensions they are of course not "for
>> free" (one dimensional would be much cheaper), but based on the
>> principle of maximum entropy they deliver really good results. I decided
>> for only two dimensions because in this case we have the best proportion
>> of cost and benefit when searching for correlation (here I'm relying on
> I think hardcoding the two-dimensions limit is wrong. I understand higher
> number of dimensions means more expensive operation, but if the user can
> influence it, I believe it's OK.
I don't know whether the user has to decide whether the statistical data 
is based on feedback or on data scans. I guess it's enough if he gets 
his histograms in higher dimensions based on data scans.
>
> Also, is there any particular reason why not to support other kinds of
> stats (say, MCV lists)? In the end it's just a different way to
> approximate the distribution, and it may be way cheaper than histograms.
The reason actually is just that 1) I have only limited time and cannot 
cover every possibility to support the optimizer when there is 
correlation and 2) there are good papers about feedback based histograms :-D
>
>> tests that were made in DB2 within a project called CORDS which detects
>> correlations even between different tables).
> Is this somehow related to LEO? I'm not familiar with the details, but
> from the description it might be related.
actually, LEO is purely feedback based, while CORDS is using data scans. 
But some authors were involved in both projects I guess. LEO itself 
never made it to be fully integrated in DB2, but some parts of it did. 
What's interesting for me is the fact that in DB2 there's no possibility 
for multidimensional histograms.
>
> regards
> Tomas
>
Katharina



pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: inherit support for foreign tables
Next
From: Dimitri Fontaine
Date:
Subject: Re: New Event Trigger: table_rewrite