Re: two dimensional statistics in Postgres - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: two dimensional statistics in Postgres
Date
Msg-id 07b431d480db44208daa93cbaeb7a577.squirrel@2.emaily.eu
Whole thread Raw
In response to two dimensional statistics in Postgres  (Katharina Büchse<katharina.buechse@uni-jena.de>)
Responses Re: two dimensional statistics in Postgres  (Katharina Büchse<katharina.buechse@uni-jena.de>)
List pgsql-hackers
Hi,

Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):
> Hi,
>
> I'm a phd-student at the university of Jena, Thüringen, Germany, in the
> field of data bases, more accurate query optimization.
> I want to implement a system in PostgreSQL that detects column
> correlations and creates statistical data about correlated columns for
> the optimizer. Therefore I need to store two dimensional statistics
> (especially two dimensional histograms) in PostgreSQL.

Cool!

> I had a look at the description of "WIP: multivariate statistics / proof
> of concept", which looks really promising, I guess these statistics are
> based on scans of the data? For my system I need both -- statistical

Yes, it's based on a sample of the data.

> data based on table scans (actually, samples are enough) and those based
> on query feedback. Query feedback (tuple counts and, speaking a little
> inaccurately, the where-part of the query itself) needs to be extracted
> and there needs to be a decision for the optimizer, when to take
> multivariate statistics and when to use the one dimensional ones. Oracle
> in this case just disables one dimensional histograms if there is
> already a multidimensional histogram, but this is not always useful,
> especially in the case of a feedback based histogram (which might not
> cover the whole data space). I want to use both kinds of histograms

What do you mean by not covering the whole data space? I assume that when
building feedback-based histogram, parts of the data will be filtered out
because of WHERE clauses etc. Is that what you mean? I don't see how this
could happen for regular histograms, though.

> because correlations might occur only in parts of the data. In this case
> a histogram based on a sample of the whole table might not get the point
> and wouldn't help for the part of the data the user seems to be
> interested in.

Yeah, there may be dependencies that are difficult to spot in the whole
dataset, but emerge once you filter to a specific subset.

Now, how would that work in practice? Initially the query needs to be
planned using regular stats (because there's no feedback yet), and then -
when we decide the estimates are way off - may be re-planned using the
feedback. The feedback is inherently query-specific, so I'm not sure if
it's possible to reuse it for multiple queries (might be possible for
"sufficiently similar" ones).

Would this be done automatically for all queries / all conditions, or only
when specifically enabled (on a table, columns, ...)?

> There are special data structures for storing multidimensional
> histograms based on feedback and I already tried to implement one of
> these in C. In the case of two dimensions they are of course not "for
> free" (one dimensional would be much cheaper), but based on the
> principle of maximum entropy they deliver really good results. I decided
> for only two dimensions because in this case we have the best proportion
> of cost and benefit when searching for correlation (here I'm relying on

I think hardcoding the two-dimensions limit is wrong. I understand higher
number of dimensions means more expensive operation, but if the user can
influence it, I believe it's OK.

Also, is there any particular reason why not to support other kinds of
stats (say, MCV lists)? In the end it's just a different way to
approximate the distribution, and it may be way cheaper than histograms.

> tests that were made in DB2 within a project called CORDS which detects
> correlations even between different tables).

Is this somehow related to LEO? I'm not familiar with the details, but
from the description it might be related.

regards
Tomas




pgsql-hackers by date:

Previous
From: Andreas Karlsson
Date:
Subject: Re: B-Tree index builds, CLUSTER, and sortsupport
Next
From: "Tomas Vondra"
Date:
Subject: Re: two dimensional statistics in Postgres