Re: two dimensional statistics in Postgres - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: two dimensional statistics in Postgres
Date
Msg-id 545E75C7.40202@fuzzy.cz
Whole thread Raw
In response to two dimensional statistics in Postgres  (Katharina Büchse<katharina.buechse@uni-jena.de>)
List pgsql-hackers
On 8.11.2014 15:40, Katharina Büchse wrote:
> I'm sorry if I repeated myself too often, I somehow started answering
> at the end of your mail and then went up... I promise to do this
> better next time.

Nah, no problem. Better say something twice than not at all ;-)

However, I see you've responded to me directly (not through the
pgsql-hackers list). I assume that's not on purpose, so I'm adding the
list back into the loop ...

> On 07.11.2014 20:37, Tomas Vondra wrote:
>> On 7.11.2014 13:19, Katharina Büchse wrote:
>>> On 06.11.2014 11:56, Tomas Vondra wrote:
>>>> Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):
>>>>> because correlations might occur only in parts of the data.
>>>>> In this case a histogram based on a sample of the whole table
>>>>> might not get the point and wouldn't help for the part of the
>>>>> data the user seems to be interested in.
>>>> Yeah, there may be dependencies that are difficult to spot in the whole
>>>> dataset, but emerge once you filter to a specific subset.
>>>>
>>>> Now, how would that work in practice? Initially the query needs
>>>> to be planned using regular stats (because there's no feedback
>>>> yet), and then - when we decide the estimates are way off - may
>>>> be re-planned using the feedback. The feedback is inherently
>>>> query-specific, so I'm not sure if it's possible to reuse it
>>>> for multiple queries (might be possible for "sufficiently
>>>> similar" ones).
>>>>
>>>> Would this be done automatically for all queries / all conditions, or
>>>> only when specifically enabled (on a table, columns, ...)?
>>>
>>> The idea is the following: I want to find out correlations with
>>> two different algorithms, one scanning some samples of the data,
>>> the other analyzing query feedback.
>>>
>> So you're starting with the default (either single or
>> multivariate) statistics, computed by ANALYZE from a sample
>> (covering the whole table).
>
> yes
>
>> And then compute matching statistics while running the query (i.e.
>> a sample filtered by the WHERE clauses)?
>
> well, not really... I would like to emphasize that my systems
> consists of two parts:
>
> 1) finding correlations automatically
> 2) creating histograms for correlated columns.
>
> In (1) we use two different approaches, one of them feedback based, but
> even this approach has to collect several feedback data to be able to
> decide if there's correlation or not. So "while running the query" is
> not 100% correct. While running the query I would extract the feedback
> data the algorithm needs, which is the tuple count and the constraint on
> the data that was done in the query. Constraints should look like
> "columnA = 'a' and columnB = 'b'" and the more different queries we have
> with different constraints, the better it is. And yes, when this
> algorithm starts analyzing, there needs to be a check done for choosing
> consistent feedback data. So if there were several queries on two
> columns of a table, but the data changed while these queries took place,
> we cannot use all of the feedback we have.

OK, thanks for the explanation!

>>> If both decide for a column combination that it's correlated,
>>> then there should be made a "regular histogram" for this
>>> combination. If only the "scanning"-algorithm says "correlated",
>>> then it means that there is some correlation, but this is not
>>> interesting for the user right now.
>>>
>> Isn't it sufficient to simply compare the estimated and observed
>> number of rows?
>
> This sounds like the easiest way, but has many disadvantages. Just 
> imagine, that the statistical information is based on data that
> already changed. The estimate could be totally wrong (even if we
> automatically update statistics after a change of, let's say, 10%,
> this might happen, because the user could ask for exactly the part
> which changed, and if there was "only" a change of maybe 8%, the
> histogram would still be the old one), but this has nothing to do
> with correlation. If we then always decided to build a
> multidimensional histogram, it would mean to do unnecessary work,
> because creating and maintain multidimensional histograms is more
> expensive then one dimensional ones.

Good point. IMHO stale stats are a problem in general, and here it may
clearly cause "false positives" if the algorithm is not careful enough.

> But if an algorithm (which checked the data for correlations) says,
> that there really is correlation, the fact, that estimates and query
> results differ a lot from one another could be a good occasion to
> really create a histogram. Of course this decision could be based on
> the same "mistake" that I just described, but we already limited this
> wrong decision to the case that one algorithm "voted" for
> correlation.

Understood. I believe I understand the general goal, although I don't
have a clear idea how to implement that, or how complex it could get.
But I guess that's not a problem ... clearly you have a plan ;-)

>>> So I would only "leave some note" for the optimizer that there
>>> is correlation and if the user interest changes and query results
>>> differ a lot from the estimates in the plan, then again --
>>> "regular histogram". If only the "feedback"-algorithm decides
>>> that the columns are correlated, a histogram based on query
>>> feedback is the most useful choice to support the work of the
>>> optimizer.
>>
>> So the check is peformed only when the query completes, and if
>> there's a mismatch then you put somewhere a note that those columns
>> are correlated?
>
> No, I would want the algorithm (which is now the one based on
> samples) to store his note as soon as he finds out correlation (which
> he does by analyzing -- doing some magic mathematics -- samples of
> the data). The thing is -- if the other algorithm, which checks of
> correlation by analyzing (also magic mathematics...) feedback data,
> doesn't vote for correlation, we don't need the histogram right now
> because the user is not interested in the correlated part of the
> table. But there must be a note that histograms could be necessary in
> the future.

OK. I quickly skimmed through the ISOMER paper (from ICDE 2006), and
this seems to match the "Figure 1", with steps like
* add new feedback* detect inconsistent feedback* eliminate unimportant feedback* compute final histogram

Seems quite close to what's discussed in this thread, correct?

>> I think what exactly is stored in the "note" is crucial. If it's
>> only a list of columns and "iscorrelated" flag, then I'm not sure
>> how is the optimizer going to use that directly. I.e. without
>> actual histogram or at least corrected estimates.
>
> As I mentioned -- the optimizer won't use this note directly,
> because right now he doesn't need it. Of course we suppose that users
> are lazy and don't change their interest too often. But if they do
> and estimates are starting to get worse, then the "iscorrelated" flag
> tells us that we should create a histogram as soon as possible.

OK, understood.

>> Actually, I'm starting to wonder whether I understand the idea. I'm
>> aware of the following two kinds of "feedback histograms":
>>
>>   a) building the full histogram from query feedback (STGrid/STHoles)
>>
>>      The main goal is to build and refine a histogram, without
>>      examining the data set directly (not even sampling it).
>>
>>   b) optimizing the set of histograms wrt. to workload (SASH)
>>
>>      Decides what histograms to build, with respect to the observed
>>      workload (i.e. executed queries).
>>
>> Is the presented similar to one of these, or rather something different?
>>
>> Actually, the "Introduction" in the CORDS paper says this:
>>
>>     CORDS is a data-driven tool that automatically discovers
>>     correlations and soft functional dependencies (fds) between pairs
>>     of columns and, based on these relationships, recommends a set of
>>     statistics for the query optimizer to maintain.
>>
>> which seems like the option (b). I haven't read the rest of the paper,
>> though.
>>
> It's something in between, I would say. I would like to use ISOMER 
> (which is a further development of STHoles) to create feedback based 
> histograms, but the decision, which histograms to build, would rely
> on algorithms to find out correlations. One of them is
> feedback-based....

OK, thanks for the clarifications.

>>> I don't know whether the user has to decide whether the
>>> statistical data is based on feedback or on data scans. I guess
>>> it's enough if he gets his histograms in higher dimensions based
>>> on data scans.
>>
>> I wasn't talking about deciding whether to use regular or feedback
>> stats (although I believe features like this should be opt-in), but
>> about tweaking the number of dimensions.
>>
>> For example imagine there are three columns [a,b,c] and you know
>> the data is somehow correlated. Is it better to build three
>> 2-dimensional feedback histograms (a,b), (a,c) and (b,c), or a
>> single 3-dimensional histogram (a,b,c) or maybe something else?
>
> that's a good question. When the correlation really is in all three 
> columns (for example, if we have a look at an employee table where
> there is information about the car an employee is using, this might
> be depended of his income and his family status), of course it should
> be better to have a three dimensional histogram. And if the user was
> able to point out these dependencies, it would be a pitty if there
> wasn't any possibility in the database system to store this
> information and create a histogram for all these columns. But if we
> talk about finding out these dependencies automatically (and that's
> my aim), there will be so (or even too) many possibilities to combine
> 3 columns....

I think that 'too many possibilities' really depends on the context. For
example we're working with hundreds of gigabytes of data, and
misestimates are a big issue for us, occasionally causing queries to run
for hours instead of seconds. We're perfectly fine with spending a few
more minutes by analyzing the stats / planning, because the gains
outweight the expenses. But clearly that's not a universal truth, which
is why I was asking about making it possible to tweak this.

That being said, I'm perfectly fine with limiting the scope of the
effort by explicitly supporting just 2 dimensions.

>>>>> tests that were made in DB2 within a project called CORDS
>>>>> which detects correlations even between different tables).
>>>>
>>>> Is this somehow related to LEO? I'm not familiar with the
>>>> details, but from the description it might be related.
>>>
>>> actually, LEO is purely feedback based, while CORDS is using
>>> data scans. But some authors were involved in both projects I
>>> guess. LEO itself never made it to be fully integrated in DB2,
>>> but some parts of it did. What's interesting for me is the fact
>>> that in DB2 there's no possibility for multidimensional
>>> histograms.
>>
>> OK, so the point is to optimize the set of histograms, similar to
>> what CORDS does, but based on feedback. Correct?
>>
>> Tomas
>
> Actually, the point is to help the optimizer to decide for better
> plans and therefore first find correlations (in two different ways -
> with feedback and with Cords) and then create new histograms.

I think that's what I meant by "optimizing the set of histograms" (i.e.
creating new histograms, etc.).

regards
Tomas



pgsql-hackers by date:

Previous
From: Sehrope Sarkuni
Date:
Subject: Re: Fw: [GENERAL] PLV8 and JS exports / referencing
Next
From: Josh Berkus
Date:
Subject: Re: pg_multixact not getting truncated