Re: two dimensional statistics in Postgres - Mailing list pgsql-hackers
| From | Tomas Vondra | 
|---|---|
| Subject | Re: two dimensional statistics in Postgres | 
| Date | |
| Msg-id | 545E75C7.40202@fuzzy.cz Whole thread Raw | 
| In response to | two dimensional statistics in Postgres (Katharina Büchse<katharina.buechse@uni-jena.de>) | 
| List | pgsql-hackers | 
On 8.11.2014 15:40, Katharina Büchse wrote: > I'm sorry if I repeated myself too often, I somehow started answering > at the end of your mail and then went up... I promise to do this > better next time. Nah, no problem. Better say something twice than not at all ;-) However, I see you've responded to me directly (not through the pgsql-hackers list). I assume that's not on purpose, so I'm adding the list back into the loop ... > On 07.11.2014 20:37, Tomas Vondra wrote: >> On 7.11.2014 13:19, Katharina Büchse wrote: >>> On 06.11.2014 11:56, Tomas Vondra wrote: >>>> Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a): >>>>> because correlations might occur only in parts of the data. >>>>> In this case a histogram based on a sample of the whole table >>>>> might not get the point and wouldn't help for the part of the >>>>> data the user seems to be interested in. >>>> Yeah, there may be dependencies that are difficult to spot in the whole >>>> dataset, but emerge once you filter to a specific subset. >>>> >>>> Now, how would that work in practice? Initially the query needs >>>> to be planned using regular stats (because there's no feedback >>>> yet), and then - when we decide the estimates are way off - may >>>> be re-planned using the feedback. The feedback is inherently >>>> query-specific, so I'm not sure if it's possible to reuse it >>>> for multiple queries (might be possible for "sufficiently >>>> similar" ones). >>>> >>>> Would this be done automatically for all queries / all conditions, or >>>> only when specifically enabled (on a table, columns, ...)? >>> >>> The idea is the following: I want to find out correlations with >>> two different algorithms, one scanning some samples of the data, >>> the other analyzing query feedback. >>> >> So you're starting with the default (either single or >> multivariate) statistics, computed by ANALYZE from a sample >> (covering the whole table). > > yes > >> And then compute matching statistics while running the query (i.e. >> a sample filtered by the WHERE clauses)? > > well, not really... I would like to emphasize that my systems > consists of two parts: > > 1) finding correlations automatically > 2) creating histograms for correlated columns. > > In (1) we use two different approaches, one of them feedback based, but > even this approach has to collect several feedback data to be able to > decide if there's correlation or not. So "while running the query" is > not 100% correct. While running the query I would extract the feedback > data the algorithm needs, which is the tuple count and the constraint on > the data that was done in the query. Constraints should look like > "columnA = 'a' and columnB = 'b'" and the more different queries we have > with different constraints, the better it is. And yes, when this > algorithm starts analyzing, there needs to be a check done for choosing > consistent feedback data. So if there were several queries on two > columns of a table, but the data changed while these queries took place, > we cannot use all of the feedback we have. OK, thanks for the explanation! >>> If both decide for a column combination that it's correlated, >>> then there should be made a "regular histogram" for this >>> combination. If only the "scanning"-algorithm says "correlated", >>> then it means that there is some correlation, but this is not >>> interesting for the user right now. >>> >> Isn't it sufficient to simply compare the estimated and observed >> number of rows? > > This sounds like the easiest way, but has many disadvantages. Just > imagine, that the statistical information is based on data that > already changed. The estimate could be totally wrong (even if we > automatically update statistics after a change of, let's say, 10%, > this might happen, because the user could ask for exactly the part > which changed, and if there was "only" a change of maybe 8%, the > histogram would still be the old one), but this has nothing to do > with correlation. If we then always decided to build a > multidimensional histogram, it would mean to do unnecessary work, > because creating and maintain multidimensional histograms is more > expensive then one dimensional ones. Good point. IMHO stale stats are a problem in general, and here it may clearly cause "false positives" if the algorithm is not careful enough. > But if an algorithm (which checked the data for correlations) says, > that there really is correlation, the fact, that estimates and query > results differ a lot from one another could be a good occasion to > really create a histogram. Of course this decision could be based on > the same "mistake" that I just described, but we already limited this > wrong decision to the case that one algorithm "voted" for > correlation. Understood. I believe I understand the general goal, although I don't have a clear idea how to implement that, or how complex it could get. But I guess that's not a problem ... clearly you have a plan ;-) >>> So I would only "leave some note" for the optimizer that there >>> is correlation and if the user interest changes and query results >>> differ a lot from the estimates in the plan, then again -- >>> "regular histogram". If only the "feedback"-algorithm decides >>> that the columns are correlated, a histogram based on query >>> feedback is the most useful choice to support the work of the >>> optimizer. >> >> So the check is peformed only when the query completes, and if >> there's a mismatch then you put somewhere a note that those columns >> are correlated? > > No, I would want the algorithm (which is now the one based on > samples) to store his note as soon as he finds out correlation (which > he does by analyzing -- doing some magic mathematics -- samples of > the data). The thing is -- if the other algorithm, which checks of > correlation by analyzing (also magic mathematics...) feedback data, > doesn't vote for correlation, we don't need the histogram right now > because the user is not interested in the correlated part of the > table. But there must be a note that histograms could be necessary in > the future. OK. I quickly skimmed through the ISOMER paper (from ICDE 2006), and this seems to match the "Figure 1", with steps like * add new feedback* detect inconsistent feedback* eliminate unimportant feedback* compute final histogram Seems quite close to what's discussed in this thread, correct? >> I think what exactly is stored in the "note" is crucial. If it's >> only a list of columns and "iscorrelated" flag, then I'm not sure >> how is the optimizer going to use that directly. I.e. without >> actual histogram or at least corrected estimates. > > As I mentioned -- the optimizer won't use this note directly, > because right now he doesn't need it. Of course we suppose that users > are lazy and don't change their interest too often. But if they do > and estimates are starting to get worse, then the "iscorrelated" flag > tells us that we should create a histogram as soon as possible. OK, understood. >> Actually, I'm starting to wonder whether I understand the idea. I'm >> aware of the following two kinds of "feedback histograms": >> >> a) building the full histogram from query feedback (STGrid/STHoles) >> >> The main goal is to build and refine a histogram, without >> examining the data set directly (not even sampling it). >> >> b) optimizing the set of histograms wrt. to workload (SASH) >> >> Decides what histograms to build, with respect to the observed >> workload (i.e. executed queries). >> >> Is the presented similar to one of these, or rather something different? >> >> Actually, the "Introduction" in the CORDS paper says this: >> >> CORDS is a data-driven tool that automatically discovers >> correlations and soft functional dependencies (fds) between pairs >> of columns and, based on these relationships, recommends a set of >> statistics for the query optimizer to maintain. >> >> which seems like the option (b). I haven't read the rest of the paper, >> though. >> > It's something in between, I would say. I would like to use ISOMER > (which is a further development of STHoles) to create feedback based > histograms, but the decision, which histograms to build, would rely > on algorithms to find out correlations. One of them is > feedback-based.... OK, thanks for the clarifications. >>> I don't know whether the user has to decide whether the >>> statistical data is based on feedback or on data scans. I guess >>> it's enough if he gets his histograms in higher dimensions based >>> on data scans. >> >> I wasn't talking about deciding whether to use regular or feedback >> stats (although I believe features like this should be opt-in), but >> about tweaking the number of dimensions. >> >> For example imagine there are three columns [a,b,c] and you know >> the data is somehow correlated. Is it better to build three >> 2-dimensional feedback histograms (a,b), (a,c) and (b,c), or a >> single 3-dimensional histogram (a,b,c) or maybe something else? > > that's a good question. When the correlation really is in all three > columns (for example, if we have a look at an employee table where > there is information about the car an employee is using, this might > be depended of his income and his family status), of course it should > be better to have a three dimensional histogram. And if the user was > able to point out these dependencies, it would be a pitty if there > wasn't any possibility in the database system to store this > information and create a histogram for all these columns. But if we > talk about finding out these dependencies automatically (and that's > my aim), there will be so (or even too) many possibilities to combine > 3 columns.... I think that 'too many possibilities' really depends on the context. For example we're working with hundreds of gigabytes of data, and misestimates are a big issue for us, occasionally causing queries to run for hours instead of seconds. We're perfectly fine with spending a few more minutes by analyzing the stats / planning, because the gains outweight the expenses. But clearly that's not a universal truth, which is why I was asking about making it possible to tweak this. That being said, I'm perfectly fine with limiting the scope of the effort by explicitly supporting just 2 dimensions. >>>>> tests that were made in DB2 within a project called CORDS >>>>> which detects correlations even between different tables). >>>> >>>> Is this somehow related to LEO? I'm not familiar with the >>>> details, but from the description it might be related. >>> >>> actually, LEO is purely feedback based, while CORDS is using >>> data scans. But some authors were involved in both projects I >>> guess. LEO itself never made it to be fully integrated in DB2, >>> but some parts of it did. What's interesting for me is the fact >>> that in DB2 there's no possibility for multidimensional >>> histograms. >> >> OK, so the point is to optimize the set of histograms, similar to >> what CORDS does, but based on feedback. Correct? >> >> Tomas > > Actually, the point is to help the optimizer to decide for better > plans and therefore first find correlations (in two different ways - > with feedback and with Cords) and then create new histograms. I think that's what I meant by "optimizing the set of histograms" (i.e. creating new histograms, etc.). regards Tomas
pgsql-hackers by date: