Re: Multidimensional Histograms - Mailing list pgsql-hackers
From | Alexander Cheshev |
---|---|
Subject | Re: Multidimensional Histograms |
Date | |
Msg-id | CAN_hQmvq-1XWZuxAEA+MHy0XGvmNxTc7002yFjmH2=CarrcQFg@mail.gmail.com Whole thread Raw |
In response to | Re: Multidimensional Histograms (Alexander Cheshev <alex.cheshev@gmail.com>) |
List | pgsql-hackers |
Hi Tomas, I am sorry I didn't look into the code carefully. Indeed Postgres uses Equi-Depth Histogram: delta = (nvals - 1) / (num_hist - 1); Regards, Alexander Cheshev On Sat, 6 Jan 2024 at 01:00, Alexander Cheshev <alex.cheshev@gmail.com> wrote: > > Hi Tomas, > > > Another reason was that the algorithm described in the two papers you > > reference (1988 paper by DeWitt and the evaluation by Carlson and > > Sutherland from ~2010) is simple but has pretty obvious weaknesses. It > > processes the columns one by one - first build bucket on column "a", > > then splits each bucket into buckets on "b". So it's not symmetrical, > > and it results in lower accuracy compared to an "ideal" histogram with > > the same number of buckets (especially for the dimensions split early). > > As stated in [1] Sum Square Error (SSE) is one of the most natural > error metrics. Equi-Depth Histogram is not ideal because it doesn't > minimize SSE. On the other hand V-Optimal Histogram indeed minimizes > SSE and from this point of view can be considered as an ideal > solution. > > > This does indeed produce an equi-depth histogram, which seems nice, but > > the mesh is regular in such a way that any value of the first dimension > > intersects all buckets on the second dimension. So for example with a > > histogram of 100x100 buckets on [a,b], any value "a=X" intersects with > > 100 buckets on "b", each representing 1/10000 of tuples. But we don't > > know how the tuples are distributed in the tuple - so we end up using > > 0.5 of the bucket as unbiased estimate, but that can easily add-up in > > the wrong dimension. > > Suppose that we have a query box a=X and we know how data is > distributed in buckets. Lets consider only the buckets which are > intersected by the query box a=X. As we know how data is distributes > in buckets we can exclude the buckets which have no tuples which > intersect the query box a=X. > > As we store buckets with no information about data distribution we > have to make reasonable assumptions. If we keep buckets relativly > small then we can assume that buckets have uniform distribution. > > What I am trying to say is that the problem which you pointed out > comes from the fact that we store buckets with no information about > data distribution. Even in one dimensional case we have to assume how > data is distributed in buckets. By the way Postgres assumes that data > has uniform distribution in buckets. > > > However, this is not the only way to build an equi-depth histogram, > > there are ways to make it more symmetrical. Also, it's not clear > > equi-depth histograms are ideal with multiple columns. There are papers > > proposing various other types of histograms (using different criteria to > > build buckets optimizing a different metric). The most interesting one > > seems to be V-Optimal histograms - see for example papers [1], [2], [3], > > [4], [5] and [6]. I'm sure there are more. The drawback of course is > > that it's more expensive to build such histograms. > > Tomas thank you for shearing with me your ideas regarding V-Optimal > Histogram. I read through the papers which you gave me and came up > with the following conclusion. > > The problem can be formulated like this. We have N tuples in > M-dimensional space. We need to partition space into buckets > iteratively until SSE is less than E or we reach the limit of buckets > B. > > In the case of M-dimensional space it seems to me like an NP-hard > problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we > sort N tuples in ascending order. Then we iteratively select a bucket > which leads to the largest SSE reduction and split it into two parts. > We repeat the process until SSE is less than E or we reach the limit > of buckets B. > > If we assume that B is significantly less than N then the time > complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M > equals 3, B equals 1000 and N equals 300*B then it will take slightly > over 0.9*10^9 iterations to build a V-Optimal Histogram. > > You can see that we have to keep B as low as possible in order to > build V-Optimal Histogram in feasible time. And here is a paradox. > From one side we use V-Optimal Histogram in order to minimize SSE but > from the other side we have to keep B as low as possible which > eventually leads to increase in SSE. > > On the other hand time complexity required to build an Equi-Depth > Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE > can be arbitrarily reduced by increasing B which in turn is only > limited by the storage limit. Experimental results show low error > metric [3]. > > In Equi-Depth Histogram a bucket is represented by two vectors. The > first vector points to the left bottom corner of the bucket and the > other one point to the right top corner of the bucket. Thus space > complexity of Equi-Depth Histogram can be estimated as > 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and > integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000 > bytes. > > If a bucket is partially intersected by a query box then we assume > that data has uniform distribution inside of the bucket. It is a > reasonable assumption if B is relativly large. > > In order to efficianly find buckets which intersect a query box we can > store Equi-Depth Histogram in R-tree as proposed in [3]. On average it > takes O(logB) iterations to find buckets which intersect a query box. > As storage requirements are dominated by leaf nodes we can assume that > it takes slightly more than 2*integer_size*M*B. > > > IIRC the patch tried to do something like V-optimal histograms, but > > using a greedy algorithm and not the exhaustive stuff described in the > > various papers. > > We should only consider computationally tackable solutions. In one > dimensional case V-Optimal Histogram is probably a good solution but > in multi-dimensional case I would only consider Equi-Width or > Equi-Depth Histograms. As stated in multiple papers Equi-Depth > Histogram proves to be more accurate than Equi-Width Histogram. By the > way Postgres uses something like Equi-Width Histogram. > > > FWIW I did not intend to reject the idea of adding multi-dimensional > > histograms, but rather to provide some insight into the history of the > > past attempt, and also point some weaknesses of the algorithm described > > in the 1988 paper. If you choose to work on this, I can do a review etc. > > Thank you very much Tomas. I am new in the community and I definitely > didn't expect to have such a warm welcome. > > As I indicated above Equi-Depth Histogram proves to be more accurate > than Equi-Width Histogram and both have the same time and space > requirements. Postgres uses some sort of Equi-Width Histogram. Suppose > that: > * I will create a patch which will replace Equi-Width Histogram with > Equi-Depth Histogram but only in 1-dimensional case. > * I will show experimental results which will demonstrate improvement > of selectivity estimation. > Then will the path be accepted by the community? > > If the above path is accepted by the community then I will proceed > further with M-dimensional Equi-Depth Histogram... > > > [1] https://www.vldb.org/conf/1998/p275.pdf > [2] https://www.vldb.org/conf/1997/P486.PDF > [3] https://dl.acm.org/doi/pdf/10.1145/50202.50205 > > Regards, > Alexander Cheshev > > On Thu, 28 Dec 2023 at 15:25, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 12/27/23 22:19, Tomas Vondra wrote: > > > Hello Alexander, > > > > > > We actually had histograms in the early patches adding multivariate > > > statistics [1]. We however ended up removing histograms and only kept > > > the simpler types, for a couple reasons. > > > > > > It might be worth going through the discussion - I'm sure one of the > > > reasons was we needed to limit the scope of the patch, and histograms > > > were much more complex and possibly less useful compared to the other > > > statistics types. > > > > > > ... > > > > FWIW I did not intend to reject the idea of adding multi-dimensional > > histograms, but rather to provide some insight into the history of the > > past attempt, and also point some weaknesses of the algorithm described > > in the 1988 paper. If you choose to work on this, I can do a review etc. > > > > regards > > > > -- > > Tomas Vondra > > EnterpriseDB: http://www.enterprisedb.com > > The Enterprise PostgreSQL Company
pgsql-hackers by date: