Re: Multidimensional Histograms - Mailing list pgsql-hackers
From | Alexander Cheshev |
---|---|
Subject | Re: Multidimensional Histograms |
Date | |
Msg-id | CAN_hQmvOzA0CRjjDzyoOh_LZwHeQQNnZTK=whF4sKh7DRWnQjQ@mail.gmail.com Whole thread Raw |
In response to | Re: Multidimensional Histograms (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Multidimensional Histograms
Re: Multidimensional Histograms |
List | pgsql-hackers |
Hi Tomas, > Another reason was that the algorithm described in the two papers you > reference (1988 paper by DeWitt and the evaluation by Carlson and > Sutherland from ~2010) is simple but has pretty obvious weaknesses. It > processes the columns one by one - first build bucket on column "a", > then splits each bucket into buckets on "b". So it's not symmetrical, > and it results in lower accuracy compared to an "ideal" histogram with > the same number of buckets (especially for the dimensions split early). As stated in [1] Sum Square Error (SSE) is one of the most natural error metrics. Equi-Depth Histogram is not ideal because it doesn't minimize SSE. On the other hand V-Optimal Histogram indeed minimizes SSE and from this point of view can be considered as an ideal solution. > This does indeed produce an equi-depth histogram, which seems nice, but > the mesh is regular in such a way that any value of the first dimension > intersects all buckets on the second dimension. So for example with a > histogram of 100x100 buckets on [a,b], any value "a=X" intersects with > 100 buckets on "b", each representing 1/10000 of tuples. But we don't > know how the tuples are distributed in the tuple - so we end up using > 0.5 of the bucket as unbiased estimate, but that can easily add-up in > the wrong dimension. Suppose that we have a query box a=X and we know how data is distributed in buckets. Lets consider only the buckets which are intersected by the query box a=X. As we know how data is distributes in buckets we can exclude the buckets which have no tuples which intersect the query box a=X. As we store buckets with no information about data distribution we have to make reasonable assumptions. If we keep buckets relativly small then we can assume that buckets have uniform distribution. What I am trying to say is that the problem which you pointed out comes from the fact that we store buckets with no information about data distribution. Even in one dimensional case we have to assume how data is distributed in buckets. By the way Postgres assumes that data has uniform distribution in buckets. > However, this is not the only way to build an equi-depth histogram, > there are ways to make it more symmetrical. Also, it's not clear > equi-depth histograms are ideal with multiple columns. There are papers > proposing various other types of histograms (using different criteria to > build buckets optimizing a different metric). The most interesting one > seems to be V-Optimal histograms - see for example papers [1], [2], [3], > [4], [5] and [6]. I'm sure there are more. The drawback of course is > that it's more expensive to build such histograms. Tomas thank you for shearing with me your ideas regarding V-Optimal Histogram. I read through the papers which you gave me and came up with the following conclusion. The problem can be formulated like this. We have N tuples in M-dimensional space. We need to partition space into buckets iteratively until SSE is less than E or we reach the limit of buckets B. In the case of M-dimensional space it seems to me like an NP-hard problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we sort N tuples in ascending order. Then we iteratively select a bucket which leads to the largest SSE reduction and split it into two parts. We repeat the process until SSE is less than E or we reach the limit of buckets B. If we assume that B is significantly less than N then the time complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M equals 3, B equals 1000 and N equals 300*B then it will take slightly over 0.9*10^9 iterations to build a V-Optimal Histogram. You can see that we have to keep B as low as possible in order to build V-Optimal Histogram in feasible time. And here is a paradox. From one side we use V-Optimal Histogram in order to minimize SSE but from the other side we have to keep B as low as possible which eventually leads to increase in SSE. On the other hand time complexity required to build an Equi-Depth Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE can be arbitrarily reduced by increasing B which in turn is only limited by the storage limit. Experimental results show low error metric [3]. In Equi-Depth Histogram a bucket is represented by two vectors. The first vector points to the left bottom corner of the bucket and the other one point to the right top corner of the bucket. Thus space complexity of Equi-Depth Histogram can be estimated as 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000 bytes. If a bucket is partially intersected by a query box then we assume that data has uniform distribution inside of the bucket. It is a reasonable assumption if B is relativly large. In order to efficianly find buckets which intersect a query box we can store Equi-Depth Histogram in R-tree as proposed in [3]. On average it takes O(logB) iterations to find buckets which intersect a query box. As storage requirements are dominated by leaf nodes we can assume that it takes slightly more than 2*integer_size*M*B. > IIRC the patch tried to do something like V-optimal histograms, but > using a greedy algorithm and not the exhaustive stuff described in the > various papers. We should only consider computationally tackable solutions. In one dimensional case V-Optimal Histogram is probably a good solution but in multi-dimensional case I would only consider Equi-Width or Equi-Depth Histograms. As stated in multiple papers Equi-Depth Histogram proves to be more accurate than Equi-Width Histogram. By the way Postgres uses something like Equi-Width Histogram. > FWIW I did not intend to reject the idea of adding multi-dimensional > histograms, but rather to provide some insight into the history of the > past attempt, and also point some weaknesses of the algorithm described > in the 1988 paper. If you choose to work on this, I can do a review etc. Thank you very much Tomas. I am new in the community and I definitely didn't expect to have such a warm welcome. As I indicated above Equi-Depth Histogram proves to be more accurate than Equi-Width Histogram and both have the same time and space requirements. Postgres uses some sort of Equi-Width Histogram. Suppose that: * I will create a patch which will replace Equi-Width Histogram with Equi-Depth Histogram but only in 1-dimensional case. * I will show experimental results which will demonstrate improvement of selectivity estimation. Then will the path be accepted by the community? If the above path is accepted by the community then I will proceed further with M-dimensional Equi-Depth Histogram... [1] https://www.vldb.org/conf/1998/p275.pdf [2] https://www.vldb.org/conf/1997/P486.PDF [3] https://dl.acm.org/doi/pdf/10.1145/50202.50205 Regards, Alexander Cheshev On Thu, 28 Dec 2023 at 15:25, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 12/27/23 22:19, Tomas Vondra wrote: > > Hello Alexander, > > > > We actually had histograms in the early patches adding multivariate > > statistics [1]. We however ended up removing histograms and only kept > > the simpler types, for a couple reasons. > > > > It might be worth going through the discussion - I'm sure one of the > > reasons was we needed to limit the scope of the patch, and histograms > > were much more complex and possibly less useful compared to the other > > statistics types. > > > > ... > > FWIW I did not intend to reject the idea of adding multi-dimensional > histograms, but rather to provide some insight into the history of the > past attempt, and also point some weaknesses of the algorithm described > in the 1988 paper. If you choose to work on this, I can do a review etc. > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
pgsql-hackers by date: