Thread: Multidimensional Histograms
Hello Hackers, To improve selectivities of queries I suggest to add support of multidimensional histograms as described in paper [1]. To query multidimensional histograms efficiently we can use H-trees as described in paper [2]. Postgres has limited support of multivariate statistics: * MCV only useful for columns with small number of distinct values; * functional dependencies only reflect dependencies among columns (not column values). [1] http://www.cs.cmu.edu/~rcarlson/docs/RyanCarlson_databases.pdf [2] https://dl.acm.org/doi/pdf/10.1145/50202.50205 -- Regards, Alexander Cheshev
Hello Alexander, We actually had histograms in the early patches adding multivariate statistics [1]. We however ended up removing histograms and only kept the simpler types, for a couple reasons. It might be worth going through the discussion - I'm sure one of the reasons was we needed to limit the scope of the patch, and histograms were much more complex and possibly less useful compared to the other statistics types. Another reason was that the algorithm described in the two papers you reference (1988 paper by DeWitt and the evaluation by Carlson and Sutherland from ~2010) is simple but has pretty obvious weaknesses. It processes the columns one by one - first build bucket on column "a", then splits each bucket into buckets on "b". So it's not symmetrical, and it results in lower accuracy compared to an "ideal" histogram with the same number of buckets (especially for the dimensions split early). This does indeed produce an equi-depth histogram, which seems nice, but the mesh is regular in such a way that any value of the first dimension intersects all buckets on the second dimension. So for example with a histogram of 100x100 buckets on [a,b], any value "a=X" intersects with 100 buckets on "b", each representing 1/10000 of tuples. But we don't know how the tuples are distributed in the tuple - so we end up using 0.5 of the bucket as unbiased estimate, but that can easily add-up in the wrong dimension. However, this is not the only way to build an equi-depth histogram, there are ways to make it more symmetrical. Also, it's not clear equi-depth histograms are ideal with multiple columns. There are papers proposing various other types of histograms (using different criteria to build buckets optimizing a different metric). The most interesting one seems to be V-Optimal histograms - see for example papers [1], [2], [3], [4], [5] and [6]. I'm sure there are more. The drawback of course is that it's more expensive to build such histograms. IIRC the patch tried to do something like V-optimal histograms, but using a greedy algorithm and not the exhaustive stuff described in the various papers. [1] https://www.vldb.org/conf/1998/p275.pdf [2] https://cs-people.bu.edu/mathan/reading-groups/papers-classics/histograms.pdf [3] https://dl.acm.org/doi/pdf/10.1145/304182.304200 [4] http://www.cs.columbia.edu/~gravano/Papers/2001/sigmod01b.pdf [5] https://cs.gmu.edu/~carlotta/publications/vldb090.pdf [6] https://core.ac.uk/download/pdf/82158702.pdf regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/27/23 22:19, Tomas Vondra wrote: > Hello Alexander, > > We actually had histograms in the early patches adding multivariate > statistics [1]. We however ended up removing histograms and only kept > the simpler types, for a couple reasons. > > It might be worth going through the discussion - I'm sure one of the > reasons was we needed to limit the scope of the patch, and histograms > were much more complex and possibly less useful compared to the other > statistics types. > > ... FWIW I did not intend to reject the idea of adding multi-dimensional histograms, but rather to provide some insight into the history of the past attempt, and also point some weaknesses of the algorithm described in the 1988 paper. If you choose to work on this, I can do a review etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Tomas, > Another reason was that the algorithm described in the two papers you > reference (1988 paper by DeWitt and the evaluation by Carlson and > Sutherland from ~2010) is simple but has pretty obvious weaknesses. It > processes the columns one by one - first build bucket on column "a", > then splits each bucket into buckets on "b". So it's not symmetrical, > and it results in lower accuracy compared to an "ideal" histogram with > the same number of buckets (especially for the dimensions split early). As stated in [1] Sum Square Error (SSE) is one of the most natural error metrics. Equi-Depth Histogram is not ideal because it doesn't minimize SSE. On the other hand V-Optimal Histogram indeed minimizes SSE and from this point of view can be considered as an ideal solution. > This does indeed produce an equi-depth histogram, which seems nice, but > the mesh is regular in such a way that any value of the first dimension > intersects all buckets on the second dimension. So for example with a > histogram of 100x100 buckets on [a,b], any value "a=X" intersects with > 100 buckets on "b", each representing 1/10000 of tuples. But we don't > know how the tuples are distributed in the tuple - so we end up using > 0.5 of the bucket as unbiased estimate, but that can easily add-up in > the wrong dimension. Suppose that we have a query box a=X and we know how data is distributed in buckets. Lets consider only the buckets which are intersected by the query box a=X. As we know how data is distributes in buckets we can exclude the buckets which have no tuples which intersect the query box a=X. As we store buckets with no information about data distribution we have to make reasonable assumptions. If we keep buckets relativly small then we can assume that buckets have uniform distribution. What I am trying to say is that the problem which you pointed out comes from the fact that we store buckets with no information about data distribution. Even in one dimensional case we have to assume how data is distributed in buckets. By the way Postgres assumes that data has uniform distribution in buckets. > However, this is not the only way to build an equi-depth histogram, > there are ways to make it more symmetrical. Also, it's not clear > equi-depth histograms are ideal with multiple columns. There are papers > proposing various other types of histograms (using different criteria to > build buckets optimizing a different metric). The most interesting one > seems to be V-Optimal histograms - see for example papers [1], [2], [3], > [4], [5] and [6]. I'm sure there are more. The drawback of course is > that it's more expensive to build such histograms. Tomas thank you for shearing with me your ideas regarding V-Optimal Histogram. I read through the papers which you gave me and came up with the following conclusion. The problem can be formulated like this. We have N tuples in M-dimensional space. We need to partition space into buckets iteratively until SSE is less than E or we reach the limit of buckets B. In the case of M-dimensional space it seems to me like an NP-hard problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we sort N tuples in ascending order. Then we iteratively select a bucket which leads to the largest SSE reduction and split it into two parts. We repeat the process until SSE is less than E or we reach the limit of buckets B. If we assume that B is significantly less than N then the time complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M equals 3, B equals 1000 and N equals 300*B then it will take slightly over 0.9*10^9 iterations to build a V-Optimal Histogram. You can see that we have to keep B as low as possible in order to build V-Optimal Histogram in feasible time. And here is a paradox. From one side we use V-Optimal Histogram in order to minimize SSE but from the other side we have to keep B as low as possible which eventually leads to increase in SSE. On the other hand time complexity required to build an Equi-Depth Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE can be arbitrarily reduced by increasing B which in turn is only limited by the storage limit. Experimental results show low error metric [3]. In Equi-Depth Histogram a bucket is represented by two vectors. The first vector points to the left bottom corner of the bucket and the other one point to the right top corner of the bucket. Thus space complexity of Equi-Depth Histogram can be estimated as 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000 bytes. If a bucket is partially intersected by a query box then we assume that data has uniform distribution inside of the bucket. It is a reasonable assumption if B is relativly large. In order to efficianly find buckets which intersect a query box we can store Equi-Depth Histogram in R-tree as proposed in [3]. On average it takes O(logB) iterations to find buckets which intersect a query box. As storage requirements are dominated by leaf nodes we can assume that it takes slightly more than 2*integer_size*M*B. > IIRC the patch tried to do something like V-optimal histograms, but > using a greedy algorithm and not the exhaustive stuff described in the > various papers. We should only consider computationally tackable solutions. In one dimensional case V-Optimal Histogram is probably a good solution but in multi-dimensional case I would only consider Equi-Width or Equi-Depth Histograms. As stated in multiple papers Equi-Depth Histogram proves to be more accurate than Equi-Width Histogram. By the way Postgres uses something like Equi-Width Histogram. > FWIW I did not intend to reject the idea of adding multi-dimensional > histograms, but rather to provide some insight into the history of the > past attempt, and also point some weaknesses of the algorithm described > in the 1988 paper. If you choose to work on this, I can do a review etc. Thank you very much Tomas. I am new in the community and I definitely didn't expect to have such a warm welcome. As I indicated above Equi-Depth Histogram proves to be more accurate than Equi-Width Histogram and both have the same time and space requirements. Postgres uses some sort of Equi-Width Histogram. Suppose that: * I will create a patch which will replace Equi-Width Histogram with Equi-Depth Histogram but only in 1-dimensional case. * I will show experimental results which will demonstrate improvement of selectivity estimation. Then will the path be accepted by the community? If the above path is accepted by the community then I will proceed further with M-dimensional Equi-Depth Histogram... [1] https://www.vldb.org/conf/1998/p275.pdf [2] https://www.vldb.org/conf/1997/P486.PDF [3] https://dl.acm.org/doi/pdf/10.1145/50202.50205 Regards, Alexander Cheshev On Thu, 28 Dec 2023 at 15:25, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 12/27/23 22:19, Tomas Vondra wrote: > > Hello Alexander, > > > > We actually had histograms in the early patches adding multivariate > > statistics [1]. We however ended up removing histograms and only kept > > the simpler types, for a couple reasons. > > > > It might be worth going through the discussion - I'm sure one of the > > reasons was we needed to limit the scope of the patch, and histograms > > were much more complex and possibly less useful compared to the other > > statistics types. > > > > ... > > FWIW I did not intend to reject the idea of adding multi-dimensional > histograms, but rather to provide some insight into the history of the > past attempt, and also point some weaknesses of the algorithm described > in the 1988 paper. If you choose to work on this, I can do a review etc. > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
Hi Tomas, I am sorry I didn't look into the code carefully. Indeed Postgres uses Equi-Depth Histogram: delta = (nvals - 1) / (num_hist - 1); Regards, Alexander Cheshev On Sat, 6 Jan 2024 at 01:00, Alexander Cheshev <alex.cheshev@gmail.com> wrote: > > Hi Tomas, > > > Another reason was that the algorithm described in the two papers you > > reference (1988 paper by DeWitt and the evaluation by Carlson and > > Sutherland from ~2010) is simple but has pretty obvious weaknesses. It > > processes the columns one by one - first build bucket on column "a", > > then splits each bucket into buckets on "b". So it's not symmetrical, > > and it results in lower accuracy compared to an "ideal" histogram with > > the same number of buckets (especially for the dimensions split early). > > As stated in [1] Sum Square Error (SSE) is one of the most natural > error metrics. Equi-Depth Histogram is not ideal because it doesn't > minimize SSE. On the other hand V-Optimal Histogram indeed minimizes > SSE and from this point of view can be considered as an ideal > solution. > > > This does indeed produce an equi-depth histogram, which seems nice, but > > the mesh is regular in such a way that any value of the first dimension > > intersects all buckets on the second dimension. So for example with a > > histogram of 100x100 buckets on [a,b], any value "a=X" intersects with > > 100 buckets on "b", each representing 1/10000 of tuples. But we don't > > know how the tuples are distributed in the tuple - so we end up using > > 0.5 of the bucket as unbiased estimate, but that can easily add-up in > > the wrong dimension. > > Suppose that we have a query box a=X and we know how data is > distributed in buckets. Lets consider only the buckets which are > intersected by the query box a=X. As we know how data is distributes > in buckets we can exclude the buckets which have no tuples which > intersect the query box a=X. > > As we store buckets with no information about data distribution we > have to make reasonable assumptions. If we keep buckets relativly > small then we can assume that buckets have uniform distribution. > > What I am trying to say is that the problem which you pointed out > comes from the fact that we store buckets with no information about > data distribution. Even in one dimensional case we have to assume how > data is distributed in buckets. By the way Postgres assumes that data > has uniform distribution in buckets. > > > However, this is not the only way to build an equi-depth histogram, > > there are ways to make it more symmetrical. Also, it's not clear > > equi-depth histograms are ideal with multiple columns. There are papers > > proposing various other types of histograms (using different criteria to > > build buckets optimizing a different metric). The most interesting one > > seems to be V-Optimal histograms - see for example papers [1], [2], [3], > > [4], [5] and [6]. I'm sure there are more. The drawback of course is > > that it's more expensive to build such histograms. > > Tomas thank you for shearing with me your ideas regarding V-Optimal > Histogram. I read through the papers which you gave me and came up > with the following conclusion. > > The problem can be formulated like this. We have N tuples in > M-dimensional space. We need to partition space into buckets > iteratively until SSE is less than E or we reach the limit of buckets > B. > > In the case of M-dimensional space it seems to me like an NP-hard > problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we > sort N tuples in ascending order. Then we iteratively select a bucket > which leads to the largest SSE reduction and split it into two parts. > We repeat the process until SSE is less than E or we reach the limit > of buckets B. > > If we assume that B is significantly less than N then the time > complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M > equals 3, B equals 1000 and N equals 300*B then it will take slightly > over 0.9*10^9 iterations to build a V-Optimal Histogram. > > You can see that we have to keep B as low as possible in order to > build V-Optimal Histogram in feasible time. And here is a paradox. > From one side we use V-Optimal Histogram in order to minimize SSE but > from the other side we have to keep B as low as possible which > eventually leads to increase in SSE. > > On the other hand time complexity required to build an Equi-Depth > Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE > can be arbitrarily reduced by increasing B which in turn is only > limited by the storage limit. Experimental results show low error > metric [3]. > > In Equi-Depth Histogram a bucket is represented by two vectors. The > first vector points to the left bottom corner of the bucket and the > other one point to the right top corner of the bucket. Thus space > complexity of Equi-Depth Histogram can be estimated as > 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and > integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000 > bytes. > > If a bucket is partially intersected by a query box then we assume > that data has uniform distribution inside of the bucket. It is a > reasonable assumption if B is relativly large. > > In order to efficianly find buckets which intersect a query box we can > store Equi-Depth Histogram in R-tree as proposed in [3]. On average it > takes O(logB) iterations to find buckets which intersect a query box. > As storage requirements are dominated by leaf nodes we can assume that > it takes slightly more than 2*integer_size*M*B. > > > IIRC the patch tried to do something like V-optimal histograms, but > > using a greedy algorithm and not the exhaustive stuff described in the > > various papers. > > We should only consider computationally tackable solutions. In one > dimensional case V-Optimal Histogram is probably a good solution but > in multi-dimensional case I would only consider Equi-Width or > Equi-Depth Histograms. As stated in multiple papers Equi-Depth > Histogram proves to be more accurate than Equi-Width Histogram. By the > way Postgres uses something like Equi-Width Histogram. > > > FWIW I did not intend to reject the idea of adding multi-dimensional > > histograms, but rather to provide some insight into the history of the > > past attempt, and also point some weaknesses of the algorithm described > > in the 1988 paper. If you choose to work on this, I can do a review etc. > > Thank you very much Tomas. I am new in the community and I definitely > didn't expect to have such a warm welcome. > > As I indicated above Equi-Depth Histogram proves to be more accurate > than Equi-Width Histogram and both have the same time and space > requirements. Postgres uses some sort of Equi-Width Histogram. Suppose > that: > * I will create a patch which will replace Equi-Width Histogram with > Equi-Depth Histogram but only in 1-dimensional case. > * I will show experimental results which will demonstrate improvement > of selectivity estimation. > Then will the path be accepted by the community? > > If the above path is accepted by the community then I will proceed > further with M-dimensional Equi-Depth Histogram... > > > [1] https://www.vldb.org/conf/1998/p275.pdf > [2] https://www.vldb.org/conf/1997/P486.PDF > [3] https://dl.acm.org/doi/pdf/10.1145/50202.50205 > > Regards, > Alexander Cheshev > > On Thu, 28 Dec 2023 at 15:25, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > > > On 12/27/23 22:19, Tomas Vondra wrote: > > > Hello Alexander, > > > > > > We actually had histograms in the early patches adding multivariate > > > statistics [1]. We however ended up removing histograms and only kept > > > the simpler types, for a couple reasons. > > > > > > It might be worth going through the discussion - I'm sure one of the > > > reasons was we needed to limit the scope of the patch, and histograms > > > were much more complex and possibly less useful compared to the other > > > statistics types. > > > > > > ... > > > > FWIW I did not intend to reject the idea of adding multi-dimensional > > histograms, but rather to provide some insight into the history of the > > past attempt, and also point some weaknesses of the algorithm described > > in the 1988 paper. If you choose to work on this, I can do a review etc. > > > > regards > > > > -- > > Tomas Vondra > > EnterpriseDB: http://www.enterprisedb.com > > The Enterprise PostgreSQL Company
On 1/6/24 01:00, Alexander Cheshev wrote: > Hi Tomas, > >> Another reason was that the algorithm described in the two papers you >> reference (1988 paper by DeWitt and the evaluation by Carlson and >> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It >> processes the columns one by one - first build bucket on column "a", >> then splits each bucket into buckets on "b". So it's not symmetrical, >> and it results in lower accuracy compared to an "ideal" histogram with >> the same number of buckets (especially for the dimensions split early). > > As stated in [1] Sum Square Error (SSE) is one of the most natural > error metrics. Equi-Depth Histogram is not ideal because it doesn't > minimize SSE. On the other hand V-Optimal Histogram indeed minimizes > SSE and from this point of view can be considered as an ideal > solution. > >> This does indeed produce an equi-depth histogram, which seems nice, but >> the mesh is regular in such a way that any value of the first dimension >> intersects all buckets on the second dimension. So for example with a >> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with >> 100 buckets on "b", each representing 1/10000 of tuples. But we don't >> know how the tuples are distributed in the tuple - so we end up using >> 0.5 of the bucket as unbiased estimate, but that can easily add-up in >> the wrong dimension. > > Suppose that we have a query box a=X and we know how data is > distributed in buckets. Lets consider only the buckets which are > intersected by the query box a=X. As we know how data is distributes > in buckets we can exclude the buckets which have no tuples which > intersect the query box a=X. > > As we store buckets with no information about data distribution we > have to make reasonable assumptions. If we keep buckets relativly > small then we can assume that buckets have uniform distribution. > > What I am trying to say is that the problem which you pointed out > comes from the fact that we store buckets with no information about > data distribution. Even in one dimensional case we have to assume how > data is distributed in buckets. By the way Postgres assumes that data > has uniform distribution in buckets. > It's not just what Postgres assumes, the assumption bucket uniformity is somewhat inherent to the whole concept of a histogram. Yes, maybe we could keep some "distribution" info about each bucket, but then maybe we could simply build histogram with more bins occupying the same space? The thing I was proposing is that it should be possible to build histograms with bins adapted to density in the given region. With smaller buckets in areas with high density. So that queries intersect with fewer buckets in low-density parts of the histogram. >> However, this is not the only way to build an equi-depth histogram, >> there are ways to make it more symmetrical. Also, it's not clear >> equi-depth histograms are ideal with multiple columns. There are papers >> proposing various other types of histograms (using different criteria to >> build buckets optimizing a different metric). The most interesting one >> seems to be V-Optimal histograms - see for example papers [1], [2], [3], >> [4], [5] and [6]. I'm sure there are more. The drawback of course is >> that it's more expensive to build such histograms. > > Tomas thank you for shearing with me your ideas regarding V-Optimal > Histogram. I read through the papers which you gave me and came up > with the following conclusion. > > The problem can be formulated like this. We have N tuples in > M-dimensional space. We need to partition space into buckets > iteratively until SSE is less than E or we reach the limit of buckets > B. > Yes. Although v-optimal histograms minimize variance of frequencies. Not sure if that's what you mean by SSE. > In the case of M-dimensional space it seems to me like an NP-hard > problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we > sort N tuples in ascending order. Then we iteratively select a bucket > which leads to the largest SSE reduction and split it into two parts. > We repeat the process until SSE is less than E or we reach the limit > of buckets B. > I don't recall all the details of the MHIST-2 algorithm, but this sounds about right. Yes, building the optimal histogram would be NP-hard, so we'd have to use some approximate / greedy algorithm. > If we assume that B is significantly less than N then the time > complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M > equals 3, B equals 1000 and N equals 300*B then it will take slightly > over 0.9*10^9 iterations to build a V-Optimal Histogram. > > You can see that we have to keep B as low as possible in order to > build V-Optimal Histogram in feasible time. And here is a paradox. > From one side we use V-Optimal Histogram in order to minimize SSE but > from the other side we have to keep B as low as possible which > eventually leads to increase in SSE. > I don't recall the details of the MHIST-2 scheme, but it's true calculating "perfect" V-optimal histogram would be quadratic O(N^2*B). But that's exactly why greedy/approximate algorithms exist. Yes, it's not going to produce the optimal V-optimal histogram, but so what? > On the other hand time complexity required to build an Equi-Depth > Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE > can be arbitrarily reduced by increasing B which in turn is only > limited by the storage limit. Experimental results show low error > metric [3]. > And how does this compare to the approximate/greedy algorithms, both in terms of construction time and accuracy? The [3] paper does not compare those, ofc, and I'm somewhat skeptical about the results in that paper. Not that they'd be wrong, but AFAICS the assumptions are quite simplistic and well-suited for that particular type of histogram. It's far from clear how would it perform for less "smooth" data, non-numeric data, etc. > In Equi-Depth Histogram a bucket is represented by two vectors. The > first vector points to the left bottom corner of the bucket and the > other one point to the right top corner of the bucket. Thus space > complexity of Equi-Depth Histogram can be estimated as > 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and > integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000 > bytes. > > If a bucket is partially intersected by a query box then we assume > that data has uniform distribution inside of the bucket. It is a > reasonable assumption if B is relativly large. > > In order to efficianly find buckets which intersect a query box we can > store Equi-Depth Histogram in R-tree as proposed in [3]. On average it > takes O(logB) iterations to find buckets which intersect a query box. > As storage requirements are dominated by leaf nodes we can assume that > it takes slightly more than 2*integer_size*M*B. > But all of this can apply to histograms in general, no? It's not somehow special to equi-depth histograms, a v-optimal histogram could be stored as an r-tree etc. >> IIRC the patch tried to do something like V-optimal histograms, but >> using a greedy algorithm and not the exhaustive stuff described in the >> various papers. > > We should only consider computationally tackable solutions. In one > dimensional case V-Optimal Histogram is probably a good solution but > in multi-dimensional case I would only consider Equi-Width or > Equi-Depth Histograms. As stated in multiple papers Equi-Depth > Histogram proves to be more accurate than Equi-Width Histogram. By the > way Postgres uses something like Equi-Width Histogram. > Sure, but AFAIK the greedy/approximate algorithms are not intractable. And at some point it becomes a tradeoff between accuracy of estimates and resources to build the histogram. Papers from ~1988 are great, but maybe not sufficient to make such decisions now. >> FWIW I did not intend to reject the idea of adding multi-dimensional >> histograms, but rather to provide some insight into the history of the >> past attempt, and also point some weaknesses of the algorithm described >> in the 1988 paper. If you choose to work on this, I can do a review etc. > > Thank you very much Tomas. I am new in the community and I definitely > didn't expect to have such a warm welcome. > > As I indicated above Equi-Depth Histogram proves to be more accurate > than Equi-Width Histogram and both have the same time and space > requirements. Postgres uses some sort of Equi-Width Histogram. Suppose > that: > * I will create a patch which will replace Equi-Width Histogram with > Equi-Depth Histogram but only in 1-dimensional case. > * I will show experimental results which will demonstrate improvement > of selectivity estimation. > Then will the path be accepted by the community? > > If the above path is accepted by the community then I will proceed > further with M-dimensional Equi-Depth Histogram... > I find it very unlikely we want (or even need) to significantly rework the 1-D histograms we already have. AFAICS the current histograms work pretty well, and if better accuracy is needed, it's easy/efficient to just increase the statistics target. I'm sure there are various estimation issues, but I'm not aware of any that'd be solved simply by using a different 1-D histogram. I'm not going to reject that outright - but I think the bar you'd need to justify such change is damn high. Pretty much everyone is using the current histograms, which makes it a very sensitive area. You'll need to show that it helps in practice, without hurting causing harm ... It's an interesting are for experiments, no doubt about it. And if you choose to explore it, that's fine. But it's better to be aware it may not end with a commit. For the multi-dimensional case, I propose we first try to experiment with the various algorithms, and figure out what works etc. Maybe implementing them in python or something would be easier than C. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/1/2024 06:54, Tomas Vondra wrote: > It's an interesting are for experiments, no doubt about it. And if you > choose to explore it, that's fine. But it's better to be aware it may > not end with a commit. > For the multi-dimensional case, I propose we first try to experiment > with the various algorithms, and figure out what works etc. Maybe > implementing them in python or something would be easier than C. Curiously, trying to utilize extended statistics for some problematic cases, I am experimenting with auto-generating such statistics by definition of indexes [1]. Doing that, I wanted to add some hand-made statistics like a multidimensional histogram or just a histogram which could help to perform estimation over a set of columns/expressions. I realized that current hooks get_relation_stats_hook and get_index_stats_hook are insufficient if I want to perform an estimation over a set of ANDed quals on different columns. In your opinion, is it possible to add a hook into the extended statistics to allow for an extension to propose alternative estimation? [1] https://github.com/danolivo/pg_index_stats -- regards, Andrei Lepikhov Postgres Professional
On 1/7/24 11:22, Andrei Lepikhov wrote: > On 7/1/2024 06:54, Tomas Vondra wrote: >> It's an interesting are for experiments, no doubt about it. And if you >> choose to explore it, that's fine. But it's better to be aware it may >> not end with a commit. >> For the multi-dimensional case, I propose we first try to experiment >> with the various algorithms, and figure out what works etc. Maybe >> implementing them in python or something would be easier than C. > > Curiously, trying to utilize extended statistics for some problematic > cases, I am experimenting with auto-generating such statistics by > definition of indexes [1]. Doing that, I wanted to add some hand-made > statistics like a multidimensional histogram or just a histogram which > could help to perform estimation over a set of columns/expressions. > I realized that current hooks get_relation_stats_hook and > get_index_stats_hook are insufficient if I want to perform an estimation > over a set of ANDed quals on different columns. > In your opinion, is it possible to add a hook into the extended > statistics to allow for an extension to propose alternative estimation? > > [1] https://github.com/danolivo/pg_index_stats > No idea, I haven't thought about that very much. Presumably the existing hooks are insufficient because they're per-attnum? I guess it would make sense to have a hook for all the attnums of the relation, but I'm not sure it'd be enough to introduce a new extended statistics kind ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/1/2024 17:51, Tomas Vondra wrote: > On 1/7/24 11:22, Andrei Lepikhov wrote: >> On 7/1/2024 06:54, Tomas Vondra wrote: >>> It's an interesting are for experiments, no doubt about it. And if you >>> choose to explore it, that's fine. But it's better to be aware it may >>> not end with a commit. >>> For the multi-dimensional case, I propose we first try to experiment >>> with the various algorithms, and figure out what works etc. Maybe >>> implementing them in python or something would be easier than C. >> >> Curiously, trying to utilize extended statistics for some problematic >> cases, I am experimenting with auto-generating such statistics by >> definition of indexes [1]. Doing that, I wanted to add some hand-made >> statistics like a multidimensional histogram or just a histogram which >> could help to perform estimation over a set of columns/expressions. >> I realized that current hooks get_relation_stats_hook and >> get_index_stats_hook are insufficient if I want to perform an estimation >> over a set of ANDed quals on different columns. >> In your opinion, is it possible to add a hook into the extended >> statistics to allow for an extension to propose alternative estimation? >> >> [1] https://github.com/danolivo/pg_index_stats >> > > No idea, I haven't thought about that very much. Presumably the existing > hooks are insufficient because they're per-attnum? I guess it would make > sense to have a hook for all the attnums of the relation, but I'm not > sure it'd be enough to introduce a new extended statistics kind ... I got stuck on the same problem Alexander mentioned: we usually have large tables with many uniformly distributed values. In this case, MCV doesn't help a lot. Usually, I face problems scanning a table with a filter containing 3-6 ANDed quals. Here, Postgres multiplies selectivities and ends up with a less than 1 tuple selectivity. But such scans, in reality, mostly have some physical sense and return a bunch of tuples. It looks like the set of columns representing some value of composite type. Sometimes extended statistics on dependency helps well, but it expensive for multiple columns. And sometimes I see that even a trivial histogram on a ROW(x1,x2,...) could predict a much more adequate value (kind of conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) = ROW(N1, N2,...). For such cases we don't have an in-core solution, and introducing a hook on clause list estimation (paired with maybe a hook on statistics generation) could help invent an extension that would deal with that problem. Also, it would open a way for experiments with different types of extended statistics ... -- regards, Andrei Lepikhov Postgres Professional
On 1/7/24 18:26, Andrei Lepikhov wrote: > On 7/1/2024 17:51, Tomas Vondra wrote: >> On 1/7/24 11:22, Andrei Lepikhov wrote: >>> On 7/1/2024 06:54, Tomas Vondra wrote: >>>> It's an interesting are for experiments, no doubt about it. And if you >>>> choose to explore it, that's fine. But it's better to be aware it may >>>> not end with a commit. >>>> For the multi-dimensional case, I propose we first try to experiment >>>> with the various algorithms, and figure out what works etc. Maybe >>>> implementing them in python or something would be easier than C. >>> >>> Curiously, trying to utilize extended statistics for some problematic >>> cases, I am experimenting with auto-generating such statistics by >>> definition of indexes [1]. Doing that, I wanted to add some hand-made >>> statistics like a multidimensional histogram or just a histogram which >>> could help to perform estimation over a set of columns/expressions. >>> I realized that current hooks get_relation_stats_hook and >>> get_index_stats_hook are insufficient if I want to perform an estimation >>> over a set of ANDed quals on different columns. >>> In your opinion, is it possible to add a hook into the extended >>> statistics to allow for an extension to propose alternative estimation? >>> >>> [1] https://github.com/danolivo/pg_index_stats >>> >> >> No idea, I haven't thought about that very much. Presumably the existing >> hooks are insufficient because they're per-attnum? I guess it would make >> sense to have a hook for all the attnums of the relation, but I'm not >> sure it'd be enough to introduce a new extended statistics kind ... > > I got stuck on the same problem Alexander mentioned: we usually have > large tables with many uniformly distributed values. In this case, MCV > doesn't help a lot. > > Usually, I face problems scanning a table with a filter containing 3-6 > ANDed quals. Here, Postgres multiplies selectivities and ends up with a > less than 1 tuple selectivity. But such scans, in reality, mostly have > some physical sense and return a bunch of tuples. It looks like the set > of columns representing some value of composite type. Understood. That's a fairly common scenario, and it can easily end up with rather terrible plan due to the underestimate. > Sometimes extended statistics on dependency helps well, but it expensive > for multiple columns. Expensive in what sense? Also, how would a multidimensional histogram be any cheaper? > And sometimes I see that even a trivial histogram > on a ROW(x1,x2,...) could predict a much more adequate value (kind of > conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND > ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) = > ROW(N1, N2,...). Are you guessing the histogram would help, or do you know it'd help? I have no problem believing that for range queries, but I think it's far less clear for simple equalities. I'm not sure a histograms would work for that ... Maybe it'd be possible to track more stuff for each bucket, not just the frequency, but also ndistinct for combinations of columns, and then use that to do better equality estimates. Or maybe we could see how the "expected" and "actual" bucket frequency compare, and use that to correct the equality estimate? Not sure. But perhaps you have some data to demonstrate it'd actually help? > For such cases we don't have an in-core solution, and introducing a hook > on clause list estimation (paired with maybe a hook on statistics > generation) could help invent an extension that would deal with that > problem. Also, it would open a way for experiments with different types > of extended statistics ... > I really don't know how to do that, or what would it take. AFAICS the statistics system is not very extensible from external code. Even if we could add new types of attribute/extended stats, I'm not sure the code calculating the estimates could use that. That doesn't mean it's impossible or not worth exploring, but I don't have any thoughts on this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Tomas, > The thing I was proposing is that it should be possible to build > histograms with bins adapted to density in the given region. With > smaller buckets in areas with high density. So that queries intersect > with fewer buckets in low-density parts of the histogram. This is how Equi-Depth Histogram works. Postgres has maller buckets in areas with high density: values[(i * (nvals - 1)) / (num_hist - 1)] > I don't recall the details of the MHIST-2 scheme, but it's true > calculating "perfect" V-optimal histogram would be quadratic O(N^2*B). In M-dimensional space "perfect" V-Optimal Histogram is an NP-hard problem. In other words it is not possible to build it in polynomial time. How did you come up with the estimate?! > But that's exactly why greedy/approximate algorithms exist. Yes, it's > not going to produce the optimal V-optimal histogram, but so what? Greedy/approximate algorithm has time complexity O(M*N*B), where M equals the number of dimensions. MHIST-2 is a greedy/approximate algorithm. > And how does this compare to the approximate/greedy algorithms, both in > terms of construction time and accuracy? Time complexity of Equi-Depth Histogram has no dependence on B. > But all of this can apply to histograms in general, no? It's not somehow > special to equi-depth histograms, a v-optimal histogram could be stored > as an r-tree etc. I agree. Regards, Alexander Cheshev On Sun, 7 Jan 2024 at 00:54, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 1/6/24 01:00, Alexander Cheshev wrote: > > Hi Tomas, > > > >> Another reason was that the algorithm described in the two papers you > >> reference (1988 paper by DeWitt and the evaluation by Carlson and > >> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It > >> processes the columns one by one - first build bucket on column "a", > >> then splits each bucket into buckets on "b". So it's not symmetrical, > >> and it results in lower accuracy compared to an "ideal" histogram with > >> the same number of buckets (especially for the dimensions split early). > > > > As stated in [1] Sum Square Error (SSE) is one of the most natural > > error metrics. Equi-Depth Histogram is not ideal because it doesn't > > minimize SSE. On the other hand V-Optimal Histogram indeed minimizes > > SSE and from this point of view can be considered as an ideal > > solution. > > > >> This does indeed produce an equi-depth histogram, which seems nice, but > >> the mesh is regular in such a way that any value of the first dimension > >> intersects all buckets on the second dimension. So for example with a > >> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with > >> 100 buckets on "b", each representing 1/10000 of tuples. But we don't > >> know how the tuples are distributed in the tuple - so we end up using > >> 0.5 of the bucket as unbiased estimate, but that can easily add-up in > >> the wrong dimension. > > > > Suppose that we have a query box a=X and we know how data is > > distributed in buckets. Lets consider only the buckets which are > > intersected by the query box a=X. As we know how data is distributes > > in buckets we can exclude the buckets which have no tuples which > > intersect the query box a=X. > > > > As we store buckets with no information about data distribution we > > have to make reasonable assumptions. If we keep buckets relativly > > small then we can assume that buckets have uniform distribution. > > > > What I am trying to say is that the problem which you pointed out > > comes from the fact that we store buckets with no information about > > data distribution. Even in one dimensional case we have to assume how > > data is distributed in buckets. By the way Postgres assumes that data > > has uniform distribution in buckets. > > > > It's not just what Postgres assumes, the assumption bucket uniformity is > somewhat inherent to the whole concept of a histogram. Yes, maybe we > could keep some "distribution" info about each bucket, but then maybe we > could simply build histogram with more bins occupying the same space? > > The thing I was proposing is that it should be possible to build > histograms with bins adapted to density in the given region. With > smaller buckets in areas with high density. So that queries intersect > with fewer buckets in low-density parts of the histogram. > > >> However, this is not the only way to build an equi-depth histogram, > >> there are ways to make it more symmetrical. Also, it's not clear > >> equi-depth histograms are ideal with multiple columns. There are papers > >> proposing various other types of histograms (using different criteria to > >> build buckets optimizing a different metric). The most interesting one > >> seems to be V-Optimal histograms - see for example papers [1], [2], [3], > >> [4], [5] and [6]. I'm sure there are more. The drawback of course is > >> that it's more expensive to build such histograms. > > > > Tomas thank you for shearing with me your ideas regarding V-Optimal > > Histogram. I read through the papers which you gave me and came up > > with the following conclusion. > > > > The problem can be formulated like this. We have N tuples in > > M-dimensional space. We need to partition space into buckets > > iteratively until SSE is less than E or we reach the limit of buckets > > B. > > > > Yes. Although v-optimal histograms minimize variance of frequencies. Not > sure if that's what you mean by SSE. > > > In the case of M-dimensional space it seems to me like an NP-hard > > problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we > > sort N tuples in ascending order. Then we iteratively select a bucket > > which leads to the largest SSE reduction and split it into two parts. > > We repeat the process until SSE is less than E or we reach the limit > > of buckets B. > > > > I don't recall all the details of the MHIST-2 algorithm, but this sounds > about right. Yes, building the optimal histogram would be NP-hard, so > we'd have to use some approximate / greedy algorithm. > > > If we assume that B is significantly less than N then the time > > complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M > > equals 3, B equals 1000 and N equals 300*B then it will take slightly > > over 0.9*10^9 iterations to build a V-Optimal Histogram. > > > > You can see that we have to keep B as low as possible in order to > > build V-Optimal Histogram in feasible time. And here is a paradox. > > From one side we use V-Optimal Histogram in order to minimize SSE but > > from the other side we have to keep B as low as possible which > > eventually leads to increase in SSE. > > > > I don't recall the details of the MHIST-2 scheme, but it's true > calculating "perfect" V-optimal histogram would be quadratic O(N^2*B). > > But that's exactly why greedy/approximate algorithms exist. Yes, it's > not going to produce the optimal V-optimal histogram, but so what? > > > On the other hand time complexity required to build an Equi-Depth > > Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE > > can be arbitrarily reduced by increasing B which in turn is only > > limited by the storage limit. Experimental results show low error > > metric [3]. > > > > And how does this compare to the approximate/greedy algorithms, both in > terms of construction time and accuracy? > > The [3] paper does not compare those, ofc, and I'm somewhat skeptical > about the results in that paper. Not that they'd be wrong, but AFAICS > the assumptions are quite simplistic and well-suited for that particular > type of histogram. > > It's far from clear how would it perform for less "smooth" data, > non-numeric data, etc. > > > In Equi-Depth Histogram a bucket is represented by two vectors. The > > first vector points to the left bottom corner of the bucket and the > > other one point to the right top corner of the bucket. Thus space > > complexity of Equi-Depth Histogram can be estimated as > > 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and > > integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000 > > bytes. > > > > If a bucket is partially intersected by a query box then we assume > > that data has uniform distribution inside of the bucket. It is a > > reasonable assumption if B is relativly large. > > > > In order to efficianly find buckets which intersect a query box we can > > store Equi-Depth Histogram in R-tree as proposed in [3]. On average it > > takes O(logB) iterations to find buckets which intersect a query box. > > As storage requirements are dominated by leaf nodes we can assume that > > it takes slightly more than 2*integer_size*M*B. > > > > But all of this can apply to histograms in general, no? It's not somehow > special to equi-depth histograms, a v-optimal histogram could be stored > as an r-tree etc. > > >> IIRC the patch tried to do something like V-optimal histograms, but > >> using a greedy algorithm and not the exhaustive stuff described in the > >> various papers. > > > > We should only consider computationally tackable solutions. In one > > dimensional case V-Optimal Histogram is probably a good solution but > > in multi-dimensional case I would only consider Equi-Width or > > Equi-Depth Histograms. As stated in multiple papers Equi-Depth > > Histogram proves to be more accurate than Equi-Width Histogram. By the > > way Postgres uses something like Equi-Width Histogram. > > > > Sure, but AFAIK the greedy/approximate algorithms are not intractable. > > And at some point it becomes a tradeoff between accuracy of estimates > and resources to build the histogram. Papers from ~1988 are great, but > maybe not sufficient to make such decisions now. > > >> FWIW I did not intend to reject the idea of adding multi-dimensional > >> histograms, but rather to provide some insight into the history of the > >> past attempt, and also point some weaknesses of the algorithm described > >> in the 1988 paper. If you choose to work on this, I can do a review etc. > > > > Thank you very much Tomas. I am new in the community and I definitely > > didn't expect to have such a warm welcome. > > > > As I indicated above Equi-Depth Histogram proves to be more accurate > > than Equi-Width Histogram and both have the same time and space > > requirements. Postgres uses some sort of Equi-Width Histogram. Suppose > > that: > > * I will create a patch which will replace Equi-Width Histogram with > > Equi-Depth Histogram but only in 1-dimensional case. > > * I will show experimental results which will demonstrate improvement > > of selectivity estimation. > > Then will the path be accepted by the community? > > > > If the above path is accepted by the community then I will proceed > > further with M-dimensional Equi-Depth Histogram... > > > > I find it very unlikely we want (or even need) to significantly rework > the 1-D histograms we already have. AFAICS the current histograms work > pretty well, and if better accuracy is needed, it's easy/efficient to > just increase the statistics target. I'm sure there are various > estimation issues, but I'm not aware of any that'd be solved simply by > using a different 1-D histogram. > > I'm not going to reject that outright - but I think the bar you'd need > to justify such change is damn high. Pretty much everyone is using the > current histograms, which makes it a very sensitive area. You'll need to > show that it helps in practice, without hurting causing harm ... > > It's an interesting are for experiments, no doubt about it. And if you > choose to explore it, that's fine. But it's better to be aware it may > not end with a commit. > > > For the multi-dimensional case, I propose we first try to experiment > with the various algorithms, and figure out what works etc. Maybe > implementing them in python or something would be easier than C. > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On 1/7/24 23:53, Alexander Cheshev wrote: > Hi Tomas, > >> The thing I was proposing is that it should be possible to build >> histograms with bins adapted to density in the given region. With >> smaller buckets in areas with high density. So that queries intersect >> with fewer buckets in low-density parts of the histogram. > > This is how Equi-Depth Histogram works. Postgres has maller buckets in > areas with high density: > > values[(i * (nvals - 1)) / (num_hist - 1)] > True, but the boundaries are somewhat random. Also, I was referring to the multi-dimensional case, it wasn't clear to me if the proposal is to do the same thing. >> I don't recall the details of the MHIST-2 scheme, but it's true >> calculating "perfect" V-optimal histogram would be quadratic O(N^2*B). > > In M-dimensional space "perfect" V-Optimal Histogram is an NP-hard > problem. In other words it is not possible to build it in polynomial > time. How did you come up with the estimate?! > See section 3.2 in this paper (the "view PDF" does not work for me, but the "source PDF" does download a postscript): https://citeseerx.ist.psu.edu/doc_view/pid/35e29cbc2bfe6662653bdae1fb89c091e2ece560 >> But that's exactly why greedy/approximate algorithms exist. Yes, it's >> not going to produce the optimal V-optimal histogram, but so what? > > Greedy/approximate algorithm has time complexity O(M*N*B), where M > equals the number of dimensions. MHIST-2 is a greedy/approximate > algorithm. > >> And how does this compare to the approximate/greedy algorithms, both in >> terms of construction time and accuracy? > > Time complexity of Equi-Depth Histogram has no dependence on B. > Really? I'd expect that to build B buckets, the algorithm repeat some O(M*N) action B-times, roughly. I mean, it needs to pick a dimension by which to split, then do some calculation on the N tuples, etc. Maybe I'm imagining that wrong, though. It's been ages since I looked ad big-O complexity and/or the histograms. I'd have to play with those algorithms for a bit again. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Tomas, > See section 3.2 in this paper (the "view PDF" does not work for me, but > the "source PDF" does download a postscript): I believe that you are referring to a dynamic programming approach. It is a 1-dimensional case! To find an optimal solution in the M-dimensional case is an NP-hard problem. Regards, Alexander Cheshev On Mon, 8 Jan 2024 at 00:29, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 1/7/24 23:53, Alexander Cheshev wrote: > > Hi Tomas, > > > >> The thing I was proposing is that it should be possible to build > >> histograms with bins adapted to density in the given region. With > >> smaller buckets in areas with high density. So that queries intersect > >> with fewer buckets in low-density parts of the histogram. > > > > This is how Equi-Depth Histogram works. Postgres has maller buckets in > > areas with high density: > > > > values[(i * (nvals - 1)) / (num_hist - 1)] > > > True, but the boundaries are somewhat random. Also, I was referring to > the multi-dimensional case, it wasn't clear to me if the proposal is to > do the same thing. > > >> I don't recall the details of the MHIST-2 scheme, but it's true > >> calculating "perfect" V-optimal histogram would be quadratic O(N^2*B). > > > > In M-dimensional space "perfect" V-Optimal Histogram is an NP-hard > > problem. In other words it is not possible to build it in polynomial > > time. How did you come up with the estimate?! > > > See section 3.2 in this paper (the "view PDF" does not work for me, but > the "source PDF" does download a postscript): > > https://citeseerx.ist.psu.edu/doc_view/pid/35e29cbc2bfe6662653bdae1fb89c091e2ece560 > > >> But that's exactly why greedy/approximate algorithms exist. Yes, it's > >> not going to produce the optimal V-optimal histogram, but so what? > > > > Greedy/approximate algorithm has time complexity O(M*N*B), where M > > equals the number of dimensions. MHIST-2 is a greedy/approximate > > algorithm. > > > >> And how does this compare to the approximate/greedy algorithms, both in > >> terms of construction time and accuracy? > > > > Time complexity of Equi-Depth Histogram has no dependence on B. > > > Really? I'd expect that to build B buckets, the algorithm repeat some > O(M*N) action B-times, roughly. I mean, it needs to pick a dimension by > which to split, then do some calculation on the N tuples, etc. > > Maybe I'm imagining that wrong, though. It's been ages since I looked ad > big-O complexity and/or the histograms. I'd have to play with those > algorithms for a bit again. > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On 8/1/2024 01:36, Tomas Vondra wrote: > On 1/7/24 18:26, Andrei Lepikhov wrote: >> On 7/1/2024 17:51, Tomas Vondra wrote: >>> On 1/7/24 11:22, Andrei Lepikhov wrote: >>>> On 7/1/2024 06:54, Tomas Vondra wrote: >>>>> It's an interesting are for experiments, no doubt about it. And if you >>>>> choose to explore it, that's fine. But it's better to be aware it may >>>>> not end with a commit. >>>>> For the multi-dimensional case, I propose we first try to experiment >>>>> with the various algorithms, and figure out what works etc. Maybe >>>>> implementing them in python or something would be easier than C. >>>> >>>> Curiously, trying to utilize extended statistics for some problematic >>>> cases, I am experimenting with auto-generating such statistics by >>>> definition of indexes [1]. Doing that, I wanted to add some hand-made >>>> statistics like a multidimensional histogram or just a histogram which >>>> could help to perform estimation over a set of columns/expressions. >>>> I realized that current hooks get_relation_stats_hook and >>>> get_index_stats_hook are insufficient if I want to perform an estimation >>>> over a set of ANDed quals on different columns. >>>> In your opinion, is it possible to add a hook into the extended >>>> statistics to allow for an extension to propose alternative estimation? >>>> >>>> [1] https://github.com/danolivo/pg_index_stats >>>> >>> >>> No idea, I haven't thought about that very much. Presumably the existing >>> hooks are insufficient because they're per-attnum? I guess it would make >>> sense to have a hook for all the attnums of the relation, but I'm not >>> sure it'd be enough to introduce a new extended statistics kind ... >> >> I got stuck on the same problem Alexander mentioned: we usually have >> large tables with many uniformly distributed values. In this case, MCV >> doesn't help a lot. >> >> Usually, I face problems scanning a table with a filter containing 3-6 >> ANDed quals. Here, Postgres multiplies selectivities and ends up with a >> less than 1 tuple selectivity. But such scans, in reality, mostly have >> some physical sense and return a bunch of tuples. It looks like the set >> of columns representing some value of composite type. > > Understood. That's a fairly common scenario, and it can easily end up > with rather terrible plan due to the underestimate. > >> Sometimes extended statistics on dependency helps well, but it expensive >> for multiple columns. > > Expensive in what sense? Also, how would a multidimensional histogram be > any cheaper? Maybe my wording needed to be more precise. I didn't implement multidimensional histograms before, so I don't know how expensive they are. I meant that for dependency statistics over about six columns, we have a lot of combinations to compute. > >> And sometimes I see that even a trivial histogram >> on a ROW(x1,x2,...) could predict a much more adequate value (kind of >> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND >> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) = >> ROW(N1, N2,...). > > Are you guessing the histogram would help, or do you know it'd help? I > have no problem believing that for range queries, but I think it's far > less clear for simple equalities. I'm not sure a histograms would work > for that ... I added Teodor Sigaev to the CC of this email - He has much more user feedback on this technique. As I see, having a histogram over a set of columns, we have top selectivity estimation for the filter. I'm not sure how good a simple histogram is in that case, too, but according to my practice, it works, disallowing usage of too-optimistic query plans. > Maybe it'd be possible to track more stuff for each bucket, not just the > frequency, but also ndistinct for combinations of columns, and then use > that to do better equality estimates. Or maybe we could see how the > "expected" and "actual" bucket frequency compare, and use that to > correct the equality estimate? Not sure. Yes, it is in my mind, too. Having such experimental stuff as an extension(s) in GitHub, we could get some community feedback. > > But perhaps you have some data to demonstrate it'd actually help? It should be redirected to Teodor, but I will consider translating messy real-life reports into a clear example. > >> For such cases we don't have an in-core solution, and introducing a hook >> on clause list estimation (paired with maybe a hook on statistics >> generation) could help invent an extension that would deal with that >> problem. Also, it would open a way for experiments with different types >> of extended statistics ... > I really don't know how to do that, or what would it take. AFAICS the > statistics system is not very extensible from external code. Even if we > could add new types of attribute/extended stats, I'm not sure the code > calculating the estimates could use that. I imagine we can add an analysis routine and directly store statistics in an extension for demonstration and experimental purposes, no problem. -- regards, Andrei Lepikhov Postgres Professional
Hi Andrei, > Maybe my wording needed to be more precise. I didn't implement > multidimensional histograms before, so I don't know how expensive they > are. I meant that for dependency statistics over about six columns, we > have a lot of combinations to compute. Equi-Depth Histogram in a 6 dimensional case requires 6 times more iterations. Postgres already uses Equi-Depth Histogram. Even if you increase the number of buckets from 100 to 1000 then there will be no overhead as the time complexity of Equi-Depth Histogram has no dependence on the number of buckets. So, no overhead at all! Regards, Alexander Cheshev On Mon, 8 Jan 2024 at 04:12, Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > On 8/1/2024 01:36, Tomas Vondra wrote: > > On 1/7/24 18:26, Andrei Lepikhov wrote: > >> On 7/1/2024 17:51, Tomas Vondra wrote: > >>> On 1/7/24 11:22, Andrei Lepikhov wrote: > >>>> On 7/1/2024 06:54, Tomas Vondra wrote: > >>>>> It's an interesting are for experiments, no doubt about it. And if you > >>>>> choose to explore it, that's fine. But it's better to be aware it may > >>>>> not end with a commit. > >>>>> For the multi-dimensional case, I propose we first try to experiment > >>>>> with the various algorithms, and figure out what works etc. Maybe > >>>>> implementing them in python or something would be easier than C. > >>>> > >>>> Curiously, trying to utilize extended statistics for some problematic > >>>> cases, I am experimenting with auto-generating such statistics by > >>>> definition of indexes [1]. Doing that, I wanted to add some hand-made > >>>> statistics like a multidimensional histogram or just a histogram which > >>>> could help to perform estimation over a set of columns/expressions. > >>>> I realized that current hooks get_relation_stats_hook and > >>>> get_index_stats_hook are insufficient if I want to perform an estimation > >>>> over a set of ANDed quals on different columns. > >>>> In your opinion, is it possible to add a hook into the extended > >>>> statistics to allow for an extension to propose alternative estimation? > >>>> > >>>> [1] https://github.com/danolivo/pg_index_stats > >>>> > >>> > >>> No idea, I haven't thought about that very much. Presumably the existing > >>> hooks are insufficient because they're per-attnum? I guess it would make > >>> sense to have a hook for all the attnums of the relation, but I'm not > >>> sure it'd be enough to introduce a new extended statistics kind ... > >> > >> I got stuck on the same problem Alexander mentioned: we usually have > >> large tables with many uniformly distributed values. In this case, MCV > >> doesn't help a lot. > >> > >> Usually, I face problems scanning a table with a filter containing 3-6 > >> ANDed quals. Here, Postgres multiplies selectivities and ends up with a > >> less than 1 tuple selectivity. But such scans, in reality, mostly have > >> some physical sense and return a bunch of tuples. It looks like the set > >> of columns representing some value of composite type. > > > > Understood. That's a fairly common scenario, and it can easily end up > > with rather terrible plan due to the underestimate. > > > >> Sometimes extended statistics on dependency helps well, but it expensive > >> for multiple columns. > > > > Expensive in what sense? Also, how would a multidimensional histogram be > > any cheaper? > Maybe my wording needed to be more precise. I didn't implement > multidimensional histograms before, so I don't know how expensive they > are. I meant that for dependency statistics over about six columns, we > have a lot of combinations to compute. > > > >> And sometimes I see that even a trivial histogram > >> on a ROW(x1,x2,...) could predict a much more adequate value (kind of > >> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND > >> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) = > >> ROW(N1, N2,...). > > > > Are you guessing the histogram would help, or do you know it'd help? I > > have no problem believing that for range queries, but I think it's far > > less clear for simple equalities. I'm not sure a histograms would work > > for that ... > > I added Teodor Sigaev to the CC of this email - He has much more user > feedback on this technique. As I see, having a histogram over a set of > columns, we have top selectivity estimation for the filter. I'm not sure > how good a simple histogram is in that case, too, but according to my > practice, it works, disallowing usage of too-optimistic query plans. > > > Maybe it'd be possible to track more stuff for each bucket, not just the > > frequency, but also ndistinct for combinations of columns, and then use > > that to do better equality estimates. Or maybe we could see how the > > "expected" and "actual" bucket frequency compare, and use that to > > correct the equality estimate? Not sure. > Yes, it is in my mind, too. Having such experimental stuff as an > extension(s) in GitHub, we could get some community feedback. > > > > But perhaps you have some data to demonstrate it'd actually help? > It should be redirected to Teodor, but I will consider translating messy > real-life reports into a clear example. > > > >> For such cases we don't have an in-core solution, and introducing a hook > >> on clause list estimation (paired with maybe a hook on statistics > >> generation) could help invent an extension that would deal with that > >> problem. Also, it would open a way for experiments with different types > >> of extended statistics ... > > I really don't know how to do that, or what would it take. AFAICS the > > statistics system is not very extensible from external code. Even if we > > could add new types of attribute/extended stats, I'm not sure the code > > calculating the estimates could use that. > I imagine we can add an analysis routine and directly store statistics > in an extension for demonstration and experimental purposes, no problem. > > -- > regards, > Andrei Lepikhov > Postgres Professional >
On 8/1/2024 16:21, Alexander Cheshev wrote: > Hi Andrei, > >> Maybe my wording needed to be more precise. I didn't implement >> multidimensional histograms before, so I don't know how expensive they >> are. I meant that for dependency statistics over about six columns, we >> have a lot of combinations to compute. > > Equi-Depth Histogram in a 6 dimensional case requires 6 times more > iterations. Postgres already uses Equi-Depth Histogram. Even if you > increase the number of buckets from 100 to 1000 then there will be no > overhead as the time complexity of Equi-Depth Histogram has no > dependence on the number of buckets. So, no overhead at all! Maybe. For three columns, we have 9 combinations (passes) for building dependency statistics and 4 combinations for ndistincts; for six columns, we have 186 and 57 combinations correspondingly. Even remembering that dependency is just one number for one combination, building the dependency statistics is still massive work. So, in the multicolumn case, having something like a histogram may be more effective. -- regards, Andrei Lepikhov Postgres Professional