Thread: Multidimensional Histograms

Multidimensional Histograms

From
Alexander Cheshev
Date:
Hello Hackers,

To improve selectivities of queries I suggest to add support of
multidimensional histograms as described in paper [1].

To query multidimensional histograms efficiently we can use H-trees as
described in paper [2].

Postgres has limited support of multivariate statistics:
 * MCV only useful for columns with small number of distinct values;
 * functional dependencies only reflect dependencies among columns
(not column values).

[1] http://www.cs.cmu.edu/~rcarlson/docs/RyanCarlson_databases.pdf
[2] https://dl.acm.org/doi/pdf/10.1145/50202.50205

-- 
Regards,
Alexander Cheshev



Re: Multidimensional Histograms

From
Tomas Vondra
Date:
Hello Alexander,

We actually had histograms in the early patches adding multivariate
statistics [1]. We however ended up removing histograms and only kept
the simpler types, for a couple reasons.

It might be worth going through the discussion - I'm sure one of the
reasons was we needed to limit the scope of the patch, and histograms
were much more complex and possibly less useful compared to the other
statistics types.

Another reason was that the algorithm described in the two papers you
reference (1988 paper by DeWitt and the evaluation by Carlson and
Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
processes the columns one by one - first build bucket on column "a",
then splits each bucket into buckets on "b". So it's not symmetrical,
and it results in lower accuracy compared to an "ideal" histogram with
the same number of buckets (especially for the dimensions split early).

This does indeed produce an equi-depth histogram, which seems nice, but
the mesh is regular in such a way that any value of the first dimension
intersects all buckets on the second dimension. So for example with a
histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
100 buckets on "b", each representing 1/10000 of tuples. But we don't
know how the tuples are distributed in the tuple - so we end up using
0.5 of the bucket as unbiased estimate, but that can easily add-up in
the wrong dimension.

However, this is not the only way to build an equi-depth histogram,
there are ways to make it more symmetrical. Also, it's not clear
equi-depth histograms are ideal with multiple columns. There are papers
proposing various other types of histograms (using different criteria to
build buckets optimizing a different metric). The most interesting one
seems to be V-Optimal histograms - see for example papers [1], [2], [3],
[4], [5] and [6]. I'm sure there are more. The drawback of course is
that it's more expensive to build such histograms.

IIRC the patch tried to do something like V-optimal histograms, but
using a greedy algorithm and not the exhaustive stuff described in the
various papers.

[1] https://www.vldb.org/conf/1998/p275.pdf
[2]
https://cs-people.bu.edu/mathan/reading-groups/papers-classics/histograms.pdf
[3] https://dl.acm.org/doi/pdf/10.1145/304182.304200
[4] http://www.cs.columbia.edu/~gravano/Papers/2001/sigmod01b.pdf
[5] https://cs.gmu.edu/~carlotta/publications/vldb090.pdf
[6] https://core.ac.uk/download/pdf/82158702.pdf


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multidimensional Histograms

From
Tomas Vondra
Date:
On 12/27/23 22:19, Tomas Vondra wrote:
> Hello Alexander,
> 
> We actually had histograms in the early patches adding multivariate
> statistics [1]. We however ended up removing histograms and only kept
> the simpler types, for a couple reasons.
> 
> It might be worth going through the discussion - I'm sure one of the
> reasons was we needed to limit the scope of the patch, and histograms
> were much more complex and possibly less useful compared to the other
> statistics types.
>
> ...

FWIW I did not intend to reject the idea of adding multi-dimensional
histograms, but rather to provide some insight into the history of the
past attempt, and also point some weaknesses of the algorithm described
in the 1988 paper. If you choose to work on this, I can do a review etc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multidimensional Histograms

From
Alexander Cheshev
Date:
Hi Tomas,

> Another reason was that the algorithm described in the two papers you
> reference (1988 paper by DeWitt and the evaluation by Carlson and
> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
> processes the columns one by one - first build bucket on column "a",
> then splits each bucket into buckets on "b". So it's not symmetrical,
> and it results in lower accuracy compared to an "ideal" histogram with
> the same number of buckets (especially for the dimensions split early).

As stated in [1] Sum Square Error (SSE) is one of the most natural
error metrics. Equi-Depth Histogram is not ideal because it doesn't
minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
SSE and from this point of view can be considered as an ideal
solution.

> This does indeed produce an equi-depth histogram, which seems nice, but
> the mesh is regular in such a way that any value of the first dimension
> intersects all buckets on the second dimension. So for example with a
> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
> 100 buckets on "b", each representing 1/10000 of tuples. But we don't
> know how the tuples are distributed in the tuple - so we end up using
> 0.5 of the bucket as unbiased estimate, but that can easily add-up in
> the wrong dimension.

Suppose that we have a query box a=X and we know how data is
distributed in buckets. Lets consider only the buckets which are
intersected by the query box a=X. As we know how data is distributes
in buckets we can exclude the buckets which have no tuples which
intersect the query box a=X.

As we store buckets with no information about data distribution we
have to make reasonable assumptions. If we keep buckets relativly
small then we can assume that buckets have uniform distribution.

What I am trying to say is that the problem which you pointed out
comes from the fact that we store buckets with no information about
data distribution. Even in one dimensional case we have to assume how
data is distributed in buckets. By the way Postgres assumes that data
has uniform distribution in buckets.

> However, this is not the only way to build an equi-depth histogram,
> there are ways to make it more symmetrical. Also, it's not clear
> equi-depth histograms are ideal with multiple columns. There are papers
> proposing various other types of histograms (using different criteria to
> build buckets optimizing a different metric). The most interesting one
> seems to be V-Optimal histograms - see for example papers [1], [2], [3],
> [4], [5] and [6]. I'm sure there are more. The drawback of course is
> that it's more expensive to build such histograms.

Tomas thank you for shearing with me your ideas regarding V-Optimal
Histogram. I read through the papers which you gave me and came up
with the following conclusion.

The problem can be formulated like this. We have N tuples in
M-dimensional space. We need to partition space into buckets
iteratively until SSE is less than E or we reach the limit of buckets
B.

In the case of M-dimensional space it seems to me like an NP-hard
problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we
sort N tuples in ascending order. Then we iteratively select a bucket
which leads to the largest SSE reduction and split it into two parts.
We repeat the process until SSE is less than E or we reach the limit
of buckets B.

If we assume that B is significantly less than N then the time
complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M
equals 3, B equals 1000 and N equals 300*B then it will take slightly
over 0.9*10^9 iterations to build a V-Optimal Histogram.

You can see that we have to keep B as low as possible in order to
build V-Optimal Histogram in feasible time. And here is a paradox.
From one side we use V-Optimal Histogram in order to minimize SSE but
from the other side we have to keep B as low as possible which
eventually leads to increase in SSE.

On the other hand time complexity required to build an Equi-Depth
Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE
can be arbitrarily reduced by increasing B which in turn is only
limited by the storage limit. Experimental results show low error
metric [3].

In Equi-Depth Histogram a bucket is represented by two vectors. The
first vector points to the left bottom corner of the bucket and the
other one point to the right top corner of the bucket. Thus space
complexity of Equi-Depth Histogram can be estimated as
2*integer_size*M*B. Assume that M equals 3, B equals 1000 and
integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000
bytes.

If a bucket is partially intersected by a query box then we assume
that data has uniform distribution inside of the bucket. It is a
reasonable assumption if B is relativly large.

In order to efficianly find buckets which intersect a query box we can
store Equi-Depth Histogram in R-tree as proposed in [3]. On average it
takes O(logB) iterations to find buckets which intersect a query box.
As storage requirements are dominated by leaf nodes we can assume that
it takes slightly more than 2*integer_size*M*B.

> IIRC the patch tried to do something like V-optimal histograms, but
> using a greedy algorithm and not the exhaustive stuff described in the
> various papers.

We should only consider computationally tackable solutions. In one
dimensional case V-Optimal Histogram is probably a good solution but
in multi-dimensional case I would only consider Equi-Width or
Equi-Depth Histograms. As stated in multiple papers Equi-Depth
Histogram proves to be more accurate than Equi-Width Histogram. By the
way Postgres uses something like Equi-Width Histogram.

> FWIW I did not intend to reject the idea of adding multi-dimensional
> histograms, but rather to provide some insight into the history of the
> past attempt, and also point some weaknesses of the algorithm described
> in the 1988 paper. If you choose to work on this, I can do a review etc.

Thank you very much Tomas. I am new in the community and I definitely
didn't expect to have such a warm welcome.

As I indicated above Equi-Depth Histogram proves to be more accurate
than Equi-Width Histogram and both have the same time and space
requirements. Postgres uses some sort of Equi-Width Histogram. Suppose
that:
 * I will create a patch which will replace Equi-Width Histogram with
Equi-Depth Histogram but only in 1-dimensional case.
 * I will show experimental results which will demonstrate improvement
of selectivity estimation.
Then will the path be accepted by the community?

If the above path is accepted by the community then I will proceed
further with M-dimensional Equi-Depth Histogram...


[1] https://www.vldb.org/conf/1998/p275.pdf
[2] https://www.vldb.org/conf/1997/P486.PDF
[3] https://dl.acm.org/doi/pdf/10.1145/50202.50205

Regards,
Alexander Cheshev

On Thu, 28 Dec 2023 at 15:25, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 12/27/23 22:19, Tomas Vondra wrote:
> > Hello Alexander,
> >
> > We actually had histograms in the early patches adding multivariate
> > statistics [1]. We however ended up removing histograms and only kept
> > the simpler types, for a couple reasons.
> >
> > It might be worth going through the discussion - I'm sure one of the
> > reasons was we needed to limit the scope of the patch, and histograms
> > were much more complex and possibly less useful compared to the other
> > statistics types.
> >
> > ...
>
> FWIW I did not intend to reject the idea of adding multi-dimensional
> histograms, but rather to provide some insight into the history of the
> past attempt, and also point some weaknesses of the algorithm described
> in the 1988 paper. If you choose to work on this, I can do a review etc.
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Multidimensional Histograms

From
Alexander Cheshev
Date:
Hi Tomas,

I am sorry I didn't look into the code carefully. Indeed Postgres uses
Equi-Depth Histogram:

delta = (nvals - 1) / (num_hist - 1);


Regards,
Alexander Cheshev

On Sat, 6 Jan 2024 at 01:00, Alexander Cheshev <alex.cheshev@gmail.com> wrote:
>
> Hi Tomas,
>
> > Another reason was that the algorithm described in the two papers you
> > reference (1988 paper by DeWitt and the evaluation by Carlson and
> > Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
> > processes the columns one by one - first build bucket on column "a",
> > then splits each bucket into buckets on "b". So it's not symmetrical,
> > and it results in lower accuracy compared to an "ideal" histogram with
> > the same number of buckets (especially for the dimensions split early).
>
> As stated in [1] Sum Square Error (SSE) is one of the most natural
> error metrics. Equi-Depth Histogram is not ideal because it doesn't
> minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
> SSE and from this point of view can be considered as an ideal
> solution.
>
> > This does indeed produce an equi-depth histogram, which seems nice, but
> > the mesh is regular in such a way that any value of the first dimension
> > intersects all buckets on the second dimension. So for example with a
> > histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
> > 100 buckets on "b", each representing 1/10000 of tuples. But we don't
> > know how the tuples are distributed in the tuple - so we end up using
> > 0.5 of the bucket as unbiased estimate, but that can easily add-up in
> > the wrong dimension.
>
> Suppose that we have a query box a=X and we know how data is
> distributed in buckets. Lets consider only the buckets which are
> intersected by the query box a=X. As we know how data is distributes
> in buckets we can exclude the buckets which have no tuples which
> intersect the query box a=X.
>
> As we store buckets with no information about data distribution we
> have to make reasonable assumptions. If we keep buckets relativly
> small then we can assume that buckets have uniform distribution.
>
> What I am trying to say is that the problem which you pointed out
> comes from the fact that we store buckets with no information about
> data distribution. Even in one dimensional case we have to assume how
> data is distributed in buckets. By the way Postgres assumes that data
> has uniform distribution in buckets.
>
> > However, this is not the only way to build an equi-depth histogram,
> > there are ways to make it more symmetrical. Also, it's not clear
> > equi-depth histograms are ideal with multiple columns. There are papers
> > proposing various other types of histograms (using different criteria to
> > build buckets optimizing a different metric). The most interesting one
> > seems to be V-Optimal histograms - see for example papers [1], [2], [3],
> > [4], [5] and [6]. I'm sure there are more. The drawback of course is
> > that it's more expensive to build such histograms.
>
> Tomas thank you for shearing with me your ideas regarding V-Optimal
> Histogram. I read through the papers which you gave me and came up
> with the following conclusion.
>
> The problem can be formulated like this. We have N tuples in
> M-dimensional space. We need to partition space into buckets
> iteratively until SSE is less than E or we reach the limit of buckets
> B.
>
> In the case of M-dimensional space it seems to me like an NP-hard
> problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we
> sort N tuples in ascending order. Then we iteratively select a bucket
> which leads to the largest SSE reduction and split it into two parts.
> We repeat the process until SSE is less than E or we reach the limit
> of buckets B.
>
> If we assume that B is significantly less than N then the time
> complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M
> equals 3, B equals 1000 and N equals 300*B then it will take slightly
> over 0.9*10^9 iterations to build a V-Optimal Histogram.
>
> You can see that we have to keep B as low as possible in order to
> build V-Optimal Histogram in feasible time. And here is a paradox.
> From one side we use V-Optimal Histogram in order to minimize SSE but
> from the other side we have to keep B as low as possible which
> eventually leads to increase in SSE.
>
> On the other hand time complexity required to build an Equi-Depth
> Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE
> can be arbitrarily reduced by increasing B which in turn is only
> limited by the storage limit. Experimental results show low error
> metric [3].
>
> In Equi-Depth Histogram a bucket is represented by two vectors. The
> first vector points to the left bottom corner of the bucket and the
> other one point to the right top corner of the bucket. Thus space
> complexity of Equi-Depth Histogram can be estimated as
> 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and
> integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000
> bytes.
>
> If a bucket is partially intersected by a query box then we assume
> that data has uniform distribution inside of the bucket. It is a
> reasonable assumption if B is relativly large.
>
> In order to efficianly find buckets which intersect a query box we can
> store Equi-Depth Histogram in R-tree as proposed in [3]. On average it
> takes O(logB) iterations to find buckets which intersect a query box.
> As storage requirements are dominated by leaf nodes we can assume that
> it takes slightly more than 2*integer_size*M*B.
>
> > IIRC the patch tried to do something like V-optimal histograms, but
> > using a greedy algorithm and not the exhaustive stuff described in the
> > various papers.
>
> We should only consider computationally tackable solutions. In one
> dimensional case V-Optimal Histogram is probably a good solution but
> in multi-dimensional case I would only consider Equi-Width or
> Equi-Depth Histograms. As stated in multiple papers Equi-Depth
> Histogram proves to be more accurate than Equi-Width Histogram. By the
> way Postgres uses something like Equi-Width Histogram.
>
> > FWIW I did not intend to reject the idea of adding multi-dimensional
> > histograms, but rather to provide some insight into the history of the
> > past attempt, and also point some weaknesses of the algorithm described
> > in the 1988 paper. If you choose to work on this, I can do a review etc.
>
> Thank you very much Tomas. I am new in the community and I definitely
> didn't expect to have such a warm welcome.
>
> As I indicated above Equi-Depth Histogram proves to be more accurate
> than Equi-Width Histogram and both have the same time and space
> requirements. Postgres uses some sort of Equi-Width Histogram. Suppose
> that:
>  * I will create a patch which will replace Equi-Width Histogram with
> Equi-Depth Histogram but only in 1-dimensional case.
>  * I will show experimental results which will demonstrate improvement
> of selectivity estimation.
> Then will the path be accepted by the community?
>
> If the above path is accepted by the community then I will proceed
> further with M-dimensional Equi-Depth Histogram...
>
>
> [1] https://www.vldb.org/conf/1998/p275.pdf
> [2] https://www.vldb.org/conf/1997/P486.PDF
> [3] https://dl.acm.org/doi/pdf/10.1145/50202.50205
>
> Regards,
> Alexander Cheshev
>
> On Thu, 28 Dec 2023 at 15:25, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 12/27/23 22:19, Tomas Vondra wrote:
> > > Hello Alexander,
> > >
> > > We actually had histograms in the early patches adding multivariate
> > > statistics [1]. We however ended up removing histograms and only kept
> > > the simpler types, for a couple reasons.
> > >
> > > It might be worth going through the discussion - I'm sure one of the
> > > reasons was we needed to limit the scope of the patch, and histograms
> > > were much more complex and possibly less useful compared to the other
> > > statistics types.
> > >
> > > ...
> >
> > FWIW I did not intend to reject the idea of adding multi-dimensional
> > histograms, but rather to provide some insight into the history of the
> > past attempt, and also point some weaknesses of the algorithm described
> > in the 1988 paper. If you choose to work on this, I can do a review etc.
> >
> > regards
> >
> > --
> > Tomas Vondra
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company



Re: Multidimensional Histograms

From
Tomas Vondra
Date:

On 1/6/24 01:00, Alexander Cheshev wrote:
> Hi Tomas,
> 
>> Another reason was that the algorithm described in the two papers you
>> reference (1988 paper by DeWitt and the evaluation by Carlson and
>> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
>> processes the columns one by one - first build bucket on column "a",
>> then splits each bucket into buckets on "b". So it's not symmetrical,
>> and it results in lower accuracy compared to an "ideal" histogram with
>> the same number of buckets (especially for the dimensions split early).
> 
> As stated in [1] Sum Square Error (SSE) is one of the most natural
> error metrics. Equi-Depth Histogram is not ideal because it doesn't
> minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
> SSE and from this point of view can be considered as an ideal
> solution.
> 
>> This does indeed produce an equi-depth histogram, which seems nice, but
>> the mesh is regular in such a way that any value of the first dimension
>> intersects all buckets on the second dimension. So for example with a
>> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
>> 100 buckets on "b", each representing 1/10000 of tuples. But we don't
>> know how the tuples are distributed in the tuple - so we end up using
>> 0.5 of the bucket as unbiased estimate, but that can easily add-up in
>> the wrong dimension.
> 
> Suppose that we have a query box a=X and we know how data is
> distributed in buckets. Lets consider only the buckets which are
> intersected by the query box a=X. As we know how data is distributes
> in buckets we can exclude the buckets which have no tuples which
> intersect the query box a=X.
> 
> As we store buckets with no information about data distribution we
> have to make reasonable assumptions. If we keep buckets relativly
> small then we can assume that buckets have uniform distribution.
> 
> What I am trying to say is that the problem which you pointed out
> comes from the fact that we store buckets with no information about
> data distribution. Even in one dimensional case we have to assume how
> data is distributed in buckets. By the way Postgres assumes that data
> has uniform distribution in buckets.
> 

It's not just what Postgres assumes, the assumption bucket uniformity is
somewhat inherent to the whole concept of a histogram. Yes, maybe we
could keep some "distribution" info about each bucket, but then maybe we
could simply build histogram with more bins occupying the same space?

The thing I was proposing is that it should be possible to build
histograms with bins adapted to density in the given region. With
smaller buckets in areas with high density. So that queries intersect
with fewer buckets in low-density parts of the histogram.

>> However, this is not the only way to build an equi-depth histogram,
>> there are ways to make it more symmetrical. Also, it's not clear
>> equi-depth histograms are ideal with multiple columns. There are papers
>> proposing various other types of histograms (using different criteria to
>> build buckets optimizing a different metric). The most interesting one
>> seems to be V-Optimal histograms - see for example papers [1], [2], [3],
>> [4], [5] and [6]. I'm sure there are more. The drawback of course is
>> that it's more expensive to build such histograms.
> 
> Tomas thank you for shearing with me your ideas regarding V-Optimal
> Histogram. I read through the papers which you gave me and came up
> with the following conclusion.
> 
> The problem can be formulated like this. We have N tuples in
> M-dimensional space. We need to partition space into buckets
> iteratively until SSE is less than E or we reach the limit of buckets
> B.
> 

Yes. Although v-optimal histograms minimize variance of frequencies. Not
sure if that's what you mean by SSE.

> In the case of M-dimensional space it seems to me like an NP-hard
> problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we
> sort N tuples in ascending order. Then we iteratively select a bucket
> which leads to the largest SSE reduction and split it into two parts.
> We repeat the process until SSE is less than E or we reach the limit
> of buckets B.
> 

I don't recall all the details of the MHIST-2 algorithm, but this sounds
about right. Yes, building the optimal histogram would be NP-hard, so
we'd have to use some approximate / greedy algorithm.

> If we assume that B is significantly less than N then the time
> complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M
> equals 3, B equals 1000 and N equals 300*B then it will take slightly
> over 0.9*10^9 iterations to build a V-Optimal Histogram.
> 
> You can see that we have to keep B as low as possible in order to
> build V-Optimal Histogram in feasible time. And here is a paradox.
> From one side we use V-Optimal Histogram in order to minimize SSE but
> from the other side we have to keep B as low as possible which
> eventually leads to increase in SSE.
> 

I don't recall the details of the MHIST-2 scheme, but it's true
calculating "perfect" V-optimal histogram would be quadratic O(N^2*B).

But that's exactly why greedy/approximate algorithms exist. Yes, it's
not going to produce the optimal V-optimal histogram, but so what?

> On the other hand time complexity required to build an Equi-Depth
> Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE
> can be arbitrarily reduced by increasing B which in turn is only
> limited by the storage limit. Experimental results show low error
> metric [3].
> 

And how does this compare to the approximate/greedy algorithms, both in
terms of construction time and accuracy?

The [3] paper does not compare those, ofc, and I'm somewhat skeptical
about the results in that paper. Not that they'd be wrong, but AFAICS
the assumptions are quite simplistic and well-suited for that particular
type of histogram.

It's far from clear how would it perform for less "smooth" data,
non-numeric data, etc.

> In Equi-Depth Histogram a bucket is represented by two vectors. The
> first vector points to the left bottom corner of the bucket and the
> other one point to the right top corner of the bucket. Thus space
> complexity of Equi-Depth Histogram can be estimated as
> 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and
> integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000
> bytes.
> 
> If a bucket is partially intersected by a query box then we assume
> that data has uniform distribution inside of the bucket. It is a
> reasonable assumption if B is relativly large.
> 
> In order to efficianly find buckets which intersect a query box we can
> store Equi-Depth Histogram in R-tree as proposed in [3]. On average it
> takes O(logB) iterations to find buckets which intersect a query box.
> As storage requirements are dominated by leaf nodes we can assume that
> it takes slightly more than 2*integer_size*M*B.
> 

But all of this can apply to histograms in general, no? It's not somehow
special to equi-depth histograms, a v-optimal histogram could be stored
as an r-tree etc.

>> IIRC the patch tried to do something like V-optimal histograms, but
>> using a greedy algorithm and not the exhaustive stuff described in the
>> various papers.
> 
> We should only consider computationally tackable solutions. In one
> dimensional case V-Optimal Histogram is probably a good solution but
> in multi-dimensional case I would only consider Equi-Width or
> Equi-Depth Histograms. As stated in multiple papers Equi-Depth
> Histogram proves to be more accurate than Equi-Width Histogram. By the
> way Postgres uses something like Equi-Width Histogram.
> 

Sure, but AFAIK the greedy/approximate algorithms are not intractable.

And at some point it becomes a tradeoff between accuracy of estimates
and resources to build the histogram. Papers from ~1988 are great, but
maybe not sufficient to make such decisions now.

>> FWIW I did not intend to reject the idea of adding multi-dimensional
>> histograms, but rather to provide some insight into the history of the
>> past attempt, and also point some weaknesses of the algorithm described
>> in the 1988 paper. If you choose to work on this, I can do a review etc.
> 
> Thank you very much Tomas. I am new in the community and I definitely
> didn't expect to have such a warm welcome.
> 
> As I indicated above Equi-Depth Histogram proves to be more accurate
> than Equi-Width Histogram and both have the same time and space
> requirements. Postgres uses some sort of Equi-Width Histogram. Suppose
> that:
>  * I will create a patch which will replace Equi-Width Histogram with
> Equi-Depth Histogram but only in 1-dimensional case.
>  * I will show experimental results which will demonstrate improvement
> of selectivity estimation.
> Then will the path be accepted by the community?
> 
> If the above path is accepted by the community then I will proceed
> further with M-dimensional Equi-Depth Histogram...
> 

I find it very unlikely we want (or even need) to significantly rework
the 1-D histograms we already have. AFAICS the current histograms work
pretty well, and if better accuracy is needed, it's easy/efficient to
just increase the statistics target. I'm sure there are various
estimation issues, but I'm not aware of any that'd be solved simply by
using a different 1-D histogram.

I'm not going to reject that outright - but I think the bar you'd need
to justify such change is damn high. Pretty much everyone is using the
current histograms, which makes it a very sensitive area. You'll need to
show that it helps in practice, without hurting causing harm ...

It's an interesting are for experiments, no doubt about it. And if you
choose to explore it, that's fine. But it's better to be aware it may
not end with a commit.


For the multi-dimensional case, I propose we first try to experiment
with the various algorithms, and figure out what works etc. Maybe
implementing them in python or something would be easier than C.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multidimensional Histograms

From
Andrei Lepikhov
Date:
On 7/1/2024 06:54, Tomas Vondra wrote:
> It's an interesting are for experiments, no doubt about it. And if you
> choose to explore it, that's fine. But it's better to be aware it may
> not end with a commit.
> For the multi-dimensional case, I propose we first try to experiment
> with the various algorithms, and figure out what works etc. Maybe
> implementing them in python or something would be easier than C.

Curiously, trying to utilize extended statistics for some problematic 
cases, I am experimenting with auto-generating such statistics by 
definition of indexes [1]. Doing that, I wanted to add some hand-made 
statistics like a multidimensional histogram or just a histogram which 
could help to perform estimation over a set of columns/expressions.
I realized that current hooks get_relation_stats_hook and 
get_index_stats_hook are insufficient if I want to perform an estimation 
over a set of ANDed quals on different columns.
In your opinion, is it possible to add a hook into the extended 
statistics to allow for an extension to propose alternative estimation?

[1] https://github.com/danolivo/pg_index_stats

-- 
regards,
Andrei Lepikhov
Postgres Professional




Re: Multidimensional Histograms

From
Tomas Vondra
Date:

On 1/7/24 11:22, Andrei Lepikhov wrote:
> On 7/1/2024 06:54, Tomas Vondra wrote:
>> It's an interesting are for experiments, no doubt about it. And if you
>> choose to explore it, that's fine. But it's better to be aware it may
>> not end with a commit.
>> For the multi-dimensional case, I propose we first try to experiment
>> with the various algorithms, and figure out what works etc. Maybe
>> implementing them in python or something would be easier than C.
> 
> Curiously, trying to utilize extended statistics for some problematic
> cases, I am experimenting with auto-generating such statistics by
> definition of indexes [1]. Doing that, I wanted to add some hand-made
> statistics like a multidimensional histogram or just a histogram which
> could help to perform estimation over a set of columns/expressions.
> I realized that current hooks get_relation_stats_hook and
> get_index_stats_hook are insufficient if I want to perform an estimation
> over a set of ANDed quals on different columns.
> In your opinion, is it possible to add a hook into the extended
> statistics to allow for an extension to propose alternative estimation?
> 
> [1] https://github.com/danolivo/pg_index_stats
> 

No idea, I haven't thought about that very much. Presumably the existing
hooks are insufficient because they're per-attnum? I guess it would make
sense to have a hook for all the attnums of the relation, but I'm not
sure it'd be enough to introduce a new extended statistics kind ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multidimensional Histograms

From
Andrei Lepikhov
Date:
On 7/1/2024 17:51, Tomas Vondra wrote:
> On 1/7/24 11:22, Andrei Lepikhov wrote:
>> On 7/1/2024 06:54, Tomas Vondra wrote:
>>> It's an interesting are for experiments, no doubt about it. And if you
>>> choose to explore it, that's fine. But it's better to be aware it may
>>> not end with a commit.
>>> For the multi-dimensional case, I propose we first try to experiment
>>> with the various algorithms, and figure out what works etc. Maybe
>>> implementing them in python or something would be easier than C.
>>
>> Curiously, trying to utilize extended statistics for some problematic
>> cases, I am experimenting with auto-generating such statistics by
>> definition of indexes [1]. Doing that, I wanted to add some hand-made
>> statistics like a multidimensional histogram or just a histogram which
>> could help to perform estimation over a set of columns/expressions.
>> I realized that current hooks get_relation_stats_hook and
>> get_index_stats_hook are insufficient if I want to perform an estimation
>> over a set of ANDed quals on different columns.
>> In your opinion, is it possible to add a hook into the extended
>> statistics to allow for an extension to propose alternative estimation?
>>
>> [1] https://github.com/danolivo/pg_index_stats
>>
> 
> No idea, I haven't thought about that very much. Presumably the existing
> hooks are insufficient because they're per-attnum? I guess it would make
> sense to have a hook for all the attnums of the relation, but I'm not
> sure it'd be enough to introduce a new extended statistics kind ...

I got stuck on the same problem Alexander mentioned: we usually have 
large tables with many uniformly distributed values. In this case, MCV 
doesn't help a lot.
Usually, I face problems scanning a table with a filter containing 3-6 
ANDed quals. Here, Postgres multiplies selectivities and ends up with a 
less than 1 tuple selectivity. But such scans, in reality, mostly have 
some physical sense and return a bunch of tuples. It looks like the set 
of columns representing some value of composite type.
Sometimes extended statistics on dependency helps well, but it expensive 
for multiple columns. And sometimes I see that even a trivial histogram 
on a ROW(x1,x2,...) could predict a much more adequate value (kind of 
conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND 
..." if somewhere in extension we'd transform it to ROW(x1,x2,...) = 
ROW(N1, N2,...).
For such cases we don't have an in-core solution, and introducing a hook 
on clause list estimation (paired with maybe a hook on statistics 
generation) could help invent an extension that would deal with that 
problem. Also, it would open a way for experiments with different types 
of extended statistics ...

-- 
regards,
Andrei Lepikhov
Postgres Professional




Re: Multidimensional Histograms

From
Tomas Vondra
Date:

On 1/7/24 18:26, Andrei Lepikhov wrote:
> On 7/1/2024 17:51, Tomas Vondra wrote:
>> On 1/7/24 11:22, Andrei Lepikhov wrote:
>>> On 7/1/2024 06:54, Tomas Vondra wrote:
>>>> It's an interesting are for experiments, no doubt about it. And if you
>>>> choose to explore it, that's fine. But it's better to be aware it may
>>>> not end with a commit.
>>>> For the multi-dimensional case, I propose we first try to experiment
>>>> with the various algorithms, and figure out what works etc. Maybe
>>>> implementing them in python or something would be easier than C.
>>>
>>> Curiously, trying to utilize extended statistics for some problematic
>>> cases, I am experimenting with auto-generating such statistics by
>>> definition of indexes [1]. Doing that, I wanted to add some hand-made
>>> statistics like a multidimensional histogram or just a histogram which
>>> could help to perform estimation over a set of columns/expressions.
>>> I realized that current hooks get_relation_stats_hook and
>>> get_index_stats_hook are insufficient if I want to perform an estimation
>>> over a set of ANDed quals on different columns.
>>> In your opinion, is it possible to add a hook into the extended
>>> statistics to allow for an extension to propose alternative estimation?
>>>
>>> [1] https://github.com/danolivo/pg_index_stats
>>>
>>
>> No idea, I haven't thought about that very much. Presumably the existing
>> hooks are insufficient because they're per-attnum? I guess it would make
>> sense to have a hook for all the attnums of the relation, but I'm not
>> sure it'd be enough to introduce a new extended statistics kind ...
> 
> I got stuck on the same problem Alexander mentioned: we usually have
> large tables with many uniformly distributed values. In this case, MCV
> doesn't help a lot.
>
> Usually, I face problems scanning a table with a filter containing 3-6
> ANDed quals. Here, Postgres multiplies selectivities and ends up with a
> less than 1 tuple selectivity. But such scans, in reality, mostly have
> some physical sense and return a bunch of tuples. It looks like the set
> of columns representing some value of composite type.

Understood. That's a fairly common scenario, and it can easily end up
with rather terrible plan due to the underestimate.

> Sometimes extended statistics on dependency helps well, but it expensive
> for multiple columns.

Expensive in what sense? Also, how would a multidimensional histogram be
any cheaper?

> And sometimes I see that even a trivial histogram
> on a ROW(x1,x2,...) could predict a much more adequate value (kind of
> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND
> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) =
> ROW(N1, N2,...).

Are you guessing the histogram would help, or do you know it'd help? I
have no problem believing that for range queries, but I think it's far
less clear for simple equalities. I'm not sure a histograms would work
for that ...

Maybe it'd be possible to track more stuff for each bucket, not just the
frequency, but also ndistinct for combinations of columns, and then use
that to do better equality estimates. Or maybe we could see how the
"expected" and "actual" bucket frequency compare, and use that to
correct the equality estimate? Not sure.

But perhaps you have some data to demonstrate it'd actually help?

> For such cases we don't have an in-core solution, and introducing a hook
> on clause list estimation (paired with maybe a hook on statistics
> generation) could help invent an extension that would deal with that
> problem. Also, it would open a way for experiments with different types
> of extended statistics ...
> 

I really don't know how to do that, or what would it take. AFAICS the
statistics system is not very extensible from external code. Even if we
could add new types of attribute/extended stats, I'm not sure the code
calculating the estimates could use that.

That doesn't mean it's impossible or not worth exploring, but I don't
have any thoughts on this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multidimensional Histograms

From
Alexander Cheshev
Date:
Hi Tomas,

> The thing I was proposing is that it should be possible to build
> histograms with bins adapted to density in the given region. With
> smaller buckets in areas with high density. So that queries intersect
> with fewer buckets in low-density parts of the histogram.

This is how Equi-Depth Histogram works. Postgres has maller buckets in
areas with high density:

values[(i * (nvals - 1)) / (num_hist - 1)]

> I don't recall the details of the MHIST-2 scheme, but it's true
> calculating "perfect" V-optimal histogram would be quadratic O(N^2*B).

In M-dimensional space "perfect" V-Optimal Histogram is an NP-hard
problem. In other words it is not possible to build it in polynomial
time. How did you come up with the estimate?!

> But that's exactly why greedy/approximate algorithms exist. Yes, it's
> not going to produce the optimal V-optimal histogram, but so what?

Greedy/approximate algorithm has time complexity O(M*N*B), where M
equals the number of dimensions. MHIST-2 is a greedy/approximate
algorithm.

> And how does this compare to the approximate/greedy algorithms, both in
> terms of construction time and accuracy?

Time complexity of Equi-Depth Histogram has no dependence on B.

> But all of this can apply to histograms in general, no? It's not somehow
> special to equi-depth histograms, a v-optimal histogram could be stored
> as an r-tree etc.

I agree.

Regards,
Alexander Cheshev

On Sun, 7 Jan 2024 at 00:54, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 1/6/24 01:00, Alexander Cheshev wrote:
> > Hi Tomas,
> >
> >> Another reason was that the algorithm described in the two papers you
> >> reference (1988 paper by DeWitt and the evaluation by Carlson and
> >> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
> >> processes the columns one by one - first build bucket on column "a",
> >> then splits each bucket into buckets on "b". So it's not symmetrical,
> >> and it results in lower accuracy compared to an "ideal" histogram with
> >> the same number of buckets (especially for the dimensions split early).
> >
> > As stated in [1] Sum Square Error (SSE) is one of the most natural
> > error metrics. Equi-Depth Histogram is not ideal because it doesn't
> > minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
> > SSE and from this point of view can be considered as an ideal
> > solution.
> >
> >> This does indeed produce an equi-depth histogram, which seems nice, but
> >> the mesh is regular in such a way that any value of the first dimension
> >> intersects all buckets on the second dimension. So for example with a
> >> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
> >> 100 buckets on "b", each representing 1/10000 of tuples. But we don't
> >> know how the tuples are distributed in the tuple - so we end up using
> >> 0.5 of the bucket as unbiased estimate, but that can easily add-up in
> >> the wrong dimension.
> >
> > Suppose that we have a query box a=X and we know how data is
> > distributed in buckets. Lets consider only the buckets which are
> > intersected by the query box a=X. As we know how data is distributes
> > in buckets we can exclude the buckets which have no tuples which
> > intersect the query box a=X.
> >
> > As we store buckets with no information about data distribution we
> > have to make reasonable assumptions. If we keep buckets relativly
> > small then we can assume that buckets have uniform distribution.
> >
> > What I am trying to say is that the problem which you pointed out
> > comes from the fact that we store buckets with no information about
> > data distribution. Even in one dimensional case we have to assume how
> > data is distributed in buckets. By the way Postgres assumes that data
> > has uniform distribution in buckets.
> >
>
> It's not just what Postgres assumes, the assumption bucket uniformity is
> somewhat inherent to the whole concept of a histogram. Yes, maybe we
> could keep some "distribution" info about each bucket, but then maybe we
> could simply build histogram with more bins occupying the same space?
>
> The thing I was proposing is that it should be possible to build
> histograms with bins adapted to density in the given region. With
> smaller buckets in areas with high density. So that queries intersect
> with fewer buckets in low-density parts of the histogram.
>
> >> However, this is not the only way to build an equi-depth histogram,
> >> there are ways to make it more symmetrical. Also, it's not clear
> >> equi-depth histograms are ideal with multiple columns. There are papers
> >> proposing various other types of histograms (using different criteria to
> >> build buckets optimizing a different metric). The most interesting one
> >> seems to be V-Optimal histograms - see for example papers [1], [2], [3],
> >> [4], [5] and [6]. I'm sure there are more. The drawback of course is
> >> that it's more expensive to build such histograms.
> >
> > Tomas thank you for shearing with me your ideas regarding V-Optimal
> > Histogram. I read through the papers which you gave me and came up
> > with the following conclusion.
> >
> > The problem can be formulated like this. We have N tuples in
> > M-dimensional space. We need to partition space into buckets
> > iteratively until SSE is less than E or we reach the limit of buckets
> > B.
> >
>
> Yes. Although v-optimal histograms minimize variance of frequencies. Not
> sure if that's what you mean by SSE.
>
> > In the case of M-dimensional space it seems to me like an NP-hard
> > problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we
> > sort N tuples in ascending order. Then we iteratively select a bucket
> > which leads to the largest SSE reduction and split it into two parts.
> > We repeat the process until SSE is less than E or we reach the limit
> > of buckets B.
> >
>
> I don't recall all the details of the MHIST-2 algorithm, but this sounds
> about right. Yes, building the optimal histogram would be NP-hard, so
> we'd have to use some approximate / greedy algorithm.
>
> > If we assume that B is significantly less than N then the time
> > complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M
> > equals 3, B equals 1000 and N equals 300*B then it will take slightly
> > over 0.9*10^9 iterations to build a V-Optimal Histogram.
> >
> > You can see that we have to keep B as low as possible in order to
> > build V-Optimal Histogram in feasible time. And here is a paradox.
> > From one side we use V-Optimal Histogram in order to minimize SSE but
> > from the other side we have to keep B as low as possible which
> > eventually leads to increase in SSE.
> >
>
> I don't recall the details of the MHIST-2 scheme, but it's true
> calculating "perfect" V-optimal histogram would be quadratic O(N^2*B).
>
> But that's exactly why greedy/approximate algorithms exist. Yes, it's
> not going to produce the optimal V-optimal histogram, but so what?
>
> > On the other hand time complexity required to build an Equi-Depth
> > Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE
> > can be arbitrarily reduced by increasing B which in turn is only
> > limited by the storage limit. Experimental results show low error
> > metric [3].
> >
>
> And how does this compare to the approximate/greedy algorithms, both in
> terms of construction time and accuracy?
>
> The [3] paper does not compare those, ofc, and I'm somewhat skeptical
> about the results in that paper. Not that they'd be wrong, but AFAICS
> the assumptions are quite simplistic and well-suited for that particular
> type of histogram.
>
> It's far from clear how would it perform for less "smooth" data,
> non-numeric data, etc.
>
> > In Equi-Depth Histogram a bucket is represented by two vectors. The
> > first vector points to the left bottom corner of the bucket and the
> > other one point to the right top corner of the bucket. Thus space
> > complexity of Equi-Depth Histogram can be estimated as
> > 2*integer_size*M*B. Assume that M equals 3, B equals 1000 and
> > integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000
> > bytes.
> >
> > If a bucket is partially intersected by a query box then we assume
> > that data has uniform distribution inside of the bucket. It is a
> > reasonable assumption if B is relativly large.
> >
> > In order to efficianly find buckets which intersect a query box we can
> > store Equi-Depth Histogram in R-tree as proposed in [3]. On average it
> > takes O(logB) iterations to find buckets which intersect a query box.
> > As storage requirements are dominated by leaf nodes we can assume that
> > it takes slightly more than 2*integer_size*M*B.
> >
>
> But all of this can apply to histograms in general, no? It's not somehow
> special to equi-depth histograms, a v-optimal histogram could be stored
> as an r-tree etc.
>
> >> IIRC the patch tried to do something like V-optimal histograms, but
> >> using a greedy algorithm and not the exhaustive stuff described in the
> >> various papers.
> >
> > We should only consider computationally tackable solutions. In one
> > dimensional case V-Optimal Histogram is probably a good solution but
> > in multi-dimensional case I would only consider Equi-Width or
> > Equi-Depth Histograms. As stated in multiple papers Equi-Depth
> > Histogram proves to be more accurate than Equi-Width Histogram. By the
> > way Postgres uses something like Equi-Width Histogram.
> >
>
> Sure, but AFAIK the greedy/approximate algorithms are not intractable.
>
> And at some point it becomes a tradeoff between accuracy of estimates
> and resources to build the histogram. Papers from ~1988 are great, but
> maybe not sufficient to make such decisions now.
>
> >> FWIW I did not intend to reject the idea of adding multi-dimensional
> >> histograms, but rather to provide some insight into the history of the
> >> past attempt, and also point some weaknesses of the algorithm described
> >> in the 1988 paper. If you choose to work on this, I can do a review etc.
> >
> > Thank you very much Tomas. I am new in the community and I definitely
> > didn't expect to have such a warm welcome.
> >
> > As I indicated above Equi-Depth Histogram proves to be more accurate
> > than Equi-Width Histogram and both have the same time and space
> > requirements. Postgres uses some sort of Equi-Width Histogram. Suppose
> > that:
> >  * I will create a patch which will replace Equi-Width Histogram with
> > Equi-Depth Histogram but only in 1-dimensional case.
> >  * I will show experimental results which will demonstrate improvement
> > of selectivity estimation.
> > Then will the path be accepted by the community?
> >
> > If the above path is accepted by the community then I will proceed
> > further with M-dimensional Equi-Depth Histogram...
> >
>
> I find it very unlikely we want (or even need) to significantly rework
> the 1-D histograms we already have. AFAICS the current histograms work
> pretty well, and if better accuracy is needed, it's easy/efficient to
> just increase the statistics target. I'm sure there are various
> estimation issues, but I'm not aware of any that'd be solved simply by
> using a different 1-D histogram.
>
> I'm not going to reject that outright - but I think the bar you'd need
> to justify such change is damn high. Pretty much everyone is using the
> current histograms, which makes it a very sensitive area. You'll need to
> show that it helps in practice, without hurting causing harm ...
>
> It's an interesting are for experiments, no doubt about it. And if you
> choose to explore it, that's fine. But it's better to be aware it may
> not end with a commit.
>
>
> For the multi-dimensional case, I propose we first try to experiment
> with the various algorithms, and figure out what works etc. Maybe
> implementing them in python or something would be easier than C.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Multidimensional Histograms

From
Tomas Vondra
Date:

On 1/7/24 23:53, Alexander Cheshev wrote:
> Hi Tomas,
>
>> The thing I was proposing is that it should be possible to build
>> histograms with bins adapted to density in the given region. With
>> smaller buckets in areas with high density. So that queries intersect
>> with fewer buckets in low-density parts of the histogram.
>
> This is how Equi-Depth Histogram works. Postgres has maller buckets in
> areas with high density:
>
> values[(i * (nvals - 1)) / (num_hist - 1)]
>
True, but the boundaries are somewhat random. Also, I was referring to
the multi-dimensional case, it wasn't clear to me if the proposal is to
do the same thing.

>> I don't recall the details of the MHIST-2 scheme, but it's true
>> calculating "perfect" V-optimal histogram would be quadratic O(N^2*B).
>
> In M-dimensional space "perfect" V-Optimal Histogram is an NP-hard
> problem. In other words it is not possible to build it in polynomial
> time. How did you come up with the estimate?!
>
See section 3.2 in this paper (the "view PDF" does not work for me, but
the "source PDF" does download a postscript):

https://citeseerx.ist.psu.edu/doc_view/pid/35e29cbc2bfe6662653bdae1fb89c091e2ece560

>> But that's exactly why greedy/approximate algorithms exist. Yes, it's
>> not going to produce the optimal V-optimal histogram, but so what?
>
> Greedy/approximate algorithm has time complexity O(M*N*B), where M
> equals the number of dimensions. MHIST-2 is a greedy/approximate
> algorithm.
>
>> And how does this compare to the approximate/greedy algorithms, both in
>> terms of construction time and accuracy?
>
> Time complexity of Equi-Depth Histogram has no dependence on B.
>
Really? I'd expect that to build B buckets, the algorithm repeat some
O(M*N) action B-times, roughly. I mean, it needs to pick a dimension by
which to split, then do some calculation on the N tuples, etc.

Maybe I'm imagining that wrong, though. It's been ages since I looked ad
big-O complexity and/or the histograms. I'd have to play with those
algorithms for a bit again.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Multidimensional Histograms

From
Alexander Cheshev
Date:
Hi Tomas,

> See section 3.2 in this paper (the "view PDF" does not work for me, but
> the "source PDF" does download a postscript):

I believe that you are referring to a dynamic programming approach. It
is a 1-dimensional case! To find an optimal solution in the
M-dimensional case is an NP-hard problem.

Regards,
Alexander Cheshev

On Mon, 8 Jan 2024 at 00:29, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 1/7/24 23:53, Alexander Cheshev wrote:
> > Hi Tomas,
> >
> >> The thing I was proposing is that it should be possible to build
> >> histograms with bins adapted to density in the given region. With
> >> smaller buckets in areas with high density. So that queries intersect
> >> with fewer buckets in low-density parts of the histogram.
> >
> > This is how Equi-Depth Histogram works. Postgres has maller buckets in
> > areas with high density:
> >
> > values[(i * (nvals - 1)) / (num_hist - 1)]
> >
> True, but the boundaries are somewhat random. Also, I was referring to
> the multi-dimensional case, it wasn't clear to me if the proposal is to
> do the same thing.
>
> >> I don't recall the details of the MHIST-2 scheme, but it's true
> >> calculating "perfect" V-optimal histogram would be quadratic O(N^2*B).
> >
> > In M-dimensional space "perfect" V-Optimal Histogram is an NP-hard
> > problem. In other words it is not possible to build it in polynomial
> > time. How did you come up with the estimate?!
> >
> See section 3.2 in this paper (the "view PDF" does not work for me, but
> the "source PDF" does download a postscript):
>
> https://citeseerx.ist.psu.edu/doc_view/pid/35e29cbc2bfe6662653bdae1fb89c091e2ece560
>
> >> But that's exactly why greedy/approximate algorithms exist. Yes, it's
> >> not going to produce the optimal V-optimal histogram, but so what?
> >
> > Greedy/approximate algorithm has time complexity O(M*N*B), where M
> > equals the number of dimensions. MHIST-2 is a greedy/approximate
> > algorithm.
> >
> >> And how does this compare to the approximate/greedy algorithms, both in
> >> terms of construction time and accuracy?
> >
> > Time complexity of Equi-Depth Histogram has no dependence on B.
> >
> Really? I'd expect that to build B buckets, the algorithm repeat some
> O(M*N) action B-times, roughly. I mean, it needs to pick a dimension by
> which to split, then do some calculation on the N tuples, etc.
>
> Maybe I'm imagining that wrong, though. It's been ages since I looked ad
> big-O complexity and/or the histograms. I'd have to play with those
> algorithms for a bit again.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: Multidimensional Histograms

From
Andrei Lepikhov
Date:
On 8/1/2024 01:36, Tomas Vondra wrote:
> On 1/7/24 18:26, Andrei Lepikhov wrote:
>> On 7/1/2024 17:51, Tomas Vondra wrote:
>>> On 1/7/24 11:22, Andrei Lepikhov wrote:
>>>> On 7/1/2024 06:54, Tomas Vondra wrote:
>>>>> It's an interesting are for experiments, no doubt about it. And if you
>>>>> choose to explore it, that's fine. But it's better to be aware it may
>>>>> not end with a commit.
>>>>> For the multi-dimensional case, I propose we first try to experiment
>>>>> with the various algorithms, and figure out what works etc. Maybe
>>>>> implementing them in python or something would be easier than C.
>>>>
>>>> Curiously, trying to utilize extended statistics for some problematic
>>>> cases, I am experimenting with auto-generating such statistics by
>>>> definition of indexes [1]. Doing that, I wanted to add some hand-made
>>>> statistics like a multidimensional histogram or just a histogram which
>>>> could help to perform estimation over a set of columns/expressions.
>>>> I realized that current hooks get_relation_stats_hook and
>>>> get_index_stats_hook are insufficient if I want to perform an estimation
>>>> over a set of ANDed quals on different columns.
>>>> In your opinion, is it possible to add a hook into the extended
>>>> statistics to allow for an extension to propose alternative estimation?
>>>>
>>>> [1] https://github.com/danolivo/pg_index_stats
>>>>
>>>
>>> No idea, I haven't thought about that very much. Presumably the existing
>>> hooks are insufficient because they're per-attnum? I guess it would make
>>> sense to have a hook for all the attnums of the relation, but I'm not
>>> sure it'd be enough to introduce a new extended statistics kind ...
>>
>> I got stuck on the same problem Alexander mentioned: we usually have
>> large tables with many uniformly distributed values. In this case, MCV
>> doesn't help a lot.
>>
>> Usually, I face problems scanning a table with a filter containing 3-6
>> ANDed quals. Here, Postgres multiplies selectivities and ends up with a
>> less than 1 tuple selectivity. But such scans, in reality, mostly have
>> some physical sense and return a bunch of tuples. It looks like the set
>> of columns representing some value of composite type.
> 
> Understood. That's a fairly common scenario, and it can easily end up
> with rather terrible plan due to the underestimate.
> 
>> Sometimes extended statistics on dependency helps well, but it expensive
>> for multiple columns.
> 
> Expensive in what sense? Also, how would a multidimensional histogram be
> any cheaper?
Maybe my wording needed to be more precise. I didn't implement 
multidimensional histograms before, so I don't know how expensive they 
are. I meant that for dependency statistics over about six columns, we 
have a lot of combinations to compute.
> 
>> And sometimes I see that even a trivial histogram
>> on a ROW(x1,x2,...) could predict a much more adequate value (kind of
>> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND
>> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) =
>> ROW(N1, N2,...).
> 
> Are you guessing the histogram would help, or do you know it'd help? I
> have no problem believing that for range queries, but I think it's far
> less clear for simple equalities. I'm not sure a histograms would work
> for that ...

I added Teodor Sigaev to the CC of this email - He has much more user 
feedback on this technique. As I see, having a histogram over a set of 
columns, we have top selectivity estimation for the filter. I'm not sure 
how good a simple histogram is in that case, too, but according to my 
practice, it works, disallowing usage of too-optimistic query plans.

> Maybe it'd be possible to track more stuff for each bucket, not just the
> frequency, but also ndistinct for combinations of columns, and then use
> that to do better equality estimates. Or maybe we could see how the
> "expected" and "actual" bucket frequency compare, and use that to
> correct the equality estimate? Not sure.
Yes, it is in my mind, too. Having such experimental stuff as an 
extension(s) in GitHub, we could get some community feedback.
> 
> But perhaps you have some data to demonstrate it'd actually help?
It should be redirected to Teodor, but I will consider translating messy 
real-life reports into a clear example.
> 
>> For such cases we don't have an in-core solution, and introducing a hook
>> on clause list estimation (paired with maybe a hook on statistics
>> generation) could help invent an extension that would deal with that
>> problem. Also, it would open a way for experiments with different types
>> of extended statistics ...
> I really don't know how to do that, or what would it take. AFAICS the
> statistics system is not very extensible from external code. Even if we
> could add new types of attribute/extended stats, I'm not sure the code
> calculating the estimates could use that.
I imagine we can add an analysis routine and directly store statistics 
in an extension for demonstration and experimental purposes, no problem.

-- 
regards,
Andrei Lepikhov
Postgres Professional




Re: Multidimensional Histograms

From
Alexander Cheshev
Date:
Hi Andrei,

> Maybe my wording needed to be more precise. I didn't implement
> multidimensional histograms before, so I don't know how expensive they
> are. I meant that for dependency statistics over about six columns, we
> have a lot of combinations to compute.

Equi-Depth Histogram in a 6 dimensional case requires 6 times more
iterations. Postgres already uses Equi-Depth Histogram. Even if you
increase the number of buckets from 100 to 1000 then there will be no
overhead as the time complexity of Equi-Depth Histogram has no
dependence on the number of buckets. So, no overhead at all!

Regards,
Alexander Cheshev

On Mon, 8 Jan 2024 at 04:12, Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
>
> On 8/1/2024 01:36, Tomas Vondra wrote:
> > On 1/7/24 18:26, Andrei Lepikhov wrote:
> >> On 7/1/2024 17:51, Tomas Vondra wrote:
> >>> On 1/7/24 11:22, Andrei Lepikhov wrote:
> >>>> On 7/1/2024 06:54, Tomas Vondra wrote:
> >>>>> It's an interesting are for experiments, no doubt about it. And if you
> >>>>> choose to explore it, that's fine. But it's better to be aware it may
> >>>>> not end with a commit.
> >>>>> For the multi-dimensional case, I propose we first try to experiment
> >>>>> with the various algorithms, and figure out what works etc. Maybe
> >>>>> implementing them in python or something would be easier than C.
> >>>>
> >>>> Curiously, trying to utilize extended statistics for some problematic
> >>>> cases, I am experimenting with auto-generating such statistics by
> >>>> definition of indexes [1]. Doing that, I wanted to add some hand-made
> >>>> statistics like a multidimensional histogram or just a histogram which
> >>>> could help to perform estimation over a set of columns/expressions.
> >>>> I realized that current hooks get_relation_stats_hook and
> >>>> get_index_stats_hook are insufficient if I want to perform an estimation
> >>>> over a set of ANDed quals on different columns.
> >>>> In your opinion, is it possible to add a hook into the extended
> >>>> statistics to allow for an extension to propose alternative estimation?
> >>>>
> >>>> [1] https://github.com/danolivo/pg_index_stats
> >>>>
> >>>
> >>> No idea, I haven't thought about that very much. Presumably the existing
> >>> hooks are insufficient because they're per-attnum? I guess it would make
> >>> sense to have a hook for all the attnums of the relation, but I'm not
> >>> sure it'd be enough to introduce a new extended statistics kind ...
> >>
> >> I got stuck on the same problem Alexander mentioned: we usually have
> >> large tables with many uniformly distributed values. In this case, MCV
> >> doesn't help a lot.
> >>
> >> Usually, I face problems scanning a table with a filter containing 3-6
> >> ANDed quals. Here, Postgres multiplies selectivities and ends up with a
> >> less than 1 tuple selectivity. But such scans, in reality, mostly have
> >> some physical sense and return a bunch of tuples. It looks like the set
> >> of columns representing some value of composite type.
> >
> > Understood. That's a fairly common scenario, and it can easily end up
> > with rather terrible plan due to the underestimate.
> >
> >> Sometimes extended statistics on dependency helps well, but it expensive
> >> for multiple columns.
> >
> > Expensive in what sense? Also, how would a multidimensional histogram be
> > any cheaper?
> Maybe my wording needed to be more precise. I didn't implement
> multidimensional histograms before, so I don't know how expensive they
> are. I meant that for dependency statistics over about six columns, we
> have a lot of combinations to compute.
> >
> >> And sometimes I see that even a trivial histogram
> >> on a ROW(x1,x2,...) could predict a much more adequate value (kind of
> >> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND
> >> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) =
> >> ROW(N1, N2,...).
> >
> > Are you guessing the histogram would help, or do you know it'd help? I
> > have no problem believing that for range queries, but I think it's far
> > less clear for simple equalities. I'm not sure a histograms would work
> > for that ...
>
> I added Teodor Sigaev to the CC of this email - He has much more user
> feedback on this technique. As I see, having a histogram over a set of
> columns, we have top selectivity estimation for the filter. I'm not sure
> how good a simple histogram is in that case, too, but according to my
> practice, it works, disallowing usage of too-optimistic query plans.
>
> > Maybe it'd be possible to track more stuff for each bucket, not just the
> > frequency, but also ndistinct for combinations of columns, and then use
> > that to do better equality estimates. Or maybe we could see how the
> > "expected" and "actual" bucket frequency compare, and use that to
> > correct the equality estimate? Not sure.
> Yes, it is in my mind, too. Having such experimental stuff as an
> extension(s) in GitHub, we could get some community feedback.
> >
> > But perhaps you have some data to demonstrate it'd actually help?
> It should be redirected to Teodor, but I will consider translating messy
> real-life reports into a clear example.
> >
> >> For such cases we don't have an in-core solution, and introducing a hook
> >> on clause list estimation (paired with maybe a hook on statistics
> >> generation) could help invent an extension that would deal with that
> >> problem. Also, it would open a way for experiments with different types
> >> of extended statistics ...
> > I really don't know how to do that, or what would it take. AFAICS the
> > statistics system is not very extensible from external code. Even if we
> > could add new types of attribute/extended stats, I'm not sure the code
> > calculating the estimates could use that.
> I imagine we can add an analysis routine and directly store statistics
> in an extension for demonstration and experimental purposes, no problem.
>
> --
> regards,
> Andrei Lepikhov
> Postgres Professional
>



Re: Multidimensional Histograms

From
Andrei Lepikhov
Date:
On 8/1/2024 16:21, Alexander Cheshev wrote:
> Hi Andrei,
> 
>> Maybe my wording needed to be more precise. I didn't implement
>> multidimensional histograms before, so I don't know how expensive they
>> are. I meant that for dependency statistics over about six columns, we
>> have a lot of combinations to compute.
> 
> Equi-Depth Histogram in a 6 dimensional case requires 6 times more
> iterations. Postgres already uses Equi-Depth Histogram. Even if you
> increase the number of buckets from 100 to 1000 then there will be no
> overhead as the time complexity of Equi-Depth Histogram has no
> dependence on the number of buckets. So, no overhead at all!

Maybe. For three columns, we have 9 combinations (passes) for building 
dependency statistics and 4 combinations for ndistincts; for six 
columns, we have 186 and 57 combinations correspondingly.
Even remembering that dependency is just one number for one combination, 
building the dependency statistics is still massive work. So, in the 
multicolumn case, having something like a histogram may be more effective.

-- 
regards,
Andrei Lepikhov
Postgres Professional