Re: Multidimensional Histograms - Mailing list pgsql-hackers

From Alexander Cheshev
Subject Re: Multidimensional Histograms
Date
Msg-id CAN_hQmvOzA0CRjjDzyoOh_LZwHeQQNnZTK=whF4sKh7DRWnQjQ@mail.gmail.com
Whole thread Raw
In response to Re: Multidimensional Histograms  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Multidimensional Histograms
Re: Multidimensional Histograms
List pgsql-hackers
Hi Tomas,

> Another reason was that the algorithm described in the two papers you
> reference (1988 paper by DeWitt and the evaluation by Carlson and
> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
> processes the columns one by one - first build bucket on column "a",
> then splits each bucket into buckets on "b". So it's not symmetrical,
> and it results in lower accuracy compared to an "ideal" histogram with
> the same number of buckets (especially for the dimensions split early).

As stated in [1] Sum Square Error (SSE) is one of the most natural
error metrics. Equi-Depth Histogram is not ideal because it doesn't
minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
SSE and from this point of view can be considered as an ideal
solution.

> This does indeed produce an equi-depth histogram, which seems nice, but
> the mesh is regular in such a way that any value of the first dimension
> intersects all buckets on the second dimension. So for example with a
> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
> 100 buckets on "b", each representing 1/10000 of tuples. But we don't
> know how the tuples are distributed in the tuple - so we end up using
> 0.5 of the bucket as unbiased estimate, but that can easily add-up in
> the wrong dimension.

Suppose that we have a query box a=X and we know how data is
distributed in buckets. Lets consider only the buckets which are
intersected by the query box a=X. As we know how data is distributes
in buckets we can exclude the buckets which have no tuples which
intersect the query box a=X.

As we store buckets with no information about data distribution we
have to make reasonable assumptions. If we keep buckets relativly
small then we can assume that buckets have uniform distribution.

What I am trying to say is that the problem which you pointed out
comes from the fact that we store buckets with no information about
data distribution. Even in one dimensional case we have to assume how
data is distributed in buckets. By the way Postgres assumes that data
has uniform distribution in buckets.

> However, this is not the only way to build an equi-depth histogram,
> there are ways to make it more symmetrical. Also, it's not clear
> equi-depth histograms are ideal with multiple columns. There are papers
> proposing various other types of histograms (using different criteria to
> build buckets optimizing a different metric). The most interesting one
> seems to be V-Optimal histograms - see for example papers [1], [2], [3],
> [4], [5] and [6]. I'm sure there are more. The drawback of course is
> that it's more expensive to build such histograms.

Tomas thank you for shearing with me your ideas regarding V-Optimal
Histogram. I read through the papers which you gave me and came up
with the following conclusion.

The problem can be formulated like this. We have N tuples in
M-dimensional space. We need to partition space into buckets
iteratively until SSE is less than E or we reach the limit of buckets
B.

In the case of M-dimensional space it seems to me like an NP-hard
problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we
sort N tuples in ascending order. Then we iteratively select a bucket
which leads to the largest SSE reduction and split it into two parts.
We repeat the process until SSE is less than E or we reach the limit
of buckets B.

If we assume that B is significantly less than N then the time
complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M
equals 3, B equals 1000 and N equals 300*B then it will take slightly
over 0.9*10^9 iterations to build a V-Optimal Histogram.

You can see that we have to keep B as low as possible in order to
build V-Optimal Histogram in feasible time. And here is a paradox.
From one side we use V-Optimal Histogram in order to minimize SSE but
from the other side we have to keep B as low as possible which
eventually leads to increase in SSE.

On the other hand time complexity required to build an Equi-Depth
Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE
can be arbitrarily reduced by increasing B which in turn is only
limited by the storage limit. Experimental results show low error
metric [3].

In Equi-Depth Histogram a bucket is represented by two vectors. The
first vector points to the left bottom corner of the bucket and the
other one point to the right top corner of the bucket. Thus space
complexity of Equi-Depth Histogram can be estimated as
2*integer_size*M*B. Assume that M equals 3, B equals 1000 and
integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000
bytes.

If a bucket is partially intersected by a query box then we assume
that data has uniform distribution inside of the bucket. It is a
reasonable assumption if B is relativly large.

In order to efficianly find buckets which intersect a query box we can
store Equi-Depth Histogram in R-tree as proposed in [3]. On average it
takes O(logB) iterations to find buckets which intersect a query box.
As storage requirements are dominated by leaf nodes we can assume that
it takes slightly more than 2*integer_size*M*B.

> IIRC the patch tried to do something like V-optimal histograms, but
> using a greedy algorithm and not the exhaustive stuff described in the
> various papers.

We should only consider computationally tackable solutions. In one
dimensional case V-Optimal Histogram is probably a good solution but
in multi-dimensional case I would only consider Equi-Width or
Equi-Depth Histograms. As stated in multiple papers Equi-Depth
Histogram proves to be more accurate than Equi-Width Histogram. By the
way Postgres uses something like Equi-Width Histogram.

> FWIW I did not intend to reject the idea of adding multi-dimensional
> histograms, but rather to provide some insight into the history of the
> past attempt, and also point some weaknesses of the algorithm described
> in the 1988 paper. If you choose to work on this, I can do a review etc.

Thank you very much Tomas. I am new in the community and I definitely
didn't expect to have such a warm welcome.

As I indicated above Equi-Depth Histogram proves to be more accurate
than Equi-Width Histogram and both have the same time and space
requirements. Postgres uses some sort of Equi-Width Histogram. Suppose
that:
 * I will create a patch which will replace Equi-Width Histogram with
Equi-Depth Histogram but only in 1-dimensional case.
 * I will show experimental results which will demonstrate improvement
of selectivity estimation.
Then will the path be accepted by the community?

If the above path is accepted by the community then I will proceed
further with M-dimensional Equi-Depth Histogram...


[1] https://www.vldb.org/conf/1998/p275.pdf
[2] https://www.vldb.org/conf/1997/P486.PDF
[3] https://dl.acm.org/doi/pdf/10.1145/50202.50205

Regards,
Alexander Cheshev

On Thu, 28 Dec 2023 at 15:25, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 12/27/23 22:19, Tomas Vondra wrote:
> > Hello Alexander,
> >
> > We actually had histograms in the early patches adding multivariate
> > statistics [1]. We however ended up removing histograms and only kept
> > the simpler types, for a couple reasons.
> >
> > It might be worth going through the discussion - I'm sure one of the
> > reasons was we needed to limit the scope of the patch, and histograms
> > were much more complex and possibly less useful compared to the other
> > statistics types.
> >
> > ...
>
> FWIW I did not intend to reject the idea of adding multi-dimensional
> histograms, but rather to provide some insight into the history of the
> past attempt, and also point some weaknesses of the algorithm described
> in the 1988 paper. If you choose to work on this, I can do a review etc.
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Emit fewer vacuum records by reaping removable tuples during pruning
Next
From: Robert Haas
Date:
Subject: Re: Fix bogus Asserts in calc_non_nestloop_required_outer