Home > mailing lists

Re: Multidimensional Histograms - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Multidimensional Histograms
Date	December 28, 2023 00:19:57
Msg-id	6e92450c-8136-11d4-8f6e-501c693af5c8@enterprisedb.com Whole thread Raw
In response to	Multidimensional Histograms (Alexander Cheshev <alex.cheshev@gmail.com>)
Responses	Re: Multidimensional Histograms
List	pgsql-hackers

Tree view

Hello Alexander,

We actually had histograms in the early patches adding multivariate
statistics [1]. We however ended up removing histograms and only kept
the simpler types, for a couple reasons.

It might be worth going through the discussion - I'm sure one of the
reasons was we needed to limit the scope of the patch, and histograms
were much more complex and possibly less useful compared to the other
statistics types.

Another reason was that the algorithm described in the two papers you
reference (1988 paper by DeWitt and the evaluation by Carlson and
Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
processes the columns one by one - first build bucket on column "a",
then splits each bucket into buckets on "b". So it's not symmetrical,
and it results in lower accuracy compared to an "ideal" histogram with
the same number of buckets (especially for the dimensions split early).

This does indeed produce an equi-depth histogram, which seems nice, but
the mesh is regular in such a way that any value of the first dimension
intersects all buckets on the second dimension. So for example with a
histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
100 buckets on "b", each representing 1/10000 of tuples. But we don't
know how the tuples are distributed in the tuple - so we end up using
0.5 of the bucket as unbiased estimate, but that can easily add-up in
the wrong dimension.

However, this is not the only way to build an equi-depth histogram,
there are ways to make it more symmetrical. Also, it's not clear
equi-depth histograms are ideal with multiple columns. There are papers
proposing various other types of histograms (using different criteria to
build buckets optimizing a different metric). The most interesting one
seems to be V-Optimal histograms - see for example papers [1], [2], [3],
[4], [5] and [6]. I'm sure there are more. The drawback of course is
that it's more expensive to build such histograms.

IIRC the patch tried to do something like V-optimal histograms, but
using a greedy algorithm and not the exhaustive stuff described in the
various papers.

[1] https://www.vldb.org/conf/1998/p275.pdf
[2]
https://cs-people.bu.edu/mathan/reading-groups/papers-classics/histograms.pdf
[3] https://dl.acm.org/doi/pdf/10.1145/304182.304200
[4] http://www.cs.columbia.edu/~gravano/Papers/2001/sigmod01b.pdf
[5] https://cs.gmu.edu/~carlotta/publications/vldb090.pdf
[6] https://core.ac.uk/download/pdf/82158702.pdf


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Dave Cramer
Date: 28 December 2023, 00:11:17
Subject: Re: Password leakage avoidance

From: Joe Conway
Date: 28 December 2023, 00:26:22
Subject: Re: Password leakage avoidance

Re: Multidimensional Histograms - Mailing list pgsql-hackers

Previous

Next