Re: Multidimensional Histograms - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Multidimensional Histograms |
Date | |
Msg-id | 6e92450c-8136-11d4-8f6e-501c693af5c8@enterprisedb.com Whole thread Raw |
In response to | Multidimensional Histograms (Alexander Cheshev <alex.cheshev@gmail.com>) |
Responses |
Re: Multidimensional Histograms
|
List | pgsql-hackers |
Hello Alexander, We actually had histograms in the early patches adding multivariate statistics [1]. We however ended up removing histograms and only kept the simpler types, for a couple reasons. It might be worth going through the discussion - I'm sure one of the reasons was we needed to limit the scope of the patch, and histograms were much more complex and possibly less useful compared to the other statistics types. Another reason was that the algorithm described in the two papers you reference (1988 paper by DeWitt and the evaluation by Carlson and Sutherland from ~2010) is simple but has pretty obvious weaknesses. It processes the columns one by one - first build bucket on column "a", then splits each bucket into buckets on "b". So it's not symmetrical, and it results in lower accuracy compared to an "ideal" histogram with the same number of buckets (especially for the dimensions split early). This does indeed produce an equi-depth histogram, which seems nice, but the mesh is regular in such a way that any value of the first dimension intersects all buckets on the second dimension. So for example with a histogram of 100x100 buckets on [a,b], any value "a=X" intersects with 100 buckets on "b", each representing 1/10000 of tuples. But we don't know how the tuples are distributed in the tuple - so we end up using 0.5 of the bucket as unbiased estimate, but that can easily add-up in the wrong dimension. However, this is not the only way to build an equi-depth histogram, there are ways to make it more symmetrical. Also, it's not clear equi-depth histograms are ideal with multiple columns. There are papers proposing various other types of histograms (using different criteria to build buckets optimizing a different metric). The most interesting one seems to be V-Optimal histograms - see for example papers [1], [2], [3], [4], [5] and [6]. I'm sure there are more. The drawback of course is that it's more expensive to build such histograms. IIRC the patch tried to do something like V-optimal histograms, but using a greedy algorithm and not the exhaustive stuff described in the various papers. [1] https://www.vldb.org/conf/1998/p275.pdf [2] https://cs-people.bu.edu/mathan/reading-groups/papers-classics/histograms.pdf [3] https://dl.acm.org/doi/pdf/10.1145/304182.304200 [4] http://www.cs.columbia.edu/~gravano/Papers/2001/sigmod01b.pdf [5] https://cs.gmu.edu/~carlotta/publications/vldb090.pdf [6] https://core.ac.uk/download/pdf/82158702.pdf regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: