Re: Multidimensional Histograms - Mailing list pgsql-hackers
From | Alexander Cheshev |
---|---|
Subject | Re: Multidimensional Histograms |
Date | |
Msg-id | CAN_hQms_KKKGqyhUopk6t=cDgti5p3UWx9ruKXHsTERoXRs2-w@mail.gmail.com Whole thread Raw |
In response to | Re: Multidimensional Histograms (Andrei Lepikhov <a.lepikhov@postgrespro.ru>) |
Responses |
Re: Multidimensional Histograms
|
List | pgsql-hackers |
Hi Andrei, > Maybe my wording needed to be more precise. I didn't implement > multidimensional histograms before, so I don't know how expensive they > are. I meant that for dependency statistics over about six columns, we > have a lot of combinations to compute. Equi-Depth Histogram in a 6 dimensional case requires 6 times more iterations. Postgres already uses Equi-Depth Histogram. Even if you increase the number of buckets from 100 to 1000 then there will be no overhead as the time complexity of Equi-Depth Histogram has no dependence on the number of buckets. So, no overhead at all! Regards, Alexander Cheshev On Mon, 8 Jan 2024 at 04:12, Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > On 8/1/2024 01:36, Tomas Vondra wrote: > > On 1/7/24 18:26, Andrei Lepikhov wrote: > >> On 7/1/2024 17:51, Tomas Vondra wrote: > >>> On 1/7/24 11:22, Andrei Lepikhov wrote: > >>>> On 7/1/2024 06:54, Tomas Vondra wrote: > >>>>> It's an interesting are for experiments, no doubt about it. And if you > >>>>> choose to explore it, that's fine. But it's better to be aware it may > >>>>> not end with a commit. > >>>>> For the multi-dimensional case, I propose we first try to experiment > >>>>> with the various algorithms, and figure out what works etc. Maybe > >>>>> implementing them in python or something would be easier than C. > >>>> > >>>> Curiously, trying to utilize extended statistics for some problematic > >>>> cases, I am experimenting with auto-generating such statistics by > >>>> definition of indexes [1]. Doing that, I wanted to add some hand-made > >>>> statistics like a multidimensional histogram or just a histogram which > >>>> could help to perform estimation over a set of columns/expressions. > >>>> I realized that current hooks get_relation_stats_hook and > >>>> get_index_stats_hook are insufficient if I want to perform an estimation > >>>> over a set of ANDed quals on different columns. > >>>> In your opinion, is it possible to add a hook into the extended > >>>> statistics to allow for an extension to propose alternative estimation? > >>>> > >>>> [1] https://github.com/danolivo/pg_index_stats > >>>> > >>> > >>> No idea, I haven't thought about that very much. Presumably the existing > >>> hooks are insufficient because they're per-attnum? I guess it would make > >>> sense to have a hook for all the attnums of the relation, but I'm not > >>> sure it'd be enough to introduce a new extended statistics kind ... > >> > >> I got stuck on the same problem Alexander mentioned: we usually have > >> large tables with many uniformly distributed values. In this case, MCV > >> doesn't help a lot. > >> > >> Usually, I face problems scanning a table with a filter containing 3-6 > >> ANDed quals. Here, Postgres multiplies selectivities and ends up with a > >> less than 1 tuple selectivity. But such scans, in reality, mostly have > >> some physical sense and return a bunch of tuples. It looks like the set > >> of columns representing some value of composite type. > > > > Understood. That's a fairly common scenario, and it can easily end up > > with rather terrible plan due to the underestimate. > > > >> Sometimes extended statistics on dependency helps well, but it expensive > >> for multiple columns. > > > > Expensive in what sense? Also, how would a multidimensional histogram be > > any cheaper? > Maybe my wording needed to be more precise. I didn't implement > multidimensional histograms before, so I don't know how expensive they > are. I meant that for dependency statistics over about six columns, we > have a lot of combinations to compute. > > > >> And sometimes I see that even a trivial histogram > >> on a ROW(x1,x2,...) could predict a much more adequate value (kind of > >> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND > >> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) = > >> ROW(N1, N2,...). > > > > Are you guessing the histogram would help, or do you know it'd help? I > > have no problem believing that for range queries, but I think it's far > > less clear for simple equalities. I'm not sure a histograms would work > > for that ... > > I added Teodor Sigaev to the CC of this email - He has much more user > feedback on this technique. As I see, having a histogram over a set of > columns, we have top selectivity estimation for the filter. I'm not sure > how good a simple histogram is in that case, too, but according to my > practice, it works, disallowing usage of too-optimistic query plans. > > > Maybe it'd be possible to track more stuff for each bucket, not just the > > frequency, but also ndistinct for combinations of columns, and then use > > that to do better equality estimates. Or maybe we could see how the > > "expected" and "actual" bucket frequency compare, and use that to > > correct the equality estimate? Not sure. > Yes, it is in my mind, too. Having such experimental stuff as an > extension(s) in GitHub, we could get some community feedback. > > > > But perhaps you have some data to demonstrate it'd actually help? > It should be redirected to Teodor, but I will consider translating messy > real-life reports into a clear example. > > > >> For such cases we don't have an in-core solution, and introducing a hook > >> on clause list estimation (paired with maybe a hook on statistics > >> generation) could help invent an extension that would deal with that > >> problem. Also, it would open a way for experiments with different types > >> of extended statistics ... > > I really don't know how to do that, or what would it take. AFAICS the > > statistics system is not very extensible from external code. Even if we > > could add new types of attribute/extended stats, I'm not sure the code > > calculating the estimates could use that. > I imagine we can add an analysis routine and directly store statistics > in an extension for demonstration and experimental purposes, no problem. > > -- > regards, > Andrei Lepikhov > Postgres Professional >
pgsql-hackers by date: