Re: Multidimensional Histograms - Mailing list pgsql-hackers
From | Andrei Lepikhov |
---|---|
Subject | Re: Multidimensional Histograms |
Date | |
Msg-id | ce8d75eb-498c-4731-b025-9c7b7439a969@postgrespro.ru Whole thread Raw |
In response to | Re: Multidimensional Histograms (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Multidimensional Histograms
|
List | pgsql-hackers |
On 8/1/2024 01:36, Tomas Vondra wrote: > On 1/7/24 18:26, Andrei Lepikhov wrote: >> On 7/1/2024 17:51, Tomas Vondra wrote: >>> On 1/7/24 11:22, Andrei Lepikhov wrote: >>>> On 7/1/2024 06:54, Tomas Vondra wrote: >>>>> It's an interesting are for experiments, no doubt about it. And if you >>>>> choose to explore it, that's fine. But it's better to be aware it may >>>>> not end with a commit. >>>>> For the multi-dimensional case, I propose we first try to experiment >>>>> with the various algorithms, and figure out what works etc. Maybe >>>>> implementing them in python or something would be easier than C. >>>> >>>> Curiously, trying to utilize extended statistics for some problematic >>>> cases, I am experimenting with auto-generating such statistics by >>>> definition of indexes [1]. Doing that, I wanted to add some hand-made >>>> statistics like a multidimensional histogram or just a histogram which >>>> could help to perform estimation over a set of columns/expressions. >>>> I realized that current hooks get_relation_stats_hook and >>>> get_index_stats_hook are insufficient if I want to perform an estimation >>>> over a set of ANDed quals on different columns. >>>> In your opinion, is it possible to add a hook into the extended >>>> statistics to allow for an extension to propose alternative estimation? >>>> >>>> [1] https://github.com/danolivo/pg_index_stats >>>> >>> >>> No idea, I haven't thought about that very much. Presumably the existing >>> hooks are insufficient because they're per-attnum? I guess it would make >>> sense to have a hook for all the attnums of the relation, but I'm not >>> sure it'd be enough to introduce a new extended statistics kind ... >> >> I got stuck on the same problem Alexander mentioned: we usually have >> large tables with many uniformly distributed values. In this case, MCV >> doesn't help a lot. >> >> Usually, I face problems scanning a table with a filter containing 3-6 >> ANDed quals. Here, Postgres multiplies selectivities and ends up with a >> less than 1 tuple selectivity. But such scans, in reality, mostly have >> some physical sense and return a bunch of tuples. It looks like the set >> of columns representing some value of composite type. > > Understood. That's a fairly common scenario, and it can easily end up > with rather terrible plan due to the underestimate. > >> Sometimes extended statistics on dependency helps well, but it expensive >> for multiple columns. > > Expensive in what sense? Also, how would a multidimensional histogram be > any cheaper? Maybe my wording needed to be more precise. I didn't implement multidimensional histograms before, so I don't know how expensive they are. I meant that for dependency statistics over about six columns, we have a lot of combinations to compute. > >> And sometimes I see that even a trivial histogram >> on a ROW(x1,x2,...) could predict a much more adequate value (kind of >> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND >> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) = >> ROW(N1, N2,...). > > Are you guessing the histogram would help, or do you know it'd help? I > have no problem believing that for range queries, but I think it's far > less clear for simple equalities. I'm not sure a histograms would work > for that ... I added Teodor Sigaev to the CC of this email - He has much more user feedback on this technique. As I see, having a histogram over a set of columns, we have top selectivity estimation for the filter. I'm not sure how good a simple histogram is in that case, too, but according to my practice, it works, disallowing usage of too-optimistic query plans. > Maybe it'd be possible to track more stuff for each bucket, not just the > frequency, but also ndistinct for combinations of columns, and then use > that to do better equality estimates. Or maybe we could see how the > "expected" and "actual" bucket frequency compare, and use that to > correct the equality estimate? Not sure. Yes, it is in my mind, too. Having such experimental stuff as an extension(s) in GitHub, we could get some community feedback. > > But perhaps you have some data to demonstrate it'd actually help? It should be redirected to Teodor, but I will consider translating messy real-life reports into a clear example. > >> For such cases we don't have an in-core solution, and introducing a hook >> on clause list estimation (paired with maybe a hook on statistics >> generation) could help invent an extension that would deal with that >> problem. Also, it would open a way for experiments with different types >> of extended statistics ... > I really don't know how to do that, or what would it take. AFAICS the > statistics system is not very extensible from external code. Even if we > could add new types of attribute/extended stats, I'm not sure the code > calculating the estimates could use that. I imagine we can add an analysis routine and directly store statistics in an extension for demonstration and experimental purposes, no problem. -- regards, Andrei Lepikhov Postgres Professional
pgsql-hackers by date: