Re: Multidimensional Histograms - Mailing list pgsql-hackers

From Andrei Lepikhov
Subject Re: Multidimensional Histograms
Date
Msg-id ce8d75eb-498c-4731-b025-9c7b7439a969@postgrespro.ru
Whole thread Raw
In response to Re: Multidimensional Histograms  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Multidimensional Histograms
List pgsql-hackers
On 8/1/2024 01:36, Tomas Vondra wrote:
> On 1/7/24 18:26, Andrei Lepikhov wrote:
>> On 7/1/2024 17:51, Tomas Vondra wrote:
>>> On 1/7/24 11:22, Andrei Lepikhov wrote:
>>>> On 7/1/2024 06:54, Tomas Vondra wrote:
>>>>> It's an interesting are for experiments, no doubt about it. And if you
>>>>> choose to explore it, that's fine. But it's better to be aware it may
>>>>> not end with a commit.
>>>>> For the multi-dimensional case, I propose we first try to experiment
>>>>> with the various algorithms, and figure out what works etc. Maybe
>>>>> implementing them in python or something would be easier than C.
>>>>
>>>> Curiously, trying to utilize extended statistics for some problematic
>>>> cases, I am experimenting with auto-generating such statistics by
>>>> definition of indexes [1]. Doing that, I wanted to add some hand-made
>>>> statistics like a multidimensional histogram or just a histogram which
>>>> could help to perform estimation over a set of columns/expressions.
>>>> I realized that current hooks get_relation_stats_hook and
>>>> get_index_stats_hook are insufficient if I want to perform an estimation
>>>> over a set of ANDed quals on different columns.
>>>> In your opinion, is it possible to add a hook into the extended
>>>> statistics to allow for an extension to propose alternative estimation?
>>>>
>>>> [1] https://github.com/danolivo/pg_index_stats
>>>>
>>>
>>> No idea, I haven't thought about that very much. Presumably the existing
>>> hooks are insufficient because they're per-attnum? I guess it would make
>>> sense to have a hook for all the attnums of the relation, but I'm not
>>> sure it'd be enough to introduce a new extended statistics kind ...
>>
>> I got stuck on the same problem Alexander mentioned: we usually have
>> large tables with many uniformly distributed values. In this case, MCV
>> doesn't help a lot.
>>
>> Usually, I face problems scanning a table with a filter containing 3-6
>> ANDed quals. Here, Postgres multiplies selectivities and ends up with a
>> less than 1 tuple selectivity. But such scans, in reality, mostly have
>> some physical sense and return a bunch of tuples. It looks like the set
>> of columns representing some value of composite type.
> 
> Understood. That's a fairly common scenario, and it can easily end up
> with rather terrible plan due to the underestimate.
> 
>> Sometimes extended statistics on dependency helps well, but it expensive
>> for multiple columns.
> 
> Expensive in what sense? Also, how would a multidimensional histogram be
> any cheaper?
Maybe my wording needed to be more precise. I didn't implement 
multidimensional histograms before, so I don't know how expensive they 
are. I meant that for dependency statistics over about six columns, we 
have a lot of combinations to compute.
> 
>> And sometimes I see that even a trivial histogram
>> on a ROW(x1,x2,...) could predict a much more adequate value (kind of
>> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND
>> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) =
>> ROW(N1, N2,...).
> 
> Are you guessing the histogram would help, or do you know it'd help? I
> have no problem believing that for range queries, but I think it's far
> less clear for simple equalities. I'm not sure a histograms would work
> for that ...

I added Teodor Sigaev to the CC of this email - He has much more user 
feedback on this technique. As I see, having a histogram over a set of 
columns, we have top selectivity estimation for the filter. I'm not sure 
how good a simple histogram is in that case, too, but according to my 
practice, it works, disallowing usage of too-optimistic query plans.

> Maybe it'd be possible to track more stuff for each bucket, not just the
> frequency, but also ndistinct for combinations of columns, and then use
> that to do better equality estimates. Or maybe we could see how the
> "expected" and "actual" bucket frequency compare, and use that to
> correct the equality estimate? Not sure.
Yes, it is in my mind, too. Having such experimental stuff as an 
extension(s) in GitHub, we could get some community feedback.
> 
> But perhaps you have some data to demonstrate it'd actually help?
It should be redirected to Teodor, but I will consider translating messy 
real-life reports into a clear example.
> 
>> For such cases we don't have an in-core solution, and introducing a hook
>> on clause list estimation (paired with maybe a hook on statistics
>> generation) could help invent an extension that would deal with that
>> problem. Also, it would open a way for experiments with different types
>> of extended statistics ...
> I really don't know how to do that, or what would it take. AFAICS the
> statistics system is not very extensible from external code. Even if we
> could add new types of attribute/extended stats, I'm not sure the code
> calculating the estimates could use that.
I imagine we can add an analysis routine and directly store statistics 
in an extension for demonstration and experimental purposes, no problem.

-- 
regards,
Andrei Lepikhov
Postgres Professional




pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: Change GUC hashtable to use simplehash?
Next
From: Amul Sul
Date:
Subject: Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression