Home > mailing lists

Re: Multidimensional Histograms - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Multidimensional Histograms
Date	January 7, 2024 18:36:17
Msg-id	6ac2fad9-45e8-4d5a-9de0-a02666bf50d3@enterprisedb.com Whole thread Raw
In response to	Re: Multidimensional Histograms (Andrei Lepikhov <a.lepikhov@postgrespro.ru>)
Responses	Re: Multidimensional Histograms
List	pgsql-hackers

Tree view


On 1/7/24 18:26, Andrei Lepikhov wrote:
> On 7/1/2024 17:51, Tomas Vondra wrote:
>> On 1/7/24 11:22, Andrei Lepikhov wrote:
>>> On 7/1/2024 06:54, Tomas Vondra wrote:
>>>> It's an interesting are for experiments, no doubt about it. And if you
>>>> choose to explore it, that's fine. But it's better to be aware it may
>>>> not end with a commit.
>>>> For the multi-dimensional case, I propose we first try to experiment
>>>> with the various algorithms, and figure out what works etc. Maybe
>>>> implementing them in python or something would be easier than C.
>>>
>>> Curiously, trying to utilize extended statistics for some problematic
>>> cases, I am experimenting with auto-generating such statistics by
>>> definition of indexes [1]. Doing that, I wanted to add some hand-made
>>> statistics like a multidimensional histogram or just a histogram which
>>> could help to perform estimation over a set of columns/expressions.
>>> I realized that current hooks get_relation_stats_hook and
>>> get_index_stats_hook are insufficient if I want to perform an estimation
>>> over a set of ANDed quals on different columns.
>>> In your opinion, is it possible to add a hook into the extended
>>> statistics to allow for an extension to propose alternative estimation?
>>>
>>> [1] https://github.com/danolivo/pg_index_stats
>>>
>>
>> No idea, I haven't thought about that very much. Presumably the existing
>> hooks are insufficient because they're per-attnum? I guess it would make
>> sense to have a hook for all the attnums of the relation, but I'm not
>> sure it'd be enough to introduce a new extended statistics kind ...
> 
> I got stuck on the same problem Alexander mentioned: we usually have
> large tables with many uniformly distributed values. In this case, MCV
> doesn't help a lot.
>
> Usually, I face problems scanning a table with a filter containing 3-6
> ANDed quals. Here, Postgres multiplies selectivities and ends up with a
> less than 1 tuple selectivity. But such scans, in reality, mostly have
> some physical sense and return a bunch of tuples. It looks like the set
> of columns representing some value of composite type.

Understood. That's a fairly common scenario, and it can easily end up
with rather terrible plan due to the underestimate.

> Sometimes extended statistics on dependency helps well, but it expensive
> for multiple columns.

Expensive in what sense? Also, how would a multidimensional histogram be
any cheaper?

> And sometimes I see that even a trivial histogram
> on a ROW(x1,x2,...) could predict a much more adequate value (kind of
> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND
> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) =
> ROW(N1, N2,...).

Are you guessing the histogram would help, or do you know it'd help? I
have no problem believing that for range queries, but I think it's far
less clear for simple equalities. I'm not sure a histograms would work
for that ...

Maybe it'd be possible to track more stuff for each bucket, not just the
frequency, but also ndistinct for combinations of columns, and then use
that to do better equality estimates. Or maybe we could see how the
"expected" and "actual" bucket frequency compare, and use that to
correct the equality estimate? Not sure.

But perhaps you have some data to demonstrate it'd actually help?

> For such cases we don't have an in-core solution, and introducing a hook
> on clause list estimation (paired with maybe a hook on statistics
> generation) could help invent an extension that would deal with that
> problem. Also, it would open a way for experiments with different types
> of extended statistics ...
> 

I really don't know how to do that, or what would it take. AFAICS the
statistics system is not very extensible from external code. Even if we
could add new types of attribute/extended stats, I'm not sure the code
calculating the estimates could use that.

That doesn't mean it's impossible or not worth exploring, but I don't
have any thoughts on this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Andrei Lepikhov
Date: 07 January 2024, 17:26:43
Subject: Re: Multidimensional Histograms

From: Joe Conway
Date: 07 January 2024, 18:51:50
Subject: Re: Password leakage avoidance

Re: Multidimensional Histograms - Mailing list pgsql-hackers

Previous

Next