Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

From Andrey Lepikhov
Subject Re: POC: GROUP BY optimization
Date
Msg-id e81a07b2-53d9-e920-acfa-d68970b72d7a@postgrespro.ru
Whole thread Raw
In response to Re: POC: GROUP BY optimization  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: POC: GROUP BY optimization
List pgsql-hackers
On 24/7/2023 16:56, Tomas Vondra wrote:
> On 7/24/23 04:10, Andrey Lepikhov wrote:
>> On 20/7/2023 18:46, Tomas Vondra wrote:
>>> On 7/20/23 08:37, Andrey Lepikhov wrote:
>>>> On 3/10/2022 21:56, Tom Lane wrote:
>>>>> Revert "Optimize order of GROUP BY keys".
>>>>>
>>>>> This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and
>>>>> several follow-on fixes.
>>>>> ...
>>>>> Since we're hard up against the release deadline for v15, let's
>>>>> revert these changes for now.  We can always try again later.
>>>>
>>>> It may be time to restart the project. As a first step, I rebased the
>>>> patch on the current master. It wasn't trivial because of some latest
>>>> optimizations (a29eab, 1349d27 and 8d83a5d).
>>>> Now, Let's repeat the review and rewrite the current path according to
>>>> the reasons uttered in the revert commit.
>>>
>>> I think the fundamental task is to make the costing more reliable, and
>>> the commit message 443df6e2db points out a couple challenges in this
>>> area. Not sure how feasible it is to address enough of them ...
>>>
>>> 1) procost = 1.0 - I guess we could make this more realistic by doing
>>> some microbenchmarks and tuning the costs for the most expensive cases.
>>>
>>> 2) estimating quicksort comparisons - This relies on ndistinct
>>> estimates, and I'm not sure how much more reliable we can make those.
>>> Probably not much :-( Not sure what to do about this, the only thing I
>>> can think of is to track "reliability" of the estimates and only do the
>>> reordering if we have high confidence in the estimates. That means we'll
>>> miss some optimization opportunities, but it should limit the risk.
>> I read up on the history of this thread.
>> As I see, all the problems mentioned above can be beaten by excluding
>> the new cost model at all. We can sort GROUP BY columns according to the
>> 'ndistinct' value.
>> I see the reason for introducing the cost model in [1]. The main
>> supporting point here is that with this patch, people couldn't optimize
>> the query by themselves, organizing the order of the columns in a more
>> optimal way. But now we have at least the GUC to switch off the
>> behaviour introduced here. Also, some extensions, like the well-known
>> pg_hint_plan, can help with automation.
> 
> I think the main concern is that if we reorder the group keys and get it
> wrong, it's a regression. If that happens often (due to costing based on
> poor stats), it's a problem. Yes, there's a GUC, but that's a rather
> blunt instrument, unfortunately.
I see. My point here is if the ndistinct of one column is much more than 
the ndistinct of another, it is more probable that this correlation will 
be kept in the grouping phase. Here we can get some regression, which 
can be overweighed by the possibility below.
> 
>> So, how about committing of the undoubted part of the feature and
>> working on the cost model in a new thread?
>>
> 
> But Tom's commit message says this:
> 
>      Worse, to arrive at estimates of the number of calls made to the
>      lower-order-column comparison functions, the code needs to make
>      estimates of the numbers of distinct values of multiple columns,
>      which are necessarily even less trustworthy than per-column stats.
> 
> so I'm not sure this really counts as "undoubted".
Don't try to estimate multiple  columns - just sort columns according to 
the value of ndistinct as a heuristic.
I think we should estimate the number of values of multiple columns only 
if we have extended statistics on these columns. And this can extend the 
applicability of extended statistics.

The suggestion above is just an attempt to gather low-hanging fruit 
right now. If it is not applicable, we will go a long way.

-- 
regards,
Andrey Lepikhov
Postgres Professional




pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: logical decoding and replication of sequences, take 2
Next
From: Ashutosh Bapat
Date:
Subject: Re: logicalrep_message_type throws an error