Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: POC: GROUP BY optimization
Date
Msg-id 721b3ff9-f214-f5d4-86c7-bae515bbec18@enterprisedb.com
Whole thread Raw
In response to Re: POC: GROUP BY optimization  (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Responses Re: POC: GROUP BY optimization
Re: POC: GROUP BY optimization
List pgsql-hackers

On 1/21/22 12:09, Andrey Lepikhov wrote:
> On 7/22/21 3:58 AM, Tomas Vondra wrote:
>> 4) I'm not sure it's actually a good idea to pass tuplesPerPrevGroup to
>> estimate_num_groups_incremental. In principle yes, if we use "group
>> size" from the previous step, then the returned value is the number of
>> new groups after adding the "new" pathkey.
>> But even if we ignore the issues with amplification mentioned in (3),
>> there's an issue with non-linear behavior in estimate_num_groups,
>> because at some point it's calculating
>>
>>     D(N,n,p) = n * (1 - ((N-p)/N)^(N/n))
>>
>> where N - total rows, p - sample size, n - number of distinct values.
>> And if we have (N1,n1) and (N2,n2) then the ratio of calculated
>> estimated (which is pretty much what calculating group size does)
>>
>>     D(N2,n2,p2) / D(N1,n1,p1)
>>
>> which will differ depending on p1 and p2. And if we're tweaking the
>> tuplesPerPrevGroup all the time, that's really annoying, as it may make
>> the groups smaller or larger, which is unpredictable and annoying, and I
>> wonder if it might go against the idea of penalizing tuplesPerPrevGroup
>> to some extent.
>> We could simply use the input "tuples" value here, and then divide the
>> current and previous estimate to calculate the number of new groups.
 >
> tuplesPerPrevGroup is only a top boundary for the estimation. I think, 
> we should use result of previous estimation as a limit for the next 
> during incremental estimation. Maybe we simply limit the 
> tuplesPerPrevGroup value to ensure of monotonic non-growth of this 
> value? - see in attachment patch to previous fixes.
> 

Yes, I think that's a reasonable defense - it makes no sense to exceed 
the group size in the preceding step, which could happen with small 
number of groups or some unexpected ndistinct estimate.

The other thing we could do is reduce the coefficient gradually - so 
it'd be 1.5 for the first pathkey, then 1.25 for the next one, and so 
on. But it seems somewhat arbitrary (I certainly don't have some sound 
theoretical justification ...).

I've merged most of the fixes you reported. I've skipped the path_save 
removal in planner.c, because that seems incorrect - if there are 
multiple pathkeys, we must start with the original path, not the 
modified one we built in the last iteration. Or am I missing something?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: How to get started with contribution
Next
From: Tomas Vondra
Date:
Subject: Re: How to get started with contribution