Home > mailing lists

Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: POC: GROUP BY optimization
Date	January 21, 2022 20:34:43
Msg-id	721b3ff9-f214-f5d4-86c7-bae515bbec18@enterprisedb.com Whole thread Raw
In response to	Re: POC: GROUP BY optimization (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Responses	Re: POC: GROUP BY optimization Re: POC: GROUP BY optimization
List	pgsql-hackers

Tree view


On 1/21/22 12:09, Andrey Lepikhov wrote:
> On 7/22/21 3:58 AM, Tomas Vondra wrote:
>> 4) I'm not sure it's actually a good idea to pass tuplesPerPrevGroup to
>> estimate_num_groups_incremental. In principle yes, if we use "group
>> size" from the previous step, then the returned value is the number of
>> new groups after adding the "new" pathkey.
>> But even if we ignore the issues with amplification mentioned in (3),
>> there's an issue with non-linear behavior in estimate_num_groups,
>> because at some point it's calculating
>>
>>     D(N,n,p) = n * (1 - ((N-p)/N)^(N/n))
>>
>> where N - total rows, p - sample size, n - number of distinct values.
>> And if we have (N1,n1) and (N2,n2) then the ratio of calculated
>> estimated (which is pretty much what calculating group size does)
>>
>>     D(N2,n2,p2) / D(N1,n1,p1)
>>
>> which will differ depending on p1 and p2. And if we're tweaking the
>> tuplesPerPrevGroup all the time, that's really annoying, as it may make
>> the groups smaller or larger, which is unpredictable and annoying, and I
>> wonder if it might go against the idea of penalizing tuplesPerPrevGroup
>> to some extent.
>> We could simply use the input "tuples" value here, and then divide the
>> current and previous estimate to calculate the number of new groups.
 >
> tuplesPerPrevGroup is only a top boundary for the estimation. I think, 
> we should use result of previous estimation as a limit for the next 
> during incremental estimation. Maybe we simply limit the 
> tuplesPerPrevGroup value to ensure of monotonic non-growth of this 
> value? - see in attachment patch to previous fixes.
> 

Yes, I think that's a reasonable defense - it makes no sense to exceed 
the group size in the preceding step, which could happen with small 
number of groups or some unexpected ndistinct estimate.

The other thing we could do is reduce the coefficient gradually - so 
it'd be 1.5 for the first pathkey, then 1.25 for the next one, and so 
on. But it seems somewhat arbitrary (I certainly don't have some sound 
theoretical justification ...).

I've merged most of the fixes you reported. I've skipped the path_save 
removal in planner.c, because that seems incorrect - if there are 
multiple pathkeys, we must start with the original path, not the 
modified one we built in the last iteration. Or am I missing something?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

v20220121-0001-GROUP-BY-reordering.patch

pgsql-hackers by date:

From: Stephen Frost
Date: 21 January 2022, 20:28:32
Subject: Re: How to get started with contribution

From: Tomas Vondra
Date: 21 January 2022, 20:41:55
Subject: Re: How to get started with contribution

Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

Attachment

Previous

Next