On 1/21/22 12:09, Andrey Lepikhov wrote:
> On 7/22/21 3:58 AM, Tomas Vondra wrote:
>> 4) I'm not sure it's actually a good idea to pass tuplesPerPrevGroup to
>> estimate_num_groups_incremental. In principle yes, if we use "group
>> size" from the previous step, then the returned value is the number of
>> new groups after adding the "new" pathkey.
>> But even if we ignore the issues with amplification mentioned in (3),
>> there's an issue with non-linear behavior in estimate_num_groups,
>> because at some point it's calculating
>>
>> D(N,n,p) = n * (1 - ((N-p)/N)^(N/n))
>>
>> where N - total rows, p - sample size, n - number of distinct values.
>> And if we have (N1,n1) and (N2,n2) then the ratio of calculated
>> estimated (which is pretty much what calculating group size does)
>>
>> D(N2,n2,p2) / D(N1,n1,p1)
>>
>> which will differ depending on p1 and p2. And if we're tweaking the
>> tuplesPerPrevGroup all the time, that's really annoying, as it may make
>> the groups smaller or larger, which is unpredictable and annoying, and I
>> wonder if it might go against the idea of penalizing tuplesPerPrevGroup
>> to some extent.
>> We could simply use the input "tuples" value here, and then divide the
>> current and previous estimate to calculate the number of new groups.
>
> tuplesPerPrevGroup is only a top boundary for the estimation. I think,
> we should use result of previous estimation as a limit for the next
> during incremental estimation. Maybe we simply limit the
> tuplesPerPrevGroup value to ensure of monotonic non-growth of this
> value? - see in attachment patch to previous fixes.
>
Yes, I think that's a reasonable defense - it makes no sense to exceed
the group size in the preceding step, which could happen with small
number of groups or some unexpected ndistinct estimate.
The other thing we could do is reduce the coefficient gradually - so
it'd be 1.5 for the first pathkey, then 1.25 for the next one, and so
on. But it seems somewhat arbitrary (I certainly don't have some sound
theoretical justification ...).
I've merged most of the fixes you reported. I've skipped the path_save
removal in planner.c, because that seems incorrect - if there are
multiple pathkeys, we must start with the original path, not the
modified one we built in the last iteration. Or am I missing something?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company