Re: POC: GROUP BY optimization - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: POC: GROUP BY optimization
Date
Msg-id 006667af-fc50-0627-4be7-5d9cf665219f@enterprisedb.com
Whole thread Raw
In response to Re: POC: GROUP BY optimization  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: POC: GROUP BY optimization
List pgsql-hackers
Hi,

after a bit more time spent on this, I found that the issue is due to
this chunk of code in

    if (heapSort)
    {
        if (tuplesPerPrevGroup < output_tuples)
            /* comparing only inside output_tuples */
            correctedNGroups =
                ceil(2.0 * output_tuples /
                     (tuplesPerPrevGroup / nGroups));
        else
            /* two groups - in output and out */
            correctedNGroups = 2.0;
    }
    else
        correctedNGroups = nGroups;

    if (correctedNGroups <= 1.0)
        correctedNGroups = 2.0;
    else
        correctedNGroups = ceil(correctedNGroups);
    per_tuple_cost += totalFuncCost * LOG2(correctedNGroups);


There's a couple issues, mostly due to differences in handling of cases
with different heapSort flag. A full-sort (no LIMIT clause) we have
heapSort=false, and hence the execution simply jumps to

    correctedNGroups = nGroups;

while for LIMIT we do heapSort=true, in which case we also start with

    tuplesPerPrevGroup = ntuples;

That is almost certainly greater than output_tuples (=limit+offset), so
the first if condition in calculating correctedNGroups can't possibly be
true, and we simply end with

    correctedNGroups = 2.0;

in the first loop. Which seems pretty bogus - why would there be just
two groups? When processing the first expression, it's as if there was
one big "prev group" with all the tuples, so why not to just use nGroups
as it is?

This seems to confuse the costing quite a bit - enough to produce the
"inversed" costs with/without LIMIT, and picking the other plan.

I've simplified the costing a bit, and the attached version actually
undoes all the "suspicious" plan changes in postgres_fdw. It changes one
new plan, but that seems somewhat reasonable, as it pushes sort to the
remote side.

But after looking at the costing function, I have a bunch of additional
comments and questions:


1) I looked at the resources mentioned as sources the formulas came
from, but I've been unable to really match the algorithm to them. The
quicksort paper is particularly "dense", the notation seems to be very
different, and none of the theorems seem like an obvious fit. Would be
good to make the relationship clearer in comments etc.

For the Sedgewick course it's even worse - it's way too extensive to
just point at it and say "We're using ideas from this!" because no one
is going to know which chapter/section to look at. We need to be a bit
more specific about the reference.


2) I'm a bit puzzled by the "heapsort" chunk, actually. How come we need
it now, when we didn't need that before? In a way, the difference in
behavior between heasort and non-heapsort is what triggered the plan
changes ...

FWIW It's quite possible I tweaked the costing incorrectly, but it ends
up choosing the right plans purely by accident.


3) I'm getting a bit skeptical about the various magic coefficients that
are meant to model higher costs with non-uniform distribution. But
consider that we do this, for example:

   tuplesPerPrevGroup = ceil(1.5 * tuplesPerPrevGroup / nGroups);

but then in the next loop we call estimate_num_groups_incremental and
pass this "tweaked" tuplesPerPrevGroup value to it. I'm pretty sure this
may have various strange consequences - we'll calculate the nGroups
based on the inflated value, and we'll calculate tuplesPerPrevGroup from
that again - that seems susceptible to "amplification".

We inflate tuplesPerPrevGroup by 50%, which means we'll get a higher
nGroups estimate in the next loop - but not linearly. An then we'll
calculate the inflated tuplesPerPrevGroup and estimated nGroup ...

That seems pretty dubious, with hard to explain behavior, IMO.

If we want to keep applying these coefficients, we need to do that in a
way that does not affect the subsequent loop like this - we might tweak
the per_tuple_cost formula, for example, not tuplesPerPrevGroup.


4) I'm not sure it's actually a good idea to pass tuplesPerPrevGroup to
estimate_num_groups_incremental. In principle yes, if we use "group
size" from the previous step, then the returned value is the number of
new groups after adding the "new" pathkey.

But even if we ignore the issues with amplification mentioned in (3),
there's an issue with non-linear behavior in estimate_num_groups,
because at some point it's calculating

    D(N,n,p) = n * (1 - ((N-p)/N)^(N/n))

where N - total rows, p - sample size, n - number of distinct values.
And if we have (N1,n1) and (N2,n2) then the ratio of calculated
estimated (which is pretty much what calculating group size does)

    D(N2,n2,p2) / D(N1,n1,p1)

which will differ depending on p1 and p2. And if we're tweaking the
tuplesPerPrevGroup all the time, that's really annoying, as it may make
the groups smaller or larger, which is unpredictable and annoying, and I
wonder if it might go against the idea of penalizing tuplesPerPrevGroup
to some extent.

We could simply use the input "tuples" value here, and then divide the
current and previous estimate to calculate the number of new groups.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Using a stock openssl BIO
Next
From: Andres Freund
Date:
Subject: Re: Using a stock openssl BIO