Re: POC: GROUP BY optimization - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: POC: GROUP BY optimization |
Date | |
Msg-id | 006667af-fc50-0627-4be7-5d9cf665219f@enterprisedb.com Whole thread Raw |
In response to | Re: POC: GROUP BY optimization (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: POC: GROUP BY optimization
|
List | pgsql-hackers |
Hi, after a bit more time spent on this, I found that the issue is due to this chunk of code in if (heapSort) { if (tuplesPerPrevGroup < output_tuples) /* comparing only inside output_tuples */ correctedNGroups = ceil(2.0 * output_tuples / (tuplesPerPrevGroup / nGroups)); else /* two groups - in output and out */ correctedNGroups = 2.0; } else correctedNGroups = nGroups; if (correctedNGroups <= 1.0) correctedNGroups = 2.0; else correctedNGroups = ceil(correctedNGroups); per_tuple_cost += totalFuncCost * LOG2(correctedNGroups); There's a couple issues, mostly due to differences in handling of cases with different heapSort flag. A full-sort (no LIMIT clause) we have heapSort=false, and hence the execution simply jumps to correctedNGroups = nGroups; while for LIMIT we do heapSort=true, in which case we also start with tuplesPerPrevGroup = ntuples; That is almost certainly greater than output_tuples (=limit+offset), so the first if condition in calculating correctedNGroups can't possibly be true, and we simply end with correctedNGroups = 2.0; in the first loop. Which seems pretty bogus - why would there be just two groups? When processing the first expression, it's as if there was one big "prev group" with all the tuples, so why not to just use nGroups as it is? This seems to confuse the costing quite a bit - enough to produce the "inversed" costs with/without LIMIT, and picking the other plan. I've simplified the costing a bit, and the attached version actually undoes all the "suspicious" plan changes in postgres_fdw. It changes one new plan, but that seems somewhat reasonable, as it pushes sort to the remote side. But after looking at the costing function, I have a bunch of additional comments and questions: 1) I looked at the resources mentioned as sources the formulas came from, but I've been unable to really match the algorithm to them. The quicksort paper is particularly "dense", the notation seems to be very different, and none of the theorems seem like an obvious fit. Would be good to make the relationship clearer in comments etc. For the Sedgewick course it's even worse - it's way too extensive to just point at it and say "We're using ideas from this!" because no one is going to know which chapter/section to look at. We need to be a bit more specific about the reference. 2) I'm a bit puzzled by the "heapsort" chunk, actually. How come we need it now, when we didn't need that before? In a way, the difference in behavior between heasort and non-heapsort is what triggered the plan changes ... FWIW It's quite possible I tweaked the costing incorrectly, but it ends up choosing the right plans purely by accident. 3) I'm getting a bit skeptical about the various magic coefficients that are meant to model higher costs with non-uniform distribution. But consider that we do this, for example: tuplesPerPrevGroup = ceil(1.5 * tuplesPerPrevGroup / nGroups); but then in the next loop we call estimate_num_groups_incremental and pass this "tweaked" tuplesPerPrevGroup value to it. I'm pretty sure this may have various strange consequences - we'll calculate the nGroups based on the inflated value, and we'll calculate tuplesPerPrevGroup from that again - that seems susceptible to "amplification". We inflate tuplesPerPrevGroup by 50%, which means we'll get a higher nGroups estimate in the next loop - but not linearly. An then we'll calculate the inflated tuplesPerPrevGroup and estimated nGroup ... That seems pretty dubious, with hard to explain behavior, IMO. If we want to keep applying these coefficients, we need to do that in a way that does not affect the subsequent loop like this - we might tweak the per_tuple_cost formula, for example, not tuplesPerPrevGroup. 4) I'm not sure it's actually a good idea to pass tuplesPerPrevGroup to estimate_num_groups_incremental. In principle yes, if we use "group size" from the previous step, then the returned value is the number of new groups after adding the "new" pathkey. But even if we ignore the issues with amplification mentioned in (3), there's an issue with non-linear behavior in estimate_num_groups, because at some point it's calculating D(N,n,p) = n * (1 - ((N-p)/N)^(N/n)) where N - total rows, p - sample size, n - number of distinct values. And if we have (N1,n1) and (N2,n2) then the ratio of calculated estimated (which is pretty much what calculating group size does) D(N2,n2,p2) / D(N1,n1,p1) which will differ depending on p1 and p2. And if we're tweaking the tuplesPerPrevGroup all the time, that's really annoying, as it may make the groups smaller or larger, which is unpredictable and annoying, and I wonder if it might go against the idea of penalizing tuplesPerPrevGroup to some extent. We could simply use the input "tuples" value here, and then divide the current and previous estimate to calculate the number of new groups. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: