Re: POC: GROUP BY optimization - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: POC: GROUP BY optimization |
Date | |
Msg-id | 20190503215510.bcr5ycszntqg65tw@development Whole thread Raw |
In response to | Re: POC: GROUP BY optimization (Dmitry Dolgov <9erthalion6@gmail.com>) |
Responses |
Re: POC: GROUP BY optimization
|
List | pgsql-hackers |
On Fri, May 03, 2019 at 10:28:21PM +0200, Dmitry Dolgov wrote: >> On Tue, Apr 9, 2019 at 5:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> So I personally would suggest to treat those patches as independent until >> the very last moment, develop the costing improvements needed by each >> of them, and then decide which of them are committable / in what order. > >I had the same idea, but judging from the questions, raised in this thread, >it's quite hard to go with reordering based only on frequency of values. I >hoped that the cost_sort improvement patch would be simple enough to >incorporate it here, but of course it wasn't. Having an assumption, that the >amount of work, required for performing sorting, depends only on the number of >distinct groups and how costly it is to compare a values of this data type, >I've ended up extracting get_width_multiplier and get_func_cost parts from >cost_sort patch and including them into 0003-Reorder-by-values-distribution. >This allows to take into account situations when we compare e.g. long strings >or a custom data type with high procost for comparison (but I've used this >values directly without any adjusting coefficients yet). > >> On Wed, Jun 13, 2018 at 6:41 PM Teodor Sigaev <teodor@sigaev.ru> wrote: >> >> > So that's a nice improvement, although I think we should also consider >> > non-uniform distributions (using the per-column MCV lists). >> >> Could you clarify how to do that? > >Since I'm not familiar with this topic, I would like to ask the same question, >how to do that and what are the advantages? > I don't recall the exact details of the discussion, since most of it happened almost a year ago, but the main concern about the original costing approach is that it very much assumes uniform distribution. For example if you have N tuples with M groups (which are not computed using estimate_num_groups IIRC, and we can hardly do much better), then the patch assumed each groups is ~N/M rows and used that for computing the number of comparisons for a particular sequence of ORDER BY clauses. But that may result in pretty significant errors in causes with a couple of very large groups. But hey - we can get some of that information from MCV lists, so maybe we can compute a safer less-optimistic estimate. I mean, we can't compute "perfect" estimate of how many comparisons we'll have to do, because we only have per-column MCV lits and maybe some multi-column MCV lists (but definitely not on all combinations of columns in the ORDER BY clause). But what I think we could do is using largest possible group instead of the average one. So for example when estimating number of comparisons for columns (c1,..,cN), you could look at MCV lists for these columns and compute m(i) = Max(largest group in MCV list for i-th column) and then use Min(m(1), ..., m(k)) when estimating the number of comparisons. Of course, this is likely to be a worst-case estimate, and it's not quite obvious that comparing worst-case estimates is necessarily safer than comparing the regular ones. So I'm not sure about it. It might be better to just compute the 'average group' in a different way, not by arithmetic mean, but say as geometric mean from each MCV list. Not sure. I guess this needs some research and experimentation - constructing cases that are likely to cause problems (non-uniform distributions, columns with a small number of exceptionally large groups, ...) and then showing how the costing deals with those. FWIW I've mentioned MCV lists in the incrementalsort thread too, in which case it was much clearer how to use them to improve the startup cost estimate - it's enough to look at the first group in per-column MCV lists (first in the ORDER BY sense, not by frequency). >> On Sat, Jun 16, 2018 at 5:59 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> I still think we need to be careful when introducing new optimizations >> in this area - reordering the grouping keys by ndistinct, ORDER BY or >> whatever. In particular I don't think we should commit these patches >> that may quite easily cause regressions, and then hope some hypothetical >> future patch fixes the costing. > >I'm a bit concerned about this part of the discussion. There is an idea through >the whole thread about avoiding the situation, when a user knows which order is >better and we generate different one by mistake. From what I see right now even >if all the objections would be addressed, there is a chance that some >coefficients will be not good enough (e.g. width multiplier is based on an >average width, or it can suddenly happen that all the compared string have some >difference at the very beginning) and the chosen order will be not optimal. >Does it mean that in any case the implementation of such optimization should >provide a way to override it? I don't think we have to provide an override no matter what. Otherwise we'd have to have an override for everything, because no optimizer decision is perfect - it's all built on estimates, after all. But I assume we agree that optimizations based on estimates thare are known to be unreliable are bound to be unreliable too. And optimizations that regularly misfire for common data sets may easily do more harm than good (and maybe should not be called optimizations at all). For example, the first patch relied on ndistinct estimates quite a bit and used them to compute multi-column estimates, which we already know is rather poor for GROUP BY estimates. The v9 uses estimate_num_groups() which works better because it leverages extended stats, when available. That's good, but we need to see how the rest of the costing works. So I think it very much depends on how reliable the costing can be made, and how bad consequences a poor choice can have. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: