Re: cost_sort() improvements - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: cost_sort() improvements |
Date | |
Msg-id | b80be535-273f-b026-575b-c6fdc4f33e00@2ndquadrant.com Whole thread Raw |
In response to | Re: cost_sort() improvements (Teodor Sigaev <teodor@sigaev.ru>) |
List | pgsql-hackers |
On 07/12/2018 05:00 PM, Teodor Sigaev wrote: > >> One more thought about estimating the worst case - I wonder if simply >> multiplying the per-tuple cost by 1.5 is the right approach. It does not >> seem particularly principled, and it's trivial simple to construct >> counter-examples defeating it (imagine columns with 99% of the rows >> having the same value, and then many values in exactly one row - that >> will defeat the ndistinct-based group size estimates). > > Agree. But right now that is best what we have. and constant 1.5 easily > could be changed to 1 or 10 - it is just arbitrary value, intuitively > chosen. As I mentioned in v7 patch description, new estimation function > provides ~10% bigger estimation and this constant should not be very > large, because we could easily get overestimation. > >> >> The reason why constructing such counter-examples is so simple is that >> this relies entirely on the ndistinct stats, ignoring the other stats we >> already have. I think this might leverage the per-column MCV lists (and >> eventually multi-column MCV lists, if that ever gets committed). >> >> We're estimating the number of tuples in group (after fixing values in >> the preceding columns), because that's what determines the number of >> comparisons we need to make at a given stage. The patch does this by >> estimating the average group size, by estimating ndistinct of the >> preceding columns G(i-1), and computing ntuples/G(i-1). >> >> But consider that we also have MCV lists for each column - if there is a >> very common value, it should be there. So let's say M(i) is a frequency >> of the most common value in i-th column, determined either from the MCV >> list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i) >> for the fist i columns, then the largest possible group of values when >> comparing values in i-th column is >> >> N * m(i-1) >> >> This seems like a better way to estimate the worst case. I don't know if >> this should be used instead of NG(i), or if those two estimates should >> be combined in some way. > Agree, definitely we need an estimation of larger group size to use it > in cost_sort. But I don't feel power to do that, pls, could you attack a > computing group size issue? Thank you. > Attached is a simple patch illustrating this MCV-based approach. I don't claim it's finished / RFC, but hopefully it's sufficient to show what I meant. I'm not sure how to plug it into the v8 of your patch at this point, so I've simply added an elog to estimate_num_groups to also print the estimate or largest group for GROUP BY queries. It's largely a simplified copy of estimate_num_groups() and there's a couple of comments of how it might be improved further. >> >> I think estimating the largest group we need to sort should be helpful >> for the incremental sort patch, so I'm adding Alexander to this >> thread.Agree > I think so. suggested estimation algorithm should be modified to support > leading preordered keys and this form could be directly used in GROUP BY > and incremental sort patches. Right. What I think the function estimating the group size could do in case of incremental sort is producing two values - maximum size of the leading group, and maximum group size overall. The first one would be useful for startup cost, the second one for total cost. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: