Re: [HACKERS] Removing useless DISTINCT clauses - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: [HACKERS] Removing useless DISTINCT clauses |
Date | |
Msg-id | CAKJS1f9Z6-_xFFNuA5z5c7im3CbH_pBukzxVgo0sX=8zZjKhVQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Removing useless DISTINCT clauses (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Removing useless DISTINCT clauses
(Melanie Plageman <melanieplageman@gmail.com>)
|
List | pgsql-hackers |
On 10 January 2018 at 09:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > 1. Once you don't have all the tlist items shown in DISTINCT, it really is > more like DISTINCT ON, seems like. I am not sure it's a good idea to set > hasDistinctOn, because that engages some planner behaviors we probably > don't want, but I'm also not sure we can get away with just ignoring the > difference. As an example, in allpaths.c there are assorted assumptions > that having a distinctClause but !hasDistinctOn means all output columns > of a subquery are listed in the distinctClause. Looking through allpaths.c for usages of distinctClauses it seems they're all either interested in distinctClauses being non-NIL, which we'll have no effect on. There's one that only cares about distinctOn only, and then there's: /* * If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our * time: all its output columns must be used in the distinctClause. */ if (subquery->distinctClause && !subquery->hasDistinctOn) return; I don't think removing this fast-path would currently have any benefit as we're not setting the ressortgroupref for the targetlist items which we've removed the distinctClause for. If we did that then, you might think that the following code in remove_unused_subquery_outputs() might be able to remove the targetlist item for a useless distinct item we've removed: /* * If it has a sortgroupref number, it's used in some sort/group * clause so we'd better not remove it. Also, don't remove any * resjunk columns, since their reason for being has nothing to do * with anybody reading the subquery's output. (It's likely that * resjunk columns in a sub-SELECT would always have ressortgroupref * set, but even if they don't, it seems imprudent to remove them.) */ if (tle->ressortgroupref || tle->resjunk) continue; but, it can't because we've not removed them yet. remove_unused_subquery_outputs() is called before subquery_planner() in set_subquery_pathlist(), so remove_useless_distinct_columns() won't even have been called by the time we get to the above code. So it looks like zeroing the ressortgroupref's of the TargetEntry items belonging to the removed distinct items would be wasted effort. I also can't imagine why we'd ever try to remove unused outputs from a subquery that's already been planned since almost all of the benefits are from allowing the planner more flexibility to do things like join removal, so I don't think there's any danger of this code becoming broken later. I looked in other places in the planner which are looking at distinctClauses. Many are just testing distinctClauses == NIL, which are not affected here since we'll always leave at least 1 item in that List. Another place that looks interesting is query_is_distinct_for() in analyzejoins.c. However, the subquery is being analyzed here long before it's getting planned, so all the distinctClauses will be fully intact at that time. > 2. There's a comment in planner.c to the effect that > > * When we have DISTINCT ON, we must sort by the more rigorous of > * DISTINCT and ORDER BY, else it won't have the desired behavior. > * Also, if we do have to do an explicit sort, we might as well use > * the more rigorous ordering to avoid a second sort later. (Note > * that the parser will have ensured that one clause is a prefix of > * the other.) > > Removing random elements of the distinctClause will break its > correspondence with the sortClause, with probably bad results. > > I do not remember for sure at the moment, but it may be that this > correspondence is only important for the case of DISTINCT ON, in which > case we could dodge the problem by not applying the optimization unless > it's plain DISTINCT. That doesn't help us with point 1 though. Good point, although we could probably just apply the same optimization to the ORDER BY to sidestep that. I've left a note in that area as something to think about for another day, although I think the refactoring done in this patch would make it a pretty easy thing to fix. > BTW, my dictionary says it's "dependent" not "dependant". You're right, regardless if that dictionary says Webster's or Oxford on the front. Fixed. I've attached an updated patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
pgsql-hackers by date: