Re: updated join removal patch - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: updated join removal patch |
Date | |
Msg-id | 26443.1253064598@sss.pgh.pa.us Whole thread Raw |
In response to | updated join removal patch (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: updated join removal patch
|
List | pgsql-hackers |
Robert Haas <robertmhaas@gmail.com> writes: > Here we go again. Following Tom's advice to not insert crocks in > add_path(), I've instead introduced a noopjoin path type which ignores > its inner side. This could possibly be simplified to just a "noop" > path that doesn't even include an outer side, but the way I've done it > here doesn't really cost anything and might allow debug pprints to > print more useful information. If, in the future, we have some other > use for a noop path, we might want to reconsider, though. I took a closer look at this version of the patch. Some comments: * I'm not really happy with the "NoopJoinPath" representation. In the first place, it isn't a join and should not be carrying a useless inner path. The other nasty thing about it is that it violates the rule that path.pathtype is supposed to be the nodetag for the executable plan node that corresponds to the path. My original thought about this was to just use the left input's paths as-is. I am not totally sure that that works --- up to now, all the members of a reloptinfo's path list have had that same reloptinfo as their parent --- but it seems less ugly than this. A possible compromise is to use T_SubqueryScan as the pathtype, with the idea that we're pretending a SubqueryScan is to be inserted and then immediately optimized away. But I don't want the inner path in there in any case. * Speaking of the left input, it's not good enough to copy up only the cheapest startup and cheapest total paths. If there are any other paths there at all, they're probably there because they have interesting sort orders. With the patch as-is, it'd be possible for join removal to be a net loss because it forces an extra sort at an upper level. Now it's possible that some of those sort orderings are only interesting to use for mergejoining in the same join that we're removing; in which case they could safely be discarded. But I'm not sure that's worth checking for, and I am sure that throwing everything away shouldn't be the base behavior. * I'm quite unimpressed with the refactoring you did to make a single function deal with merge clauses, hash clauses, *and* join removal clauses. I think that makes it close to unmaintainable and is buying little if anything speedwise. We can afford another iteration over the join clauses, especially if there's a GUC to let people turn it off. (BTW, do we really need that GUC, or is it only there for testing the patch? I'm leaning towards not having it.) * I'm not sure about this, because surely you would have tested it, but isn't it looking at the wrong side of the join clauses? I thought the idea is to prove the nullable (inner) side of the join unique. * Not entirely sure where to put the code that does the hard work (relation_is_distinct_for). I see you put it in pathnode.c beside query_is_distinct_for, but the *only* reason the latter is where it is is that it has only one caller, namely create_unique_path, which is (and belongs) in that module. Opening up pathnode's API to include a function unrelated to its purpose doesn't seem right. So I'm inclined to think they should both go someplace else, just not sure where. There doesn't seem to be any planner file whose charter is to export this type of knowledge; maybe we need to add one. * It shouldn't be that hard to support expression indexes. I think the only reason you can't do it right now is that you chose an intermediate data structure that presumes simple Vars ... but if you refactor to have bespoke code examining the indexes and joinclauses together, I think it would work all right. * I wonder whether all the relevant clauses are really to be found as join clauses. Considertab1 left join tab2 on (tab1.a = tab2.x and tab2.y = 42) where tab2 has a unique index on (x,y). This would at least suggest examining the inner rel's baserestrictclauses along with the current join's clauses. * I wouldn't really bother with cost_noopjoin. It does not, and never will, encapsulate any interesting knowledge. In other places where we have similar issues (eg no-op UniquePaths in create_unique_path), we just copy up the child path's costs without any folderol. Do you want to have another go at this, or shall I? regards, tom lane
pgsql-hackers by date: