join removal - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | join removal |
Date | |
Msg-id | 603c8f070901181956j9541113xe68df5985d558a97@mail.gmail.com Whole thread Raw |
List | pgsql-hackers |
Apologies for submitting new patches while we're still in the midst of CommitFest:November, but I think I've more or less come to the end of the reviewing that I can usefully do for 8.4 (or at least, I haven't had any requests lately, feel free...) and I don't want to wait forever to start thinking about 8.5. This patch is based on Simon Riggs' earlier work on join removal, although I don't believe I've actually used any of his code. Tom Lane's review comments on Simon's code were invaluable in getting this to work. I also understand that Simon plans to do more work in this area, so if this ends up getting sucked into his work or visca versa (or this ends up getting superseded altogether), that is fine. http://archives.postgresql.org/pgsql-patches/2008-08/msg00035.php http://archives.postgresql.org/pgsql-patches/2008-09/msg00024.php The theoretical underpinning of this patch is as follows: Joins can affect the result of a query in one of two ways: either they change the set of rows returned (by either increasing it or decreasing it), or they provide attributes. To remove a join, we need to prove that neither of these things is possible. It's pretty easy to test that the join isn't providing any attributes above the level of the join, and you'll see that rel_attrs_needed_above_join() handles that here. Verifying that the number of rows returned isn't changed by the join is harder. For the sake of simplicity, this patch only attempts to handle LEFT JOINs. Specifically, it attempts to prove that the results of scanning the outer side of the joinrel will be identical to the results of performing a merge-join, up to sorting. Nothing that happens on the inner side of the join can reduce the number of output rows (except via a WHERE-clause qual that will fail the rel_attrs_needed_above_join test anyway), so the only thing we need to worry about is the possibility that the join might increase the output row count. To do that, we need to know that there will be a sufficiently strong unique-ness property on the rows emitted by the inner side of the join. I believe that there are two main cases in which we could potentially prove this: the inner side of the join is an index scan (without resort) on a unique index, or the inner side of the join is a subselect with a group by clause over the join columns. For now, I'm only attempting to handle the first case, though it might be nice to eventually handle the other as well. Simon's original patch starts by checking whether the RelOptInfo for the relation to be dropped is a base relation, but that approach seems somewhat limiting, because we certainly want to be able to remove multiple joins in succession. Setting the paths for the joinrel {B C} to the paths for B doesn't help us when we try to form {A B C}, because now the join of {A} to {B C} is not a join against a base rel and join removal fails. The other problem is that at that level we don't really understand what's up with the join clauses: is the join operator even an equality operator? There is finally the problem that while we can get at the relevant IndexOptInfo structures and figure out which combinations of attributes have unique indices, I can't figure out any sane way to relate those attribute numbers back to the join clauses. So I've taken the alternative approach of working on the level of paths. sort_outer_and_inner(), before forming a join path, checks whether there happens to be a unique index with the same pathkeys as the inner side of the path. If so, and if no attributes are needed from the inner side of the join, the it pulls up all the paths from the outerrel into the joinrel and exits (really, it should skip the rest of add_paths_to_joinrel() as well, but it didn't seem worth refactoring that without some validation of the basic approach). This approach is limiting because sort_outer_and_inner() doesn't try every possible combination of pathkeys that might be helpful to us, but it seems likely to be adequate here for the same reasons that sort_outer_and_inner() gets away with it generally: the list of mergeclauses tends to be real short. (I think that if we were trying to optimize away an INNER join, this approach would be inadequate, because any non-merge-joinable join clauses could change the number of output rows. Here that can't happen: the most they can do is force the inner side to NULL, but if the inner side isn't being used that doesn't matter anyway. Of course, for INNER JOINs, we'd also need to worry about the merge-joinable clauses themselves stripping rows from the output due to a failure to match; we'd need to check for foreign key constraints.) I have a feeling there are probably lingering bugs in here, but it passes regression tests and some sample queries I tried still return the correct answers even after nuking one or more joins. Any comments appreciated, ...Robert
Attachment
pgsql-hackers by date: