join ordering via Simulated Annealing - Mailing list pgsql-hackers
From | Jan Urbański |
---|---|
Subject | join ordering via Simulated Annealing |
Date | |
Msg-id | 4B31712B.4040205@wulczer.org Whole thread Raw |
Responses |
Re: join ordering via Simulated Annealing
(bin wang <comx0330@gmail.com>)
Re: join ordering via Simulated Annealing (Tom Lane <tgl@sss.pgh.pa.us>) Re: join ordering via Simulated Annealing (Adriano Lange <alange0001@gmail.com>) Re: join ordering via Simulated Annealing (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hi, I've been playing with using a Simulated Annealing-type algorithm for determinig join ordering for relations. To get into context see http://archives.postgresql.org/pgsql-hackers/2009-05/msg00098.php (there's also a TODO in the wiki). There's a nice paper on that in http://reference.kfupm.edu.sa/content/h/e/heuristic_and_randomized_optimization_fo_87585.pdf (also linked from that thread) and someone even wrote a patch: http://archives.postgresql.org/pgsql-hackers/2009-05/msg00736.php This generally aims at being able to replace GEQO. I have some rough prototype code, but I'm not even asking people to look at it. It's stuffed with silly things and writes three Graphviz-style .dot files in /tmp for each algorithm step. But I'd like to at least present the idea. There are three main problems that have to be solved to get a good Simulated Annealing implementation:o) choosing the starting point for the algorithmo) generating the next pointo) computingthe cost in the current point The code I have now creates the initial plan by doing something similar to what gimme_tree does in GEQO, but without any kind of heuristic to try to favour joins with join restrictions (the idea is that it doesn't matter, since we only want to get *any* plan and we only do it once), but ideally it would be somehow chosen randonly from the space of all possible join orderings. I keep a binary tree of relations in memory, where leaves are RelOptInfos with 1 relid and the root is a RelOptInfo with all relids. Each iteration of the algorithm picks two nodes at random (with some restrictions, but that's not important) and tries to swap them around, meaning that a tree like (use a monospace font for best results): (1 2 3 4) *(1 2) (3 4) (1) (2) *(3) (4) where the parenthesised things are the two chosen nodes would get transformed into (1 2 3 4) (3) (1 2 4) (1 2) (4) (1) (2) that is, the (1 2) node and its subtree gets swapped with the (3) node and the upper-level nodes get changed accordingly. Sometimes a swap like that will produce an invalid join ordering - then swap is then reverted. If the join order given by the tree after the swap is legal, the paths are recomputed, much like in geqo_eval, and if the root node's cheapest path is not cheaper that before the swap, the swap gets reverted. Simulated Annealing algorithms permit uphill moves in terms of cost, with some probability that's decreasing as time passes, but that's not implemented yet. After a fixed amount of moves, the algorithm stops and returns the root node of the tree as the single RelOptInfo. The issues with the approach are: o) the initial state is not really a random plan, it's usualy a left-deep tree (and is very inefficient) and this might skew results. o) is swapping random nodes like that a good way of exploring the solution space? The SA paper suggests something much simpler, but some of the moves proposed there don't really make sense when using the make_join_rel machinery: *) changing the inner and outer relation of a join doesn't make sense, because make_join_rel is symmetrical *) changing the join executing strategy (nested loop, merge join, etc.) doesn't make sense, because make_join_rel considers all possible paths for a join o) each swap needs a full recalcualtion of the whole solution (geqo_eval does that, for instance), maybe it's possible to leave the subtrees of swapped nodes intact and only recalculate the nodes above the two swapped nodes? o) because make_join_rel scribbles on the PlannerInfo, the same hack as in geqo_eval is necessary: truncating join_rel_list after building all joinrels and nulling out the join_rel_hash o) I use make_join_rel to determine whether a join is legal, which does lots of other things. That looks like wasted effort, although in the end each time I need to build the final rel to assess the resulting path's cost. To follow the SA algorithm spirit more closely it would be necessary to only consider one, random path for each relation and make using different paths a move that the algorithm can choose while exploring the solution space. A cheaper way of determining the current solution's cost would be nice, too - fully rebuilding the final RelOptInfo after each move is annoying. Lastly, I'm lacking good testcases or even a testing approach: I'm generating silly queries and looking at how they get optimised, but if someone has a real dataset and examples of joins that cannot be planned with the standard planner, I would be interested to compare the results my prototype gets with those produced by GEQO. The code, a module that you can LOAD into a backend, is here (if you really want to see it): http://git.wulczer.org/?p=saio.git Set the saio GUC to true and saio_cutoff to the number of iterations you want. Cheers, Jan
pgsql-hackers by date: