Thread: Re:allowing extensions to control planner behavior

Re:allowing extensions to control planner behavior

From
"chungui.wcg"
Date:








At 2024-08-27 00:32:53, "Robert Haas" <robertmhaas@gmail.com> wrote: >I'm somewhat expecting to be flamed to a well-done crisp for saying >this, but I think we need better ways for extensions to control the >behavior of PostgreSQL's query planner. I know of two major reasons >why somebody might want to do this. First, you might want to do >something like what pg_hint_plan does, where it essentially implements >Oracle-style hints that can be either inline or stored in a side table >and automatically applied to queries.[1] In addition to supporting >Oracle-style hints, it also supports some other kinds of hints so that >you can, for example, try to fix broken cardinality estimates. Second, >you might want to convince the planner to keep producing the same kind >of plan that it produced previously. I believe this is what Amazon's >query plan management feature[2] does, although since it is closed >source and I don't work at Amazon maybe it's actually implemented >completely differently. Regardless of what Amazon did in this case, >plan stability is a feature people want. Just trying to keep using the >same plan data structure forever doesn't seem like a good strategy, >because for example it would be fragile in the case of any DDL >changes, like dropping and recreating an index, or dropping or adding >a column. But you might want conceptually the same plan. Although it's >not frequently admitted on this mailing list, unexpected plan changes >are a frequent cause of sudden database outages, and wanting to >prevent that is a legitimate thing for a user to try to do. Naturally, >there is a risk that you might in so doing also prevent plan changes >that would have dramatically improved performance, or stick with a >plan long after you've outgrown it, but that doesn't stop people from >wanting it, or other databases (or proprietary forks of this database) >from offering it, and I don't think it should. > >We have some hooks right now that offer a few options in this area, >but there are problems. The hook that I believe to be closest to the >right thing is this one: > > /* > * Allow a plugin to editorialize on the set of Paths for this base > * relation. It could add new paths (such as CustomPaths) by calling > * add_path(), or add_partial_path() if parallel aware. It could also > * delete or modify paths added by the core code. > */ > if (set_rel_pathlist_hook) > (*set_rel_pathlist_hook) (root, rel, rti, rte); > >Unfortunately, the part about the hook having the freedom to delete >paths isn't really true. Perhaps technically you can delete a path >that you don't want to be chosen, but any paths that were dominated by >the path you deleted have already been thrown away and it's too late >to get them back. You can modify paths if you don't want to change >their costs, but if you change their costs then you have the same >problem: the contents of the pathlist at the time that you see it are >determined by the costs that each path had when it was initially >added, and it's really too late to editorialize on that. So all you >can really do here in practice is add new paths. >set_join_pathlist_hook, which applies to joinrels, is similarly >limited. appendrels don't even have an equivalent of this hook. > >So, how could we do better? > >I think there are two basic approaches that are possible here. If >someone sees a third option, let me know. First, we could allow users >to hook add_path() and add_partial_path(). That certainly provides the >flexibility on paper to accept or reject whatever paths you do or do >not want. However, I don't find this approach very appealing. One >problem is that it's likely to be fairly expensive, because add_path() >gets called A LOT. A second problem is that you don't actually get an >awful lot of context: I think anybody writing a hook would have to >write code to basically analyze each proposed path and figure out why >it was getting added and then decide what to do. In some cases that >might be fine, because for example accepting or rejecting paths based >on path type seems fairly straightforward with this approach, but as >soon as you want to do anything more complicated than that it starts >to seem difficult. If, for example, you want relation R1 to be the >driving table for the whole query plan, you're going to have to >determine whether or not that is the case for every single candidate >(partial) path that someone hands you, so you're going to end up >making that decision a whole lot of times. It doesn't sound >particularly fun. Third, even if you are doing something really simple >like trying to reject mergejoins, you've already lost the opportunity >to skip a bunch of work. If you had known when you started planning >the joinrel that you didn't care about mergejoins, you could have >skipped looking for merge-joinable clauses. Overall, while I wouldn't >be completely against further exploration of this option, I suspect >it's pretty hard to do anything useful with it. > >The other possible approach is to allow extensions to feed some >information into the planner before path generation and let that >influence which paths are generated. This is essentially what >pg_hint_plan is doing: it implements plan type hints by arranging to >flip the various enable_* GUCs on and off during the planning of >various rels. That's clever but ugly, and it ends up duplicating >substantial chunks of planner code due to the inadequacy of the >existing hooks. With some refactoring and some additional hooks, we >could make this much less ugly. But that gets at what I believe to be >the core difficulty of this approach, which is that the core planner >code needs to be somewhat aware of and on board with what the user or >the extension is trying to do. If an extension wants to force the join >order, that is likely to require different scaffolding than if it >wants to force the join methods which is again different from if a >hook wants to bias the query planner towards or against particular >indexes. Putting in hooks or other infrastructure that allows an >extension to control a particular aspect of planner behavior is to >some extent an endorsement of controlling the planner behavior in that >particular way. Since any amount of allowing the user to control the >planner tends to be controversial around here, that opens up the >spectre of putting a whole lot of effort into arguing about which >things extensions should be allowed to do, getting most of the patches >rejected, and ending up with nothing that's actually useful. > >But on the other hand, it's not like we have to design everything in a >greenfield. Other database systems have provided in-core, user-facing >features to control the planner for decades, and we can look at those >offerings -- and existing offerings in the PG space -- as we try to >judge whether a particular use case is totally insane. I am not here >to argue that everything that every system has done is completely >perfect and without notable flaws, but our own system has its own >share of flaws, and the fact that you can do very little when a >previously unproblematic query starts suddenly producing a bad plan is >definitely one of them. I believe we are long past the point where we >can simply hold our breath and pretend like there's no issue here. At >the very least, allowing extensions to control scan methods (including >choice of indexes), join methods, and join order (including which >table ends up on which side of a given join) and similar things for >aggregates and appendrels seems to me like it ought to be table >stakes. And those extensions shouldn't have to duplicate large chunks >of code or resort to gross hacks to do it. Eventually, maybe we'll >even want to have directly user-facing features to do some of this >stuff (in query hints, out of query hints, or whatever) but I think >opening the door up to extensions doing it is a good first step, >because (1) that allows different extensions to do different things >without taking a position on what the One Right Thing To Do is and (2) >if it becomes clear that something improvident has been done, it is a >lot easier to back out a hook or some C API change than it is to >back-out a user-visible feature. Or maybe we'll never want to expose a >user-visible feature here, but it can still be useful to enable >extensions. > >The attached patch, briefly mentioned above, essentially converts the >enable_* GUCs into RelOptInfo properties where the defaults are set by >the corresponding GUCs. The idea is that a hook could then change this >on a per-RelOptInfo basis before path generation happens. For >baserels, I believe that could be done from get_relation_info_hook for >baserels, and we could introduce something similar for other kinds of >rels. I don't think this is in any way the perfect approach. On the >one hand, it doesn't give you all the kinds of control over path >generation that you might want. On the other hand, the more I look at >what our enable_* GUCs actually do, the less impressed I am. IMHO, >things like enable_hashjoin make a lot of sense, but enable_sort seems >like it just controls an absolutely random smattering of behaviors in >a way that seems to me to have very little to recommend it, and I've >complained elsewhere about how enable_indexscan and >enable_indexonlyscan are really quite odd when you look at how they're >implemented. Still, this seemed like a fairly easy thing to do as a >way of demonstrating the kind of thing that we could do to provide >extensions with more control over planner behavior, and I believe it >would be concretely useful to pg_hint_plan in particular. But all that >said, as much as anything, I want to get some feedback on what >approaches and trade-offs people think might be acceptable here, >because there's not much point in me spending a bunch of time writing >code that everyone (or a critical mass of people) are going to hate. > >Thanks, > >-- >Robert Haas >EDB: http://www.enterprisedb.com > >[1] https://github.com/ossc-db/pg_hint_plan


I really admire this idea. here is my confusion: Isn't the core of this idea whether to turn the planner into a framework? Personally, I think that under PostgreSQL's heap table storage, the optimizer might be better off focusing on optimizing the generation of execution plans. It’s possible that in some specific scenarios, developers might want to intervene in the generation of execution plans by extensions. I'm not sure if these scenarios usually occur when the storage structure is also extended by developers. If so, could existing solutions like "planner_hook" potentially solve the problem?

Re: allowing extensions to control planner behavior

From
Robert Haas
Date:
On Tue, Aug 27, 2024 at 2:44 AM chungui.wcg <wcg2008zl@126.com> wrote:
> I really admire this idea.

Thanks.

>  here is my confusion: Isn't the core of this idea whether to turn the planner into a framework? Personally, I think
thatunder PostgreSQL's heap table storage, the optimizer might be better off focusing on optimizing the generation of
executionplans. It’s possible that in some specific scenarios, developers might want to intervene in the generation of
executionplans by  extensions. I'm not sure if these scenarios usually occur when the storage structure is also
extendedby developers. If so, could existing solutions like "planner_hook"  potentially solve the problem? 

You could use planner_hook if you wanted to replace the entire planner
with your own planner. However, that doesn't seem like something
practical, as the planner code is very large. The real use of the hook
is to allow running some extra code when the planner is invoked, as
demonstrated by the pg_stat_statements contrib module. To get some
meaningful control over the planner, you need something more
fine-grained. You need to be able to run code at specific points in
the planner, as we already allow with, for example,
get_relation_info_hook or set_rel_pathlist_hook.

Whether or not that constitutes "turning the planner into a framework"
is, I suppose, a question of opinion. Perhaps a more positive way to
phrase it would be "allowing for some code reuse". Right now, if you
mostly like the behavior of the planner but want a few things to be
different, you've got to duplicate a lot of code and then hack it up.
That's not very nice. I think it's better to set things up so that you
can keep most of the planner behavior but override it in a few
specific cases without a lot of difficulty.

Cases where the data is stored in some different way are really a
separate issue from what I'm talking about here. In that case, you
don't want to override the planner behavior for all tables everywhere,
so planner_hook still isn't a good solution. You only want to change
the behavior for the specific table AM that implements the new
storage. You would probably want there to be an option where
cost_seqscan() calls a tableam-specific function instead of just doing
the same thing for every AM; and maybe something similar for indexes,
although that is less clear. The details aren't quite clear, which is
probably part of why we haven't done anything yet.

But this patch set is really more about enabling use cases where the
user wants an extension to take control of the plan more explicitly,
say to avoid some bad plan that they got before and that they don't
want to get again.

--
Robert Haas
EDB: http://www.enterprisedb.com