Thread: Re: [HACKERS] get_relation_stats_hook()
On Thu, 2008-06-26 at 12:50 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > >> Surely you didn't mean ALL calls. Please be more specific about what > >> you're proposing. > > > The statistics relation STATRELATT is accessed in a few places in the > > planner. Since it is in the syscache it is accessed directly from there. > > I would like to add hooks so that stats data can come from somewhere > > else other than the syscache for tables, just as we can already do with > > get_relation_stats_hook(). > > Well, defining the hooks as replacing STATRELATT lookups seems the wrong > level of abstraction. What if you want to insert stats that can't be > represented by the current pg_statistic definition? Even if there's no > functional limitation, cons'ing up a dummy pg_statistic tuple seems the > hard way to do it --- we didn't define get_relation_info_hook as working > by supplying made-up catalog rows. Furthermore, many of the interesting > cases that someone might want to hook into are code paths that don't > even try to look up a pg_statistic row because they know there won't be > one, such as two out of the three cases in examine_variable(). > > I think you need to move up a level, and perhaps refactor some of the > existing code to make it easier to inject made-up stats. I've looked at ways of refactoring this, but the hooks provided here complement the existing functionality fairly well. We could add per column info to RelOptInfo, though toasted stats data is always going to be difficult to handle. There are other ways of doing this 1. replace selfuncs directly via the catalog 2. put in a hook for each selfunc at top level 3. decide an abstract API we would like to support 1 is possible now, 2 can still be done whether or not this patch is accepted and I will likely submit a patch for that also, though 3 is too open ended to pursue right now, IMHO. Also, this patch can be backpatched more easily, to allow helping current issues. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
Simon Riggs <simon@2ndquadrant.com> writes: > On Thu, 2008-06-26 at 12:50 -0400, Tom Lane wrote: >> I think you need to move up a level, and perhaps refactor some of the >> existing code to make it easier to inject made-up stats. > I've looked at ways of refactoring this, but the hooks provided here > complement the existing functionality fairly well. The hooks proposed here are completely useless --- it's impossible to return anything except actual catalog data, because whatever is returned is going to be subjected to ReleaseSysCache() later on. I'd think you at least need to be able to turn that into a no-op, or perhaps a pfree() call. Also it's clear that some calls of examine_variable would still return regular syscache entries, so the cleanup behavior has to be determinable on a per-call basis. I also still think that you need to reconsider the level at which the hook gets control. It's impossible given these hook definitions to create "fake" data for an appendrel or sub-SELECT, which seems to me to be an important requirement, at least till such time as the core planner handles those cases better. My feeling is that the hooks ought to be at more or less the same semantic level as examine_variable and ReleaseVariableStats, and that rather than making special-case hooks for the other couple of direct accesses to STATRELATT, the right thing is to make sure that the examine_variable hook will work for them too. I do now concur that having the hook return a pg_statistic tuple is appropriate --- after looking at the code that uses this stuff, decoupling that representation would be more work than it's worth. I think it might be a good idea to prepare an actual working example of a plug-in that does something with the hook in order to verify that you've got a useful definition. regards, tom lane
On Thu, 2008-07-10 at 14:59 -0400, Tom Lane wrote: > I think it might be a good idea to prepare an actual working example > of a plug-in that does something with the hook in order to verify > that you've got a useful definition. Thanks, will do. I was waiting for the API, but I see a concrete example will help us test whether we got the API in the right place. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Thu, 2008-07-10 at 14:59 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Thu, 2008-06-26 at 12:50 -0400, Tom Lane wrote: > >> I think you need to move up a level, and perhaps refactor some of the > >> existing code to make it easier to inject made-up stats. > > > I've looked at ways of refactoring this, but the hooks provided here > > complement the existing functionality fairly well. > > The hooks proposed here are completely useless --- it's impossible > to return anything except actual catalog data, because whatever is > returned is going to be subjected to ReleaseSysCache() later on. > I'd think you at least need to be able to turn that into a no-op, > or perhaps a pfree() call. Also it's clear that some calls of > examine_variable would still return regular syscache entries, so > the cleanup behavior has to be determinable on a per-call basis. I've introduced an additional plugin function typedef void (*release_relation_stats_hook_type) (HeapTuple statstup); Simply pfree-ing things would prevent the plugin from sometimes returning a syscache entry or keeping its own cache internally. > I also still think that you need to reconsider the level at which > the hook gets control. It's impossible given these hook definitions > to create "fake" data for an appendrel or sub-SELECT, which seems to > me to be an important requirement, at least till such time as the > core planner handles those cases better. I agree with this, especially with regard to join relations. However, I'm having difficulty seeing how to identify these structures externally to allow any meaningful lookups of data. So if we have a 7 table join, then how do we specifically identify the join between tables 4 and 7, say. We can say, its "foo = foo", but that can easily occur multiple times in a many way join, especially since join columns are frequently named the same. How should we tell them apart? AFAICS we can't and yet we need to if this is to be useful. Overriding that is already reasonably easily possible anyway using operators, so I'm not seeing this plugin as a way to introduce hints. This is purely to provide access to alternate stats. > My feeling is that the hooks ought to be at more or less the same > semantic level as examine_variable and ReleaseVariableStats, and that > rather than making special-case hooks for the other couple of direct > accesses to STATRELATT, the right thing is to make sure that the > examine_variable hook will work for them too. In general, I agree, though this does cause more work in the plugin and in other areas of selfuncs.c btcostestimate() searches syscache directly rather than going through examine_variable. I suspect refactoring that would cause changes to internals of examine_variable, possibly also changing the API to cater for index stats. I have tried to follow this through as you suggest, but the thought that examine_variable is the right level doesn't seem as clean as it sounded when you first suggested. With those problems in mind, can we stick to the original plan? It's reasonably clear how it works, and it does work correctly now with the new release call mentioned above. Or can you suggest the refactoring you would like to see in that area? Or better still commit something for me to use as a base. > I think it might be a good idea to prepare an actual working example > of a plug-in that does something with the hook in order to verify > that you've got a useful definition. I've tested this new patch with the plugin I showed you before, rewired so it makes syscache calls via patch, as a "null" test case. All works, no problems. I'll submit the fully working plugin once we've stabilised the API. It's designed as a contrib module, so it can go in pgfoundry or contrib. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
On Wed, 2008-08-06 at 16:37 +0100, Simon Riggs wrote: > I'll submit the fully working plugin once we've stabilised the API. It's > designed as a contrib module, so it can go in pgfoundry or contrib. OK, here's fully working plugin, plus API patch. I expect to open a pgfoundry project for the plugin, but will wait for the main patch review. I've tried the APIs three different ways and this seems cleanest and most widely applicable approach. It's possible to add calls in more places, but I haven't done this for reasons discussed previously. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
On Wed, 2008-08-06 at 23:38 +0100, Simon Riggs wrote: > On Wed, 2008-08-06 at 16:37 +0100, Simon Riggs wrote: > > > I'll submit the fully working plugin once we've stabilised the API. It's > > designed as a contrib module, so it can go in pgfoundry or contrib. > > OK, here's fully working plugin, plus API patch. > > I expect to open a pgfoundry project for the plugin, but will wait for > the main patch review. > > I've tried the APIs three different ways and this seems cleanest and > most widely applicable approach. It's possible to add calls in more > places, but I haven't done this for reasons discussed previously. New version of Postgres patch, v5. Implements suggested changes. Ready for review and apply. New version of stats plugin, v3. Works with v5. Corrected problems: * now loads using preload_shared_libraries as well as LOAD * example test script fix -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
Simon Riggs <simon@2ndQuadrant.com> writes: > New version of Postgres patch, v5. Implements suggested changes. > Ready for review and apply. Applied with some revisions. The method for passing back freefunc didn't work, so I made it pass the whole VariableStatsData struct instead; this might allow some additional flexibility by changing other fields besides the intended statsTuple and freefunc. Also, I was still unhappy about adding a hook in the midst of code that clearly needs improvement, without making it possible for the hook to override the adjacent broken code paths; so I refactored the API a bit for that too. The plugin function would now be something like this: static bool plugin_get_relation_stats(PlannerInfo *root, RangeTblEntry *rte, AttrNumber attnum, VariableStatData *vardata) { HeapTuple statstup = NULL; /* For now, we only cover the simple-relation case */ if (rte->rtekind != RTE_RELATION || rte->inh) return false; if (!get_tom_stats_tupletable(rte->relid, attnum)) return false; /* * Get stats if present. We asked for only one row, so no need for loops. */ if (SPI_processed > 0) statstup = SPI_copytuple(SPI_tuptable->vals[0]); SPI_freetuptable(SPI_tuptable); SPI_finish(); if (!statstup) return false; /* should this happen? */ vardata->statsTuple = statstup; /* define function to use when time to free the tuple */ vardata->freefunc = heap_freetuple; return true; } and if you want to insert stats for expression indexes then there's a separate get_index_stats_hook for that. regards, tom lane