Thread: ANALYZE patch for review
Hi everyone, Here is a first attempt at a patch to allow a customised ANALYZE function to be specified for user-defined types, relating to the following two threads: http://archives.postgresql.org/pgsql-hackers/2003-10/msg00113.php and http://archives.postgresql.org/pgsql-hackers/2004-01/msg00236.php This is my first attempt at a patch for Postgres so I don't expect it to get applied without significant review. Included along with this email is a sample datatype that can be used to see that the ANALYZE function is being called by inserting a fixed value into the pg_statistic table. Some points: 1) This patch does nothing other than provide a mechanism for the user to place their own values in the pg_statistic table. Currently I am still trying to complete the loop by finding out how and in what format a GiST selectivity function has access to this data. The user-defined function uses a new STATISTICS_KIND_CUSTOM constant to hold the data as I am not 100% sure how the planner interprets the STATISTIC_KIND_* values in the pg_statistic table. My aim is to be able to store a 2D histogram so that spatial row count estimates can be used by the planner in PostGIS (http://postgis.refractions.net) without having to maintain the statistics by manually running a stored procedure. The pg_statistic.h file doesn't seem clear to me as to what the values of the various columns should be when not dealing with single one-dimensional histograms..... 2) All existing types are assigned the default analyze function called pg_analyze_alg_default(). 3) The function is called within examine_attribute() to setup a valid VacAttrStats function if the column is analyzable and point to the algorithm function itself which currently uses the existing API of compute_scalar_stats() and compute_minimal_stats(). 4) The VacAttrStats structure may need to be moved out to a different .h file for access by the programmer (currently it is in commands/vacuum.h) - maybe a new analyze.h would be better? 5) Currently no thought has been given to providing statistics on functional indexes (see http://archives.postgresql.org/pgsql-general/2004-01/msg00978.php) In other words, the basic work is done but there is still some way to go before it is complete. Hopefully by putting this out now other pgsql-hackers will get the chance to review and refine the patch (particularly in the area of functional indexes) so this feature can make it out into 7.5. Many thanks to the hackers who have spent their time helping me get this far (particularly Tom). Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
Attachment
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > Here is a first attempt at a patch to allow a customised ANALYZE > function to be specified for user-defined types, relating to the > following two threads: > http://archives.postgresql.org/pgsql-hackers/2003-10/msg00113.php and > http://archives.postgresql.org/pgsql-hackers/2004-01/msg00236.php This seems to be going in mostly the right direction, but I think you've made a few errors. One is that I think that the VacAttrStats structure should be palloc'd by the type-specific typanalyze function, and not by general code. The reason for this is that a type-specific function could allocate a larger struct of which VacAttrStats is just the first field, leaving room for additional fields that can communicate data to the later type-specific analysis routine (which would know that it is receiving this type and not just generic VacAttrStats). This is already useful for the standard analysis code --- it would want to keep eqopr, eqfunc, and ltopr there, so as to avoid the extra lookups required in your patch. Other datatypes would likely need comparable functionality. This would mean that pretty much the whole of examine_attribute is treated as type-specific code. In consequence there would be a certain amount of duplication of code across different type-specific setup routines, but that does not bother me. Also, I'd suggest that it would be better to define a zero in pg_type.typanalyze as selecting the default analysis routine. This would put the special case in just one place rather than having that knowledge all through pg_type.h plus CreateType. (If you do this, I think the default analysis routine wouldn't even need a pg_proc entry.) > The user-defined function uses a new STATISTICS_KIND_CUSTOM > constant to hold the data as I am not 100% sure how the planner > interprets the > STATISTIC_KIND_* values in the pg_statistic table. My aim is to > be able to store > a 2D histogram so that spatial row count estimates can be used > by the planner > in PostGIS (http://postgis.refractions.net) without having to > maintain the > statistics by manually running a stored procedure. The > pg_statistic.h file doesn't > seem clear to me as to what the values of the various columns > should be when not > dealing with single one-dimensional histograms..... Obviously we'd need to define more STATISTIC_KIND_xxx values to represent any additional kinds of statistics that are getting stored. "STATISTICS_KIND_CUSTOM" is exactly the way *not* to do that, as it would lead people to use the same ID code for different things. What we probably need to do is think of a suitable scheme for reserving STATISTIC_KIND codes for different uses. A really off-the-cuff suggestion is: 1-100: reserved for assignment by the core Postgres project 100-199: reserved for assignment by PostGIS 200-9999: reserved for other globally-known stats kinds 10000-32767: reserved for private site-local use People writing private datatypes would select random values above 10000 and have little chance of conflict. Anybody who wanted to write code for public distribution would want to come to us and get an assignment of space below 10000. > 4) The VacAttrStats structure may need to be moved out to a > different .h file > for access by the programmer (currently it is in > commands/vacuum.h) - maybe a > new analyze.h would be better? vacuum.h seems fine. regards, tom lane
Hi Tom, > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 25 January 2004 23:08 > To: Mark Cave-Ayland > Cc: pgsql-patches@postgresql.org > Subject: Re: [PATCHES] ANALYZE patch for review > > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > Here is a first attempt at a patch to allow a customised ANALYZE > > function to be specified for user-defined types, relating to the > > following two threads: > > > http://archives.postgresql.org/pgsql-hackers/2003-10/msg00113.php and > > http://archives.postgresql.org/pgsql-hackers/2004-01/msg00236.php > > This seems to be going in mostly the right direction, but I > think you've made a few errors. > > One is that I think that the VacAttrStats structure should be > palloc'd by the type-specific typanalyze function, and not by > general code. The reason for this is that a type-specific > function could allocate a larger struct of which VacAttrStats > is just the first field, leaving room for additional fields > that can communicate data to the later type-specific analysis > routine (which would know that it is receiving this type and > not just generic VacAttrStats). This is already useful for > the standard analysis code --- it would want to keep eqopr, > eqfunc, and ltopr there, so as to avoid the extra lookups > required in your patch. Other datatypes would likely need > comparable functionality. > > This would mean that pretty much the whole of > examine_attribute is treated as type-specific code. In > consequence there would be a certain amount of duplication of > code across different type-specific setup routines, but that > does not bother me. My thinking behind this was that examine_attribute decides whether a column of a given type is analyzable so it made sense to provide the user with the type and column information. The one thing I do want to ensure is that the user function doesn't have to do things like check if the column is dropped or if the stats target is zero - that should be done by Postgres IMHO. I do like the idea of having a custom structure passed between functions as like you say it makes things a lot more flexible.... Will see what I can do about it. > Also, I'd suggest that it would be better to define a zero in > pg_type.typanalyze as selecting the default analysis routine. > This would put the special case in just one place rather than > having that knowledge all through pg_type.h plus CreateType. > (If you do this, I think the default analysis routine > wouldn't even need a pg_proc entry.) I'll have a look at this too. The thing I really want to avoid is having to create lots of 'special case' code depending upon whether the default analysis routine is used or an external function. At the moment, the logic just calls the typanalyze entry using OidFunctionCall1 regardless which seems quite a neat solution. > > The user-defined function uses a new STATISTICS_KIND_CUSTOM > > constant to hold the data as I am not 100% sure how the planner > > interprets the > > STATISTIC_KIND_* values in the pg_statistic table. My > aim is to be > > able to store > > a 2D histogram so that spatial row count estimates can > be used by the > > planner > > in PostGIS (http://postgis.refractions.net) without having to > > maintain the > > statistics by manually running a stored procedure. The > pg_statistic.h > > file doesn't > > seem clear to me as to what the values of the various > columns should > > be when not > > dealing with single one-dimensional histograms..... > > Obviously we'd need to define more STATISTIC_KIND_xxx values > to represent any additional kinds of statistics that are > getting stored. "STATISTICS_KIND_CUSTOM" is exactly the way > *not* to do that, as it would lead people to use the same ID > code for different things. What we probably need to do is > think of a suitable scheme for reserving STATISTIC_KIND codes > for different uses. A really off-the-cuff suggestion is: > > 1-100: reserved for assignment by the core Postgres project > 100-199: reserved for assignment by PostGIS > 200-9999: reserved for other globally-known stats kinds > 10000-32767: reserved for private site-local use > > People writing private datatypes would select random values > above 10000 and have little chance of conflict. Anybody who > wanted to write code for public distribution would want to > come to us and get an assignment of space below 10000. Would it really be necessary to need to do this? My understanding would be that using STATISTICS_KIND_CUSTOM merely marks the pg_statistic entry as being valid, i.e. it is valid data but the row sta* contents can only be interpreted by the specified type and operator combination, and should not attempt to be interpreted by any of the standard estimation functions. Also in a selectivity estimator function would it be reasonable to call get_attstatsslot(i.e. is it a relatively stable API for external as well as internal use) since this function does a lot of the work that would be required by an estimation function? > > 4) The VacAttrStats structure may need to be moved out > to a different > > .h file > > for access by the programmer (currently it is in > > commands/vacuum.h) - maybe a > > new analyze.h would be better? > > vacuum.h seems fine. OK, I'll leave it where it is. Many thanks, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: >> This would mean that pretty much the whole of >> examine_attribute is treated as type-specific code. In >> consequence there would be a certain amount of duplication of >> code across different type-specific setup routines, but that >> does not bother me. > My thinking behind this was that examine_attribute decides whether a > column of a given type is analyzable so it made sense to provide the > user with the type and column information. The one thing I do want to > ensure is that the user function doesn't have to do things like check if > the column is dropped or if the stats target is zero - that should be > done by Postgres IMHO. [ shrug ... ] It's only a couple lines either way, and you could argue that the present behavior for attstattarget = 0 should be subject to customization anyway. But if you prefer to pass in the pg_attribute and pg_type rows already fetched, I won't object. > I'll have a look at this too. The thing I really want to avoid is having > to create lots of 'special case' code depending upon whether the default > analysis routine is used or an external function. At the moment, the > logic just calls the typanalyze entry using OidFunctionCall1 regardless > which seems quite a neat solution. My point was that there is going to be a 'special case' to provide the default somewhere --- if not here, then in the CREATE TYPE code. It seems to me that localizing knowledge of what the default behavior is wouldn't be a bad idea. >> What we probably need to do is >> think of a suitable scheme for reserving STATISTIC_KIND codes >> for different uses. > Would it really be necessary to need to do this? My understanding would > be that using STATISTICS_KIND_CUSTOM merely marks the pg_statistic entry > as being valid, i.e. it is valid data but the row sta* contents can only > be interpreted by the specified type and operator combination, and > should not attempt to be interpreted by any of the standard estimation > functions. I don't much like that. It amounts to assuming that you will always be able to guess from context what a STATISTICS_KIND_CUSTOM entry actually is. I think that's shortsighted. As soon as there are a few different custom estimation functions out there, they will be tripping over each others' data. You may be thinking that only one kind of custom data would actually exist for any one datatype, but I'd turn that around: what of custom analysis functions that are supposed to work for a range of datatypes? It'd be very unsafe for such functions to all use the same STATISTICS_KIND_CUSTOM label. The operator selectivity functions that need to use the data couldn't be sure what they were looking at. > Also in a selectivity estimator function would it be reasonable to call > get_attstatsslot(i.e. is it a relatively stable API for external as well > as internal use) Yes, it's as stable as any other part of Postgres ... I don't see any reason that an external selectivity estimator should be coded any differently from the built-in ones. regards, tom lane
Hi Tom, > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 27 January 2004 17:16 > To: Mark Cave-Ayland > Cc: pgsql-patches@postgresql.org > Subject: Re: [PATCHES] ANALYZE patch for review > > > My thinking behind this was that examine_attribute decides > whether a > > column of a given type is analyzable so it made sense to > provide the > > user with the type and column information. The one thing I > do want to > > ensure is that the user function doesn't have to do things > like check > > if the column is dropped or if the stats target is zero - > that should > > be done by Postgres IMHO. > > [ shrug ... ] It's only a couple lines either way, and you > could argue that the present behavior for attstattarget = 0 > should be subject to customization anyway. But if you prefer > to pass in the pg_attribute and pg_type rows already fetched, > I won't object. OK you've won me over on the stats target. I can see where someone could have a case where an operation could fail if no statistics are available and hence they would want to intercept the case where the target is zero to place in some initial data. I have a question about something you wrote in your previous email (this is more a C question than anything else): > One is that I think that the VacAttrStats structure should be > palloc'd by the type-specific typanalyze function, and not by > general code. The reason for this is that a type-specific > function could allocate a larger struct of which VacAttrStats > is just the first field, leaving room for additional fields > that can communicate data to the later type-specific analysis > routine (which would know that it is receiving this type and > not just generic VacAttrStats). This is already useful for > the standard analysis code --- it would want to keep eqopr, > eqfunc, and ltopr there, so as to avoid the extra lookups > required in your patch. Other datatypes would likely need > comparable functionality. Is what you're saying that a custom function would do something like this: typedef struct { VacAttrStats *v; int mydata1; int mydata2; int mydata3; } mystruct; myanalyzeinfo = palloc0(sizeof(mystruct)); .. .. return myanalyzeinfo; This means that a structure of kind mystruct is returned back to analyze_rel(), but analyze_rel() still wants to access some information from VacAttrStats. Does that mean that the reference to VacAttrStats in analyze_rel() would be derived something like this? VacAttrStats *vacattrptr; vacattrptr = (VacAttrStats *)myanalyzeinfo; vacattrptr->minrows = 2; Firstly, is this right? And secondly, if so, is there a way of arranging the structures to ensure that the compiler throws a warning if VacAttrStats is not the first element of the structure? > > I'll have a look at this too. The thing I really want to avoid is > > having to create lots of 'special case' code depending upon whether > > the default analysis routine is used or an external > function. At the > > moment, the logic just calls the typanalyze entry using > > OidFunctionCall1 regardless which seems quite a neat solution. > > My point was that there is going to be a 'special case' to > provide the default somewhere --- if not here, then in the > CREATE TYPE code. It seems to me that localizing knowledge > of what the default behavior is wouldn't be a bad idea. I suppose the question here would be does having a default case in this way make things harder for developers to understand/follow the code? If it makes sense that a value of 0 would indicate the default statistics routines (which would be called directly) and the localization still maintains readability then I'm happy to go with this. > >> What we probably need to do is > >> think of a suitable scheme for reserving STATISTIC_KIND codes > >> for different uses. > > > Would it really be necessary to need to do this? My understanding > > would be that using STATISTICS_KIND_CUSTOM merely marks the > > pg_statistic entry as being valid, i.e. it is valid data > but the row > > sta* contents can only be interpreted by the specified type and > > operator combination, and should not attempt to be > interpreted by any > > of the standard estimation functions. > > I don't much like that. It amounts to assuming that you will > always be able to guess from context what a > STATISTICS_KIND_CUSTOM entry actually is. I think that's > shortsighted. As soon as there are a few different custom > estimation functions out there, they will be tripping over > each others' data. You may be thinking that only one kind of > custom data would actually exist for any one datatype, but > I'd turn that around: what of custom analysis functions that > are supposed to work for a range of datatypes? It'd be very > unsafe for such functions to all use the same > STATISTICS_KIND_CUSTOM label. The operator selectivity > functions that need to use the data couldn't be sure what > they were looking at. I'm beginning to think that perhaps we're looking at this in the wrong way, and that a more elegant version of what you're suggesting could be implemented using a major/minor method of identifying a statistics type. As an example, for existing types, the major would identify the owner of the data type as being STATSOWNER_PGDG, while the minor would indicate the algorithm, e.g. ALG_INDEX_SCALAR or ALG_INDEX_MCV. In the case of PostGIS this would mean that the PGDG would only have to assign them a STATSOWNER_POSTGIS and then they would be free to divide the minor to indicate whatever type of algorithm they need, without having to allocate chunks of potentially unused number space. This also then brings up the idea that different types of statistics can be kept on a given relation; for example index selectivity stats could be one, where as another could be the number of times the letter 'e' appears. Your example where a custom analysis function works on a range of datatypes can be handled by the stats owners themselves, by allocating a block of minor numbers for an algorithm where each entry can correspond to a different data type as they require. I suppose this also means that there might need to be a mechanism where a custom analysis function would want to add more than one row to the pg_statistic table (ie it collects more than one set of statistics per column); I can see this being used to implement statistics keeping on functional indexes for example, where for a column X, a copy is made and f(X) calculated for each index and added to the pg_statistic table. > > Also in a selectivity estimator function would it be reasonable to > > call get_attstatsslot(i.e. is it a relatively stable API > for external > > as well as internal use) > > Yes, it's as stable as any other part of Postgres ... I don't > see any reason that an external selectivity estimator should > be coded any differently from the built-in ones. OK, great. That makes life slightly easier. Many thanks, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > Is what you're saying that a custom function would do something like > this: > typedef struct > { > VacAttrStats *v; > int mydata1; > int mydata2; > int mydata3; > } mystruct; > myanalyzeinfo = palloc0(sizeof(mystruct)); > .. > .. > return myanalyzeinfo; Exactly, except that the return would look like return (VacAttrStruct *) myanalyzeinfo; (or at least it would if you weren't converting to Datum anyway). The main analyze code would think it had a plain pointer to VacAttrStats, but the data analysis subroutine associated with this setup routine would know it could cast the pointer back to (mystruct *). > VacAttrStats *vacattrptr; > vacattrptr = (VacAttrStats *)myanalyzeinfo; > vacattrptr->minrows = 2; Seems like the hard way; you can just access myanalyzeinfo->v.minrows when you're working with a pointer declared as mystruct *. > is there a way of arranging > the structures to ensure that the compiler throws a warning if > VacAttrStats is not the first element of the structure? Not that I know of, but we use techniques like this in very many places in the PG code and have not had troubles of that sort. I'm not too concerned about that. > I'm beginning to think that perhaps we're looking at this in the wrong > way, and that a more elegant version of what you're suggesting could be > implemented using a major/minor method of identifying a statistics type. If you suppose that the "major" field is the upper bits of the statistics ID value, then this is just a slightly different way of thinking about the range-based allocation method I suggested before. However, the range-based method can adapt to allocating different amounts of identifier space to different owners, whereas a major/minor approach can't easily do that since you've defined it to be 2^N minor IDs for each major code. > I suppose this also means that there might need to be a mechanism where > a custom analysis function would want to add more than one row to the > pg_statistic table (ie it collects more than one set of statistics per > column); As long as the custom analysis function is tied to a datatype, I don't see the need for that. It already has the ability to define up to four different "kinds" of stats for its column, and if someone makes a case for wanting more, I'd be inclined to widen pg_statistic rather than get into multiple rows per column. (To point out just one problem, such an arrangement would break the unique index scheme presently used.) I'm not sure yet about whether we need special hooks for keeping stats on functional indexes, but if we do they'd surely not be looked up by datatype, or at least not datatype alone. regards, tom lane
Hi Tom, > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 29 January 2004 15:31 > To: Mark Cave-Ayland > Cc: pgsql-patches@postgresql.org > Subject: Re: [PATCHES] ANALYZE patch for review > > <lots cut about pointers> OK, I've had another attempt at writing the code as you suggested but the more I work on it the less I like it :(. What I would like to do is make the VacAttrStats structure so that it just contains the information that is updated in the pg_statistic table, however this fell apart when I realised that update_attstats() suddenly requires the attr and attrtype fields to be present. Doh. So I'd like to propose a slightly different solution. I think that examine_attribute() should return a pointer to a custom structure containing any information that needs to be passed to the datatype specific routine (not the entire VacAttrStats structure), or NULL if the column should not be analyzed. I'm also considering changing the examine_attribute() input parameters to be Relation, Attribute, Type for the current column along with a pointer to a bool to indicate whether or not the column should be analyzed or not. If examine_attribute() sets the bool to false then the column is ignored. If the bool is set to true then a VacAttrStats structure is created in memory, and then the Attribute and Type tuple information is copied into the VacAttrStats structure. A new field for VacAttrStats will contain the pointer to the custom structure returned by examine_attribute() which can then be passed into the compute_*_stats() functions as an extra parameter. This seems to achieve the aims of abstracting the statistics data from the intermediate information required by the statistics routines, allowing extra/custom data to be passed between the typanalyze function and the statistics algorithm, and allowing the user to have the attr and attrtype structures given to them. The only thing I don't really like about this is providing a pointer to a bool in examine_attribute() - however this is needed to distinguish from a NULL meaning 'I have no custom data but the analyze function should still be called' and 'This column should not be analyzed'. I can't think of a better solution at the moment. > > I'm beginning to think that perhaps we're looking at this > in the wrong > > way, and that a more elegant version of what you're > suggesting could > > be implemented using a major/minor method of identifying a > statistics > > type. > > If you suppose that the "major" field is the upper bits of > the statistics ID value, then this is just a slightly > different way of thinking about the range-based allocation > method I suggested before. However, the range-based method > can adapt to allocating different amounts of identifier space > to different owners, whereas a major/minor approach can't > easily do that since you've defined it to be 2^N minor IDs > for each major code. I was thinking perhaps in terms of an extra staowner int2 field in pg_statistic where the IDs are allocated by the PGDG. Then each group/project would only require one owner id to be allocated to them and then have the existing 2^16 stakind space to organise themselves. The advantage of this is that projects can allocate their own stakind fields, implementing new or improved statistic algorithms without having to wait on the new allocation from the PGDG. Many thanks, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > So I'd like to propose a slightly different solution. I think that > examine_attribute() should return a pointer to a custom structure > containing any information that needs to be passed to the datatype > specific routine (not the entire VacAttrStats structure), or NULL if the > column should not be analyzed. Just a void* you mean? Sure, we could do that, although it might result in some duplicated effort. Another possibility is that analyze.c goes ahead and creates a VacAttrStats struct (including a void* extension pointer that it initializes to NULL) and then passes the struct to examine_attribute, which returns a bool and optionally modifies fields in the VacAttrStats struct --- in particular the requested-row-count and extension pointer. If false is returned then analyze.c just throws away the VacAttrStats struct instead of including it in its to-do list. >> If you suppose that the "major" field is the upper bits of >> the statistics ID value, then this is just a slightly >> different way of thinking about the range-based allocation >> method I suggested before. > I was thinking perhaps in terms of an extra staowner int2 field in > pg_statistic where the IDs are allocated by the PGDG. I do not really want to add a field to pg_statistic. That complicates matters more than we have a justification to do. Nor do we have any reason at this point to think that we need a 2^32 namespace for statistics kinds. (If 2^16 ever starts to look cramped, we can just widen the fields to int4 without introducing any backwards compatibility issues --- existing code assignments will remain valid. But I find it hard to foresee that happening.) regards, tom lane
Hi Tom, > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 02 February 2004 16:54 > To: Mark Cave-Ayland > Cc: pgsql-patches@postgresql.org > Subject: Re: [PATCHES] ANALYZE patch for review > > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > So I'd like to propose a slightly different solution. I think that > > examine_attribute() should return a pointer to a custom structure > > containing any information that needs to be passed to the datatype > > specific routine (not the entire VacAttrStats structure), > or NULL if > > the column should not be analyzed. > > Just a void* you mean? Sure, we could do that, although it > might result in some duplicated effort. Another possibility > is that analyze.c goes ahead and creates a VacAttrStats > struct (including a void* extension pointer that it > initializes to NULL) and then passes the struct to > examine_attribute, which returns a bool and optionally > modifies fields in the VacAttrStats struct --- in particular > the requested-row-count and extension pointer. If false is > returned then analyze.c just throws away the VacAttrStats > struct instead of including it in its to-do list. Yup indeed. Please find enclosed the latest version of the analyze patch taking into account all the things we have discussed in the thread. It feels a lot cleaner than the other - the advantage of having analyze_rel() create the VacAttrStats structure is that you can put some vaguely sensible defaults in the user-defined function so it shouldn't bomb too badly if someone gets it wrong, plus the attribute and type information can be fed in automagically. This rewrite has the benefit that it gets rid of the duplication of routines having to get their own copies of the attribute and type rows based on attnum, and also the (void *) pointer gives a clear abstraction of the data required for the pg_statistic table entry and those that are used by the current default routine. I also altered the examine_attribute() to allow the user-defined function to check attstattarget for the column manually (in case they want to do something special with minus or zero values), and also as discussed the internal stats routine is called directly without requiring an entry in the catalog. > >> If you suppose that the "major" field is the upper bits of > >> the statistics ID value, then this is just a slightly > >> different way of thinking about the range-based allocation > >> method I suggested before. > > > I was thinking perhaps in terms of an extra staowner int2 field in > > pg_statistic where the IDs are allocated by the PGDG. > > I do not really want to add a field to pg_statistic. That > complicates matters more than we have a justification to do. > Nor do we have any reason at this point to think that we need > a 2^32 namespace for statistics kinds. (If 2^16 ever starts > to look cramped, we can just widen the fields to int4 without > introducing any backwards compatibility issues --- existing > code assignments will remain valid. But I find it hard to > foresee that happening.) OK, I've left this as it is at the moment. Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
Attachment
Hi Tom, Did you manage to find a spare moment over the past week to check the revised ANALYZE patch I posted to the list? Is there any more work that needs to be done to it before it can applied to CVS? TIA, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > Yup indeed. Please find enclosed the latest version of the analyze patch > taking into account all the things we have discussed in the thread. I've reviewed and applied this with some small changes. You did a good job --- the only things you missed AFAICS were pg_dump support and documentation. I changed the API slightly for the typanalyze function and the compute_stats subroutine. I don't think it's necessary or appropriate to pass in the Relation pointer to typanalyze; that simplifies its signature to a single INTERNAL parameter. (If we did want to pass the Relation, I'd be inclined to make it a field in VacAttrStats.) Also I tweaked compute_stats so that the attribute number of the column is passed explicitly. I foresee needing this for functional-index stats gathering --- it's likely that the accumulated sample rows will be built on the fly and will have a tupdesc that includes both the main table rows and the functional index columns. So compute_stats shouldn't assume that the nominal column number is the right thing to use to extract the column it wants. > I also altered the examine_attribute() to allow the user-defined > function to check attstattarget for the column manually (in case they > want to do something special with minus or zero values), As I've committed it, the convention that zero means "no stats" is enforced by examine_attribute(), but the typanalyze function is responsible for deciding what a negative value means. Seem reasonable? Also, I put the following documentation about "kind" values into pg_statistic.h. /* * The present allocation of "kind" codes is: * * 1-99: reserved for assignment by the core PostgreSQL project * (values in this range will be documented in this file) * 100-199: reserved for assignment by the PostGIS project * (values to be documented in PostGIS documentation) * 200-9999: reserved for future public assignments * * For private use you may choose a "kind" code at random in the range * 10000-30000. However, for code that is to be widely disseminated it is * better to obtain a publicly defined "kind" code by request from the * PostgreSQL Global Development Group. */ regards, tom lane
Applied by Tom. Thanks. --------------------------------------------------------------------------- Mark Cave-Ayland wrote: > Hi Tom, > > > -----Original Message----- > > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > Sent: 02 February 2004 16:54 > > To: Mark Cave-Ayland > > Cc: pgsql-patches@postgresql.org > > Subject: Re: [PATCHES] ANALYZE patch for review > > > > > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > > So I'd like to propose a slightly different solution. I think that > > > examine_attribute() should return a pointer to a custom structure > > > containing any information that needs to be passed to the datatype > > > specific routine (not the entire VacAttrStats structure), > > or NULL if > > > the column should not be analyzed. > > > > Just a void* you mean? Sure, we could do that, although it > > might result in some duplicated effort. Another possibility > > is that analyze.c goes ahead and creates a VacAttrStats > > struct (including a void* extension pointer that it > > initializes to NULL) and then passes the struct to > > examine_attribute, which returns a bool and optionally > > modifies fields in the VacAttrStats struct --- in particular > > the requested-row-count and extension pointer. If false is > > returned then analyze.c just throws away the VacAttrStats > > struct instead of including it in its to-do list. > > Yup indeed. Please find enclosed the latest version of the analyze patch > taking into account all the things we have discussed in the thread. It > feels a lot cleaner than the other - the advantage of having > analyze_rel() create the VacAttrStats structure is that you can put some > vaguely sensible defaults in the user-defined function so it shouldn't > bomb too badly if someone gets it wrong, plus the attribute and type > information can be fed in automagically. > > This rewrite has the benefit that it gets rid of the duplication of > routines having to get their own copies of the attribute and type rows > based on attnum, and also the (void *) pointer gives a clear abstraction > of the data required for the pg_statistic table entry and those that are > used by the current default routine. I also altered the > examine_attribute() to allow the user-defined function to check > attstattarget for the column manually (in case they want to do something > special with minus or zero values), and also as discussed the internal > stats routine is called directly without requiring an entry in the > catalog. > > > >> If you suppose that the "major" field is the upper bits of > > >> the statistics ID value, then this is just a slightly > > >> different way of thinking about the range-based allocation > > >> method I suggested before. > > > > > I was thinking perhaps in terms of an extra staowner int2 field in > > > pg_statistic where the IDs are allocated by the PGDG. > > > > I do not really want to add a field to pg_statistic. That > > complicates matters more than we have a justification to do. > > Nor do we have any reason at this point to think that we need > > a 2^32 namespace for statistics kinds. (If 2^16 ever starts > > to look cramped, we can just widen the fields to int4 without > > introducing any backwards compatibility issues --- existing > > code assignments will remain valid. But I find it hard to > > foresee that happening.) > > OK, I've left this as it is at the moment. > > > Cheers, > > Mark. > > --- > > Mark Cave-Ayland > Webbased Ltd. > Tamar Science Park > Derriford > Plymouth > PL6 8BX > England > > Tel: +44 (0)1752 764445 > Fax: +44 (0)1752 764446 > > > This email and any attachments are confidential to the intended > recipient and may also be privileged. If you are not the intended > recipient please delete it from your system and notify the sender. You > should not copy it or use it for any purpose nor disclose or distribute > its contents to any other person. [ Attachment, skipping... ] [ Attachment, skipping... ] [ Attachment, skipping... ] [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Hi Tom, > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 12 February 2004 23:52 > To: Mark Cave-Ayland > Cc: pgsql-patches@postgresql.org > Subject: Re: [PATCHES] ANALYZE patch for review > > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > Yup indeed. Please find enclosed the latest version of the analyze > > patch taking into account all the things we have discussed in the > > thread. > > I've reviewed and applied this with some small changes. You > did a good job --- the only things you missed AFAICS were > pg_dump support and documentation. Ahhh pg_dump. Knew I forgot something :) I was going to ask about patches for the documentation but it looks like you beat me to it. > I changed the API slightly for the typanalyze function and > the compute_stats subroutine. I don't think it's necessary > or appropriate to pass in the Relation pointer to typanalyze; > that simplifies its signature to a single INTERNAL parameter. > (If we did want to pass the Relation, I'd be inclined to > make it a field in VacAttrStats.) Also I tweaked > compute_stats so that the attribute number of the column is > passed explicitly. I foresee needing this for > functional-index stats gathering --- it's likely that the > accumulated sample rows will be built on the fly and will > have a tupdesc that includes both the main table rows and the > functional index columns. So compute_stats shouldn't assume > that the nominal column number is the right thing to use to > extract the column it wants. > > > I also altered the examine_attribute() to allow the user-defined > > function to check attstattarget for the column manually (in > case they > > want to do something special with minus or zero values), > > As I've committed it, the convention that zero means "no > stats" is enforced by examine_attribute(), but the typanalyze > function is responsible for deciding what a negative value > means. Seem reasonable? I've had a quick browse of CVS on developer.postgresql.org and the work you've done looks great. The only reason I kept the Relation parameter was because I wasn't sure if there was a historical reason why someone would need the relation information as well as the attribute information. No problem with the zero value enforcing no stats. I also noticed your later commit with the added fetch_func parameter which brings stats on functional indexes even closer :) > Also, I put the following documentation about "kind" values > into pg_statistic.h. > > /* > * The present allocation of "kind" codes is: > * > * 1-99: reserved for assignment by the core > PostgreSQL project > * (values in this range will be documented > in this file) > * 100-199: reserved for assignment by the PostGIS project > * (values to be documented in PostGIS documentation) > * 200-9999: reserved for future public assignments > * > * For private use you may choose a "kind" code at random in the range > * 10000-30000. However, for code that is to be widely > disseminated it is > * better to obtain a publicly defined "kind" code by request from the > * PostgreSQL Global Development Group. > */ Fantastic stuff. I'll email Dave and Paul at Refractions and let them know the changes for analyzing custom types have been made - not sure whether it will be them or myself that get to implement the changes in PostGIS but it will be a good test that everything works as it should :) Thanks again for all your work on this, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > The only reason I kept the Relation parameter > was because I wasn't sure if there was a historical reason why someone > would need the relation information as well as the attribute > information. I can't think of one, but if someone did, they could extract the relation OID from the pg_attribute row and re-open it for themselves. So AFAICS this API does not omit any critical info. I forgot to email you about the fetch_function revision, but I trust it meets with your approval. Right now it's just a wrapper around heap_fetch, but I thought we might conceivably want something different when we do functional-index stats. The fetch function will give us wiggle room if we need it. regards, tom lane
> -----Original Message----- > From: pgsql-patches-owner@postgresql.org > [mailto:pgsql-patches-owner@postgresql.org] On Behalf Of Tom Lane > Sent: 13 February 2004 14:41 > To: Mark Cave-Ayland > Cc: pgsql-patches@postgresql.org > Subject: Re: [PATCHES] ANALYZE patch for review > > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > The only reason I kept the Relation parameter > > was because I wasn't sure if there was a historical reason > why someone > > would need the relation information as well as the attribute > > information. > > I can't think of one, but if someone did, they could extract > the relation OID from the pg_attribute row and re-open it for > themselves. So AFAICS this API does not omit any critical info. Great. > I forgot to email you about the fetch_function revision, but > I trust it meets with your approval. Right now it's just a > wrapper around heap_fetch, but I thought we might conceivably > want something different when we do functional-index stats. > The fetch function will give us wiggle room if we need it. Yep no problems as long as the functionality is there. I think I can also see where you're going with functional indexes - during an analyze phase, for a column X, a pseudo-column f(X) is generated from the sample data before the stats calculation is performed and the results dropped into pg_statistic. Hopefully this now shouldn't be too difficult for someone to pick it up and run with it. Cheers, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
Hi Tom, Having been working with the PostGIS team to implement a custom analyze routine for R-Tree selectivity, we have a question regarding the new vacuum_delay_point() which is present in analyze.c. Is it the responsibility of the programmers to remember to do a vacuum_delay_point() before calling the fetch_func(), or would it be better to move the vacuum_delay_point() into std_fetch_func()? It would seem to make more sense that the throttling is done by PostgreSQL rather than requiring programmers to have to remember to include the extra function call before calling the fetch_func() themselves. Many thanks, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > Having been working with the PostGIS team to implement a custom analyze > routine for R-Tree selectivity, we have a question regarding the new > vacuum_delay_point() which is present in analyze.c. Is it the > responsibility of the programmers to remember to do a > vacuum_delay_point() before calling the fetch_func(), or would it be > better to move the vacuum_delay_point() into std_fetch_func()? It's probably not really necessary to call vacuum_delay_point in the analysis routine, unless you are contemplating extremely expensive analysis. If you did have an expensive loop, would std_fetch_func necessarily be called inside it? It seems inappropriate to me to put vacuum_delay_point inside std_fetch_func --- it's the analysis programmer's business to understand whether it must be called, and if so where's an appropriate place. regards, tom lane