Re: Suggestions for analyze patch required... - Mailing list pgsql-hackers
From | Mark Cave-Ayland |
---|---|
Subject | Re: Suggestions for analyze patch required... |
Date | |
Msg-id | 8F4A22E017460A458DB7BBAB65CA6AE512D0E4@openmanage Whole thread Raw |
In response to | Suggestions for analyze patch required... ("Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk>) |
Responses |
Re: Suggestions for analyze patch required...
|
List | pgsql-hackers |
Hi Tom, Thanks for your criticism on this. > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 11 January 2004 17:04 > To: Mark Cave-Ayland > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Suggestions for analyze patch required... > > > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes: > > 2) Should the statistical data for custom types be stored in the > > pg_statistic table, or should it be the responsibility of > the custom > > type to store this in separate tables itself? > > You had better have a very, very good reason for proposing > the latter, as it will break many things. (For example, the > next thing you'll be wanting is a hook in DROP TABLE so you > can clean out your other tables.) > > pg_statistic is designed to store multiple kinds of data > using the "kind" flag. If that doesn't seem flexible enough > to you, let's talk about why not, rather than proposing more tables. That's OK. My gut feeling was that pg_statistic would be able to do the job, but of course it never hurts to explore other possibilities, especially when you don't hack on PostgreSQL too often ;) > > 3) If pg_statistic is the best place to store the data, > what would be > > the most appropriate structure to send/receive from the > custom analyze > > function? It looks as if VacAttrStats would do the job, > although I'm > > not convinced that passing a non-datum/standard C types to a user > > function is a good idea? > > The only reason you're bothering with this is so you can feed > the data to custom selectivity routines, no? And those have > always used internal C structs as arguments. So avoiding > them for analyze routines seems pretty pointless. You might > as well make the API be exactly that of > compute_minimal_stats/compute_scalar_stats/etc. > > Note that I don't think you will be able to get away with > just one routine in the API. You'll need at least two, one > that gets invoked at the "examine_attribute" stage and one at > the "compute_stats" stage, so that you can have some input > into how many sample rows get collected. Everything in > examine_attribute past the first couple of tests is really > just as analysis-method-specific as the actual computation. > > Actually, looking at this, I recall that the original > intention was for VacAttrStats to be a datatype-independent > struct, but the examine routine could allocate a larger > struct of which VacAttrStats is just the first component; so > if you need some private communication between the two > routines, you can have it. You could replace "algcode" with > a function pointer to the stats-computation routine, so that > only one function OID reference is needed in pg_type > (pointing to the correct "examine" routine instead of > "compute_stats"). I agree that the custom function needs an input as to the number of rows used for analysis, but I think that this is determined by the application in question. It may be that while the existing algorithm is fine for the existing data types, it may not give accurate enough statistics for some custom type that someone will need to create in the future (e.g. the 300 * atstattarget estimate for minrows may not be valid in some cases). Since this then becomes part of the implementation, a benefit would be that we would only need to add a single OID to pg_type. On the other hand, I think that it is not worth making extra work for the programmer so I am keen that as well as being flexible, it shouldn't require too much low-level PostgreSQL knowledge. In view of your feedback, would the following be a better compromise? 1) Modify examine_attribute() so it will return a VacAttrStats structure if the column has a valid ANALYZE function OID, and has not been dropped. Move all the specific functionality into a new function, assign it an OID, and make this the default for existing pg_types. This function would take a pointer to a VacAttrStats structure as its only argument that will be filled in by the function. VacAttrStats will be extended to include the OID relid (to allow the function to determine which relation so it may obtain sample rows from it). analyze_rel() will then call the custom analysis function and then call update_attstats() as normal. Custom analysis functions should be initialised with an algcode of ALG_CUSTOM, although this can be overridden by the function itself. 2) Make the acquire_sample_rows function a documented interface which can be used by programmers who do not want to delve into other algorithms. I am hoping that this would allow a simplistic version of the existing functions to look similar to below: Datum my_scalar_analyze_function (VacAttrStats *v) { // Obtain the rowsint targrows = 300 * v->attr->attstattarget;double totalrows;HeapTuple *rows;int numrows; rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));numrows = pg_analyze_acquire_sample_rows(v->relid, HeapTuple *rows, targrows, &totalrows)); // Generate statistics from the rowscompute_scalar_stats(v, v->relid->rd_att, totalrows, rows, numrows); } This also means that people could replace pg_analyze_acquire_sample_rows() with their own sampling function if that were deemed necessary, but it makes life slightly easier for those that don't. Finally the way VacAttrStats is defined means that the float * arrays are fixed at STATISTIC_NUM_SLOTS elements. For example, what if I want a histogram with more than 1000 buckets??? Perhaps a better solution would be for the functions to return cstrings which are inserted directly into the pg_statistic fields (using atstattarget as a guideline) although I think the solution here would do what we need it to. Many thanks, Mark. --- Mark Cave-Ayland Webbased Ltd. Tamar Science Park Derriford Plymouth PL6 8BX England Tel: +44 (0)1752 764445 Fax: +44 (0)1752 764446 This email and any attachments are confidential to the intended recipient and may also be privileged. If you are not the intended recipient please delete it from your system and notify the sender. You should not copy it or use it for any purpose nor disclose or distribute its contents to any other person.
pgsql-hackers by date: