Thread: Suggestions for analyze patch required...

Suggestions for analyze patch required...

From
"Mark Cave-Ayland"
Date:
Hi everyone,

In response to an email last year (see
http://archives.postgresql.org/pgsql-hackers/2003-10/msg00113.php for
more information), I'm beginning to write a patch for PostgreSQL to
enable people to write their own ANALYZE routines for user-defined
types. In terms of the PostgreSQL "itch and scratch" philosophy, the
main reason to do this is to allow PostGIS to update its geometric
histogram as part of the analyze process, instead of having to execute a
separate stored procedure on a regular basis. This is information is
then used as part of the GiST selectivity function to give PostgreSQL
the details needed to make the right decision about indexes.

The basic plan is as follows: to move the existing analysis function out
of analyze_rel() and assign it an OID, then assign that OID by default
to each type in the pg_types table. Finally analyze_rel() will be
modified to call the ANALYZE routine for the given OID in the type
table.

I've spent a few hours looking over the code but I have some questions
that I hope can be answered by members on this list (particularly core
team members!):


1) Is the basic plan above thinking along the right lines?

2) Should the statistical data for custom types be stored in the
pg_statistic table, or should it be the responsibility of the custom
type to store this in separate tables itself?

3) If pg_statistic is the best place to store the data, what would be
the most appropriate structure to send/receive from the custom analyze
function? It looks as if VacAttrStats would do the job, although I'm not
convinced that passing a non-datum/standard C types to a user function
is a good idea? And then also what would be appropriate values for the
other fields, in particular sta_kind, staopX and stanumbersX?


Many thanks,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.




Re: Suggestions for analyze patch required...

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> 2) Should the statistical data for custom types be stored in the
> pg_statistic table, or should it be the responsibility of the custom
> type to store this in separate tables itself?

You had better have a very, very good reason for proposing the latter,
as it will break many things.  (For example, the next thing you'll be
wanting is a hook in DROP TABLE so you can clean out your other tables.)

pg_statistic is designed to store multiple kinds of data using the
"kind" flag.  If that doesn't seem flexible enough to you, let's talk
about why not, rather than proposing more tables.

> 3) If pg_statistic is the best place to store the data, what would be
> the most appropriate structure to send/receive from the custom analyze
> function? It looks as if VacAttrStats would do the job, although I'm not
> convinced that passing a non-datum/standard C types to a user function
> is a good idea?

The only reason you're bothering with this is so you can feed the data
to custom selectivity routines, no?  And those have always used internal
C structs as arguments.  So avoiding them for analyze routines seems
pretty pointless.  You might as well make the API be exactly that
of compute_minimal_stats/compute_scalar_stats/etc.

Note that I don't think you will be able to get away with just one
routine in the API.  You'll need at least two, one that gets invoked at
the "examine_attribute" stage and one at the "compute_stats" stage,
so that you can have some input into how many sample rows get collected.
Everything in examine_attribute past the first couple of tests is really
just as analysis-method-specific as the actual computation.

Actually, looking at this, I recall that the original intention was for
VacAttrStats to be a datatype-independent struct, but the examine
routine could allocate a larger struct of which VacAttrStats is just the
first component; so if you need some private communication between the
two routines, you can have it.  You could replace "algcode" with a
function pointer to the stats-computation routine, so that only one
function OID reference is needed in pg_type (pointing to the correct
"examine" routine instead of "compute_stats").
        regards, tom lane


Re: Suggestions for analyze patch required...

From
"Mark Cave-Ayland"
Date:
Hi Tom,

Thanks for your criticism on this.

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] 
> Sent: 11 January 2004 17:04
> To: Mark Cave-Ayland
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Suggestions for analyze patch required... 
> 
> 
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > 2) Should the statistical data for custom types be stored in the 
> > pg_statistic table, or should it be the responsibility of 
> the custom 
> > type to store this in separate tables itself?
> 
> You had better have a very, very good reason for proposing 
> the latter, as it will break many things.  (For example, the 
> next thing you'll be wanting is a hook in DROP TABLE so you 
> can clean out your other tables.)
> 
> pg_statistic is designed to store multiple kinds of data 
> using the "kind" flag.  If that doesn't seem flexible enough 
> to you, let's talk about why not, rather than proposing more tables.

That's OK. My gut feeling was that pg_statistic would be able to do the
job, but of course it never hurts to explore other possibilities,
especially when you don't hack on PostgreSQL too often ;)

> > 3) If pg_statistic is the best place to store the data, 
> what would be 
> > the most appropriate structure to send/receive from the 
> custom analyze 
> > function? It looks as if VacAttrStats would do the job, 
> although I'm 
> > not convinced that passing a non-datum/standard C types to a user 
> > function is a good idea?
> 
> The only reason you're bothering with this is so you can feed 
> the data to custom selectivity routines, no?  And those have 
> always used internal C structs as arguments.  So avoiding 
> them for analyze routines seems pretty pointless.  You might 
> as well make the API be exactly that of 
> compute_minimal_stats/compute_scalar_stats/etc.
> 
> Note that I don't think you will be able to get away with 
> just one routine in the API.  You'll need at least two, one 
> that gets invoked at the "examine_attribute" stage and one at 
> the "compute_stats" stage, so that you can have some input 
> into how many sample rows get collected. Everything in 
> examine_attribute past the first couple of tests is really 
> just as analysis-method-specific as the actual computation.
> 
> Actually, looking at this, I recall that the original 
> intention was for VacAttrStats to be a datatype-independent 
> struct, but the examine routine could allocate a larger 
> struct of which VacAttrStats is just the first component; so 
> if you need some private communication between the two 
> routines, you can have it.  You could replace "algcode" with 
> a function pointer to the stats-computation routine, so that 
> only one function OID reference is needed in pg_type 
> (pointing to the correct "examine" routine instead of 
> "compute_stats").

I agree that the custom function needs an input as to the number of rows
used for analysis, but I think that this is determined by the
application in question. It may be that while the existing algorithm is
fine for the existing data types, it may not give accurate enough
statistics for some custom type that someone will need to create in the
future (e.g. the 300 * atstattarget estimate for minrows may not be
valid in some cases). Since this then becomes part of the
implementation, a benefit would be that we would only need to add a
single OID to pg_type.

On the other hand, I think that it is not worth making extra work for
the programmer so I am keen that as well as being flexible, it shouldn't
require too much low-level PostgreSQL knowledge.

In view of your feedback, would the following be a better compromise?

1) Modify examine_attribute() so it will return a VacAttrStats structure
if the column has a valid ANALYZE function OID, and has not been
dropped. Move all the specific functionality into a new function, assign
it an OID, and make this the default for existing pg_types. This
function would take a pointer to a VacAttrStats structure as its only
argument that will be filled in by the function. VacAttrStats will be
extended to include the OID relid (to allow the function to determine
which relation so it may obtain sample rows from it).

analyze_rel() will then call the custom analysis function and then call
update_attstats() as normal. Custom analysis functions should be
initialised with an algcode of ALG_CUSTOM, although this can be
overridden by the function itself.


2) Make the acquire_sample_rows function a documented interface which
can be used by programmers who do not want to delve into other
algorithms. I am hoping that this would allow a simplistic version of
the existing functions to look similar to below:


Datum my_scalar_analyze_function (VacAttrStats *v)
{
// Obtain the rowsint targrows = 300 * v->attr->attstattarget;double totalrows;HeapTuple  *rows;int numrows;
rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));numrows = pg_analyze_acquire_sample_rows(v->relid, HeapTuple
*rows, targrows,                &totalrows));
// Generate statistics from the rowscompute_scalar_stats(v, v->relid->rd_att, totalrows, rows,
numrows);

}


This also means that people could replace
pg_analyze_acquire_sample_rows() with their own sampling function if
that were deemed necessary, but it makes life slightly easier for those
that don't.



Finally the way VacAttrStats is defined means that the float * arrays
are fixed at STATISTIC_NUM_SLOTS elements. For example, what if I want a
histogram with more than 1000 buckets??? Perhaps a better solution would
be for the functions to return cstrings which are inserted directly into
the pg_statistic fields (using atstattarget as a guideline) although I
think the solution here would do what we need it to.


Many thanks,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.




Re: Suggestions for analyze patch required...

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> I agree that the custom function needs an input as to the number of rows
> used for analysis, but I think that this is determined by the
> application in question. It may be that while the existing algorithm is
> fine for the existing data types, it may not give accurate enough
> statistics for some custom type that someone will need to create in the
> future (e.g. the 300 * atstattarget estimate for minrows may not be
> valid in some cases).

Exactly.  That equation has to be part of the potentially-datatype-specific
code.  I believe the sampling code is already set up to take the max()
across all the requested sample size values.  The notion here is that if
we need to sample (say) 10000 rows instead of 3000 to satisfy some
particular analysis requirement, we might as well make use of the larger
sample size for all the columns.  You seem to be envisioning fetching a
new sample for each column of the table --- that seems like N times the
work for an N-column table, with little benefit that I can see.

> 1) Modify examine_attribute() so it will return a VacAttrStats structure
> if the column has a valid ANALYZE function OID, and has not been
> dropped. Move all the specific functionality into a new function, assign
> it an OID, and make this the default for existing pg_types.

I was envisioning that the existing examine_attribute() would become the
default datatype-specific routine referenced in pg_type.  Either it, or
a substitute routine written by a datatype author, would be called and
would return a VacAttrStats structure (or NULL to skip analysis).  The
stats structure would indicate the requested sample size and contain a
function pointer to a second function to call back after the sample has
been collected.  The existing compute_xxx_stats functions would become
two examples of this second function.  (The second functions would thus
not require pg_proc entries nor a pg_type column to reference them: the
examine_attribute routine would know which function it wanted called.)
The second functions would return data to be stored into pg_statistic,
using the VacAttrStats structures.

IMHO neither acquisition of the sample rows nor storing of the final
results in pg_statistic should be under the control of the per-datatype
routine, because those are best done in parallel for all columns at
once.

> Finally the way VacAttrStats is defined means that the float * arrays
> are fixed at STATISTIC_NUM_SLOTS elements. For example, what if I want a
> histogram with more than 1000 buckets???

The histogram is still just one array, no?  NUM_SLOTS defines the
maximum number of different arrays you can put into pg_statistic, but
not their dimensions or contents.  I don't see your point.
        regards, tom lane


Re: Suggestions for analyze patch required...

From
"Mark Cave-Ayland"
Date:
Hi Tom,

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] 
> Sent: 13 January 2004 18:08
> To: Mark Cave-Ayland
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Suggestions for analyze patch required... 
> 
> 
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > I agree that the custom function needs an input as to the number of 
> > rows used for analysis, but I think that this is determined by the 
> > application in question. It may be that while the existing 
> algorithm 
> > is fine for the existing data types, it may not give 
> accurate enough 
> > statistics for some custom type that someone will need to create in 
> > the future (e.g. the 300 * atstattarget estimate for 
> minrows may not 
> > be valid in some cases).
> 
> Exactly.  That equation has to be part of the 
> potentially-datatype-specific code.  I believe the sampling 
> code is already set up to take the max() across all the 
> requested sample size values.  The notion here is that if we 
> need to sample (say) 10000 rows instead of 3000 to satisfy 
> some particular analysis requirement, we might as well make 
> use of the larger sample size for all the columns.  You seem 
> to be envisioning fetching a new sample for each column of 
> the table --- that seems like N times the work for an 
> N-column table, with little benefit that I can see.

*lightbulb*

Now I understand this. I was assuming that each type acquires its own
sample rows, but now what I understand happens is that the atstattarget
is used from each column to calculate the number of rows required, the
max() across all columns for a relation is taken, and then this number
of sample rows are loaded from the relation. The analysis routines for
each column then run on this sample and drop the results in the stats
structure. Finally, for each column, the stats are converted into an
array, and the relevant entry created/updated in pg_statistic. Hope that
sounds about right.

> > 1) Modify examine_attribute() so it will return a VacAttrStats 
> > structure if the column has a valid ANALYZE function OID, 
> and has not 
> > been dropped. Move all the specific functionality into a 
> new function, 
> > assign it an OID, and make this the default for existing pg_types.
> 
> I was envisioning that the existing examine_attribute() would 
> become the default datatype-specific routine referenced in 
> pg_type.  Either it, or a substitute routine written by a 
> datatype author, would be called and would return a 
> VacAttrStats structure (or NULL to skip analysis).  The stats 
> structure would indicate the requested sample size and 
> contain a function pointer to a second function to call back 
> after the sample has been collected.  The existing 
> compute_xxx_stats functions would become two examples of this 
> second function.  (The second functions would thus not 
> require pg_proc entries nor a pg_type column to reference 
> them: the examine_attribute routine would know which function 
> it wanted called.) The second functions would return data to 
> be stored into pg_statistic, using the VacAttrStats structures.
> 
> IMHO neither acquisition of the sample rows nor storing of 
> the final results in pg_statistic should be under the control 
> of the per-datatype routine, because those are best done in 
> parallel for all columns at once.
>
> > Finally the way VacAttrStats is defined means that the 
> float * arrays 
> > are fixed at STATISTIC_NUM_SLOTS elements. For example, 
> what if I want 
> > a histogram with more than 1000 buckets???
> 
> The histogram is still just one array, no?  NUM_SLOTS defines 
> the maximum number of different arrays you can put into 
> pg_statistic, but not their dimensions or contents.  I don't 
> see your point.

Again, this was probably a result of me misunderstanding of how the
above process works, and in that context the approach you suggest above
would make perfect sense. There's enough material here for me to start
coding up the patch - thanks again Tom for taking the time to explain
the innards of this to me. When I get something working, I'll post an
evaluation (along with my custom type used to test it) to pgsql-patches.


Many thanks,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.