Thread: ANALYZE patch for review

ANALYZE patch for review

From
"Mark Cave-Ayland"
Date:
Hi everyone,


Here is a first attempt at a patch to allow a customised ANALYZE
function to be specified for user-defined types, relating to the
following two threads:

http://archives.postgresql.org/pgsql-hackers/2003-10/msg00113.php and
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00236.php


This is my first attempt at a patch for Postgres so I don't expect it to
get applied without significant review. Included along with this email
is a sample datatype that can be used to see that the ANALYZE function
is being called by inserting a fixed value into the pg_statistic table.

Some points:

    1) This patch does nothing other than provide a mechanism for
the user to place
    their own values in the pg_statistic table. Currently I am still
trying to complete
    the loop by finding out how and in what format a GiST
selectivity function has
    access to this data.

    The user-defined function uses a new STATISTICS_KIND_CUSTOM
    constant to hold the data as I am not 100% sure how the planner
interprets the
    STATISTIC_KIND_* values in the pg_statistic table. My aim is to
be able to store
    a 2D histogram so that spatial row count estimates can be used
by the planner
    in PostGIS (http://postgis.refractions.net) without having to
maintain the
    statistics by manually running a stored procedure. The
pg_statistic.h file doesn't
    seem clear to me as to what the values of the various columns
should be when not
    dealing with single one-dimensional histograms.....


    2) All existing types are assigned the default analyze function
called
    pg_analyze_alg_default().


    3) The function is called within examine_attribute() to setup a
valid VacAttrStats
    function if the column is analyzable and point to the algorithm
function itself
    which currently uses the existing API of compute_scalar_stats()
and     compute_minimal_stats().


    4) The VacAttrStats structure may need to be moved out to a
different .h file
    for access by the programmer (currently it is in
commands/vacuum.h) - maybe a
    new analyze.h would be better?


    5) Currently no thought has been given to providing statistics
on functional
    indexes (see
http://archives.postgresql.org/pgsql-general/2004-01/msg00978.php)


In other words, the basic work is done but there is still some way to go
before it is complete. Hopefully by putting this out now other
pgsql-hackers will get the chance to review and refine the patch
(particularly in the area of functional indexes) so this
feature can make it out into 7.5.


Many thanks to the hackers who have spent their time helping me get this
far (particularly Tom).


Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

Attachment

Re: ANALYZE patch for review

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> Here is a first attempt at a patch to allow a customised ANALYZE
> function to be specified for user-defined types, relating to the
> following two threads:
> http://archives.postgresql.org/pgsql-hackers/2003-10/msg00113.php and
> http://archives.postgresql.org/pgsql-hackers/2004-01/msg00236.php

This seems to be going in mostly the right direction, but I think you've
made a few errors.

One is that I think that the VacAttrStats structure should be palloc'd
by the type-specific typanalyze function, and not by general code.  The
reason for this is that a type-specific function could allocate a larger
struct of which VacAttrStats is just the first field, leaving room for
additional fields that can communicate data to the later type-specific
analysis routine (which would know that it is receiving this type and
not just generic VacAttrStats).  This is already useful for the standard
analysis code --- it would want to keep eqopr, eqfunc, and ltopr there,
so as to avoid the extra lookups required in your patch.  Other
datatypes would likely need comparable functionality.

This would mean that pretty much the whole of examine_attribute is
treated as type-specific code.  In consequence there would be a certain
amount of duplication of code across different type-specific setup
routines, but that does not bother me.

Also, I'd suggest that it would be better to define a zero in
pg_type.typanalyze as selecting the default analysis routine.
This would put the special case in just one place rather than
having that knowledge all through pg_type.h plus CreateType.
(If you do this, I think the default analysis routine wouldn't
even need a pg_proc entry.)

>     The user-defined function uses a new STATISTICS_KIND_CUSTOM
>     constant to hold the data as I am not 100% sure how the planner
> interprets the
>     STATISTIC_KIND_* values in the pg_statistic table. My aim is to
> be able to store
>     a 2D histogram so that spatial row count estimates can be used
> by the planner
>     in PostGIS (http://postgis.refractions.net) without having to
> maintain the
>     statistics by manually running a stored procedure. The
> pg_statistic.h file doesn't
>     seem clear to me as to what the values of the various columns
> should be when not
>     dealing with single one-dimensional histograms.....

Obviously we'd need to define more STATISTIC_KIND_xxx values to
represent any additional kinds of statistics that are getting stored.
"STATISTICS_KIND_CUSTOM" is exactly the way *not* to do that, as it
would lead people to use the same ID code for different things.
What we probably need to do is think of a suitable scheme for reserving
STATISTIC_KIND codes for different uses.  A really off-the-cuff
suggestion is:

    1-100:    reserved for assignment by the core Postgres project
    100-199: reserved for assignment by PostGIS
    200-9999: reserved for other globally-known stats kinds
    10000-32767: reserved for private site-local use

People writing private datatypes would select random values above 10000
and have little chance of conflict.  Anybody who wanted to write code
for public distribution would want to come to us and get an assignment
of space below 10000.

>     4) The VacAttrStats structure may need to be moved out to a
> different .h file
>     for access by the programmer (currently it is in
> commands/vacuum.h) - maybe a
>     new analyze.h would be better?

vacuum.h seems fine.

            regards, tom lane

Re: ANALYZE patch for review

From
"Mark Cave-Ayland"
Date:
Hi Tom,

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 25 January 2004 23:08
> To: Mark Cave-Ayland
> Cc: pgsql-patches@postgresql.org
> Subject: Re: [PATCHES] ANALYZE patch for review
>
>
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > Here is a first attempt at a patch to allow a customised ANALYZE
> > function to be specified for user-defined types, relating to the
> > following two threads:
> >
> http://archives.postgresql.org/pgsql-hackers/2003-10/msg00113.php and
> > http://archives.postgresql.org/pgsql-hackers/2004-01/msg00236.php
>
> This seems to be going in mostly the right direction, but I
> think you've made a few errors.
>
> One is that I think that the VacAttrStats structure should be
> palloc'd by the type-specific typanalyze function, and not by
> general code.  The reason for this is that a type-specific
> function could allocate a larger struct of which VacAttrStats
> is just the first field, leaving room for additional fields
> that can communicate data to the later type-specific analysis
> routine (which would know that it is receiving this type and
> not just generic VacAttrStats).  This is already useful for
> the standard analysis code --- it would want to keep eqopr,
> eqfunc, and ltopr there, so as to avoid the extra lookups
> required in your patch.  Other datatypes would likely need
> comparable functionality.
>
> This would mean that pretty much the whole of
> examine_attribute is treated as type-specific code.  In
> consequence there would be a certain amount of duplication of
> code across different type-specific setup routines, but that
> does not bother me.

My thinking behind this was that examine_attribute decides whether a
column of a given type is analyzable so it made sense to provide the
user with the type and column information. The one thing I do want to
ensure is that the user function doesn't have to do things like check if
the column is dropped or if the stats target is zero - that should be
done by Postgres IMHO. I do like the idea of having a custom structure
passed between functions as like you say it makes things a lot more
flexible.... Will see what I can do about it.

> Also, I'd suggest that it would be better to define a zero in
> pg_type.typanalyze as selecting the default analysis routine.
> This would put the special case in just one place rather than
> having that knowledge all through pg_type.h plus CreateType.
> (If you do this, I think the default analysis routine
> wouldn't even need a pg_proc entry.)

I'll have a look at this too. The thing I really want to avoid is having
to create lots of 'special case' code depending upon whether the default
analysis routine is used or an external function. At the moment, the
logic just calls the typanalyze entry using OidFunctionCall1 regardless
which seems quite a neat solution.

> >     The user-defined function uses a new STATISTICS_KIND_CUSTOM
> >     constant to hold the data as I am not 100% sure how the planner
> > interprets the
> >     STATISTIC_KIND_* values in the pg_statistic table. My
> aim is to be
> > able to store
> >     a 2D histogram so that spatial row count estimates can
> be used by the
> > planner
> >     in PostGIS (http://postgis.refractions.net) without having to
> > maintain the
> >     statistics by manually running a stored procedure. The
> pg_statistic.h
> > file doesn't
> >     seem clear to me as to what the values of the various
> columns should
> > be when not
> >     dealing with single one-dimensional histograms.....
>
> Obviously we'd need to define more STATISTIC_KIND_xxx values
> to represent any additional kinds of statistics that are
> getting stored. "STATISTICS_KIND_CUSTOM" is exactly the way
> *not* to do that, as it would lead people to use the same ID
> code for different things. What we probably need to do is
> think of a suitable scheme for reserving STATISTIC_KIND codes
> for different uses.  A really off-the-cuff suggestion is:
>
>     1-100:    reserved for assignment by the core Postgres project
>     100-199: reserved for assignment by PostGIS
>     200-9999: reserved for other globally-known stats kinds
>     10000-32767: reserved for private site-local use
>
> People writing private datatypes would select random values
> above 10000 and have little chance of conflict.  Anybody who
> wanted to write code for public distribution would want to
> come to us and get an assignment of space below 10000.

Would it really be necessary to need to do this? My understanding would
be that using STATISTICS_KIND_CUSTOM merely marks the pg_statistic entry
as being valid, i.e. it is valid data but the row sta* contents can only
be interpreted by the specified type and operator combination, and
should not attempt to be interpreted by any of the standard estimation
functions.

Also in a selectivity estimator function would it be reasonable to call
get_attstatsslot(i.e. is it a relatively stable API for external as well
as internal use) since this function does a lot of the work that would
be required by an estimation function?

> >     4) The VacAttrStats structure may need to be moved out
> to a different
> > .h file
> >     for access by the programmer (currently it is in
> > commands/vacuum.h) - maybe a
> >     new analyze.h would be better?
>
> vacuum.h seems fine.

OK, I'll leave it where it is.


Many thanks,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Re: ANALYZE patch for review

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
>> This would mean that pretty much the whole of
>> examine_attribute is treated as type-specific code.  In
>> consequence there would be a certain amount of duplication of
>> code across different type-specific setup routines, but that
>> does not bother me.

> My thinking behind this was that examine_attribute decides whether a
> column of a given type is analyzable so it made sense to provide the
> user with the type and column information. The one thing I do want to
> ensure is that the user function doesn't have to do things like check if
> the column is dropped or if the stats target is zero - that should be
> done by Postgres IMHO.

[ shrug ... ]  It's only a couple lines either way, and you could argue
that the present behavior for attstattarget = 0 should be subject to
customization anyway.  But if you prefer to pass in the pg_attribute and
pg_type rows already fetched, I won't object.

> I'll have a look at this too. The thing I really want to avoid is having
> to create lots of 'special case' code depending upon whether the default
> analysis routine is used or an external function. At the moment, the
> logic just calls the typanalyze entry using OidFunctionCall1 regardless
> which seems quite a neat solution.

My point was that there is going to be a 'special case' to provide the
default somewhere --- if not here, then in the CREATE TYPE code.  It
seems to me that localizing knowledge of what the default behavior is
wouldn't be a bad idea.

>> What we probably need to do is
>> think of a suitable scheme for reserving STATISTIC_KIND codes
>> for different uses.

> Would it really be necessary to need to do this? My understanding would
> be that using STATISTICS_KIND_CUSTOM merely marks the pg_statistic entry
> as being valid, i.e. it is valid data but the row sta* contents can only
> be interpreted by the specified type and operator combination, and
> should not attempt to be interpreted by any of the standard estimation
> functions.

I don't much like that.  It amounts to assuming that you will always be
able to guess from context what a STATISTICS_KIND_CUSTOM entry actually
is.  I think that's shortsighted.  As soon as there are a few different
custom estimation functions out there, they will be tripping over each
others' data.  You may be thinking that only one kind of custom data
would actually exist for any one datatype, but I'd turn that around:
what of custom analysis functions that are supposed to work for a range
of datatypes?  It'd be very unsafe for such functions to all use the
same STATISTICS_KIND_CUSTOM label.  The operator selectivity functions
that need to use the data couldn't be sure what they were looking at.

> Also in a selectivity estimator function would it be reasonable to call
> get_attstatsslot(i.e. is it a relatively stable API for external as well
> as internal use)

Yes, it's as stable as any other part of Postgres ... I don't see any
reason that an external selectivity estimator should be coded any
differently from the built-in ones.

            regards, tom lane

Re: ANALYZE patch for review

From
"Mark Cave-Ayland"
Date:
Hi Tom,

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 27 January 2004 17:16
> To: Mark Cave-Ayland
> Cc: pgsql-patches@postgresql.org
> Subject: Re: [PATCHES] ANALYZE patch for review
>
> > My thinking behind this was that examine_attribute decides
> whether a
> > column of a given type is analyzable so it made sense to
> provide the
> > user with the type and column information. The one thing I
> do want to
> > ensure is that the user function doesn't have to do things
> like check
> > if the column is dropped or if the stats target is zero -
> that should
> > be done by Postgres IMHO.
>
> [ shrug ... ]  It's only a couple lines either way, and you
> could argue that the present behavior for attstattarget = 0
> should be subject to customization anyway.  But if you prefer
> to pass in the pg_attribute and pg_type rows already fetched,
> I won't object.

OK you've won me over on the stats target. I can see where someone could
have a case where an operation could fail if no statistics are available
and hence they would want to intercept the case where the target is zero
to place in some initial data.

I have a question about something you wrote in your previous email (this
is more a C question than anything else):

> One is that I think that the VacAttrStats structure should be
> palloc'd by the type-specific typanalyze function, and not by
> general code.  The reason for this is that a type-specific
> function could allocate a larger struct of which VacAttrStats
> is just the first field, leaving room for additional fields
> that can communicate data to the later type-specific analysis
> routine (which would know that it is receiving this type and
> not just generic VacAttrStats).  This is already useful for
> the standard analysis code --- it would want to keep eqopr,
> eqfunc, and ltopr there, so as to avoid the extra lookups
> required in your patch.  Other datatypes would likely need
> comparable functionality.

Is what you're saying that a custom function would do something like
this:

typedef struct
{
    VacAttrStats *v;
    int    mydata1;
    int    mydata2;
    int    mydata3;
} mystruct;

myanalyzeinfo = palloc0(sizeof(mystruct));
..
..
return myanalyzeinfo;


This means that a structure of kind mystruct is returned back to
analyze_rel(), but analyze_rel() still wants to access some information
from VacAttrStats. Does that mean that the reference to VacAttrStats in
analyze_rel() would be derived something like this?

VacAttrStats *vacattrptr;
vacattrptr = (VacAttrStats *)myanalyzeinfo;
vacattrptr->minrows = 2;

Firstly, is this right? And secondly, if so, is there a way of arranging
the structures to ensure that the compiler throws a warning if
VacAttrStats is not the first element of the structure?

> > I'll have a look at this too. The thing I really want to avoid is
> > having to create lots of 'special case' code depending upon whether
> > the default analysis routine is used or an external
> function. At the
> > moment, the logic just calls the typanalyze entry using
> > OidFunctionCall1 regardless which seems quite a neat solution.
>
> My point was that there is going to be a 'special case' to
> provide the default somewhere --- if not here, then in the
> CREATE TYPE code.  It seems to me that localizing knowledge
> of what the default behavior is wouldn't be a bad idea.

I suppose the question here would be does having a default case in this
way make things harder for developers to understand/follow the code? If
it makes sense that a value of 0 would indicate the default statistics
routines (which would be called directly) and the localization still
maintains readability then I'm happy to go with this.

> >> What we probably need to do is
> >> think of a suitable scheme for reserving STATISTIC_KIND codes
> >> for different uses.
>
> > Would it really be necessary to need to do this? My understanding
> > would be that using STATISTICS_KIND_CUSTOM merely marks the
> > pg_statistic entry as being valid, i.e. it is valid data
> but the row
> > sta* contents can only be interpreted by the specified type and
> > operator combination, and should not attempt to be
> interpreted by any
> > of the standard estimation functions.
>
> I don't much like that.  It amounts to assuming that you will
> always be able to guess from context what a
> STATISTICS_KIND_CUSTOM entry actually is.  I think that's
> shortsighted.  As soon as there are a few different custom
> estimation functions out there, they will be tripping over
> each others' data.  You may be thinking that only one kind of
> custom data would actually exist for any one datatype, but
> I'd turn that around: what of custom analysis functions that
> are supposed to work for a range of datatypes?  It'd be very
> unsafe for such functions to all use the same
> STATISTICS_KIND_CUSTOM label.  The operator selectivity
> functions that need to use the data couldn't be sure what
> they were looking at.

I'm beginning to think that perhaps we're looking at this in the wrong
way, and that a more elegant version of what you're suggesting could be
implemented using a major/minor method of identifying a statistics type.


As an example, for existing types, the major would identify the owner of
the data type as being STATSOWNER_PGDG, while the minor would indicate
the algorithm, e.g. ALG_INDEX_SCALAR or ALG_INDEX_MCV. In the case of
PostGIS this would mean that the PGDG would only have to assign them a
STATSOWNER_POSTGIS and then they would be free to divide the minor to
indicate whatever type of algorithm they need, without having to
allocate chunks of potentially unused number space.

This also then brings up the idea that different types of statistics can
be kept on a given relation; for example index selectivity stats could
be one, where as another could be the number of times the letter 'e'
appears. Your example where a custom analysis function works on a range
of datatypes can be handled by the stats owners themselves, by
allocating a block of minor numbers for an algorithm where each entry
can correspond to a different data type as they require.

I suppose this also means that there might need to be a mechanism where
a custom analysis function would want to add more than one row to the
pg_statistic table (ie it collects more than one set of statistics per
column); I can see this being used to implement statistics keeping on
functional indexes for example, where for a column X, a copy is made and
f(X) calculated for each index and added to the pg_statistic table.

> > Also in a selectivity estimator function would it be reasonable to
> > call get_attstatsslot(i.e. is it a relatively stable API
> for external
> > as well as internal use)
>
> Yes, it's as stable as any other part of Postgres ... I don't
> see any reason that an external selectivity estimator should
> be coded any differently from the built-in ones.

OK, great. That makes life slightly easier.


Many thanks,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Re: ANALYZE patch for review

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> Is what you're saying that a custom function would do something like
> this:

> typedef struct
> {
>     VacAttrStats *v;
>     int    mydata1;
>     int    mydata2;
>     int    mydata3;
> } mystruct;

> myanalyzeinfo = palloc0(sizeof(mystruct));
> ..
> ..
> return myanalyzeinfo;

Exactly, except that the return would look like

return (VacAttrStruct *) myanalyzeinfo;

(or at least it would if you weren't converting to Datum anyway).  The
main analyze code would think it had a plain pointer to VacAttrStats,
but the data analysis subroutine associated with this setup routine
would know it could cast the pointer back to (mystruct *).

> VacAttrStats *vacattrptr;
> vacattrptr = (VacAttrStats *)myanalyzeinfo;
> vacattrptr->minrows = 2;

Seems like the hard way; you can just access myanalyzeinfo->v.minrows
when you're working with a pointer declared as mystruct *.

> is there a way of arranging
> the structures to ensure that the compiler throws a warning if
> VacAttrStats is not the first element of the structure?

Not that I know of, but we use techniques like this in very many places
in the PG code and have not had troubles of that sort.  I'm not too
concerned about that.

> I'm beginning to think that perhaps we're looking at this in the wrong
> way, and that a more elegant version of what you're suggesting could be
> implemented using a major/minor method of identifying a statistics type.

If you suppose that the "major" field is the upper bits of the
statistics ID value, then this is just a slightly different way of
thinking about the range-based allocation method I suggested before.
However, the range-based method can adapt to allocating different
amounts of identifier space to different owners, whereas a major/minor
approach can't easily do that since you've defined it to be 2^N minor
IDs for each major code.

> I suppose this also means that there might need to be a mechanism where
> a custom analysis function would want to add more than one row to the
> pg_statistic table (ie it collects more than one set of statistics per
> column);

As long as the custom analysis function is tied to a datatype, I don't
see the need for that.  It already has the ability to define up to four
different "kinds" of stats for its column, and if someone makes a case
for wanting more, I'd be inclined to widen pg_statistic rather than get
into multiple rows per column.  (To point out just one problem, such an
arrangement would break the unique index scheme presently used.)  I'm
not sure yet about whether we need special hooks for keeping stats on
functional indexes, but if we do they'd surely not be looked up by
datatype, or at least not datatype alone.

            regards, tom lane

Re: ANALYZE patch for review

From
"Mark Cave-Ayland"
Date:
Hi Tom,

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 29 January 2004 15:31
> To: Mark Cave-Ayland
> Cc: pgsql-patches@postgresql.org
> Subject: Re: [PATCHES] ANALYZE patch for review
>
>

<lots cut about pointers>

OK, I've had another attempt at writing the code as you suggested but
the more I work on it the less I like it :(. What I would like to do is
make the VacAttrStats structure so that it just contains the information
that is updated in the pg_statistic table, however this fell apart when
I realised that update_attstats() suddenly requires the attr and
attrtype fields to be present. Doh.

So I'd like to propose a slightly different solution. I think that
examine_attribute() should return a pointer to a custom structure
containing any information that needs to be passed to the datatype
specific routine (not the entire VacAttrStats structure), or NULL if the
column should not be analyzed. I'm also considering changing the
examine_attribute() input parameters to be Relation, Attribute, Type for
the current column along with a pointer to a bool to indicate whether or
not the column should be analyzed or not.

If examine_attribute() sets the bool to false then the column is
ignored. If the bool is set to true then a VacAttrStats structure is
created in memory, and then the Attribute and Type tuple information is
copied into the VacAttrStats structure. A new field for VacAttrStats
will contain the pointer to the custom structure returned by
examine_attribute() which can then be passed into the compute_*_stats()
functions as an extra parameter.

This seems to achieve the aims of abstracting the statistics data from
the intermediate information required by the statistics routines,
allowing extra/custom data to be passed between the typanalyze function
and the statistics algorithm, and allowing the user to have the attr and
attrtype structures given to them. The only thing I don't really like
about this is providing a pointer to a bool in examine_attribute() -
however this is needed to distinguish from a NULL meaning 'I have no
custom data but the analyze function should still be called' and 'This
column should not be analyzed'. I can't think of a better solution at
the moment.

> > I'm beginning to think that perhaps we're looking at this
> in the wrong
> > way, and that a more elegant version of what you're
> suggesting could
> > be implemented using a major/minor method of identifying a
> statistics
> > type.
>
> If you suppose that the "major" field is the upper bits of
> the statistics ID value, then this is just a slightly
> different way of thinking about the range-based allocation
> method I suggested before. However, the range-based method
> can adapt to allocating different amounts of identifier space
> to different owners, whereas a major/minor approach can't
> easily do that since you've defined it to be 2^N minor IDs
> for each major code.

I was thinking perhaps in terms of an extra staowner int2 field in
pg_statistic where the IDs are allocated by the PGDG. Then each
group/project would only require one owner id to be allocated to them
and then have the existing 2^16 stakind space to organise themselves.
The advantage of this is that projects can allocate their own stakind
fields, implementing new or improved statistic algorithms without having
to wait on the new allocation from the PGDG.


Many thanks,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Re: ANALYZE patch for review

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> So I'd like to propose a slightly different solution. I think that
> examine_attribute() should return a pointer to a custom structure
> containing any information that needs to be passed to the datatype
> specific routine (not the entire VacAttrStats structure), or NULL if the
> column should not be analyzed.

Just a void* you mean?  Sure, we could do that, although it might result
in some duplicated effort.  Another possibility is that analyze.c goes
ahead and creates a VacAttrStats struct (including a void* extension
pointer that it initializes to NULL) and then passes the struct to
examine_attribute, which returns a bool and optionally modifies fields
in the VacAttrStats struct --- in particular the requested-row-count and
extension pointer.  If false is returned then analyze.c just throws away
the VacAttrStats struct instead of including it in its to-do list.

>> If you suppose that the "major" field is the upper bits of
>> the statistics ID value, then this is just a slightly
>> different way of thinking about the range-based allocation
>> method I suggested before.

> I was thinking perhaps in terms of an extra staowner int2 field in
> pg_statistic where the IDs are allocated by the PGDG.

I do not really want to add a field to pg_statistic.  That complicates
matters more than we have a justification to do.  Nor do we have any
reason at this point to think that we need a 2^32 namespace for
statistics kinds.  (If 2^16 ever starts to look cramped, we can just
widen the fields to int4 without introducing any backwards compatibility
issues --- existing code assignments will remain valid.  But I find it
hard to foresee that happening.)

            regards, tom lane

Re: ANALYZE patch for review

From
"Mark Cave-Ayland"
Date:
Hi Tom,

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 02 February 2004 16:54
> To: Mark Cave-Ayland
> Cc: pgsql-patches@postgresql.org
> Subject: Re: [PATCHES] ANALYZE patch for review
>
>
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > So I'd like to propose a slightly different solution. I think that
> > examine_attribute() should return a pointer to a custom structure
> > containing any information that needs to be passed to the datatype
> > specific routine (not the entire VacAttrStats structure),
> or NULL if
> > the column should not be analyzed.
>
> Just a void* you mean?  Sure, we could do that, although it
> might result in some duplicated effort.  Another possibility
> is that analyze.c goes ahead and creates a VacAttrStats
> struct (including a void* extension pointer that it
> initializes to NULL) and then passes the struct to
> examine_attribute, which returns a bool and optionally
> modifies fields in the VacAttrStats struct --- in particular
> the requested-row-count and extension pointer.  If false is
> returned then analyze.c just throws away the VacAttrStats
> struct instead of including it in its to-do list.

Yup indeed. Please find enclosed the latest version of the analyze patch
taking into account all the things we have discussed in the thread. It
feels a lot cleaner than the other - the advantage of having
analyze_rel() create the VacAttrStats structure is that you can put some
vaguely sensible defaults in the user-defined function so it shouldn't
bomb too badly if someone gets it wrong, plus the attribute and type
information can be fed in automagically.

This rewrite has the benefit that it gets rid of the duplication of
routines having to get their own copies of the attribute and type rows
based on attnum, and also the (void *) pointer gives a clear abstraction
of the data required for the pg_statistic table entry and those that are
used by the current default routine. I also altered the
examine_attribute() to allow the user-defined function to check
attstattarget for the column manually (in case they want to do something
special with minus or zero values), and also as discussed the internal
stats routine is called directly without requiring an entry in the
catalog.

> >> If you suppose that the "major" field is the upper bits of
> >> the statistics ID value, then this is just a slightly
> >> different way of thinking about the range-based allocation
> >> method I suggested before.
>
> > I was thinking perhaps in terms of an extra staowner int2 field in
> > pg_statistic where the IDs are allocated by the PGDG.
>
> I do not really want to add a field to pg_statistic.  That
> complicates matters more than we have a justification to do.
> Nor do we have any reason at this point to think that we need
> a 2^32 namespace for statistics kinds.  (If 2^16 ever starts
> to look cramped, we can just widen the fields to int4 without
> introducing any backwards compatibility issues --- existing
> code assignments will remain valid.  But I find it hard to
> foresee that happening.)

OK, I've left this as it is at the moment.


Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

Attachment

Re: ANALYZE patch for review

From
"Mark Cave-Ayland"
Date:
Hi Tom,

Did you manage to find a spare moment over the past week to check the
revised ANALYZE patch I posted to the list? Is there any more work that
needs to be done to it before it can applied to CVS?


TIA,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Re: ANALYZE patch for review

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> Yup indeed. Please find enclosed the latest version of the analyze patch
> taking into account all the things we have discussed in the thread.

I've reviewed and applied this with some small changes.  You did a good
job --- the only things you missed AFAICS were pg_dump support and
documentation.

I changed the API slightly for the typanalyze function and the
compute_stats subroutine.  I don't think it's necessary or appropriate
to pass in the Relation pointer to typanalyze; that simplifies its
signature to a single INTERNAL parameter.  (If we did want to pass the
Relation, I'd be inclined to make it a field in VacAttrStats.)  Also
I tweaked compute_stats so that the attribute number of the column is
passed explicitly.  I foresee needing this for functional-index stats
gathering --- it's likely that the accumulated sample rows will be built
on the fly and will have a tupdesc that includes both the main table
rows and the functional index columns.  So compute_stats shouldn't
assume that the nominal column number is the right thing to use to
extract the column it wants.

> I also altered the examine_attribute() to allow the user-defined
> function to check attstattarget for the column manually (in case they
> want to do something special with minus or zero values),

As I've committed it, the convention that zero means "no stats" is
enforced by examine_attribute(), but the typanalyze function is
responsible for deciding what a negative value means.  Seem reasonable?

Also, I put the following documentation about "kind" values into
pg_statistic.h.

/*
 * The present allocation of "kind" codes is:
 *
 *    1-99:        reserved for assignment by the core PostgreSQL project
 *                 (values in this range will be documented in this file)
 *    100-199:     reserved for assignment by the PostGIS project
 *                 (values to be documented in PostGIS documentation)
 *    200-9999:    reserved for future public assignments
 *
 * For private use you may choose a "kind" code at random in the range
 * 10000-30000.  However, for code that is to be widely disseminated it is
 * better to obtain a publicly defined "kind" code by request from the
 * PostgreSQL Global Development Group.
 */

            regards, tom lane

Re: ANALYZE patch for review

From
Bruce Momjian
Date:
Applied by Tom.  Thanks.

---------------------------------------------------------------------------

Mark Cave-Ayland wrote:
> Hi Tom,
>
> > -----Original Message-----
> > From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> > Sent: 02 February 2004 16:54
> > To: Mark Cave-Ayland
> > Cc: pgsql-patches@postgresql.org
> > Subject: Re: [PATCHES] ANALYZE patch for review
> >
> >
> > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > > So I'd like to propose a slightly different solution. I think that
> > > examine_attribute() should return a pointer to a custom structure
> > > containing any information that needs to be passed to the datatype
> > > specific routine (not the entire VacAttrStats structure),
> > or NULL if
> > > the column should not be analyzed.
> >
> > Just a void* you mean?  Sure, we could do that, although it
> > might result in some duplicated effort.  Another possibility
> > is that analyze.c goes ahead and creates a VacAttrStats
> > struct (including a void* extension pointer that it
> > initializes to NULL) and then passes the struct to
> > examine_attribute, which returns a bool and optionally
> > modifies fields in the VacAttrStats struct --- in particular
> > the requested-row-count and extension pointer.  If false is
> > returned then analyze.c just throws away the VacAttrStats
> > struct instead of including it in its to-do list.
>
> Yup indeed. Please find enclosed the latest version of the analyze patch
> taking into account all the things we have discussed in the thread. It
> feels a lot cleaner than the other - the advantage of having
> analyze_rel() create the VacAttrStats structure is that you can put some
> vaguely sensible defaults in the user-defined function so it shouldn't
> bomb too badly if someone gets it wrong, plus the attribute and type
> information can be fed in automagically.
>
> This rewrite has the benefit that it gets rid of the duplication of
> routines having to get their own copies of the attribute and type rows
> based on attnum, and also the (void *) pointer gives a clear abstraction
> of the data required for the pg_statistic table entry and those that are
> used by the current default routine. I also altered the
> examine_attribute() to allow the user-defined function to check
> attstattarget for the column manually (in case they want to do something
> special with minus or zero values), and also as discussed the internal
> stats routine is called directly without requiring an entry in the
> catalog.
>
> > >> If you suppose that the "major" field is the upper bits of
> > >> the statistics ID value, then this is just a slightly
> > >> different way of thinking about the range-based allocation
> > >> method I suggested before.
> >
> > > I was thinking perhaps in terms of an extra staowner int2 field in
> > > pg_statistic where the IDs are allocated by the PGDG.
> >
> > I do not really want to add a field to pg_statistic.  That
> > complicates matters more than we have a justification to do.
> > Nor do we have any reason at this point to think that we need
> > a 2^32 namespace for statistics kinds.  (If 2^16 ever starts
> > to look cramped, we can just widen the fields to int4 without
> > introducing any backwards compatibility issues --- existing
> > code assignments will remain valid.  But I find it hard to
> > foresee that happening.)
>
> OK, I've left this as it is at the moment.
>
>
> Cheers,
>
> Mark.
>
> ---
>
> Mark Cave-Ayland
> Webbased Ltd.
> Tamar Science Park
> Derriford
> Plymouth
> PL6 8BX
> England
>
> Tel: +44 (0)1752 764445
> Fax: +44 (0)1752 764446
>
>
> This email and any attachments are confidential to the intended
> recipient and may also be privileged. If you are not the intended
> recipient please delete it from your system and notify the sender. You
> should not copy it or use it for any purpose nor disclose or distribute
> its contents to any other person.

[ Attachment, skipping... ]

[ Attachment, skipping... ]

[ Attachment, skipping... ]

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: ANALYZE patch for review

From
"Mark Cave-Ayland"
Date:
Hi Tom,

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 12 February 2004 23:52
> To: Mark Cave-Ayland
> Cc: pgsql-patches@postgresql.org
> Subject: Re: [PATCHES] ANALYZE patch for review
>
>
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > Yup indeed. Please find enclosed the latest version of the analyze
> > patch taking into account all the things we have discussed in the
> > thread.
>
> I've reviewed and applied this with some small changes.  You
> did a good job --- the only things you missed AFAICS were
> pg_dump support and documentation.

Ahhh pg_dump. Knew I forgot something :) I was going to ask about
patches for the documentation but it looks like you beat me to it.

> I changed the API slightly for the typanalyze function and
> the compute_stats subroutine.  I don't think it's necessary
> or appropriate to pass in the Relation pointer to typanalyze;
> that simplifies its signature to a single INTERNAL parameter.
>  (If we did want to pass the Relation, I'd be inclined to
> make it a field in VacAttrStats.)  Also I tweaked
> compute_stats so that the attribute number of the column is
> passed explicitly.  I foresee needing this for
> functional-index stats gathering --- it's likely that the
> accumulated sample rows will be built on the fly and will
> have a tupdesc that includes both the main table rows and the
> functional index columns.  So compute_stats shouldn't assume
> that the nominal column number is the right thing to use to
> extract the column it wants.
>
> > I also altered the examine_attribute() to allow the user-defined
> > function to check attstattarget for the column manually (in
> case they
> > want to do something special with minus or zero values),
>
> As I've committed it, the convention that zero means "no
> stats" is enforced by examine_attribute(), but the typanalyze
> function is responsible for deciding what a negative value
> means.  Seem reasonable?

I've had a quick browse of CVS on developer.postgresql.org and the work
you've done looks great. The only reason I kept the Relation parameter
was because I wasn't sure if there was a historical reason why someone
would need the relation information as well as the attribute
information. No problem with the zero value enforcing no stats. I also
noticed your later commit with the added fetch_func parameter which
brings stats on functional indexes even closer :)

> Also, I put the following documentation about "kind" values
> into pg_statistic.h.
>
> /*
>  * The present allocation of "kind" codes is:
>  *
>  *    1-99:        reserved for assignment by the core
> PostgreSQL project
>  *                 (values in this range will be documented
> in this file)
>  *    100-199:     reserved for assignment by the PostGIS project
>  *                 (values to be documented in PostGIS documentation)
>  *    200-9999:    reserved for future public assignments
>  *
>  * For private use you may choose a "kind" code at random in the range
>  * 10000-30000.  However, for code that is to be widely
> disseminated it is
>  * better to obtain a publicly defined "kind" code by request from the
>  * PostgreSQL Global Development Group.
>  */

Fantastic stuff. I'll email Dave and Paul at Refractions and let them
know the changes for analyzing custom types have been made - not sure
whether it will be them or myself that get to implement the changes in
PostGIS but it will be a good test that everything works as it should :)


Thanks again for all your work on this,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Re: ANALYZE patch for review

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> The only reason I kept the Relation parameter
> was because I wasn't sure if there was a historical reason why someone
> would need the relation information as well as the attribute
> information.

I can't think of one, but if someone did, they could extract the
relation OID from the pg_attribute row and re-open it for themselves.
So AFAICS this API does not omit any critical info.

I forgot to email you about the fetch_function revision, but I trust
it meets with your approval.  Right now it's just a wrapper around
heap_fetch, but I thought we might conceivably want something different
when we do functional-index stats.  The fetch function will give us
wiggle room if we need it.

            regards, tom lane

Re: ANALYZE patch for review

From
"Mark Cave-Ayland"
Date:
> -----Original Message-----
> From: pgsql-patches-owner@postgresql.org
> [mailto:pgsql-patches-owner@postgresql.org] On Behalf Of Tom Lane
> Sent: 13 February 2004 14:41
> To: Mark Cave-Ayland
> Cc: pgsql-patches@postgresql.org
> Subject: Re: [PATCHES] ANALYZE patch for review
>
>
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > The only reason I kept the Relation parameter
> > was because I wasn't sure if there was a historical reason
> why someone
> > would need the relation information as well as the attribute
> > information.
>
> I can't think of one, but if someone did, they could extract
> the relation OID from the pg_attribute row and re-open it for
> themselves. So AFAICS this API does not omit any critical info.

Great.

> I forgot to email you about the fetch_function revision, but
> I trust it meets with your approval.  Right now it's just a
> wrapper around heap_fetch, but I thought we might conceivably
> want something different when we do functional-index stats.
> The fetch function will give us wiggle room if we need it.

Yep no problems as long as the functionality is there. I think I can
also see where you're going with functional indexes - during an analyze
phase, for a column X, a pseudo-column f(X) is generated from the sample
data before the stats calculation is performed and the results dropped
into pg_statistic. Hopefully this now shouldn't be too difficult for
someone to pick it up and run with it.


Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



Re: ANALYZE patch for review

From
"Mark Cave-Ayland"
Date:
Hi Tom,

Having been working with the PostGIS team to implement a custom analyze
routine for R-Tree selectivity, we have a question regarding the new
vacuum_delay_point() which is present in analyze.c. Is it the
responsibility of the programmers to remember to do a
vacuum_delay_point() before calling the fetch_func(), or would it be
better to move the vacuum_delay_point() into std_fetch_func()? It would
seem to make more sense that the throttling is done by PostgreSQL rather
than requiring programmers to have to remember to include the extra
function call before calling the fetch_func() themselves.


Many thanks,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.




Re: ANALYZE patch for review

From
Tom Lane
Date:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> Having been working with the PostGIS team to implement a custom analyze
> routine for R-Tree selectivity, we have a question regarding the new
> vacuum_delay_point() which is present in analyze.c. Is it the
> responsibility of the programmers to remember to do a
> vacuum_delay_point() before calling the fetch_func(), or would it be
> better to move the vacuum_delay_point() into std_fetch_func()?

It's probably not really necessary to call vacuum_delay_point in the
analysis routine, unless you are contemplating extremely expensive
analysis.  If you did have an expensive loop, would std_fetch_func
necessarily be called inside it?  It seems inappropriate to me to put
vacuum_delay_point inside std_fetch_func --- it's the analysis
programmer's business to understand whether it must be called, and if
so where's an appropriate place.

            regards, tom lane