Re: ANALYZE patch for review - Mailing list pgsql-patches

From Mark Cave-Ayland
Subject Re: ANALYZE patch for review
Date
Msg-id 8F4A22E017460A458DB7BBAB65CA6AE502654B@openmanage
Whole thread Raw
In response to ANALYZE patch for review  ("Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk>)
Responses Re: ANALYZE patch for review
List pgsql-patches
Hi Tom,

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 27 January 2004 17:16
> To: Mark Cave-Ayland
> Cc: pgsql-patches@postgresql.org
> Subject: Re: [PATCHES] ANALYZE patch for review
>
> > My thinking behind this was that examine_attribute decides
> whether a
> > column of a given type is analyzable so it made sense to
> provide the
> > user with the type and column information. The one thing I
> do want to
> > ensure is that the user function doesn't have to do things
> like check
> > if the column is dropped or if the stats target is zero -
> that should
> > be done by Postgres IMHO.
>
> [ shrug ... ]  It's only a couple lines either way, and you
> could argue that the present behavior for attstattarget = 0
> should be subject to customization anyway.  But if you prefer
> to pass in the pg_attribute and pg_type rows already fetched,
> I won't object.

OK you've won me over on the stats target. I can see where someone could
have a case where an operation could fail if no statistics are available
and hence they would want to intercept the case where the target is zero
to place in some initial data.

I have a question about something you wrote in your previous email (this
is more a C question than anything else):

> One is that I think that the VacAttrStats structure should be
> palloc'd by the type-specific typanalyze function, and not by
> general code.  The reason for this is that a type-specific
> function could allocate a larger struct of which VacAttrStats
> is just the first field, leaving room for additional fields
> that can communicate data to the later type-specific analysis
> routine (which would know that it is receiving this type and
> not just generic VacAttrStats).  This is already useful for
> the standard analysis code --- it would want to keep eqopr,
> eqfunc, and ltopr there, so as to avoid the extra lookups
> required in your patch.  Other datatypes would likely need
> comparable functionality.

Is what you're saying that a custom function would do something like
this:

typedef struct
{
    VacAttrStats *v;
    int    mydata1;
    int    mydata2;
    int    mydata3;
} mystruct;

myanalyzeinfo = palloc0(sizeof(mystruct));
..
..
return myanalyzeinfo;


This means that a structure of kind mystruct is returned back to
analyze_rel(), but analyze_rel() still wants to access some information
from VacAttrStats. Does that mean that the reference to VacAttrStats in
analyze_rel() would be derived something like this?

VacAttrStats *vacattrptr;
vacattrptr = (VacAttrStats *)myanalyzeinfo;
vacattrptr->minrows = 2;

Firstly, is this right? And secondly, if so, is there a way of arranging
the structures to ensure that the compiler throws a warning if
VacAttrStats is not the first element of the structure?

> > I'll have a look at this too. The thing I really want to avoid is
> > having to create lots of 'special case' code depending upon whether
> > the default analysis routine is used or an external
> function. At the
> > moment, the logic just calls the typanalyze entry using
> > OidFunctionCall1 regardless which seems quite a neat solution.
>
> My point was that there is going to be a 'special case' to
> provide the default somewhere --- if not here, then in the
> CREATE TYPE code.  It seems to me that localizing knowledge
> of what the default behavior is wouldn't be a bad idea.

I suppose the question here would be does having a default case in this
way make things harder for developers to understand/follow the code? If
it makes sense that a value of 0 would indicate the default statistics
routines (which would be called directly) and the localization still
maintains readability then I'm happy to go with this.

> >> What we probably need to do is
> >> think of a suitable scheme for reserving STATISTIC_KIND codes
> >> for different uses.
>
> > Would it really be necessary to need to do this? My understanding
> > would be that using STATISTICS_KIND_CUSTOM merely marks the
> > pg_statistic entry as being valid, i.e. it is valid data
> but the row
> > sta* contents can only be interpreted by the specified type and
> > operator combination, and should not attempt to be
> interpreted by any
> > of the standard estimation functions.
>
> I don't much like that.  It amounts to assuming that you will
> always be able to guess from context what a
> STATISTICS_KIND_CUSTOM entry actually is.  I think that's
> shortsighted.  As soon as there are a few different custom
> estimation functions out there, they will be tripping over
> each others' data.  You may be thinking that only one kind of
> custom data would actually exist for any one datatype, but
> I'd turn that around: what of custom analysis functions that
> are supposed to work for a range of datatypes?  It'd be very
> unsafe for such functions to all use the same
> STATISTICS_KIND_CUSTOM label.  The operator selectivity
> functions that need to use the data couldn't be sure what
> they were looking at.

I'm beginning to think that perhaps we're looking at this in the wrong
way, and that a more elegant version of what you're suggesting could be
implemented using a major/minor method of identifying a statistics type.


As an example, for existing types, the major would identify the owner of
the data type as being STATSOWNER_PGDG, while the minor would indicate
the algorithm, e.g. ALG_INDEX_SCALAR or ALG_INDEX_MCV. In the case of
PostGIS this would mean that the PGDG would only have to assign them a
STATSOWNER_POSTGIS and then they would be free to divide the minor to
indicate whatever type of algorithm they need, without having to
allocate chunks of potentially unused number space.

This also then brings up the idea that different types of statistics can
be kept on a given relation; for example index selectivity stats could
be one, where as another could be the number of times the letter 'e'
appears. Your example where a custom analysis function works on a range
of datatypes can be handled by the stats owners themselves, by
allocating a block of minor numbers for an algorithm where each entry
can correspond to a different data type as they require.

I suppose this also means that there might need to be a mechanism where
a custom analysis function would want to add more than one row to the
pg_statistic table (ie it collects more than one set of statistics per
column); I can see this being used to implement statistics keeping on
functional indexes for example, where for a column X, a copy is made and
f(X) calculated for each index and added to the pg_statistic table.

> > Also in a selectivity estimator function would it be reasonable to
> > call get_attstatsslot(i.e. is it a relatively stable API
> for external
> > as well as internal use)
>
> Yes, it's as stable as any other part of Postgres ... I don't
> see any reason that an external selectivity estimator should
> be coded any differently from the built-in ones.

OK, great. That makes life slightly easier.


Many thanks,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.



pgsql-patches by date:

Previous
From: Neil Conway
Date:
Subject: Re: support for printing/exporting xml
Next
From: Tom Lane
Date:
Subject: Re: ANALYZE patch for review