Home > mailing lists

Re: Suggestions for analyze patch required... - Mailing list pgsql-hackers

From	Mark Cave-Ayland
Subject	Re: Suggestions for analyze patch required...
Date	January 14, 2004 05:55:35
Msg-id	8F4A22E017460A458DB7BBAB65CA6AE5026540@openmanage Whole thread
In response to	Suggestions for analyze patch required... ("Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk>)
List	pgsql-hackers

Tree view

Hi Tom,

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] 
> Sent: 13 January 2004 18:08
> To: Mark Cave-Ayland
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Suggestions for analyze patch required... 
> 
> 
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > I agree that the custom function needs an input as to the number of 
> > rows used for analysis, but I think that this is determined by the 
> > application in question. It may be that while the existing 
> algorithm 
> > is fine for the existing data types, it may not give 
> accurate enough 
> > statistics for some custom type that someone will need to create in 
> > the future (e.g. the 300 * atstattarget estimate for 
> minrows may not 
> > be valid in some cases).
> 
> Exactly.  That equation has to be part of the 
> potentially-datatype-specific code.  I believe the sampling 
> code is already set up to take the max() across all the 
> requested sample size values.  The notion here is that if we 
> need to sample (say) 10000 rows instead of 3000 to satisfy 
> some particular analysis requirement, we might as well make 
> use of the larger sample size for all the columns.  You seem 
> to be envisioning fetching a new sample for each column of 
> the table --- that seems like N times the work for an 
> N-column table, with little benefit that I can see.

*lightbulb*

Now I understand this. I was assuming that each type acquires its own
sample rows, but now what I understand happens is that the atstattarget
is used from each column to calculate the number of rows required, the
max() across all columns for a relation is taken, and then this number
of sample rows are loaded from the relation. The analysis routines for
each column then run on this sample and drop the results in the stats
structure. Finally, for each column, the stats are converted into an
array, and the relevant entry created/updated in pg_statistic. Hope that
sounds about right.

> > 1) Modify examine_attribute() so it will return a VacAttrStats 
> > structure if the column has a valid ANALYZE function OID, 
> and has not 
> > been dropped. Move all the specific functionality into a 
> new function, 
> > assign it an OID, and make this the default for existing pg_types.
> 
> I was envisioning that the existing examine_attribute() would 
> become the default datatype-specific routine referenced in 
> pg_type.  Either it, or a substitute routine written by a 
> datatype author, would be called and would return a 
> VacAttrStats structure (or NULL to skip analysis).  The stats 
> structure would indicate the requested sample size and 
> contain a function pointer to a second function to call back 
> after the sample has been collected.  The existing 
> compute_xxx_stats functions would become two examples of this 
> second function.  (The second functions would thus not 
> require pg_proc entries nor a pg_type column to reference 
> them: the examine_attribute routine would know which function 
> it wanted called.) The second functions would return data to 
> be stored into pg_statistic, using the VacAttrStats structures.
> 
> IMHO neither acquisition of the sample rows nor storing of 
> the final results in pg_statistic should be under the control 
> of the per-datatype routine, because those are best done in 
> parallel for all columns at once.
>
> > Finally the way VacAttrStats is defined means that the 
> float * arrays 
> > are fixed at STATISTIC_NUM_SLOTS elements. For example, 
> what if I want 
> > a histogram with more than 1000 buckets???
> 
> The histogram is still just one array, no?  NUM_SLOTS defines 
> the maximum number of different arrays you can put into 
> pg_statistic, but not their dimensions or contents.  I don't 
> see your point.

Again, this was probably a result of me misunderstanding of how the
above process works, and in that context the approach you suggest above
would make perfect sense. There's enough material here for me to start
coding up the patch - thanks again Tom for taking the time to explain
the innards of this to me. When I get something working, I'll post an
evaluation (along with my custom type used to test it) to pgsql-patches.

Many thanks,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

pgsql-hackers by date:

From: Steve Atkins
Date: 14 January 2004, 02:46:34
Subject: Mixing threaded and non-threaded

From: Dave Cramer
Date: 14 January 2004, 08:49:14
Subject: Re: failed to re-find parent key

Re: Suggestions for analyze patch required... - Mailing list pgsql-hackers

Previous

Next