Thread: Operator Classes and ANALYZE
Hi All, It doesn't appear that ANALYZE uses the specified operator class for producing statistics on an index when that operator class is not the default for the data type. This appears to be leading to poor query planning. I'm using: # select version(); -[ RECORD 1 ]--------------------------------------------------------------------- ------------------------- version | PostgreSQL 8.1.5 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.0 20060304 (Red Hat 4.1.0-3) For speed of indexing a few million urls I'm indexing them with a custom data type. The data type aptly named "urlhash" has: A 2 byte host id (just the hostname, not many hosts are involved so this is fine) A 2 byte locality id (parts of the path) A 16 byte MD5 of the full url all in a custom data type called aptly. For a total length of 20 bytes. By default the text output of the data type is of the form of ([data in hex]). Everything is nice and fast (due to good locality) for the default operator class that respects the entire data type. Everything is unique so there is an n_distinct of -1 in pg_stats for the index. Now, I created an operator class for urlhash that just does comparisons on the host id "urlhash_host_ops" and I created the index using: # create index url_uh2_idx on url using btree (texttourlhash(url) urlhash_host_ops); After running vacuum analyze that statistics for the new index using the second operator class still say that every value is unique. It appears it's using the default operator class to produce the statistics. By doing so, the planner doesn't work great since it assumes that only one row will be returned. Note: I prefixed all default operators with @ to denote the host class. # explain select * from url where url @= 'http:// www.prolitegear.com/'::urlhash ; QUERY PLAN ------------------------------------------------------------------------ ----------------- Index Scan using url_uh2_idx on url (cost=0.00..6.02 rows=1 width=105) Index Cond: ((url)::urlhash @= '(93d48c2ab505280c99322630a24f61c6533bc368)'::urlhash) # select count(*) from url where url @= 'http:// www.prolitegear.com/'::urlhash; count ------- 6992 There are hosts with orders of magnitude difference in total number of rows (100,000 or more), this was just chosen as an example. Here are the stats for the index: # select * from pg_stats where tablename = 'url_uh2_idx'; -[ RECORD 1 ]----- +----------------------------------------------------------------------- ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ -------------------------------------------- schemaname | public tablename | url_uh2_idx attname | pg_expression_1 null_frac | 0 avg_width | 20 n_distinct | -1 most_common_vals | most_common_freqs | histogram_bounds | {(11890a55087088877a5a01ea5b8a8459e62491c9), (3c983e95c87b3ba8669ade6fe1be797495be671f), (4ac4698bc9a61b94eef527c6599217a0bbcd1ad6), (4ac4b7c874b9b16ceab038e93646510a8fb756ca), (60d74740f0a49e68e89e01232373ed2da0ca4c2f), (67fe336eb73e327491a4e8bf224bce5857879188), (7975336e56ad4ee211ebee9ff2a6f1e4c9e43ca5), (88a9336e7a8e5419ff63ed1cec6985d91cd15688), (909b76d12cf6c84255a33a008bf91f0cf0f711c4), (d687336e90c41fc92776dc35896797ef873c38f5), (f4aaba84e7e86a4c7d7c4d5930128cb3a5889cbf)} correlation | -0.109073 Should I just create a cast function rather then an entire operator class that just respects the host part? Is this bad style? If you have questions I'd be happy to explain more. Thanks, Rusty -- Rusty Conover InfoGears Inc. Web: http://www.infogears.com
On Fri, 2006-11-03 at 15:16 -0700, Rusty Conover wrote: > It doesn't appear that ANALYZE uses the specified operator class for > producing statistics on an index when that operator class is not the > default for the data type. This appears to be leading to poor query > planning. > For speed of indexing a few million urls I'm indexing them with a > custom data type. The data type aptly named "urlhash" has: Have you read the CREATE TYPE man page, specifically with regard to the analyze_function clause? Basically, if you want anything different, you have to write an analysis function yourself. This is what PostGIS (www.postgis.org) does, if you want to look for specific code examples. ANALYZE collects stats for tables, not indexes, using the default operator class for the datatype. So even though you've clearly specified an opclass for the index, no stats will be collected using it. Alternatively, perhaps you have fallen foul of this situation? /* * Can't analyze if the opclass uses a storage type * different from the expression result type. We'd get * confused because the type shown in pg_attribute for * the index column doesn't match what we are getting * from the expression. Perhaps this can be fixed * someday, but for now, punt. */ It's in the analyze.c code, but not in the docs. Be interested in a full report of your research, once you're done. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Nov 6, 2006, at 1:53 PM, Simon Riggs wrote: > On Fri, 2006-11-03 at 15:16 -0700, Rusty Conover wrote: > >> It doesn't appear that ANALYZE uses the specified operator class for >> producing statistics on an index when that operator class is not the >> default for the data type. This appears to be leading to poor query >> planning. > >> For speed of indexing a few million urls I'm indexing them with a >> custom data type. The data type aptly named "urlhash" has: > > Have you read the CREATE TYPE man page, specifically with regard to > the > analyze_function clause? Basically, if you want anything different, > you > have to write an analysis function yourself. This is what PostGIS > (www.postgis.org) does, if you want to look for specific code > examples. I have read it and works great, because it says it uses the default operators = and < to build the statistics which should work great for the default operator class. I don't think I need to write my own analyze function because the default behavior works so well. I just want the ANALYZE call to use the index's opclass definitions of = and < if the index is created with a custom operator class that is not the default for the data type. > > ANALYZE collects stats for tables, not indexes, using the default > operator class for the datatype. So even though you've clearly > specified > an opclass for the index, no stats will be collected using it. > I don't think thats correct because, with indexes based on functions there are statistics that clearly can't be gathered just directly from the table. Analyze does look at indexes to build up the histogram statistics and correlation. See compute_index_stats in analyze.c. > Alternatively, perhaps you have fallen foul of this situation? > > /* > * Can't analyze if the opclass uses a storage type > * different from the expression result type. We'd get > * confused because the type shown in pg_attribute for > * the index column doesn't match what we are getting > * from the expression. Perhaps this can be fixed > * someday, but for now, punt. > */ > > It's in the analyze.c code, but not in the docs. > > Be interested in a full report of your research, once you're done. > I don't think this applies because the storage type is the same its just a separate operator class being used. I think the problem may be with the "ordering_oper" and "equality_oper" only being passed the type rather then the type and the operator class that's being used for the index. Looking more it goes back into typcache.c which only caches the default operators from the default btree or hash operator classes. Changing this looks like it would require a bit of effort if it is the right path to pursue. Thanks, Rusty -- Rusty Conover InfoGears Inc. Web: http://www.infogears.com
On Mon, 2006-11-06 at 14:47 -0700, Rusty Conover wrote: > I just > want the ANALYZE call to use the index's opclass definitions of = and > < if the index is created with a custom operator class that is not > the default for the data type. Which is exactly what the manual specifically says it doesn't do, therefore you'll need the analyze_function. That capability was put there deliberately to help you out, in this situation. > I don't think I need to write my own analyze function That is the solution, not a workaround. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Nov 6, 2006, at 3:20 PM, Simon Riggs wrote: > On Mon, 2006-11-06 at 14:47 -0700, Rusty Conover wrote: > >> I just >> want the ANALYZE call to use the index's opclass definitions of = and >> < if the index is created with a custom operator class that is not >> the default for the data type. > > Which is exactly what the manual specifically says it doesn't do, > therefore you'll need the analyze_function. That capability was put > there deliberately to help you out, in this situation. Since the analyze function is part of the type not the operator class (making it type specific not operator class specific), it doesn't appear that it will be able to do what I ask since you can only have one analyze function per type and its not passed the operator context for the particular run. from analyze.h: /*---------- * ANALYZE builds one of these structs for each attribute (column) that is * to be analyzed. The struct and subsidiary data are in anl_context, * so they live until the end of the ANALYZE operation. * * The type-specific typanalyze function is passed a pointer to this struct * and must return TRUE to continue analysis, FALSE to skip analysis of this * column. In the TRUE case it must set the compute_stats and minrows fields, * and can optionally set extra_data to pass additional info to compute_stats. * minrows is its request for the minimum number of sample rows to be gathered * (but note this request might not be honored, eg if there are fewer rows * than that in the table). * */ I still think this is a deficiency in the analyze function to not use the operator_class that the index uses when producing statistics for that index. Thanks, Rusty -- Rusty Conover InfoGears Inc. Web: http://www.infogears.com
On Mon, Nov 06, 2006 at 08:53:28PM +0000, Simon Riggs wrote: > On Fri, 2006-11-03 at 15:16 -0700, Rusty Conover wrote: > > > It doesn't appear that ANALYZE uses the specified operator class for > > producing statistics on an index when that operator class is not the > > default for the data type. This appears to be leading to poor query > > planning. > > > For speed of indexing a few million urls I'm indexing them with a > > custom data type. The data type aptly named "urlhash" has: > > Have you read the CREATE TYPE man page, specifically with regard to the > analyze_function clause? Basically, if you want anything different, you > have to write an analysis function yourself. This is what PostGIS > (www.postgis.org) does, if you want to look for specific code examples. > > ANALYZE collects stats for tables, not indexes, using the default > operator class for the datatype. So even though you've clearly specified > an opclass for the index, no stats will be collected using it. > > Alternatively, perhaps you have fallen foul of this situation? > > /* > * Can't analyze if the opclass uses a storage type > * different from the expression result type. We'd get > * confused because the type shown in pg_attribute for > * the index column doesn't match what we are getting > * from the expression. Perhaps this can be fixed > * someday, but for now, punt. > */ > > It's in the analyze.c code, but not in the docs. Should a doc patch be in the offing here? Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
On Mon, 2006-11-06 at 15:54 -0700, Rusty Conover wrote: > I still think this is a deficiency in the analyze function to not use > the operator_class that the index uses when producing statistics for > that index. Agreed, but that isn't the way it works right now, AFAICS. TODO... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Mon, 2006-11-06 at 15:54 -0700, Rusty Conover wrote: > > > I still think this is a deficiency in the analyze function to not use > > the operator_class that the index uses when producing statistics for > > that index. > > Agreed, but that isn't the way it works right now, AFAICS. TODO... Do people want this as a TODO? What would the wording be? -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +