Operator Classes and ANALYZE - Mailing list pgsql-bugs
From | Rusty Conover |
---|---|
Subject | Operator Classes and ANALYZE |
Date | |
Msg-id | 46F770BA-EDFD-49B5-BF62-1B6F6BB1C185@infogears.com Whole thread Raw |
Responses |
Re: Operator Classes and ANALYZE
|
List | pgsql-bugs |
Hi All, It doesn't appear that ANALYZE uses the specified operator class for producing statistics on an index when that operator class is not the default for the data type. This appears to be leading to poor query planning. I'm using: # select version(); -[ RECORD 1 ]--------------------------------------------------------------------- ------------------------- version | PostgreSQL 8.1.5 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.0 20060304 (Red Hat 4.1.0-3) For speed of indexing a few million urls I'm indexing them with a custom data type. The data type aptly named "urlhash" has: A 2 byte host id (just the hostname, not many hosts are involved so this is fine) A 2 byte locality id (parts of the path) A 16 byte MD5 of the full url all in a custom data type called aptly. For a total length of 20 bytes. By default the text output of the data type is of the form of ([data in hex]). Everything is nice and fast (due to good locality) for the default operator class that respects the entire data type. Everything is unique so there is an n_distinct of -1 in pg_stats for the index. Now, I created an operator class for urlhash that just does comparisons on the host id "urlhash_host_ops" and I created the index using: # create index url_uh2_idx on url using btree (texttourlhash(url) urlhash_host_ops); After running vacuum analyze that statistics for the new index using the second operator class still say that every value is unique. It appears it's using the default operator class to produce the statistics. By doing so, the planner doesn't work great since it assumes that only one row will be returned. Note: I prefixed all default operators with @ to denote the host class. # explain select * from url where url @= 'http:// www.prolitegear.com/'::urlhash ; QUERY PLAN ------------------------------------------------------------------------ ----------------- Index Scan using url_uh2_idx on url (cost=0.00..6.02 rows=1 width=105) Index Cond: ((url)::urlhash @= '(93d48c2ab505280c99322630a24f61c6533bc368)'::urlhash) # select count(*) from url where url @= 'http:// www.prolitegear.com/'::urlhash; count ------- 6992 There are hosts with orders of magnitude difference in total number of rows (100,000 or more), this was just chosen as an example. Here are the stats for the index: # select * from pg_stats where tablename = 'url_uh2_idx'; -[ RECORD 1 ]----- +----------------------------------------------------------------------- ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ -------------------------------------------- schemaname | public tablename | url_uh2_idx attname | pg_expression_1 null_frac | 0 avg_width | 20 n_distinct | -1 most_common_vals | most_common_freqs | histogram_bounds | {(11890a55087088877a5a01ea5b8a8459e62491c9), (3c983e95c87b3ba8669ade6fe1be797495be671f), (4ac4698bc9a61b94eef527c6599217a0bbcd1ad6), (4ac4b7c874b9b16ceab038e93646510a8fb756ca), (60d74740f0a49e68e89e01232373ed2da0ca4c2f), (67fe336eb73e327491a4e8bf224bce5857879188), (7975336e56ad4ee211ebee9ff2a6f1e4c9e43ca5), (88a9336e7a8e5419ff63ed1cec6985d91cd15688), (909b76d12cf6c84255a33a008bf91f0cf0f711c4), (d687336e90c41fc92776dc35896797ef873c38f5), (f4aaba84e7e86a4c7d7c4d5930128cb3a5889cbf)} correlation | -0.109073 Should I just create a cast function rather then an entire operator class that just respects the host part? Is this bad style? If you have questions I'd be happy to explain more. Thanks, Rusty -- Rusty Conover InfoGears Inc. Web: http://www.infogears.com
pgsql-bugs by date: