Operator Classes and ANALYZE - Mailing list pgsql-bugs

From Rusty Conover
Subject Operator Classes and ANALYZE
Date
Msg-id 46F770BA-EDFD-49B5-BF62-1B6F6BB1C185@infogears.com
Whole thread Raw
Responses Re: Operator Classes and ANALYZE
List pgsql-bugs
Hi All,

It doesn't appear that ANALYZE uses the specified operator class for
producing statistics on an index when that operator class is not the
default for the data type.  This appears to be leading to poor query
planning.

I'm using:

# select version();
-[ RECORD
1 ]---------------------------------------------------------------------
-------------------------
version | PostgreSQL 8.1.5 on i686-pc-linux-gnu, compiled by GCC gcc
(GCC) 4.1.0 20060304 (Red Hat 4.1.0-3)

For speed of indexing a few million urls I'm indexing them with a
custom data type. The data type aptly named "urlhash" has:

A 2 byte host id (just the hostname, not many hosts are involved so
this is fine)
A 2 byte locality id (parts of the path)
A 16 byte MD5 of the full url all in a custom data type called aptly.

For a total length of 20 bytes.  By default the text output of the
data type is of the form of ([data in hex]).

Everything is nice and fast (due to good locality) for the default
operator class that respects the entire data type.  Everything is
unique so there is an n_distinct of -1 in pg_stats for the index.

Now, I created an operator class for urlhash that just does
comparisons on the host id "urlhash_host_ops" and I created the index
using:

# create index url_uh2_idx on url using btree  (texttourlhash(url)
urlhash_host_ops);

After running vacuum analyze that statistics for the new index using
the second operator class still say that every value is unique.  It
appears it's using the default operator class to produce the
statistics.  By doing so, the planner doesn't work great since it
assumes that only one row will be returned.

Note: I prefixed all default operators with @ to denote the host class.

# explain select * from url where url @= 'http://
www.prolitegear.com/'::urlhash ;
                                        QUERY PLAN
------------------------------------------------------------------------
-----------------
Index Scan using url_uh2_idx on url  (cost=0.00..6.02 rows=1 width=105)
    Index Cond: ((url)::urlhash @=
'(93d48c2ab505280c99322630a24f61c6533bc368)'::urlhash)


# select count(*) from url where url @= 'http://
www.prolitegear.com/'::urlhash;
count
-------
   6992

There are hosts with orders of magnitude difference in total number
of rows (100,000 or more), this was just chosen as an example.

Here are the stats for the index:

# select * from pg_stats where tablename = 'url_uh2_idx';
-[ RECORD 1 ]-----
+-----------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------
schemaname        | public
tablename         | url_uh2_idx
attname           | pg_expression_1
null_frac         | 0
avg_width         | 20
n_distinct        | -1
most_common_vals  |
most_common_freqs |
histogram_bounds  | {(11890a55087088877a5a01ea5b8a8459e62491c9),
(3c983e95c87b3ba8669ade6fe1be797495be671f),
(4ac4698bc9a61b94eef527c6599217a0bbcd1ad6),
(4ac4b7c874b9b16ceab038e93646510a8fb756ca),
(60d74740f0a49e68e89e01232373ed2da0ca4c2f),
(67fe336eb73e327491a4e8bf224bce5857879188),
(7975336e56ad4ee211ebee9ff2a6f1e4c9e43ca5),
(88a9336e7a8e5419ff63ed1cec6985d91cd15688),
(909b76d12cf6c84255a33a008bf91f0cf0f711c4),
(d687336e90c41fc92776dc35896797ef873c38f5),
(f4aaba84e7e86a4c7d7c4d5930128cb3a5889cbf)}
correlation       | -0.109073

Should I just create a cast function rather then an entire operator
class that just respects the host part?  Is this bad style?  If you
have questions I'd be happy to explain more.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com

pgsql-bugs by date:

Previous
From: Jeff Davis
Date:
Subject: Re: COPY fails on 8.1 with invalid byte sequences in text
Next
From: "Harald Armin Massa"
Date:
Subject: Re: BUG #2731: Cannot install PostgreSQL server on WinXP Media Center Edition