Thread: Operator Classes and ANALYZE

Operator Classes and ANALYZE

From
Rusty Conover
Date:
Hi All,

It doesn't appear that ANALYZE uses the specified operator class for
producing statistics on an index when that operator class is not the
default for the data type.  This appears to be leading to poor query
planning.

I'm using:

# select version();
-[ RECORD
1 ]---------------------------------------------------------------------
-------------------------
version | PostgreSQL 8.1.5 on i686-pc-linux-gnu, compiled by GCC gcc
(GCC) 4.1.0 20060304 (Red Hat 4.1.0-3)

For speed of indexing a few million urls I'm indexing them with a
custom data type. The data type aptly named "urlhash" has:

A 2 byte host id (just the hostname, not many hosts are involved so
this is fine)
A 2 byte locality id (parts of the path)
A 16 byte MD5 of the full url all in a custom data type called aptly.

For a total length of 20 bytes.  By default the text output of the
data type is of the form of ([data in hex]).

Everything is nice and fast (due to good locality) for the default
operator class that respects the entire data type.  Everything is
unique so there is an n_distinct of -1 in pg_stats for the index.

Now, I created an operator class for urlhash that just does
comparisons on the host id "urlhash_host_ops" and I created the index
using:

# create index url_uh2_idx on url using btree  (texttourlhash(url)
urlhash_host_ops);

After running vacuum analyze that statistics for the new index using
the second operator class still say that every value is unique.  It
appears it's using the default operator class to produce the
statistics.  By doing so, the planner doesn't work great since it
assumes that only one row will be returned.

Note: I prefixed all default operators with @ to denote the host class.

# explain select * from url where url @= 'http://
www.prolitegear.com/'::urlhash ;
                                        QUERY PLAN
------------------------------------------------------------------------
-----------------
Index Scan using url_uh2_idx on url  (cost=0.00..6.02 rows=1 width=105)
    Index Cond: ((url)::urlhash @=
'(93d48c2ab505280c99322630a24f61c6533bc368)'::urlhash)


# select count(*) from url where url @= 'http://
www.prolitegear.com/'::urlhash;
count
-------
   6992

There are hosts with orders of magnitude difference in total number
of rows (100,000 or more), this was just chosen as an example.

Here are the stats for the index:

# select * from pg_stats where tablename = 'url_uh2_idx';
-[ RECORD 1 ]-----
+-----------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------
schemaname        | public
tablename         | url_uh2_idx
attname           | pg_expression_1
null_frac         | 0
avg_width         | 20
n_distinct        | -1
most_common_vals  |
most_common_freqs |
histogram_bounds  | {(11890a55087088877a5a01ea5b8a8459e62491c9),
(3c983e95c87b3ba8669ade6fe1be797495be671f),
(4ac4698bc9a61b94eef527c6599217a0bbcd1ad6),
(4ac4b7c874b9b16ceab038e93646510a8fb756ca),
(60d74740f0a49e68e89e01232373ed2da0ca4c2f),
(67fe336eb73e327491a4e8bf224bce5857879188),
(7975336e56ad4ee211ebee9ff2a6f1e4c9e43ca5),
(88a9336e7a8e5419ff63ed1cec6985d91cd15688),
(909b76d12cf6c84255a33a008bf91f0cf0f711c4),
(d687336e90c41fc92776dc35896797ef873c38f5),
(f4aaba84e7e86a4c7d7c4d5930128cb3a5889cbf)}
correlation       | -0.109073

Should I just create a cast function rather then an entire operator
class that just respects the host part?  Is this bad style?  If you
have questions I'd be happy to explain more.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com

Re: Operator Classes and ANALYZE

From
"Simon Riggs"
Date:
On Fri, 2006-11-03 at 15:16 -0700, Rusty Conover wrote:

> It doesn't appear that ANALYZE uses the specified operator class for
> producing statistics on an index when that operator class is not the
> default for the data type.  This appears to be leading to poor query
> planning.

> For speed of indexing a few million urls I'm indexing them with a
> custom data type. The data type aptly named "urlhash" has:

Have you read the CREATE TYPE man page, specifically with regard to the
analyze_function clause? Basically, if you want anything different, you
have to write an analysis function yourself. This is what PostGIS
(www.postgis.org) does, if you want to look for specific code examples.

ANALYZE collects stats for tables, not indexes, using the default
operator class for the datatype. So even though you've clearly specified
an opclass for the index, no stats will be collected using it.

Alternatively, perhaps you have fallen foul of this situation?

/*
 * Can't analyze if the opclass uses a storage type
 * different from the expression result type. We'd get
 * confused because the type shown in pg_attribute for
 * the index column doesn't match what we are getting
 * from the expression. Perhaps this can be fixed
 * someday, but for now, punt.
 */

It's in the analyze.c code, but not in the docs.

Be interested in a full report of your research, once you're done.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Operator Classes and ANALYZE

From
Rusty Conover
Date:
On Nov 6, 2006, at 1:53 PM, Simon Riggs wrote:

> On Fri, 2006-11-03 at 15:16 -0700, Rusty Conover wrote:
>
>> It doesn't appear that ANALYZE uses the specified operator class for
>> producing statistics on an index when that operator class is not the
>> default for the data type.  This appears to be leading to poor query
>> planning.
>
>> For speed of indexing a few million urls I'm indexing them with a
>> custom data type. The data type aptly named "urlhash" has:
>
> Have you read the CREATE TYPE man page, specifically with regard to
> the
> analyze_function clause? Basically, if you want anything different,
> you
> have to write an analysis function yourself. This is what PostGIS
> (www.postgis.org) does, if you want to look for specific code
> examples.


I have read it and works great, because it says it uses the default
operators = and < to build the statistics which should work great for
the default operator class.    I don't think I need to write my own
analyze function because the default behavior works so well.  I just
want the ANALYZE call to use the index's opclass definitions of = and
< if the index is created with a custom operator class that is not
the default for the data type.


>
> ANALYZE collects stats for tables, not indexes, using the default
> operator class for the datatype. So even though you've clearly
> specified
> an opclass for the index, no stats will be collected using it.
>

I don't think thats correct because, with indexes based on functions
there are statistics that clearly can't be gathered just directly
from the table.  Analyze does look at indexes to build up the
histogram statistics and correlation.  See compute_index_stats in
analyze.c.


> Alternatively, perhaps you have fallen foul of this situation?
>

> /*
>  * Can't analyze if the opclass uses a storage type
>  * different from the expression result type. We'd get
>  * confused because the type shown in pg_attribute for
>  * the index column doesn't match what we are getting
>  * from the expression. Perhaps this can be fixed
>  * someday, but for now, punt.
>  */
>
> It's in the analyze.c code, but not in the docs.
>
> Be interested in a full report of your research, once you're done.
>

I don't think this applies because the storage type is the same its
just a separate operator class being used.

I think the problem may be with the "ordering_oper" and
"equality_oper" only being passed the type rather then the type and
the operator class that's being used for the index.  Looking more it
goes back into typcache.c which only caches the default operators
from the default btree or hash operator classes.  Changing this looks
like it would require a bit of effort if it is the right path to pursue.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com

Re: Operator Classes and ANALYZE

From
"Simon Riggs"
Date:
On Mon, 2006-11-06 at 14:47 -0700, Rusty Conover wrote:

> I just
> want the ANALYZE call to use the index's opclass definitions of = and
> < if the index is created with a custom operator class that is not
> the default for the data type.

Which is exactly what the manual specifically says it doesn't do,
therefore you'll need the analyze_function. That capability was put
there deliberately to help you out, in this situation.

> I don't think I need to write my own analyze function

That is the solution, not a workaround.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Operator Classes and ANALYZE

From
Rusty Conover
Date:
On Nov 6, 2006, at 3:20 PM, Simon Riggs wrote:

> On Mon, 2006-11-06 at 14:47 -0700, Rusty Conover wrote:
>
>> I just
>> want the ANALYZE call to use the index's opclass definitions of = and
>> < if the index is created with a custom operator class that is not
>> the default for the data type.
>
> Which is exactly what the manual specifically says it doesn't do,
> therefore you'll need the analyze_function. That capability was put
> there deliberately to help you out, in this situation.


Since the analyze function is part of the type not the operator class
(making it type specific not operator class specific), it doesn't
appear that it will be able to do what I ask since you can only have
one analyze function per type and its not passed the operator context
for the particular run.

from analyze.h:

/*----------
* ANALYZE builds one of these structs for each attribute (column)
that is
* to be analyzed.      The struct and subsidiary data are in
anl_context,
* so they live until the end of the ANALYZE operation.
*
* The type-specific typanalyze function is passed a pointer to this
struct
* and must return TRUE to continue analysis, FALSE to skip analysis
of this
* column.      In the TRUE case it must set the compute_stats and
minrows fields,
* and can optionally set extra_data to pass additional info to
compute_stats.
* minrows is its request for the minimum number of sample rows to be
gathered
* (but note this request might not be honored, eg if there are fewer
rows
* than that in the table).
*
*/

I still think this is a deficiency in the analyze function to not use
the operator_class that the index uses when producing statistics for
that index.

Thanks,

Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com

Re: Operator Classes and ANALYZE

From
David Fetter
Date:
On Mon, Nov 06, 2006 at 08:53:28PM +0000, Simon Riggs wrote:
> On Fri, 2006-11-03 at 15:16 -0700, Rusty Conover wrote:
>
> > It doesn't appear that ANALYZE uses the specified operator class for
> > producing statistics on an index when that operator class is not the
> > default for the data type.  This appears to be leading to poor query
> > planning.
>
> > For speed of indexing a few million urls I'm indexing them with a
> > custom data type. The data type aptly named "urlhash" has:
>
> Have you read the CREATE TYPE man page, specifically with regard to the
> analyze_function clause? Basically, if you want anything different, you
> have to write an analysis function yourself. This is what PostGIS
> (www.postgis.org) does, if you want to look for specific code examples.
>
> ANALYZE collects stats for tables, not indexes, using the default
> operator class for the datatype. So even though you've clearly specified
> an opclass for the index, no stats will be collected using it.
>
> Alternatively, perhaps you have fallen foul of this situation?
>
> /*
>  * Can't analyze if the opclass uses a storage type
>  * different from the expression result type. We'd get
>  * confused because the type shown in pg_attribute for
>  * the index column doesn't match what we are getting
>  * from the expression. Perhaps this can be fixed
>  * someday, but for now, punt.
>  */
>
> It's in the analyze.c code, but not in the docs.

Should a doc patch be in the offing here?

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: Operator Classes and ANALYZE

From
"Simon Riggs"
Date:
On Mon, 2006-11-06 at 15:54 -0700, Rusty Conover wrote:

> I still think this is a deficiency in the analyze function to not use
> the operator_class that the index uses when producing statistics for
> that index.

Agreed, but that isn't the way it works right now, AFAICS. TODO...

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Operator Classes and ANALYZE

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Mon, 2006-11-06 at 15:54 -0700, Rusty Conover wrote:
>
> > I still think this is a deficiency in the analyze function to not use
> > the operator_class that the index uses when producing statistics for
> > that index.
>
> Agreed, but that isn't the way it works right now, AFAICS. TODO...

Do people want this as a TODO?  What would the wording be?

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +