Thread: Implied Functional Index use

Implied Functional Index use

From

Simon Riggs

Date:

11 July 2006, 17:55:04

Currently, functional indexes can be used by a query that explicitly
mentions the exact phrasing of the functional index within the WHERE
clause.

IMHO it is feasible to extend the range of WHERE clauses for which
functional indexes can be used by using implication, much in the same
way that we use transitive closure in join queries now. This has some
powerful and important uses, just one of which is index prefix
compression (so read on).

If we have a subclause within a WHERE clause of the form
col1 = k1 AND col2 = k2 AND ... colN = kN

this implies that the following clause is also true:
col1 = k1 AND col2 = k2 AND ... colN = kN
AND    F(col1, col2 ... colN) = F(k1, k2 ... kN)

iff:
- the function F is IMMUTABLE, since that definition implies that the
output is always the same for any set of constant inputs.
- the operator that connects each col1,k1 pair must be defined as true
equality (note that in the above example what looks like the same
equality operator is in fact context dependent on the datatypes)
- the datatypes of each col1, k1 pair must match

If we have a Functional Index F(col1, col2...colN) then we can use the
second implied form of the subclause to recognise that the index can 
be useful for the query. This would then allow a query plan that uses an
IndexScan on the functional index with a re-check filter of the original
query clause.

An example might be a query using the clausesurname = 'RIGGS' 
would allow us to use an index defined asmetaphone(surname)
even though the original application was written before we added the
index. Note that the index would be significantly smaller than an index
on the full surname, as well as avoiding some of the hotspots caused by
certain most-frequent-values in the data.

An alternative example might be to define an index onsubstr(text_col, 1, 100)
which still allows searching for longer strings without the need to
store the complete string in the index. This is effectively index prefix
compression, which is not directly possible with pgsql because of the
requirements of datatype encapsulation. (Note that this avoids the need
to have very long index keys also).

One difficulty to this is defining which operators represent
true-equality. There isn't a definition of this currently for operators
and we cannot assume that an operator called "=" has the required
properties, even if that is true for most built-in types. An example of
true-equality and its difficulties is for the FLOAT type minus-zero (-0)
and plus-zero (+0) have different byte representations yet when compared
with the standard FLOAT comparison operator +0 and -0 would be
considered equal. If we were to put each value through a hash-like
function such as md5() then we would clearly get different answers.

To take advantage of this, I propose to
- add code to be called from allpaths.c: when we check functional
indexes, if they exist and yet are not usable because of a lack of
explicit clauses we will check to see if there any clauses that can be
used to imply the correct use of the functional index. This is in many
ways similar to existing constraint exclusion code.
- add a new boolean to pg_operator to allow us to define which operators
offer true equality
- add a new keyword EQUALITY to CREATE/ALTER OPERATOR (which I think
implies HASHES and MERGES if they are not mentioned explicitly).

No promises for 8.2, but does anyone have further input?

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com

Re: Implied Functional Index use

From

Tom Lane

Date:

11 July 2006, 20:26:42

Simon Riggs <simon@2ndquadrant.com> writes:
> ...
> - add a new boolean to pg_operator to allow us to define which operators
> offer true equality
> ...

This would be useful for other purposes too, as we keep coming up
against "what's the equality operator for this datatype" problems.
However, the restriction to "true" equality, such that we can assume
x = y implies f(x) = f(y) for every immutable function f on the datatype
(note this need not necessarily mean bitwise equality --- it depends on
what operations the datatype provides), seems like a problem.  For
instance, the ordinary = operators on float and numeric are NOT true
equality, nor do we provide any true equality in this sense for these
common datatypes.  We could hardly get away with using this concept
to drive foreign-key comparisons, if it doesn't work for float or
numeric.

We could invent some more-complex concept involving "well, this is
equality, but there are some functions for which f(x) might differ
from f(y) anyway" and then mark the presumably-few functions that
could produce divergent results --- examples are sgn() for float8
and anything dependent on dscale for numeric.  This seems ugly and
error prone however.

Anyone have a better idea?
        regards, tom lane

Re: Implied Functional Index use

From

Peter Eisentraut

Date:

12 July 2006, 10:09:42

Am Dienstag, 11. Juli 2006 23:31 schrieb Tom Lane:
> We could invent some more-complex concept involving "well, this is
> equality, but there are some functions for which f(x) might differ
> from f(y) anyway" and then mark the presumably-few functions that
> could produce divergent results --- examples are sgn() for float8
> and anything dependent on dscale for numeric.  This seems ugly and
> error prone however.

From a mathematical point of view, it seems cleaner to attach this property to
functions, not operators, namely, "this function preserves the following
relations".  This would allow extending Simon's idea to other operators such
as > and < and possibly more mind-boggling cases with geometric operators and
such.

Admittedly, this would put a lot of additional typing load on function
authors, but perhaps it's safer that way.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Implied Functional Index use

From

Simon Riggs

Date:

12 July 2006, 19:13:46

On Tue, 2006-07-11 at 17:31 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > ...
> > - add a new boolean to pg_operator to allow us to define which operators
> > offer true equality
> > ...
> 
> This would be useful for other purposes too, as we keep coming up
> against "what's the equality operator for this datatype" problems.
> However, the restriction to "true" equality, such that we can assume
> x = y implies f(x) = f(y) for every immutable function f on the datatype
> (note this need not necessarily mean bitwise equality --- it depends on
> what operations the datatype provides), seems like a problem.  For
> instance, the ordinary = operators on float and numeric are NOT true
> equality, nor do we provide any true equality in this sense for these
> common datatypes.  We could hardly get away with using this concept
> to drive foreign-key comparisons, if it doesn't work for float or
> numeric.

Normally, I would not suggest that we do things only for certain data
types only. In this case however, it seems that the reason it would work
only for INTEGER and TEXT data types is that they are simple atomic
datatypes that have the required properties. So doing this for those
datatypes only seems permissable on a theoretical basis, rather than
just because we can't be bothered to do it for more complex types.

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com

Re: Implied Functional Index use

From

Tom Lane

Date:

12 July 2006, 23:35:13

Simon Riggs <simon@2ndquadrant.com> writes:
> Normally, I would not suggest that we do things only for certain data
> types only. In this case however, it seems that the reason it would work
> only for INTEGER and TEXT data types is that they are simple atomic
> datatypes that have the required properties. So doing this for those
> datatypes only seems permissable on a theoretical basis, rather than
> just because we can't be bothered to do it for more complex types.

There's nothing simple nor atomic about TEXT, and in fact until very
recently text_eq was NOT true equality by this definition.  See
discussions about hu_HU locale back in December.  A number of people
thought that fix was an ugly kluge, and so we may someday go back to
a behavior in which text_eq is again not true equality --- in particular
I'm dubious that such a restriction can survive once we support multiple
encodings/collations in the same database.

More generally, I don't believe in hacks that only work for a small
number of built-in types: to me, that's prima facie evidence that you
haven't thought the problem through.
        regards, tom lane

Re: Implied Functional Index use

From

"Zeugswetter Andreas DCP SD"

Date:

13 July 2006, 04:24:10

> > > - add a new boolean to pg_operator to allow us to define which
> > > operators offer true equality ...
> >
> > This would be useful for other purposes too, as we keep coming up
> > against "what's the equality operator for this datatype" problems.
> > However, the restriction to "true" equality, such that we can assume
x
> > = y implies f(x) = f(y) for every immutable function f on the
datatype

Maybe we could have a tri (or more) state flag for the equality
operators.

' ' .. not an equality op
'e' .. equality
's' .. strict equality (op only true iff the binary representation is
equal)

Andreas

Re: Implied Functional Index use

From

Simon Riggs

Date:

13 July 2006, 05:11:32

On Wed, 2006-07-12 at 22:34 -0400, Tom Lane wrote:

> More generally, I don't believe in hacks that only work for a small
> number of built-in types: to me, that's prima facie evidence that you
> haven't thought the problem through.

I agree an attempt at a simple definition of the required functional
properties hasn't yielded a great solution, so we must go deeper.

On Wed, 2006-07-12 at 15:09 +0200, Peter Eisentraut wrote:

> From a mathematical point of view, it seems cleaner to attach this property to 
> functions, not operators, namely, "this function preserves the following 
> relations".  This would allow extending Simon's idea to other operators such 
> as > and < and possibly more mind-boggling cases with geometric operators and 
> such.

Well, in PG, operators are functions that define various special
properties of their inputs/outputs, so it seems the correct place to put
these definitions. I agree it seems more normalised to place this
information at the function itself, but that is not current precedent.

On Thu, 2006-07-13 at 09:14 +0200, Zeugswetter Andreas DCP SD wrote:

> Maybe we could have a tri (or more) state flag for the equality
> operators.
> 
> ' ' .. not an equality op
> 'e' .. equality
> 's' .. strict equality (op only true iff the binary representation is
> equal)

Tom has pointed out that the required functional properties are actually
a fourth state between equality and full binary equality.

I was trying to avoid introducing new single-use properties, but I think
that is the only way here. My concern was not the complexity of
specifying this for function authors, but the problem of making an
incorrect selection leading to incorrect query results. 

It seems we need this in the catalog:

> ' ' .. not an equality op
> 'e' .. equality (sufficient for FKs)'f' .. functional equality (sufficient for this proposed optimisation)
> 's' .. strict equality (op only true iff the binary representation is
> equal)

We're breaking new ground here, but that's a good thing. I'd much rather
have an optimisable and extensible type system than a hard-wired one.

There is a problem of implication here, AFAICS:
When a user SQL asks WHERE col1 = 7
which equality level is meant when several exist?

We currently already assume that they mean e-equality, since it is the
most useful and intuitive meaning. That is not as strict as f-equality,
so we would not be able to make implications from that.

I'll think on this some more, but not for 8.2.

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com

Re: Implied Functional Index use

From

"Zeugswetter Andreas DCP SD"

Date:

13 July 2006, 07:49:42

> There is a problem of implication here, AFAICS:
> When a user SQL asks
>     WHERE col1 = 7
> which equality level is meant when several exist?

Well, the operator must be unique, so there is no problem.
Unique in the sense that an operator with the same name ('=' in this
case)
and argument types cannot exist for more than one level of equality.
(and the level should not have an effect on the resolution)

So, when we see col1 = 7 we lookup the equality level of the operator
and decide whether it is strict enough for the particular optimization.

Andreas