Re: New approach to ye olde cross-datatype indexing problem - Mailing list pgsql-hackers

From Tom Lane
Subject Re: New approach to ye olde cross-datatype indexing problem
Date
Msg-id 29832.1068682253@sss.pgh.pa.us
Whole thread Raw
In response to New approach to ye olde cross-datatype indexing problem  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: New approach to ye olde cross-datatype indexing problem
List pgsql-hackers
I wrote:
> What if we attack the problem head on --- make indexes able to handle
> cross-datatype comparison operators directly?  That would solve the
> problem without *any* semantic side-effects.  AFAIR we've never seriously
> considered doing that; I guess anyone who thought of it dismissed it as
> too hard.  But I've been studying the idea for the last day or so and I'm
> starting to think that it's eminently practical.

In fact, it works --- as of CVS tip, you can do:

regression=# create table foo (f1 bigint primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo"
CREATE TABLE

regression=# explain select * from foo where f1 = 44;                            QUERY PLAN
--------------------------------------------------------------------Index Scan using foo_pkey on foo  (cost=0.00..4.82
rows=1width=8)  Index Cond: (f1 = 44)
 
(2 rows)

or even

regression=# explain select * from foo where f1 = 44::int2;                            QUERY PLAN
--------------------------------------------------------------------Index Scan using foo_pkey on foo  (cost=0.00..4.82
rows=1width=8)  Index Cond: (f1 = 44::smallint)
 
(2 rows)


As seems to happen every time I touch the btree code, I learned a few
formerly-undocumented details about how it works :-(.  It turns out that
allowing an operator class to include operators between the indexed
datatype and other datatypes isn't really sufficient for everything the
backend wants to do.  In particular, there are places where the system
wants to determine which of two inequalities is a tighter constraint.
For example, consider a query like
SELECT ... WHERE x > 4 AND x >= 5;

The x > 4 condition is redundant and can be thrown away.  The btree
subsystem actually has code to do this, and in fact was *depending on*
the assumption that it could do so, because there were bits of logic
that assumed they'd not have to deal with more than one inequality
constraint at a time.  Now in order to do this, the code has to compare
the constants (4 and 5).  This is no problem as long as an operator
class deals with just a single datatype --- you can pick the needed
comparison operator out of the opclass.  However, in a cross-type
situation you could have something like
SELECT ... WHERE int8var > 4::int4 AND int8var >= 5::int2;

where the operators appearing in the query are int84lt and int82le.
To determine which condition is a tighter constraint, you'd need to
apply an int4-vs-int2 comparison --- and there are no such operators
available in the int8 opclass.

Since the whole point of an opclass is to encapsulate a particular
comparison semantics, it's not reasonable for the system to guess at
which comparison operator on the other type(s) is compatible with the
opclass it's working with.  And coercing the comparison values to
the opclass' own datatype is no good (that opens up all the problems
we wanted to avoid, such as possible overflow).

For the moment, I've just punted this problem; the btree code will not
optimize away redundant cross-type comparisons.  (I went and fixed the
places that failed to cope with extra conditions.)  It would be nice
to do better, but it's surely a much lower priority than getting
cross-type comparisons to be indexed at all.

Looking to the long term of how we might deal with this if we wanted
to invest more effort, I can think of a couple of ways to proceed.

One is to allow an opclass to contain operators that aren't directly
usable with the index at all, but do apply to datatypes that are used
within the opclass --- for instance, we'd have to add all the
int4-vs-int4 operators, *and* the int4-vs-int2 and int2-vs-int2
operators, into the int8 opclass; likewise for the int2 and int4
opclasses.  This seems like a recipe for bloating pg_amop quite a lot;
there'd be a lot of redundancy among these opclasses.

The other idea I had was to somehow create associations of opclasses,
so that you could say that "int2_ops, int4_ops, and int8_ops all sort
values compatibly, so it's okay to go find the operator you need in
int4_ops or int2_ops".  I don't have any concrete idea about how to
represent that though.  Any thoughts about that, or other ideas
altogether?

For the moment I'm satisfied with what we have; this is already a big
step forward on a problem that's been there forever.  But there's always
something else that could be done ...
        regards, tom lane


pgsql-hackers by date:

Previous
From: Andrew Sullivan
Date:
Subject: Re: [GENERAL] Proposal for a cascaded master-slave replication system
Next
From: Jan Wieck
Date:
Subject: ARC buffer strategy committed