Thread: New thoughts about indexing cross-type comparisons

New thoughts about indexing cross-type comparisons

From
Tom Lane
Date:
We've spent much effort trying to solve the "int8col = 42 doesn't use
an index" class of problems.  AFAIR, all the recent tries have focused
on trying to get the parser to choose an index-compatible operator
initially.  (In this example, that would mean promoting 42 to int8 so
that int8 = int8 would be chosen at parse time.)  While I have not
completely given up hope on that approach, it's clearly fraught with
potential for unwanted side-effects.

The other place we could try to fix it is to improve either the planner
or the index AMs themselves to cope with cross-type comparisons for
themselves.  I have shied away from that thought because it seemed
unreasonably difficult, but with the parser-side solution looking harder
and harder, maybe it's time to re-evaluate.

There actually was code in the planner at one time to substitute
index-compatible operators for non-index-compatible operators.
The algorithm was basically "if I have indexcolumn OP somevalue, where
OP isn't a member of the index opclass but also somevalue is of a
different datatype than indexcolumn, then look to see if somevalue can
be binary-coerced to the datatype of indexcolumn.  If so, look for an
operator of the same name as OP and accepting the indexcolumn datatype
on both sides.  If found, and it's a member of the index opclass, then
use that operator instead of the original."

This algorithm was wrong on both practical and theoretical levels;
in the first place it's not very helpful to only be able to handle
binary-compatible transformations, and in the second place there isn't
any good guarantee that it's not changing the semantics when it replaces
the operator.  For instance int4 < and oid < do not act the same.
Depending on equality of operator names was a bad idea even then, and
would be quite unworkable now in the world of schema search paths.

However, those objections really stem from the planner not having enough
information to know when the transformation was safe to make.  What if
we gave it that information?

After some thought I think the most practical approach is to make a new
system catalog for "secondary members of index opclasses".  A secondary
member is not one of the operators that the index can handle directly,
but it can be transformed into one of the primary members.  The new
catalog would be called, say, pg_amsecop, and would have columns like   opclass    operatorid    replacementop
ltransform   rtransform
 
where the first two are the primary key.  The idea is when we have a
WHERE expression "indexcol OP somevalue", and we can't find OP in the
index's operator class (ie, there's no pg_amop entry for that operator
and opclass), we next look to see if the opclass/operator combination
appears in pg_amsecop.  If so, then we are allowed to replace the
"indexcol OP somevalue" expression by "ltransform(indexcol)
REPLACEMENTOP rtransform(somevalue)".  ltransform and rtransform are
pg_proc OIDs of cast functions, or zero where no run-time cast is
needed.  replacementop is the OID of the substitute operator, which
presumably is one of the primary members of the index opclass.  Now we
have an indexable expression.  The presence of the entry in pg_amsecop
represents the opclass designer's promise to us that this is a valid,
semantics-preserving transformation.

This design might be overly general --- for example, I doubt there can
be any need to apply a cast function to the indexable column.  We could
eliminate the ltransform column and probably also halve the number of
entries in the table if we insist that the indexable column be on the
left before we start looking (in other words, "somevalue OP indexcol"
must be commuted before we look for the operator in pg_amsecop, not after).

This is not a 100% solution to our problems.  I don't think we could use
it to solve the problem for int2 columns ("int2col = 42") because it'd
be unsafe to promise that an int4-to-int2 cast could be inserted into
an expression without changing the behavior.  So we'd still want to look
at having small integer constants be initially typed as int2, which
leaves us with a number of unsolved issues, as noted here:
http://archives.postgresql.org/pgsql-hackers/2002-11/msg00468.php
So maybe the "secondary operator" idea will help, or maybe it won't do
much for us.

Comments?  Does this spur any better ideas?
        regards, tom lane


Re: New thoughts about indexing cross-type comparisons

From
Dave Smith
Date:
If this is only dealing with constants, why not just explicitly add a 
cast to the constant of the column type at the planner level. It would 
solve this problem as well ...

create table test (f int2);
select * from test where f=cast('1981928928921' as int2);
ERROR:  pg_atoi: error reading "1981928928921": Numerical result out of 
range
 select * from test where f=1981928928921; f
---
(0 rows)



Tom Lane wrote:
> We've spent much effort trying to solve the "int8col = 42 doesn't use
> an index" class of problems.  AFAIR, all the recent tries have focused
> on trying to get the parser to choose an index-compatible operator
> initially.  (In this example, that would mean promoting 42 to int8 so
> that int8 = int8 would be chosen at parse time.)  While I have not
> completely given up hope on that approach, it's clearly fraught with
> potential for unwanted side-effects.
> 
> The other place we could try to fix it is to improve either the planner
> or the index AMs themselves to cope with cross-type comparisons for
> themselves.  I have shied away from that thought because it seemed
> unreasonably difficult, but with the parser-side solution looking harder
> and harder, maybe it's time to re-evaluate.
> 
> There actually was code in the planner at one time to substitute
> index-compatible operators for non-index-compatible operators.
> The algorithm was basically "if I have indexcolumn OP somevalue, where
> OP isn't a member of the index opclass but also somevalue is of a
> different datatype than indexcolumn, then look to see if somevalue can
> be binary-coerced to the datatype of indexcolumn.  If so, look for an
> operator of the same name as OP and accepting the indexcolumn datatype
> on both sides.  If found, and it's a member of the index opclass, then
> use that operator instead of the original."
> 
> This algorithm was wrong on both practical and theoretical levels;
> in the first place it's not very helpful to only be able to handle
> binary-compatible transformations, and in the second place there isn't
> any good guarantee that it's not changing the semantics when it replaces
> the operator.  For instance int4 < and oid < do not act the same.
> Depending on equality of operator names was a bad idea even then, and
> would be quite unworkable now in the world of schema search paths.
> 
> However, those objections really stem from the planner not having enough
> information to know when the transformation was safe to make.  What if
> we gave it that information?
> 
> After some thought I think the most practical approach is to make a new
> system catalog for "secondary members of index opclasses".  A secondary
> member is not one of the operators that the index can handle directly,
> but it can be transformed into one of the primary members.  The new
> catalog would be called, say, pg_amsecop, and would have columns like
>     opclass    operatorid    replacementop    ltransform    rtransform
> where the first two are the primary key.  The idea is when we have a
> WHERE expression "indexcol OP somevalue", and we can't find OP in the
> index's operator class (ie, there's no pg_amop entry for that operator
> and opclass), we next look to see if the opclass/operator combination
> appears in pg_amsecop.  If so, then we are allowed to replace the
> "indexcol OP somevalue" expression by "ltransform(indexcol)
> REPLACEMENTOP rtransform(somevalue)".  ltransform and rtransform are
> pg_proc OIDs of cast functions, or zero where no run-time cast is
> needed.  replacementop is the OID of the substitute operator, which
> presumably is one of the primary members of the index opclass.  Now we
> have an indexable expression.  The presence of the entry in pg_amsecop
> represents the opclass designer's promise to us that this is a valid,
> semantics-preserving transformation.
> 
> This design might be overly general --- for example, I doubt there can
> be any need to apply a cast function to the indexable column.  We could
> eliminate the ltransform column and probably also halve the number of
> entries in the table if we insist that the indexable column be on the
> left before we start looking (in other words, "somevalue OP indexcol"
> must be commuted before we look for the operator in pg_amsecop, not after).
> 
> This is not a 100% solution to our problems.  I don't think we could use
> it to solve the problem for int2 columns ("int2col = 42") because it'd
> be unsafe to promise that an int4-to-int2 cast could be inserted into
> an expression without changing the behavior.  So we'd still want to look
> at having small integer constants be initially typed as int2, which
> leaves us with a number of unsolved issues, as noted here:
> http://archives.postgresql.org/pgsql-hackers/2002-11/msg00468.php
> So maybe the "secondary operator" idea will help, or maybe it won't do
> much for us.
> 
> Comments?  Does this spur any better ideas?
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faqs/FAQ.html
> 



Re: New thoughts about indexing cross-type comparisons

From
Tom Lane
Date:
Dave Smith <dave.smith@candata.com> writes:
> If this is only dealing with constants, why not just explicitly add a 
> cast to the constant of the column type at the planner level. It would 
> solve this problem as well ...

> create table test (f int2);
> select * from test where f=cast('1981928928921' as int2);
> ERROR:  pg_atoi: error reading "1981928928921": Numerical result out of 
> range

>   select * from test where f=1981928928921;
>   f
> ---
> (0 rows)

Uh, your own example points out why we can't just insert a cast to int2
--- it changes the results.  Or am I missing your point?

(In any case, we aren't necessarily dealing with constants.)
        regards, tom lane


Re: New thoughts about indexing cross-type comparisons

From
Dave Smith
Date:
My point was that it was inconstant behavour. What exactly are you 
comparing with int2? To me the case without the cast should should throw 
the same error as the statement with the cast.



Tom Lane wrote:
> Dave Smith <dave.smith@candata.com> writes:
> 
>>If this is only dealing with constants, why not just explicitly add a 
>>cast to the constant of the column type at the planner level. It would 
>>solve this problem as well ...
> 
> 
>>create table test (f int2);
>>select * from test where f=cast('1981928928921' as int2);
>>ERROR:  pg_atoi: error reading "1981928928921": Numerical result out of 
>>range
> 
> 
>>  select * from test where f=1981928928921;
>>  f
>>---
>>(0 rows)
> 
> 
> Uh, your own example points out why we can't just insert a cast to int2
> --- it changes the results.  Or am I missing your point?
> 
> (In any case, we aren't necessarily dealing with constants.)
> 
>             regards, tom lane
> 



Re: New thoughts about indexing cross-type comparisons

From
Tom Lane
Date:
Dave Smith <dave.smith@candata.com> writes:
> My point was that it was inconstant behavour. What exactly are you 
> comparing with int2? To me the case without the cast should should throw 
> the same error as the statement with the cast.

> select * from test where f=1981928928921;

I contend not.  The above is perfectly well defined.  It will always
return false if f is int2, but that does not mean it should throw an
error instead.  In any standard programming language, you'd resolve
the operator by up-converting f to the type of the constant, not by
trying to down-convert the wider value.  PG happens to have
implementation reasons to wish to use the variable's datatype instead
of the constant's, but that doesn't excuse us from obeying the ordinary
laws of arithmetic.
        regards, tom lane


Re: New thoughts about indexing cross-type comparisons

From
Hiroshi Inoue
Date:
Tom Lane wrote:
> 
> Dave Smith <dave.smith@candata.com> writes:
> > If this is only dealing with constants, why not just explicitly add a
> > cast to the constant of the column type at the planner level. It would
> > solve this problem as well ...
> 
> > create table test (f int2);
> > select * from test where f=cast('1981928928921' as int2);
> > ERROR:  pg_atoi: error reading "1981928928921": Numerical result out of
> > range
> 
> >   select * from test where f=1981928928921;
> >   f
> > ---
> > (0 rows)
> 
> Uh, your own example points out why we can't just insert a cast to int2
> --- it changes the results.  Or am I missing your point?

Because PG doesn't have the functionality to catch exceptions
until now, we can't call cast functions tentatively. But for
example, why should the pg_atoi error cause an unrecoverable
error immediately ? IMHO we should be able to call cast functions
internally without an anxiety to cause an unrecoverable error. 

regards,
Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/


Re: New thoughts about indexing cross-type comparisons

From
Dennis Bjorklund
Date:
On Tue, 16 Sep 2003, Tom Lane wrote:

> This is not a 100% solution to our problems.  I don't think we could use
> it to solve the problem for int2 columns ("int2col = 42") because it'd
> be unsafe to promise that an int4-to-int2 cast could be inserted into
> an expression without changing the behavior.
> 
> Comments?  Does this spur any better ideas?

I still think that this is a type inference problem and nothing else. The
problem with an expression like int2col = 42 is that the parser tries
directly to put a type on the constant 42 while it shouldn't. 42 is not a
int4 value, it's also not a int8 value. By introducing a type variables
and say that the 42 :: alpha and some constraint saying "Int alpha" then
then you after parsing look at all your type variables and try to assign
real types to them that is consistent (the type inference phase). Another 
way is to say that 42 :: int2 and then having some subtyping rules saying 
that int2 < int4 and int4 < int8 so the type system can insert coercions 
in order to make a concistent typing of the term.

Things to look at is Hindley-Miller type inference and subtyping.

What I'm afraid of is that there are more and more ad hock solutions added
to postgresql but no overall strategy. For example in 7.4 there are
polymorphism that is something that hindley miller type systems have
always been good at but I don't know exactly what pg uses and how it's
checking types. I'm not saying that there is anything wrong with
polymorphism as added. I don't know enough of it to say anything about it.

Another thing I don't understand is why '42' works but not 42. Is it the
case that you can always write '42' in places where pg expects an integer?  
That suggest to me that '42' is treated as something with an unknown type
(let's call it alpha), where the type is infered later on depending on how
it is used (alpha is assigned to int2 or something). Sounds very much like
what I talk about above. Why is not 42 treated in the same way? If someone
want 42 to be an integer to for example select a specific overloaded
function they would have to write 42::integer instead (or some defaulting 
could be added).

I feal bad to talk about the pg type system without knowing all details
about it. But I do know a little about type systems in general, so I still
want to share some thoughts. I wrote a similar message 6 months ago or
something. If you get a deja vu feeling you know why :-)

I wish I had lots of money so I could take 6 months off and just spend on
pg. Then I would start by writing down how the current type system works
and try to create a new nice type system that works without a lot of ad
hock solutions. Unfortunatly there is a reality where you have to pay 
the rent and so on... 

-- 
/Dennis



Re: New thoughts about indexing cross-type comparisons

From
Tom Lane
Date:
Dennis Bjorklund <db@zigo.dhs.org> writes:
> I still think that this is a type inference problem and nothing else.

Well, perhaps ripping out the entire type resolution algorithm and
replacing it with something less ad-hoc is indeed the right long-term
answer.  I don't have time to do that though.  Do you?  It seems like
a fairly large project with possibly zero payback, if we find that too
many behavioral incompatibilities are introduced to make it adoptable.

> Another thing I don't understand is why '42' works but not 42.

Because in the '42' case the determination of the constant's type is
indeed postponed ... though no doubt in a way that's too ad-hoc ;-)

Another thing to keep in mind is that it's not always the case that
assigning the right type to a literal constant would solve the problem.
We have the same issues with variables; for example, a join with
"WHERE a.int8col = b.int4col" may fail to take advantage of available
indexes.

Something else that's been troubling me about the issue is that the most
efficient operator to use is really context-dependent.  In the example
of "WHERE a.int8col = b.int4col", we basically have two possible
choices; use a cross-type operator (int8 = int4), or introduce a
coercion, so that the clause becomes "WHERE a.int8col = b.int4col::int8"
using an int8 = int8 operator.  The latter is a win when considering
indexscans on the int8 column, and also a win for hash joins (cross-type
operators in general aren't hashable).  *But* the former is a win for
merge joins, and particularly for implied equality deduction.  If we
force the coercion-based representation to be used then the planner will
be able to deduce that a.int8col and b.int4col::int8 are equal, but it
would have to make a leap of faith to relate that knowledge to the bare
b.int4col column.  This would eg. prevent the use of a mergejoin with
use of an index on the int4 column.

So I'm beginning to think that avoiding cross-type operators is not the
right route to a solution anyway.  It may be better to leave the parser
alone and teach the planner how to switch to the alternate
representation when and where appropriate.  (The connection to hash
joins suggests that tying this closely to index opclasses is wrong,
though.)
        regards, tom lane


Re: New thoughts about indexing cross-type comparisons

From
Dennis Bjorklund
Date:
On Wed, 17 Sep 2003, Tom Lane wrote:

> Another thing to keep in mind is that it's not always the case that
> assigning the right type to a literal constant would solve the problem.
> We have the same issues with variables; for example, a join with
> "WHERE a.int8col = b.int4col" may fail to take advantage of available
> indexes.

Sure, which is why subtyping is needed to insert the correct coercions
(relations like int4 < int8 and such). Generating all constraints and then
have a planner phase/inference phase in the end that makes the chioces. By
not forcing anything too early you give the inference/planner the
possibility to make good choices (which might depend on what indexes are
available and so on).

> Something else that's been troubling me about the issue is that the most
> efficient operator to use is really context-dependent.  In the example
> of "WHERE a.int8col = b.int4col", we basically have two possible
> choices; use a cross-type operator (int8 = int4), or introduce a
> coercion, so that the clause becomes "WHERE a.int8col = b.int4col::int8"
> using an int8 = int8 operator.  The latter is a win when considering
> indexscans on the int8 column, and also a win for hash joins (cross-type
> operators in general aren't hashable).  *But* the former is a win for
> merge joins, and particularly for implied equality deduction.

Yes, by not making the decission directly but just setting constrants,
then you in the end have a set of possible solutions. And then the planner
have to choose among those.

> If we force the coercion-based representation to be used then the
> planner will be able to deduce that a.int8col and b.int4col::int8 are
> equal, but it would have to make a leap of faith to relate that
> knowledge to the bare b.int4col column.  This would eg. prevent the use
> of a mergejoin with use of an index on the int4 column.

If I understand what you say correctly, then yes. If you have an inference
phase before the planner phase that decides to insert a coercion from int4
to int8 then you have made a decission that forces you to use one plan
over another. And that should be made by the planner and not the type
inference.

> So I'm beginning to think that avoiding cross-type operators is not the
> right route to a solution anyway.  It may be better to leave the parser
> alone and teach the planner how to switch to the alternate
> representation when and where appropriate.

Yes, and the planner needs information about what alternative
representations there are (which is the same as letting the planner insert
coercions and selecting what == operator to use, or are you thinking of
something else).

Another problem is of course if one let the planner do too much work and 
have to many possible plans to choose from, it has to be fast.

-- 
/Dennis



Re: New thoughts about indexing cross-type comparisons

From
Tom Lane
Date:
Dennis Bjorklund <db@zigo.dhs.org> writes:
> On Wed, 17 Sep 2003, Tom Lane wrote:
>> So I'm beginning to think that avoiding cross-type operators is not the
>> right route to a solution anyway.  It may be better to leave the parser
>> alone and teach the planner how to switch to the alternate
>> representation when and where appropriate.

> Yes, and the planner needs information about what alternative
> representations there are (which is the same as letting the planner insert
> coercions and selecting what == operator to use, or are you thinking of
> something else).

Right.  My pg_amsecop catalog proposal of yesterday could be seen as
providing knowledge about valid/useful transformations of this form
(ignoring the linkage to index opclasses, which I now see to be
possibly irrelevant).

> Another problem is of course if one let the planner do too much work and 
> have to many possible plans to choose from, it has to be fast.

I don't think it's a big problem; this would add at most one catalog
lookup in each situation where an indexed column is compared to
something of a different type.  And the plans involved would usually
be ones we'd want to find, I think.
        regards, tom lane


Re: New thoughts about indexing cross-type comparisons

From
Kevin Brown
Date:
Tom Lane wrote:
> Dave Smith <dave.smith@candata.com> writes:
> > My point was that it was inconstant behavour. What exactly are you 
> > comparing with int2? To me the case without the cast should should throw 
> > the same error as the statement with the cast.
> 
> > select * from test where f=1981928928921;
> 
> I contend not.  The above is perfectly well defined.  It will always
> return false if f is int2, but that does not mean it should throw an
> error instead.  In any standard programming language, you'd resolve
> the operator by up-converting f to the type of the constant, not by
> trying to down-convert the wider value.  PG happens to have
> implementation reasons to wish to use the variable's datatype instead
> of the constant's, but that doesn't excuse us from obeying the ordinary
> laws of arithmetic.

Hmm...but what if the cast were to return NULL in the event that the cast
fails or cannot be done?  Would that even be reasonable?  I don't know
what the standard says about this so my suggestion may be unreasonable
(and it may break a lot of things as well).

In a way, this would be consistent with the meaning of NULL: "no value",
and would also yield the desired effect in the example select (no
matches).


Of course, I could always be off my rocker here.  :-)

-- 
Kevin Brown                          kevin@sysexperts.com


Re: New thoughts about indexing cross-type comparisons

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> Hmm...but what if the cast were to return NULL in the event that the cast
> fails or cannot be done?  Would that even be reasonable?

Yeah, I was wondering about that myself.  I'd not want to try to use
such an idea in general, but if we find that int2 indexes are the only
sore spot in an otherwise-useful solution, some klugery for int2 might
be the way to go.  What I was visualizing was that for an int2 index,
we might transform "int2col op int4-or-int8-comparison-value" into
"int2col int2op special_cast_fn(int4-or-int8-comparison-value)"
where the trick is to make up a good special_cast_fn (possibly one
specific to the comparison op being used).

Returning NULL might be an acceptable substitute when the cast function
wants to force an always-false answer, but what about cases where it
needs to force an always-true answer?  For instanceint2col < 1000000
should yield true always.  There's no int2 value the cast function
could output to make that happen.  I thought maybe we could hack it
by changing the operator to "<=" and introducing an offset of -1 in the
cast function to compensate.  I haven't worked out all the combinations
though, and I'm not real sure that it's acceptable to substitute NULL
for always-false cases.  It'd work at the top level of WHERE but
possibly not in other cases where indexscanning is desirable.
        regards, tom lane


Re: New thoughts about indexing cross-type comparisons

From
Dave Smith
Date:
For the int2col op value I have this table for when the cast returns NULL

value <0
<, <= ,= int2col=null>,>= in2col is not null

value > 0
<,<= in2col is not null
=,>,>= int2col=null

Im not sure why pg allows me to do a int2col=null and returns nothing 
so I am assuming that internally pg just resolves this to false.




Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> 
>>Hmm...but what if the cast were to return NULL in the event that the cast
>>fails or cannot be done?  Would that even be reasonable?
> 
> 
> Yeah, I was wondering about that myself.  I'd not want to try to use
> such an idea in general, but if we find that int2 indexes are the only
> sore spot in an otherwise-useful solution, some klugery for int2 might
> be the way to go.  What I was visualizing was that for an int2 index,
> we might transform "int2col op int4-or-int8-comparison-value" into
> "int2col int2op special_cast_fn(int4-or-int8-comparison-value)"
> where the trick is to make up a good special_cast_fn (possibly one
> specific to the comparison op being used).
> 
> Returning NULL might be an acceptable substitute when the cast function
> wants to force an always-false answer, but what about cases where it
> needs to force an always-true answer?  For instance
>     int2col < 1000000
> should yield true always.  There's no int2 value the cast function
> could output to make that happen.  I thought maybe we could hack it
> by changing the operator to "<=" and introducing an offset of -1 in the
> cast function to compensate.  I haven't worked out all the combinations
> though, and I'm not real sure that it's acceptable to substitute NULL
> for always-false cases.  It'd work at the top level of WHERE but
> possibly not in other cases where indexscanning is desirable.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faqs/FAQ.html
>