Thread: New thoughts about indexing cross-type comparisons
We've spent much effort trying to solve the "int8col = 42 doesn't use an index" class of problems. AFAIR, all the recent tries have focused on trying to get the parser to choose an index-compatible operator initially. (In this example, that would mean promoting 42 to int8 so that int8 = int8 would be chosen at parse time.) While I have not completely given up hope on that approach, it's clearly fraught with potential for unwanted side-effects. The other place we could try to fix it is to improve either the planner or the index AMs themselves to cope with cross-type comparisons for themselves. I have shied away from that thought because it seemed unreasonably difficult, but with the parser-side solution looking harder and harder, maybe it's time to re-evaluate. There actually was code in the planner at one time to substitute index-compatible operators for non-index-compatible operators. The algorithm was basically "if I have indexcolumn OP somevalue, where OP isn't a member of the index opclass but also somevalue is of a different datatype than indexcolumn, then look to see if somevalue can be binary-coerced to the datatype of indexcolumn. If so, look for an operator of the same name as OP and accepting the indexcolumn datatype on both sides. If found, and it's a member of the index opclass, then use that operator instead of the original." This algorithm was wrong on both practical and theoretical levels; in the first place it's not very helpful to only be able to handle binary-compatible transformations, and in the second place there isn't any good guarantee that it's not changing the semantics when it replaces the operator. For instance int4 < and oid < do not act the same. Depending on equality of operator names was a bad idea even then, and would be quite unworkable now in the world of schema search paths. However, those objections really stem from the planner not having enough information to know when the transformation was safe to make. What if we gave it that information? After some thought I think the most practical approach is to make a new system catalog for "secondary members of index opclasses". A secondary member is not one of the operators that the index can handle directly, but it can be transformed into one of the primary members. The new catalog would be called, say, pg_amsecop, and would have columns like opclass operatorid replacementop ltransform rtransform where the first two are the primary key. The idea is when we have a WHERE expression "indexcol OP somevalue", and we can't find OP in the index's operator class (ie, there's no pg_amop entry for that operator and opclass), we next look to see if the opclass/operator combination appears in pg_amsecop. If so, then we are allowed to replace the "indexcol OP somevalue" expression by "ltransform(indexcol) REPLACEMENTOP rtransform(somevalue)". ltransform and rtransform are pg_proc OIDs of cast functions, or zero where no run-time cast is needed. replacementop is the OID of the substitute operator, which presumably is one of the primary members of the index opclass. Now we have an indexable expression. The presence of the entry in pg_amsecop represents the opclass designer's promise to us that this is a valid, semantics-preserving transformation. This design might be overly general --- for example, I doubt there can be any need to apply a cast function to the indexable column. We could eliminate the ltransform column and probably also halve the number of entries in the table if we insist that the indexable column be on the left before we start looking (in other words, "somevalue OP indexcol" must be commuted before we look for the operator in pg_amsecop, not after). This is not a 100% solution to our problems. I don't think we could use it to solve the problem for int2 columns ("int2col = 42") because it'd be unsafe to promise that an int4-to-int2 cast could be inserted into an expression without changing the behavior. So we'd still want to look at having small integer constants be initially typed as int2, which leaves us with a number of unsolved issues, as noted here: http://archives.postgresql.org/pgsql-hackers/2002-11/msg00468.php So maybe the "secondary operator" idea will help, or maybe it won't do much for us. Comments? Does this spur any better ideas? regards, tom lane
If this is only dealing with constants, why not just explicitly add a cast to the constant of the column type at the planner level. It would solve this problem as well ... create table test (f int2); select * from test where f=cast('1981928928921' as int2); ERROR: pg_atoi: error reading "1981928928921": Numerical result out of range select * from test where f=1981928928921; f --- (0 rows) Tom Lane wrote: > We've spent much effort trying to solve the "int8col = 42 doesn't use > an index" class of problems. AFAIR, all the recent tries have focused > on trying to get the parser to choose an index-compatible operator > initially. (In this example, that would mean promoting 42 to int8 so > that int8 = int8 would be chosen at parse time.) While I have not > completely given up hope on that approach, it's clearly fraught with > potential for unwanted side-effects. > > The other place we could try to fix it is to improve either the planner > or the index AMs themselves to cope with cross-type comparisons for > themselves. I have shied away from that thought because it seemed > unreasonably difficult, but with the parser-side solution looking harder > and harder, maybe it's time to re-evaluate. > > There actually was code in the planner at one time to substitute > index-compatible operators for non-index-compatible operators. > The algorithm was basically "if I have indexcolumn OP somevalue, where > OP isn't a member of the index opclass but also somevalue is of a > different datatype than indexcolumn, then look to see if somevalue can > be binary-coerced to the datatype of indexcolumn. If so, look for an > operator of the same name as OP and accepting the indexcolumn datatype > on both sides. If found, and it's a member of the index opclass, then > use that operator instead of the original." > > This algorithm was wrong on both practical and theoretical levels; > in the first place it's not very helpful to only be able to handle > binary-compatible transformations, and in the second place there isn't > any good guarantee that it's not changing the semantics when it replaces > the operator. For instance int4 < and oid < do not act the same. > Depending on equality of operator names was a bad idea even then, and > would be quite unworkable now in the world of schema search paths. > > However, those objections really stem from the planner not having enough > information to know when the transformation was safe to make. What if > we gave it that information? > > After some thought I think the most practical approach is to make a new > system catalog for "secondary members of index opclasses". A secondary > member is not one of the operators that the index can handle directly, > but it can be transformed into one of the primary members. The new > catalog would be called, say, pg_amsecop, and would have columns like > opclass operatorid replacementop ltransform rtransform > where the first two are the primary key. The idea is when we have a > WHERE expression "indexcol OP somevalue", and we can't find OP in the > index's operator class (ie, there's no pg_amop entry for that operator > and opclass), we next look to see if the opclass/operator combination > appears in pg_amsecop. If so, then we are allowed to replace the > "indexcol OP somevalue" expression by "ltransform(indexcol) > REPLACEMENTOP rtransform(somevalue)". ltransform and rtransform are > pg_proc OIDs of cast functions, or zero where no run-time cast is > needed. replacementop is the OID of the substitute operator, which > presumably is one of the primary members of the index opclass. Now we > have an indexable expression. The presence of the entry in pg_amsecop > represents the opclass designer's promise to us that this is a valid, > semantics-preserving transformation. > > This design might be overly general --- for example, I doubt there can > be any need to apply a cast function to the indexable column. We could > eliminate the ltransform column and probably also halve the number of > entries in the table if we insist that the indexable column be on the > left before we start looking (in other words, "somevalue OP indexcol" > must be commuted before we look for the operator in pg_amsecop, not after). > > This is not a 100% solution to our problems. I don't think we could use > it to solve the problem for int2 columns ("int2col = 42") because it'd > be unsafe to promise that an int4-to-int2 cast could be inserted into > an expression without changing the behavior. So we'd still want to look > at having small integer constants be initially typed as int2, which > leaves us with a number of unsolved issues, as noted here: > http://archives.postgresql.org/pgsql-hackers/2002-11/msg00468.php > So maybe the "secondary operator" idea will help, or maybe it won't do > much for us. > > Comments? Does this spur any better ideas? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
Dave Smith <dave.smith@candata.com> writes: > If this is only dealing with constants, why not just explicitly add a > cast to the constant of the column type at the planner level. It would > solve this problem as well ... > create table test (f int2); > select * from test where f=cast('1981928928921' as int2); > ERROR: pg_atoi: error reading "1981928928921": Numerical result out of > range > select * from test where f=1981928928921; > f > --- > (0 rows) Uh, your own example points out why we can't just insert a cast to int2 --- it changes the results. Or am I missing your point? (In any case, we aren't necessarily dealing with constants.) regards, tom lane
My point was that it was inconstant behavour. What exactly are you comparing with int2? To me the case without the cast should should throw the same error as the statement with the cast. Tom Lane wrote: > Dave Smith <dave.smith@candata.com> writes: > >>If this is only dealing with constants, why not just explicitly add a >>cast to the constant of the column type at the planner level. It would >>solve this problem as well ... > > >>create table test (f int2); >>select * from test where f=cast('1981928928921' as int2); >>ERROR: pg_atoi: error reading "1981928928921": Numerical result out of >>range > > >> select * from test where f=1981928928921; >> f >>--- >>(0 rows) > > > Uh, your own example points out why we can't just insert a cast to int2 > --- it changes the results. Or am I missing your point? > > (In any case, we aren't necessarily dealing with constants.) > > regards, tom lane >
Dave Smith <dave.smith@candata.com> writes: > My point was that it was inconstant behavour. What exactly are you > comparing with int2? To me the case without the cast should should throw > the same error as the statement with the cast. > select * from test where f=1981928928921; I contend not. The above is perfectly well defined. It will always return false if f is int2, but that does not mean it should throw an error instead. In any standard programming language, you'd resolve the operator by up-converting f to the type of the constant, not by trying to down-convert the wider value. PG happens to have implementation reasons to wish to use the variable's datatype instead of the constant's, but that doesn't excuse us from obeying the ordinary laws of arithmetic. regards, tom lane
Tom Lane wrote: > > Dave Smith <dave.smith@candata.com> writes: > > If this is only dealing with constants, why not just explicitly add a > > cast to the constant of the column type at the planner level. It would > > solve this problem as well ... > > > create table test (f int2); > > select * from test where f=cast('1981928928921' as int2); > > ERROR: pg_atoi: error reading "1981928928921": Numerical result out of > > range > > > select * from test where f=1981928928921; > > f > > --- > > (0 rows) > > Uh, your own example points out why we can't just insert a cast to int2 > --- it changes the results. Or am I missing your point? Because PG doesn't have the functionality to catch exceptions until now, we can't call cast functions tentatively. But for example, why should the pg_atoi error cause an unrecoverable error immediately ? IMHO we should be able to call cast functions internally without an anxiety to cause an unrecoverable error. regards, Hiroshi Inouehttp://www.geocities.jp/inocchichichi/psqlodbc/
On Tue, 16 Sep 2003, Tom Lane wrote: > This is not a 100% solution to our problems. I don't think we could use > it to solve the problem for int2 columns ("int2col = 42") because it'd > be unsafe to promise that an int4-to-int2 cast could be inserted into > an expression without changing the behavior. > > Comments? Does this spur any better ideas? I still think that this is a type inference problem and nothing else. The problem with an expression like int2col = 42 is that the parser tries directly to put a type on the constant 42 while it shouldn't. 42 is not a int4 value, it's also not a int8 value. By introducing a type variables and say that the 42 :: alpha and some constraint saying "Int alpha" then then you after parsing look at all your type variables and try to assign real types to them that is consistent (the type inference phase). Another way is to say that 42 :: int2 and then having some subtyping rules saying that int2 < int4 and int4 < int8 so the type system can insert coercions in order to make a concistent typing of the term. Things to look at is Hindley-Miller type inference and subtyping. What I'm afraid of is that there are more and more ad hock solutions added to postgresql but no overall strategy. For example in 7.4 there are polymorphism that is something that hindley miller type systems have always been good at but I don't know exactly what pg uses and how it's checking types. I'm not saying that there is anything wrong with polymorphism as added. I don't know enough of it to say anything about it. Another thing I don't understand is why '42' works but not 42. Is it the case that you can always write '42' in places where pg expects an integer? That suggest to me that '42' is treated as something with an unknown type (let's call it alpha), where the type is infered later on depending on how it is used (alpha is assigned to int2 or something). Sounds very much like what I talk about above. Why is not 42 treated in the same way? If someone want 42 to be an integer to for example select a specific overloaded function they would have to write 42::integer instead (or some defaulting could be added). I feal bad to talk about the pg type system without knowing all details about it. But I do know a little about type systems in general, so I still want to share some thoughts. I wrote a similar message 6 months ago or something. If you get a deja vu feeling you know why :-) I wish I had lots of money so I could take 6 months off and just spend on pg. Then I would start by writing down how the current type system works and try to create a new nice type system that works without a lot of ad hock solutions. Unfortunatly there is a reality where you have to pay the rent and so on... -- /Dennis
Dennis Bjorklund <db@zigo.dhs.org> writes: > I still think that this is a type inference problem and nothing else. Well, perhaps ripping out the entire type resolution algorithm and replacing it with something less ad-hoc is indeed the right long-term answer. I don't have time to do that though. Do you? It seems like a fairly large project with possibly zero payback, if we find that too many behavioral incompatibilities are introduced to make it adoptable. > Another thing I don't understand is why '42' works but not 42. Because in the '42' case the determination of the constant's type is indeed postponed ... though no doubt in a way that's too ad-hoc ;-) Another thing to keep in mind is that it's not always the case that assigning the right type to a literal constant would solve the problem. We have the same issues with variables; for example, a join with "WHERE a.int8col = b.int4col" may fail to take advantage of available indexes. Something else that's been troubling me about the issue is that the most efficient operator to use is really context-dependent. In the example of "WHERE a.int8col = b.int4col", we basically have two possible choices; use a cross-type operator (int8 = int4), or introduce a coercion, so that the clause becomes "WHERE a.int8col = b.int4col::int8" using an int8 = int8 operator. The latter is a win when considering indexscans on the int8 column, and also a win for hash joins (cross-type operators in general aren't hashable). *But* the former is a win for merge joins, and particularly for implied equality deduction. If we force the coercion-based representation to be used then the planner will be able to deduce that a.int8col and b.int4col::int8 are equal, but it would have to make a leap of faith to relate that knowledge to the bare b.int4col column. This would eg. prevent the use of a mergejoin with use of an index on the int4 column. So I'm beginning to think that avoiding cross-type operators is not the right route to a solution anyway. It may be better to leave the parser alone and teach the planner how to switch to the alternate representation when and where appropriate. (The connection to hash joins suggests that tying this closely to index opclasses is wrong, though.) regards, tom lane
On Wed, 17 Sep 2003, Tom Lane wrote: > Another thing to keep in mind is that it's not always the case that > assigning the right type to a literal constant would solve the problem. > We have the same issues with variables; for example, a join with > "WHERE a.int8col = b.int4col" may fail to take advantage of available > indexes. Sure, which is why subtyping is needed to insert the correct coercions (relations like int4 < int8 and such). Generating all constraints and then have a planner phase/inference phase in the end that makes the chioces. By not forcing anything too early you give the inference/planner the possibility to make good choices (which might depend on what indexes are available and so on). > Something else that's been troubling me about the issue is that the most > efficient operator to use is really context-dependent. In the example > of "WHERE a.int8col = b.int4col", we basically have two possible > choices; use a cross-type operator (int8 = int4), or introduce a > coercion, so that the clause becomes "WHERE a.int8col = b.int4col::int8" > using an int8 = int8 operator. The latter is a win when considering > indexscans on the int8 column, and also a win for hash joins (cross-type > operators in general aren't hashable). *But* the former is a win for > merge joins, and particularly for implied equality deduction. Yes, by not making the decission directly but just setting constrants, then you in the end have a set of possible solutions. And then the planner have to choose among those. > If we force the coercion-based representation to be used then the > planner will be able to deduce that a.int8col and b.int4col::int8 are > equal, but it would have to make a leap of faith to relate that > knowledge to the bare b.int4col column. This would eg. prevent the use > of a mergejoin with use of an index on the int4 column. If I understand what you say correctly, then yes. If you have an inference phase before the planner phase that decides to insert a coercion from int4 to int8 then you have made a decission that forces you to use one plan over another. And that should be made by the planner and not the type inference. > So I'm beginning to think that avoiding cross-type operators is not the > right route to a solution anyway. It may be better to leave the parser > alone and teach the planner how to switch to the alternate > representation when and where appropriate. Yes, and the planner needs information about what alternative representations there are (which is the same as letting the planner insert coercions and selecting what == operator to use, or are you thinking of something else). Another problem is of course if one let the planner do too much work and have to many possible plans to choose from, it has to be fast. -- /Dennis
Dennis Bjorklund <db@zigo.dhs.org> writes: > On Wed, 17 Sep 2003, Tom Lane wrote: >> So I'm beginning to think that avoiding cross-type operators is not the >> right route to a solution anyway. It may be better to leave the parser >> alone and teach the planner how to switch to the alternate >> representation when and where appropriate. > Yes, and the planner needs information about what alternative > representations there are (which is the same as letting the planner insert > coercions and selecting what == operator to use, or are you thinking of > something else). Right. My pg_amsecop catalog proposal of yesterday could be seen as providing knowledge about valid/useful transformations of this form (ignoring the linkage to index opclasses, which I now see to be possibly irrelevant). > Another problem is of course if one let the planner do too much work and > have to many possible plans to choose from, it has to be fast. I don't think it's a big problem; this would add at most one catalog lookup in each situation where an indexed column is compared to something of a different type. And the plans involved would usually be ones we'd want to find, I think. regards, tom lane
Tom Lane wrote: > Dave Smith <dave.smith@candata.com> writes: > > My point was that it was inconstant behavour. What exactly are you > > comparing with int2? To me the case without the cast should should throw > > the same error as the statement with the cast. > > > select * from test where f=1981928928921; > > I contend not. The above is perfectly well defined. It will always > return false if f is int2, but that does not mean it should throw an > error instead. In any standard programming language, you'd resolve > the operator by up-converting f to the type of the constant, not by > trying to down-convert the wider value. PG happens to have > implementation reasons to wish to use the variable's datatype instead > of the constant's, but that doesn't excuse us from obeying the ordinary > laws of arithmetic. Hmm...but what if the cast were to return NULL in the event that the cast fails or cannot be done? Would that even be reasonable? I don't know what the standard says about this so my suggestion may be unreasonable (and it may break a lot of things as well). In a way, this would be consistent with the meaning of NULL: "no value", and would also yield the desired effect in the example select (no matches). Of course, I could always be off my rocker here. :-) -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Hmm...but what if the cast were to return NULL in the event that the cast > fails or cannot be done? Would that even be reasonable? Yeah, I was wondering about that myself. I'd not want to try to use such an idea in general, but if we find that int2 indexes are the only sore spot in an otherwise-useful solution, some klugery for int2 might be the way to go. What I was visualizing was that for an int2 index, we might transform "int2col op int4-or-int8-comparison-value" into "int2col int2op special_cast_fn(int4-or-int8-comparison-value)" where the trick is to make up a good special_cast_fn (possibly one specific to the comparison op being used). Returning NULL might be an acceptable substitute when the cast function wants to force an always-false answer, but what about cases where it needs to force an always-true answer? For instanceint2col < 1000000 should yield true always. There's no int2 value the cast function could output to make that happen. I thought maybe we could hack it by changing the operator to "<=" and introducing an offset of -1 in the cast function to compensate. I haven't worked out all the combinations though, and I'm not real sure that it's acceptable to substitute NULL for always-false cases. It'd work at the top level of WHERE but possibly not in other cases where indexscanning is desirable. regards, tom lane
For the int2col op value I have this table for when the cast returns NULL value <0 <, <= ,= int2col=null>,>= in2col is not null value > 0 <,<= in2col is not null =,>,>= int2col=null Im not sure why pg allows me to do a int2col=null and returns nothing so I am assuming that internally pg just resolves this to false. Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > >>Hmm...but what if the cast were to return NULL in the event that the cast >>fails or cannot be done? Would that even be reasonable? > > > Yeah, I was wondering about that myself. I'd not want to try to use > such an idea in general, but if we find that int2 indexes are the only > sore spot in an otherwise-useful solution, some klugery for int2 might > be the way to go. What I was visualizing was that for an int2 index, > we might transform "int2col op int4-or-int8-comparison-value" into > "int2col int2op special_cast_fn(int4-or-int8-comparison-value)" > where the trick is to make up a good special_cast_fn (possibly one > specific to the comparison op being used). > > Returning NULL might be an acceptable substitute when the cast function > wants to force an always-false answer, but what about cases where it > needs to force an always-true answer? For instance > int2col < 1000000 > should yield true always. There's no int2 value the cast function > could output to make that happen. I thought maybe we could hack it > by changing the operator to "<=" and introducing an offset of -1 in the > cast function to compensate. I haven't worked out all the combinations > though, and I'm not real sure that it's acceptable to substitute NULL > for always-false cases. It'd work at the top level of WHERE but > possibly not in other cases where indexscanning is desirable. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >