Thread: TODO: Fix CREATE CAST on DOMAINs
I would like to work on the domain casting problem. I have spent sometime in order to understand how this whole domain handling works when it comes to casting and I think I understand why this cannot be fixed in isolation as Tom has described in: http://archives.postgresql.org/pgsql-hackers/2006-05/msg00190.php Perhaps I am way off but I am starting to think we need to handle domains in more organized and consistent way in order to avoid bugs like this. First I would like to know how PG's code looked like without the domains. I went searching in the release notes and I found the following regarding the domains: 7.3 : Add domain support (Rod) 7.3.3 : Fix planner's selectivity estimation functions to handle domains properly 7.4 : Add check constraints for domains (Rod): Improve automatic type casting for domains (Rod, Tom) 7.4.12 : Properly check DOMAIN constraints for UNKNOWN parameters in prepared statements (Neil) 8.0.1 : Make ALTER TABLE ADD COLUMN enforce domain constraints in all cases 8.0.7 : Properly check DOMAIN constraints for UNKNOWN parameters in prepared statements (Neil) 8.1.4 : Properly check DOMAIN constraints for UNKNOWN parameters in prepared statements (Neil) I need to go back and see where and how the domains are handled in a global sense. Then I hope I can gather enough information to be able to submit a coherent proposal. If you have any thoughts you would like to share about this, please let me know. Regards, Gevik.
Gevik Babakhani <pgdev@xs4all.nl> writes: > First I would like to know how PG's code looked like without the > domains. IIRC, as far as the datatype coercion and operator/function resolution code were concerned, the domain patch basically consisted of dropping getBaseType() calls in at a bunch of choke points, so that these routines all treat a domain the same as its base type. I was never real happy about this, mainly because it adds extra syscache lookups that buy you nothing when you're not using domains. What would be nice is to revert all that, and instead have domain creation insert explicit cast entries between a domain and its base type into pg_cast, so that domains don't require any special-case code in this part of the system. The problem with that glib answer is that maintaining anything like the current behavior would seem to require that we abandon the current principle that we consider only one-step cast pathways when trying to match arguments to functions. (For instance, a varchar value can be passed to a text-taking function because varchar->text is an implicit cast according to pg_cast. What of a domain over varchar?) And allowing multi-step casts to be chosen automatically is unbelievably scary --- I think it would basically destroy the system's ability to resolve overloaded functions at all, because you can get from any type to any other if you allow enough cast steps. (The unreasonably large number of implicit casts to text aren't helping any here :-() So the hard part of this doesn't really require any understanding of code at all. What we need is a proposal for an algorithm that loosens the casting rules "just enough" to make explicit pg_cast entries for domains work the way we would like them to, without wholesale breakage of situations that have nothing to do with domains. See http://developer.postgresql.org/pgdocs/postgres/typeconv.html Rereading what I just wrote, it might be as simple as allowing a two-step cast in certain cases, only if the first step is a domain to base type coercion (which we assume would be specially marked in pg_cast). But the devil is in the details ... and anyway there might be a cleaner approach than that. regards, tom lane
Tom Lane wrote: > So the hard part of this doesn't really require any understanding of > code at all. What we need is a proposal for an algorithm that loosens > the casting rules "just enough" to make explicit pg_cast entries for > domains work the way we would like them to, without wholesale breakage > of situations that have nothing to do with domains. See > http://developer.postgresql.org/pgdocs/postgres/typeconv.html > > Rereading what I just wrote, it might be as simple as allowing a > two-step cast in certain cases, only if the first step is a domain to > base type coercion (which we assume would be specially marked in > pg_cast). FWIW, before I got to this paragraph that was the thought that immediately occurred to me. cheers andrew
Tom Lane wrote: > Rereading what I just wrote, it might be as simple as allowing a > two-step cast in certain cases, only if the first step is a domain to > base type coercion (which we assume would be specially marked in > pg_cast). But the devil is in the details ... and anyway there might > be a cleaner approach than that. ISTM casts from a domain to their base type are fundamentally different from casts between types. In general, casting TYPE_X to TYPE_Y requires malloc'ing memory for TYPE_Y, and converting the data of TYPE_X into TYPE_Y, possibly with loss of accuracy or correctness, etc. (4-byte or less types are handled on the stack, not the heap, but that seems irrelevant to me and I'm only mentioning it here to head off any replies along those lines.) Certainly, having the system chain together lots of implicit casts of this sort is scary. But casting a domain to its base type never involves loss of accuracy or correctness, right? (Casting from the base type to the domain might not work, on account of the domain restrictions forbidding the particular value stored in the base.) Perhaps we need to be able to register casts with more information than just IMPLICIT vs. EXPLICIT. Perhaps we also need something like SAFE or some other term, and then have a rule that no chain of casts chosen by the system (as opposed to specified by the user) can contain more than one IMPLICIT cast, but can contain unlimited many SAFE casts. When a domain is created, a SAFE cast from the domain to its base type could automatically be generated. Casts between the existing varchar(n) to text could be marked as SAFE, given that the underlying storage scheme for varchar(n) is the same as text. (Casts from text to varchar(n) are not SAFE, because the text might be too long to fit.) Casts from int2 -> int4, int2 -> int8, and int4 -> int8 would all be SAFE, I think, because they are not lossy. But perhaps I have not thought enough about this and these should be IMPLICIT rather than SAFE. Casts from non-text types to text would remain IMPLICIT, I expect. If a user created their own type, such as the recent discussion of an int3 type, they could also create an int3 -> int4cast marked as SAFE, and from int2 -> int3 marked as SAFE, and from int3 -> int2 marked as EXPLICIT, and from int4 -> int3 marked as EXPLICIT, and could avoid writing all the casts to other integral types. (I've pretty much abandoned the idea of an int3 type because my testing convinced me there were no performance advantages to it. But it serves ok as an example.) mark
Mark Dilger wrote: > Casts from int2 -> int4, int2 -> int8, and int4 -> int8 would all be > SAFE, I think, because they are not lossy. But perhaps I have not > thought enough about this and these should be IMPLICIT rather than SAFE. I have thought about this some more. I think these are indeed SAFE. The distinction between SAFE and IMPLICIT should not, I think, be whether the storage type is identical, but rather whether there is any possible loss of precision, range, accuracy, etc., or whether there is any change in the fundamental interpretation of the data when cast from the source to destination type. The built-in cast from int2 -> int4, which is currently IMPLICIT, cannot lose any information, nor can it cause the data to be interpreted differently. Therefore it is SAFE. The built-in cast from int8 -> float8 is currently marked as IMPLICIT, but since a large integer value which is cast inthis fashion might be somewhat altered, it is not SAFE. It is also interpreted differently, since floating point numbers are typically interpreted as approximations, whereas integers are interpreted as exact. (Hence the tendency to compare integers for equality, but not floats.) Can anyone think of examples where chaining together SAFE casts would cause problems, using the guidelines for what is SAFE outlined above? mark
Mark Dilger <pgsql@markdilger.com> writes: > Mark Dilger wrote: >> Casts from int2 -> int4, int2 -> int8, and int4 -> int8 would all be >> SAFE, I think, because they are not lossy. But perhaps I have not >> thought enough about this and these should be IMPLICIT rather than SAFE. > I have thought about this some more. I think these are indeed SAFE. The > distinction between SAFE and IMPLICIT should not, I think, be whether the > storage type is identical, but rather whether there is any possible loss of > precision, range, accuracy, etc., or whether there is any change in the > fundamental interpretation of the data when cast from the source to destination > type. You are going in exactly the wrong direction --- this line of thought is aiming to make *more* casts possible by default, which is not what we need, at least not among the collection of base types. regards, tom lane
Tom Lane wrote: > Mark Dilger <pgsql@markdilger.com> writes: >> Mark Dilger wrote: >>> Casts from int2 -> int4, int2 -> int8, and int4 -> int8 would all be >>> SAFE, I think, because they are not lossy. But perhaps I have not >>> thought enough about this and these should be IMPLICIT rather than SAFE. > >> I have thought about this some more. I think these are indeed SAFE. The >> distinction between SAFE and IMPLICIT should not, I think, be whether the >> storage type is identical, but rather whether there is any possible loss of >> precision, range, accuracy, etc., or whether there is any change in the >> fundamental interpretation of the data when cast from the source to destination >> type. > > You are going in exactly the wrong direction --- this line of thought is > aiming to make *more* casts possible by default, which is not what we > need, at least not among the collection of base types. > If I understand correctly, you are worried about two issues: ambiguity and performance. You don't want the system to be slower from the extra searching needed to find possible multiple step casts, and you don't want any new ambiguity where the system can't deterministically decide which choice of cast(s) should be used. Is that right? If the system chooses cast chains based on a breadth-first search, then the existing int2 -> int8 cast would be chosen over an int2 -> int4 -> int8 chain, or an int2 -> int3 -> int4 -> int8 chain, or in fact any chain at all, because the int2 -> int8 cast is the shortest. So the code to search chains should only be invoked in what would currently be an *error condition*, that being that the SQL includes a request for a cast that cannot be resolved without chaining. Since the chaining code would be new, and the rules for it would be new, we can still design them however we like (within reason.) I would propose: 1) Shorter chains trump longer chains. 2) When comparing two equal length chains, one made entirely of SAFE casts trumps one which contains an IMPLICIT cast. 3) When two or more chains remain that cannot be resolved under the above two rules, the SQL is considered ambiguous and an error condition is raised. I don't see how this would break any existing valid SQL. But it seems like it would solve both the DOMAIN problem you mentioned and the oft lamented problem that adding a new datatype requires quadratically many casts to the system. mark
Mark Dilger <pgsql@markdilger.com> writes: > If the system chooses cast chains based on a breadth-first search, > then the existing int2 -> int8 cast would be chosen over an int2 -> > int4 -> int8 chain, or an int2 -> int3 -> int4 -> int8 chain, or in > fact any chain at all, because the int2 -> int8 cast is the shortest. Well, this is the sort of thing that has to be thought about pretty carefully. Is "length of chain" the most appropriate metric? What are you going to do when resolving a multi-input operator or function, and there are different ways to match different candidates with different sets of path lengths? There's been some prior discussion of attaching a "measure of goodness" to different potential cast pathways. I'm too lazy to look it up at the moment but I strongly suggest whoever wants to produce a concrete proposal in this area should go check the archives. regards, tom lane
On Wed, Sep 20, 2006 at 09:31:48AM -0700, Mark Dilger wrote: > Perhaps we need to be able to register casts with more information than > just IMPLICIT vs. EXPLICIT. Perhaps we also need something like SAFE or > some other term, and then have a rule that no chain of casts chosen by the > system (as opposed to specified by the user) can contain more than one > IMPLICIT cast, but can contain unlimited many SAFE casts. Currently cast have three types, implicit, explicit and assignment. You don't want to have an unlimited number of anything. Cast lookups are expensive enough as it is. > Casts from int2 -> int4, int2 -> int8, and int4 -> int8 would all be SAFE, > I think, because they are not lossy. But perhaps I have not thought enough > about this and these should be IMPLICIT rather than SAFE. Yeah, but the trick is you want, with the above casts to only ever produce the *shortest* path. That's what makes it expensive. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Wed, Sep 20, 2006 at 10:26:55AM -0700, Mark Dilger wrote: > I have thought about this some more. I think these are indeed SAFE. The > distinction between SAFE and IMPLICIT should not, I think, be whether the > storage type is identical, but rather whether there is any possible loss of > precision, range, accuracy, etc., or whether there is any change in the > fundamental interpretation of the data when cast from the source to > destination type. My question is whether there should be any implicit casts that are not safe. Your example int8 -> float8 being implicit is I think an error and we should wonder why that cast implicit now anyway. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes: > My question is whether there should be any implicit casts that are not > safe. Your example int8 -> float8 being implicit is I think an error > and we should wonder why that cast implicit now anyway. Because the SQL spec requires it. You are not required to write a cast to add an exact and an approximate quantity, and the spec says the result is approximate. Trying to design this stuff purely according to abstract notions of elegance of the cast rules isn't going to work out well --- we have both spec requirements and backwards compatibility to worry about. Now we do have the flexibility to alter the default contents of pg_cast --- there could be more or fewer entries in there than there are now, if the type coercion rules are altered to do less or more automatically than they do now. But the end-result behavior needs to wind up being pretty darn near the same thing, at least within the numeric type category (I'm not as certain that we have the other ones right, but the numeric category has been *very* heavily scrutinized and beat upon). The only thing I really want to see changing is the behavior for domain types --- and even there, the "default" behavior when there are no user-created domain-specific operators or casts has to stay the same. regards, tom lane
Tom Lane wrote: > Now we do have the flexibility to alter the default contents of pg_cast > --- there could be more or fewer entries in there than there are now, > if the type coercion rules are altered to do less or more automatically > than they do now. But the end-result behavior needs to wind up being > pretty darn near the same thing, at least within the numeric type > category (I'm not as certain that we have the other ones right, but the > numeric category has been *very* heavily scrutinized and beat upon). > The only thing I really want to see changing is the behavior for domain > types --- and even there, the "default" behavior when there are no > user-created domain-specific operators or casts has to stay the same. Your suggestion upthread that domains have two-step casts (from domain to base, then from base to whatever) is what got my attention. I don't like the idea of having an interim solution to that subset of the problem if it might get in the way of solving the general problem later. But perhaps it can be argued that no cruftiness would result from the special case code for casting domains to their base types? mark
> Trying to design this stuff purely according to abstract notions of > elegance of the cast rules isn't going to work out well --- we have > both spec requirements and backwards compatibility to worry about. > > Now we do have the flexibility to alter the default contents of pg_cast > --- there could be more or fewer entries in there than there are now, > if the type coercion rules are altered to do less or more automatically > than they do now. But the end-result behavior needs to wind up being > pretty darn near the same thing, at least within the numeric type > category (I'm not as certain that we have the other ones right, but the > numeric category has been *very* heavily scrutinized and beat upon). > The only thing I really want to see changing is the behavior for domain > types --- and even there, the "default" behavior when there are no > user-created domain-specific operators or casts has to stay the same. Trying to solve this problem requires more investigation having spec requirements and backwards compatibility etc.. etc.. in mind. After reading the thread, I think there are some interesting similarities, ideas (or even techniques) used in OO languages like JAVA and C# regarding internal handling when type boxing and type casting. (I would like to think domains as inherited classes of their super or the base class.) I will come back with more thoughts after I have investigated a thing or two. Say tuned....
On Wed, Sep 20, 2006 at 10:56:08AM -0700, Mark Dilger wrote: > If the system chooses cast chains based on a breadth-first search, then the > existing int2 -> int8 cast would be chosen over an int2 -> int4 -> int8 > chain, or an int2 -> int3 -> int4 -> int8 chain, or in fact any chain at > all, because the int2 -> int8 cast is the shortest. But we're not talking about a search here, we don't always know where the endpoint is. Imagine you have the following three functions: abs(int8) abs(float4) abs(numeric) And you have an int2. Which is the best cast to use? What's the answer if you have a float8? What if it's an unknown type text string? Now, consider that functions can have up to 32 arguments and that this resolution might have to be applied to each argument and you find that searching is going to get very expensive very quickly. The current system of requiring only a single step is at least predictable. If you have the choice between: - first argument matches, second needs three "safe" conversions, and - first argument need one "unsafe" conversion, second matches exactly Which is cheaper? To make this manageable you have to keep the number of types you can cast to small, or you'll get lost in the possibilites. Adding just a single step domain to base type conversion seems pretty safe, but anything more is going to be hard. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote: > On Wed, Sep 20, 2006 at 10:56:08AM -0700, Mark Dilger wrote: >> If the system chooses cast chains based on a breadth-first search, then the >> existing int2 -> int8 cast would be chosen over an int2 -> int4 -> int8 >> chain, or an int2 -> int3 -> int4 -> int8 chain, or in fact any chain at >> all, because the int2 -> int8 cast is the shortest. > > But we're not talking about a search here, we don't always know where > the endpoint is. Imagine you have the following three functions: > > abs(int8) > abs(float4) > abs(numeric) > > And you have an int2. Which is the best cast to use? What's the answer > if you have a float8? What if it's an unknown type text string? > > Now, consider that functions can have up to 32 arguments and that this > resolution might have to be applied to each argument and you find that > searching is going to get very expensive very quickly. > > The current system of requiring only a single step is at least > predictable. If you have the choice between: > > - first argument matches, second needs three "safe" conversions, and > - first argument need one "unsafe" conversion, second matches exactly > > Which is cheaper? > > To make this manageable you have to keep the number of types you can > cast to small, or you'll get lost in the possibilites. Adding just a > single step domain to base type conversion seems pretty safe, but > anything more is going to be hard. > > Have a nice day, The searching never needs to be done at runtime. It should be computable at cast creation time. A new cast creates a potential bridge between any two types in the system. Using a shortest path algorithm, the best chain (if any exists) from one type to another can be computed and pre-compiled, right? So, assume the following already exists: Types A,B,C, fully connected with casts A->B, B->A, A->C, C->A, B->C, C->B, with some marked IMPLICIT, some marked EXPLICIT, and some marked SAFE. Types X,Y,Z, also fully connected with casts, as above. Then assume someone comes along and creates a new type M with conversions A->M, M->A, X->M, and M->X. At the time that type and those casts are added to the system, the system could calculate any additional casts to/from B, C, Y, and Z. A simple implementation (but maybe not optimal)would be for the system to autogenerate code like: CREATE FUNCTION cast_M_Y (arg M) RETURNS Y AS $$SELECT arg::X::Y; $$ LANGUAGE SQL; CREATE CAST (M AS Y) WITH FUNCTION cast_M_Y(M) [ AS ASSIGNMENT | AS IMPLICIT ] And then load that function and cast. The only real trick seems to be determining the rules for which cast chain gets used within that autogenerated function, and whether the generated cast is IMPLICIT, EXPLICIT, or ASSIGNMENT. Looking over what I have just written, another idea pops up. To avoid having the system decide which casts are reasonable, you could extend the syntax and allow an easy shorthand for the user. Something like: CREATE CAST (M AS A)WITH FUNCTION cast_M_AAS ASSIGNMENTPROPOGATES TO B AS ASSIGNMENT,PROPOGATES TO C AS ASSIGNMENT; CREATE CAST (A AS M)WITH FUNCTION cast_A_MAS ASSIGNMENTPROPOGATES FROM B,PROPOGATES FROM C; And then the casts from M->B, M->C, B->M, and C->M would all be added to the system. Thoughts? mark