Thread: TODO: Fix CREATE CAST on DOMAINs

TODO: Fix CREATE CAST on DOMAINs

From

Gevik Babakhani

Date:

20 September 2006, 13:08:11

I would like to work on the domain casting problem. I have spent
sometime in order to understand how this whole domain handling works
when it comes to casting and I think I understand why this cannot be
fixed in isolation as Tom has described in:

http://archives.postgresql.org/pgsql-hackers/2006-05/msg00190.php

Perhaps I am way off but I am starting to think we need to handle
domains in more organized and consistent way in order to avoid bugs like
this.


First I would like to know how PG's code looked like without the
domains. I went searching in the release notes and I found the following
regarding the domains:

7.3     : Add domain support (Rod)
7.3.3    : Fix planner's selectivity estimation functions to handle domains
properly
7.4    : Add check constraints for domains (Rod): Improve automatic type casting for domains (Rod, Tom)
7.4.12  : Properly check DOMAIN constraints for UNKNOWN parameters in
prepared statements (Neil)
8.0.1    : Make ALTER TABLE ADD COLUMN enforce domain constraints in all
cases
8.0.7    : Properly check DOMAIN constraints for UNKNOWN parameters in
prepared statements (Neil)
8.1.4    : Properly check DOMAIN constraints for UNKNOWN parameters in
prepared statements (Neil)


I need to go back and see where and how the domains are handled in a
global sense. Then I hope I can gather enough information to be able to
submit a coherent proposal.

If you have any thoughts you would like to share about this, please let
me know.

Regards,
Gevik.

Re: TODO: Fix CREATE CAST on DOMAINs

From

Tom Lane

Date:

20 September 2006, 14:56:21

Gevik Babakhani <pgdev@xs4all.nl> writes:
> First I would like to know how PG's code looked like without the
> domains.

IIRC, as far as the datatype coercion and operator/function resolution
code were concerned, the domain patch basically consisted of dropping
getBaseType() calls in at a bunch of choke points, so that these
routines all treat a domain the same as its base type.

I was never real happy about this, mainly because it adds extra syscache
lookups that buy you nothing when you're not using domains.

What would be nice is to revert all that, and instead have domain
creation insert explicit cast entries between a domain and its base type
into pg_cast, so that domains don't require any special-case code in
this part of the system.  The problem with that glib answer is that
maintaining anything like the current behavior would seem to require
that we abandon the current principle that we consider only one-step
cast pathways when trying to match arguments to functions.  (For
instance, a varchar value can be passed to a text-taking function
because varchar->text is an implicit cast according to pg_cast.  What
of a domain over varchar?)  And allowing multi-step casts to be chosen
automatically is unbelievably scary --- I think it would basically
destroy the system's ability to resolve overloaded functions at all,
because you can get from any type to any other if you allow enough cast
steps.  (The unreasonably large number of implicit casts to text aren't
helping any here :-()

So the hard part of this doesn't really require any understanding of
code at all.  What we need is a proposal for an algorithm that loosens
the casting rules "just enough" to make explicit pg_cast entries for
domains work the way we would like them to, without wholesale breakage
of situations that have nothing to do with domains.  See
http://developer.postgresql.org/pgdocs/postgres/typeconv.html

Rereading what I just wrote, it might be as simple as allowing a
two-step cast in certain cases, only if the first step is a domain to
base type coercion (which we assume would be specially marked in
pg_cast).  But the devil is in the details ... and anyway there might
be a cleaner approach than that.
        regards, tom lane

Re: TODO: Fix CREATE CAST on DOMAINs

From

Andrew Dunstan

Date:

20 September 2006, 15:03:30

Tom Lane wrote:
> So the hard part of this doesn't really require any understanding of
> code at all.  What we need is a proposal for an algorithm that loosens
> the casting rules "just enough" to make explicit pg_cast entries for
> domains work the way we would like them to, without wholesale breakage
> of situations that have nothing to do with domains.  See
> http://developer.postgresql.org/pgdocs/postgres/typeconv.html
>
> Rereading what I just wrote, it might be as simple as allowing a
> two-step cast in certain cases, only if the first step is a domain to
> base type coercion (which we assume would be specially marked in
> pg_cast).  

FWIW, before I got to this paragraph that was the thought that 
immediately occurred to me.

cheers

andrew

Re: TODO: Fix CREATE CAST on DOMAINs

From

Mark Dilger

Date:

20 September 2006, 16:32:08

Tom Lane wrote:
> Rereading what I just wrote, it might be as simple as allowing a
> two-step cast in certain cases, only if the first step is a domain to
> base type coercion (which we assume would be specially marked in
> pg_cast).  But the devil is in the details ... and anyway there might
> be a cleaner approach than that.

ISTM casts from a domain to their base type are fundamentally different from 
casts between types.  In general, casting TYPE_X to TYPE_Y requires malloc'ing 
memory for TYPE_Y, and converting the data of TYPE_X into TYPE_Y, possibly with 
loss of accuracy or correctness, etc.  (4-byte or less types are handled on the 
stack, not the heap, but that seems irrelevant to me and I'm only mentioning it 
here to head off any replies along those lines.)  Certainly, having the system 
chain together lots of implicit casts of this sort is scary.  But casting a 
domain to its base type never involves loss of accuracy or correctness, right? 
(Casting from the base type to the domain might not work, on account of the 
domain restrictions forbidding the particular value stored in the base.)

Perhaps we need to be able to register casts with more information than just 
IMPLICIT vs. EXPLICIT.  Perhaps we also need something like SAFE or some other 
term, and then have a rule that no chain of casts chosen by the system (as 
opposed to specified by the user) can contain more than one IMPLICIT cast, but 
can contain unlimited many SAFE casts.

When a domain is created, a SAFE cast from the domain to its base type could 
automatically be generated.

Casts between the existing varchar(n) to text could be marked as SAFE, given 
that the underlying storage scheme for varchar(n) is the same as text.  (Casts 
from text to varchar(n) are not SAFE, because the text might be too long to fit.)

Casts from int2 -> int4, int2 -> int8, and int4 -> int8 would all be SAFE, I 
think, because they are not lossy.  But perhaps I have not thought enough about 
this and these should be IMPLICIT rather than SAFE.

Casts from non-text types to text would remain IMPLICIT, I expect.

If a user created their own type, such as the recent discussion of an int3 type,  they could also create an int3 ->
int4cast marked as SAFE, and from int2 -> 

int3 marked as SAFE, and from int3 -> int2 marked as EXPLICIT, and from int4 -> 
int3 marked as EXPLICIT, and could avoid writing all the casts to other integral 
types.

(I've pretty much abandoned the idea of an int3 type because my testing 
convinced me there were no performance advantages to it.  But it serves ok as an  example.)

mark

Re: TODO: Fix CREATE CAST on DOMAINs

From

Mark Dilger

Date:

20 September 2006, 17:27:19

Mark Dilger wrote:
> Casts from int2 -> int4, int2 -> int8, and int4 -> int8 would all be 
> SAFE, I think, because they are not lossy.  But perhaps I have not 
> thought enough about this and these should be IMPLICIT rather than SAFE.

I have thought about this some more.  I think these are indeed SAFE.  The 
distinction between SAFE and IMPLICIT should not, I think, be whether the 
storage type is identical, but rather whether there is any possible loss of 
precision, range, accuracy, etc., or whether there is any change in the 
fundamental interpretation of the data when cast from the source to destination 
type.

The built-in cast from int2 -> int4, which is currently IMPLICIT, cannot lose 
any information, nor can it cause the data to be interpreted differently. 
Therefore it is SAFE.

The built-in cast from int8 -> float8 is currently marked as IMPLICIT, but since  a large integer value which is cast
inthis fashion might be somewhat altered, 

it is not SAFE.  It is also interpreted differently, since floating point 
numbers are typically interpreted as approximations, whereas integers are 
interpreted as exact.  (Hence the tendency to compare integers for equality, but 
not floats.)

Can anyone think of examples where chaining together SAFE casts would cause 
problems, using the guidelines for what is SAFE outlined above?

mark

Re: TODO: Fix CREATE CAST on DOMAINs

From

Tom Lane

Date:

20 September 2006, 17:32:37

Mark Dilger <pgsql@markdilger.com> writes:
> Mark Dilger wrote:
>> Casts from int2 -> int4, int2 -> int8, and int4 -> int8 would all be 
>> SAFE, I think, because they are not lossy.  But perhaps I have not 
>> thought enough about this and these should be IMPLICIT rather than SAFE.

> I have thought about this some more.  I think these are indeed SAFE.  The 
> distinction between SAFE and IMPLICIT should not, I think, be whether the 
> storage type is identical, but rather whether there is any possible loss of 
> precision, range, accuracy, etc., or whether there is any change in the 
> fundamental interpretation of the data when cast from the source to destination 
> type.

You are going in exactly the wrong direction --- this line of thought is
aiming to make *more* casts possible by default, which is not what we
need, at least not among the collection of base types.
        regards, tom lane

Re: TODO: Fix CREATE CAST on DOMAINs

From

Mark Dilger

Date:

20 September 2006, 17:56:24

Tom Lane wrote:
> Mark Dilger <pgsql@markdilger.com> writes:
>> Mark Dilger wrote:
>>> Casts from int2 -> int4, int2 -> int8, and int4 -> int8 would all be 
>>> SAFE, I think, because they are not lossy.  But perhaps I have not 
>>> thought enough about this and these should be IMPLICIT rather than SAFE.
> 
>> I have thought about this some more.  I think these are indeed SAFE.  The 
>> distinction between SAFE and IMPLICIT should not, I think, be whether the 
>> storage type is identical, but rather whether there is any possible loss of 
>> precision, range, accuracy, etc., or whether there is any change in the 
>> fundamental interpretation of the data when cast from the source to destination 
>> type.
> 
> You are going in exactly the wrong direction --- this line of thought is
> aiming to make *more* casts possible by default, which is not what we
> need, at least not among the collection of base types.
> 

If I understand correctly, you are worried about two issues:  ambiguity and 
performance.  You don't want the system to be slower from the extra searching 
needed to find possible multiple step casts, and you don't want any new 
ambiguity where the system can't deterministically decide which choice of 
cast(s) should be used.  Is that right?

If the system chooses cast chains based on a breadth-first search, then the 
existing int2 -> int8 cast would be chosen over an int2 -> int4 -> int8 chain, 
or an int2 -> int3 -> int4 -> int8 chain, or in fact any chain at all, because 
the int2 -> int8 cast is the shortest.

So the code to search chains should only be invoked in what would currently be 
an *error condition*, that being that the SQL includes a request for a cast that 
cannot be resolved without chaining.

Since the chaining code would be new, and the rules for it would be new, we can 
still design them however we like (within reason.)  I would propose:

1) Shorter chains trump longer chains.

2) When comparing two equal length chains, one made entirely of SAFE casts 
trumps one which contains an IMPLICIT cast.

3) When two or more chains remain that cannot be resolved under the above two 
rules, the SQL is considered ambiguous and an error condition is raised.

I don't see how this would break any existing valid SQL.  But it seems like it 
would solve both the DOMAIN problem you mentioned and the oft lamented problem 
that adding a new datatype requires quadratically many casts to the system.

mark

Re: TODO: Fix CREATE CAST on DOMAINs

From

Tom Lane

Date:

20 September 2006, 20:09:44

Mark Dilger <pgsql@markdilger.com> writes:
> If the system chooses cast chains based on a breadth-first search,
> then the existing int2 -> int8 cast would be chosen over an int2 ->
> int4 -> int8 chain, or an int2 -> int3 -> int4 -> int8 chain, or in
> fact any chain at all, because the int2 -> int8 cast is the shortest.

Well, this is the sort of thing that has to be thought about pretty
carefully.  Is "length of chain" the most appropriate metric?  What are
you going to do when resolving a multi-input operator or function, and
there are different ways to match different candidates with different
sets of path lengths?

There's been some prior discussion of attaching a "measure of goodness"
to different potential cast pathways.  I'm too lazy to look it up at the
moment but I strongly suggest whoever wants to produce a concrete
proposal in this area should go check the archives.
        regards, tom lane

Re: TODO: Fix CREATE CAST on DOMAINs

From

Martijn van Oosterhout

Date:

20 September 2006, 20:14:57

On Wed, Sep 20, 2006 at 09:31:48AM -0700, Mark Dilger wrote:
> Perhaps we need to be able to register casts with more information than
> just IMPLICIT vs. EXPLICIT.  Perhaps we also need something like SAFE or
> some other term, and then have a rule that no chain of casts chosen by the
> system (as opposed to specified by the user) can contain more than one
> IMPLICIT cast, but can contain unlimited many SAFE casts.

Currently cast have three types, implicit, explicit and assignment.

You don't want to have an unlimited number of anything. Cast lookups
are expensive enough as it is.

> Casts from int2 -> int4, int2 -> int8, and int4 -> int8 would all be SAFE,
> I think, because they are not lossy.  But perhaps I have not thought enough
> about this and these should be IMPLICIT rather than SAFE.

Yeah, but the trick is you want, with the above casts to only ever
produce the *shortest* path. That's what makes it expensive.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: TODO: Fix CREATE CAST on DOMAINs

From

Martijn van Oosterhout

Date:

20 September 2006, 20:48:52

On Wed, Sep 20, 2006 at 10:26:55AM -0700, Mark Dilger wrote:
> I have thought about this some more.  I think these are indeed SAFE.  The
> distinction between SAFE and IMPLICIT should not, I think, be whether the
> storage type is identical, but rather whether there is any possible loss of
> precision, range, accuracy, etc., or whether there is any change in the
> fundamental interpretation of the data when cast from the source to
> destination type.

My question is whether there should be any implicit casts that are not
safe. Your example int8 -> float8 being implicit is I think an error
and we should wonder why that cast implicit now anyway.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: TODO: Fix CREATE CAST on DOMAINs

From

Tom Lane

Date:

20 September 2006, 21:02:16

Martijn van Oosterhout <kleptog@svana.org> writes:
> My question is whether there should be any implicit casts that are not
> safe. Your example int8 -> float8 being implicit is I think an error
> and we should wonder why that cast implicit now anyway.

Because the SQL spec requires it.  You are not required to write a cast
to add an exact and an approximate quantity, and the spec says the
result is approximate.

Trying to design this stuff purely according to abstract notions of
elegance of the cast rules isn't going to work out well --- we have
both spec requirements and backwards compatibility to worry about.

Now we do have the flexibility to alter the default contents of pg_cast
--- there could be more or fewer entries in there than there are now,
if the type coercion rules are altered to do less or more automatically
than they do now.  But the end-result behavior needs to wind up being
pretty darn near the same thing, at least within the numeric type
category (I'm not as certain that we have the other ones right, but the
numeric category has been *very* heavily scrutinized and beat upon).
The only thing I really want to see changing is the behavior for domain
types --- and even there, the "default" behavior when there are no
user-created domain-specific operators or casts has to stay the same.
        regards, tom lane

Re: TODO: Fix CREATE CAST on DOMAINs

From

Mark Dilger

Date:

20 September 2006, 22:36:20

Tom Lane wrote:
> Now we do have the flexibility to alter the default contents of pg_cast
> --- there could be more or fewer entries in there than there are now,
> if the type coercion rules are altered to do less or more automatically
> than they do now.  But the end-result behavior needs to wind up being
> pretty darn near the same thing, at least within the numeric type
> category (I'm not as certain that we have the other ones right, but the
> numeric category has been *very* heavily scrutinized and beat upon).
> The only thing I really want to see changing is the behavior for domain
> types --- and even there, the "default" behavior when there are no
> user-created domain-specific operators or casts has to stay the same.

Your suggestion upthread that domains have two-step casts (from domain to base, 
then from base to whatever) is what got my attention.  I don't like the idea of 
having an interim solution to that subset of the problem if it might get in the 
way of solving the general problem later.  But perhaps it can be argued that no 
cruftiness would result from the special case code for casting domains to their 
base types?

mark

Re: TODO: Fix CREATE CAST on DOMAINs

From

Gevik Babakhani

Date:

20 September 2006, 23:08:23

> Trying to design this stuff purely according to abstract notions of
> elegance of the cast rules isn't going to work out well --- we have
> both spec requirements and backwards compatibility to worry about.
> 
> Now we do have the flexibility to alter the default contents of pg_cast
> --- there could be more or fewer entries in there than there are now,
> if the type coercion rules are altered to do less or more automatically
> than they do now.  But the end-result behavior needs to wind up being
> pretty darn near the same thing, at least within the numeric type
> category (I'm not as certain that we have the other ones right, but the
> numeric category has been *very* heavily scrutinized and beat upon).
> The only thing I really want to see changing is the behavior for domain
> types --- and even there, the "default" behavior when there are no
> user-created domain-specific operators or casts has to stay the same.

Trying to solve this problem requires more investigation having spec
requirements and backwards compatibility etc.. etc.. in mind.

After reading the thread, I think there are some interesting
similarities, ideas (or even techniques) used in OO languages like JAVA
and C# regarding internal handling when type boxing and type casting. (I
would like to think domains as inherited classes of their super or the
base class.)

I will come back with more thoughts after I have investigated a thing or
two.  Say tuned....

Re: TODO: Fix CREATE CAST on DOMAINs

From

Martijn van Oosterhout

Date:

21 September 2006, 07:06:56

On Wed, Sep 20, 2006 at 10:56:08AM -0700, Mark Dilger wrote:
> If the system chooses cast chains based on a breadth-first search, then the
> existing int2 -> int8 cast would be chosen over an int2 -> int4 -> int8
> chain, or an int2 -> int3 -> int4 -> int8 chain, or in fact any chain at
> all, because the int2 -> int8 cast is the shortest.

But we're not talking about a search here, we don't always know where
the endpoint is. Imagine you have the following three functions:

abs(int8)
abs(float4)
abs(numeric)

And you have an int2. Which is the best cast to use? What's the answer
if you have a float8? What if it's an unknown type text string?

Now, consider that functions can have up to 32 arguments and that this
resolution might have to be applied to each argument and you find that
searching is going to get very expensive very quickly.

The current system of requiring only a single step is at least
predictable. If you have the choice between:

- first argument matches, second needs three "safe" conversions, and
- first argument need one "unsafe" conversion, second matches exactly

Which is cheaper?

To make this manageable you have to keep the number of types you can
cast to small, or you'll get lost in the possibilites. Adding just a
single step domain to base type conversion seems pretty safe, but
anything more is going to be hard.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: TODO: Fix CREATE CAST on DOMAINs

From

Mark Dilger

Date:

21 September 2006, 16:38:11

Martijn van Oosterhout wrote:
> On Wed, Sep 20, 2006 at 10:56:08AM -0700, Mark Dilger wrote:
>> If the system chooses cast chains based on a breadth-first search, then the 
>> existing int2 -> int8 cast would be chosen over an int2 -> int4 -> int8 
>> chain, or an int2 -> int3 -> int4 -> int8 chain, or in fact any chain at 
>> all, because the int2 -> int8 cast is the shortest.
> 
> But we're not talking about a search here, we don't always know where
> the endpoint is. Imagine you have the following three functions:
> 
> abs(int8)
> abs(float4)
> abs(numeric)
> 
> And you have an int2. Which is the best cast to use? What's the answer
> if you have a float8? What if it's an unknown type text string?
> 
> Now, consider that functions can have up to 32 arguments and that this
> resolution might have to be applied to each argument and you find that
> searching is going to get very expensive very quickly.
> 
> The current system of requiring only a single step is at least
> predictable. If you have the choice between:
> 
> - first argument matches, second needs three "safe" conversions, and
> - first argument need one "unsafe" conversion, second matches exactly
> 
> Which is cheaper?
> 
> To make this manageable you have to keep the number of types you can
> cast to small, or you'll get lost in the possibilites. Adding just a
> single step domain to base type conversion seems pretty safe, but
> anything more is going to be hard.
> 
> Have a nice day,

The searching never needs to be done at runtime.  It should be computable at 
cast creation time.  A new cast creates a potential bridge between any two types 
in the system.  Using a shortest path algorithm, the best chain (if any exists) 
from one type to another can be computed and pre-compiled, right?

So, assume the following already exists:

Types A,B,C, fully connected with casts A->B, B->A, A->C, C->A, B->C, C->B, with 
some marked IMPLICIT, some marked EXPLICIT, and some marked SAFE.

Types X,Y,Z, also fully connected with casts, as above.

Then assume someone comes along and creates a new type M with conversions A->M, 
M->A, X->M, and M->X.  At the time that type and those casts are added to the 
system, the system could calculate any additional casts to/from B, C, Y, and Z.  A simple implementation (but maybe not
optimal)would be for the system to 

autogenerate code like:

CREATE FUNCTION cast_M_Y (arg M) RETURNS Y AS $$SELECT arg::X::Y;
$$ LANGUAGE SQL;
CREATE CAST (M AS Y) WITH FUNCTION cast_M_Y(M) [ AS ASSIGNMENT | AS IMPLICIT ]

And then load that function and cast.  The only real trick seems to be 
determining the rules for which cast chain gets used within that autogenerated 
function, and whether the generated cast is IMPLICIT, EXPLICIT, or ASSIGNMENT.

Looking over what I have just written, another idea pops up.  To avoid having 
the system decide which casts are reasonable, you could extend the syntax and 
allow an easy shorthand for the user.  Something like:

CREATE CAST (M AS A)WITH FUNCTION cast_M_AAS ASSIGNMENTPROPOGATES TO B AS ASSIGNMENT,PROPOGATES TO C AS ASSIGNMENT;

CREATE CAST (A AS M)WITH FUNCTION cast_A_MAS ASSIGNMENTPROPOGATES FROM B,PROPOGATES FROM C;

And then the casts from M->B, M->C, B->M, and C->M would all be added to the system.

Thoughts?

mark