Thread: Charset/collate support and function parameters

Charset/collate support and function parameters

From
Dennis Bjorklund
Date:
I have a long term plan to implement charset support in pg and now when I
have dropped the work on the timestamps, I've been looking into this
subject.

Today we store the max length of a string in the typmod field, but that
has to be extended so we also store the charset and the collation of the
string. That's simple but we need functions that take a string of a
specific charset and collation as an input and give that as a result.
Currently all information we have about function arguments are the OID of
the type. The function argument OID's are stored in an array in pg_proc
and I suggest that we instead of this array have a table pg_parameters
that is much like

http://www.postgresql.org/docs/7.4/static/infoschema-parameters.html

Notice how there are a lot of columns describing the dynamic parts of a
type, like character_maximum_length, character_set_name,
datetime_precision. We would of course not store the name of a charset,
but the oid (and so on).

Most of these are NULL since they only apply to a specific type, but
that's okay since NULL values are stored in a bitmap so the row width will
still be small.

Before one start to work on charset/collation support I think it would be
good of one can make the above change with just the old properties. As a
result we could write functions like
 foo (bar varchar(5))

We probably won't write functions like that very often. but as a first
step this is what we want.

Changing this is a lot of work, especially when one look in pg_proc.h and 
realize that one need to alter 3000 lines of
 DATA(insert OID = 2238 ( bit_and PGNSP PGUID 12 t f f f i 1 23 "23" _null_ aggregate_dummy - _null_));
DESCR("bitwise-andinteger aggregate");
 

into another form. The "23" should be pulled out and it would become a row 
in the pg_parameters table. Maybe some job for a script :-)
 Sometimes I wish that (at least part of) the bootstrap was in a higher  level and that the above was just normal sql
statements:
 CREATE FUNCTION bit_and ( .... ) AS ...

In addition to the function arguments we also need to treat the function
return value in a similar way. The natural solution is to extend pg_proc
with many of the same columns as in the pg_parameters table. One could
also reuse the pg_parameters table and store a parameter with ordinal
number 0 to be the return value. But then there would be some columns that
do not apply to return values.

My current plan is

A) Implement a pg_parameters table and let everything else work  as today. Also, the return values have to be taken
careof in a   similar way.
 

B) Change function overloading so we can have functions with the same   name but different properties. For example for
stringsthat means   different max lengths are used to resolve overloading.
 

C) Work on charset / collation.

All of these will probably not happen for 8.1 but I hope to finish A and
B. It all depends on how much trouble I run into and how much time I can
put into it. The function overload parts in pg are far from trivial, but I
will not worry about that until I get that far.

Any comments about this plan?

-- 
/Dennis Björklund



Re: Charset/collate support and function parameters

From
Tom Lane
Date:
Dennis Bjorklund <db@zigo.dhs.org> writes:
> Today we store the max length of a string in the typmod field, but that
> has to be extended so we also store the charset and the collation of the
> string.

Why would we not keep this information right in the string values?

> [ unworkable proposal snipped ]
        regards, tom lane


Re: Charset/collate support and function parameters

From
Dennis Bjorklund
Date:
On Sat, 30 Oct 2004, Tom Lane wrote:

> Why would we not keep this information right in the string values?

We could, but then we would need to parse it every time. Storing it in a 
structured way seems like the database solution and at least as a user 
from the client side it makes sense.

Are you worried about performance or is it the smaller change that you
want?

I can't tell how much, if any, is gained in speed by having an array of
strings instead of another system table.

-- 
/Dennis Björklund



Re: Charset/collate support and function parameters

From
Tom Lane
Date:
Dennis Bjorklund <db@zigo.dhs.org> writes:
> On Sat, 30 Oct 2004, Tom Lane wrote:
>> Why would we not keep this information right in the string values?

> We could, but then we would need to parse it every time.

Huh?  We'd store it in the most compact pre-parsed form we could think
of; probably some sort of index into a list of supported character sets
and collations.  (This is not so different from representing timezones
inside timestamptz values, instead of using a global setting.)

> Are you worried about performance or is it the smaller change that you
> want?

I'm worried about the fact that instead of, say, one length(text)
function, we would now have to have a different one for every
characterset/collation.  Not to mention one for every possible N in
varchar(N).  Making those properties part of a function's parameter
signature is unworkable on its face --- it'll cause an exponential
explosion in the number of pg_proc entries, and probably make it
impossible to resolve a unique candidate function in many situations.
        regards, tom lane


Re: Charset/collate support and function parameters

From
Dennis Bjorklund
Date:
On Sat, 30 Oct 2004, Tom Lane wrote:

> > Are you worried about performance or is it the smaller change that you
> > want?
> 
> I'm worried about the fact that instead of, say, one length(text)
> function, we would now have to have a different one for every
> characterset/collation.

This is not about how the parameter information is stored, but let's 
discuss that anyway. It's important issues.

I was hoping that we could implement functions where one didn't have to 
specify the charset and collation (but could if we want to).

For some functions one really want different ones depending on the
charset. For example the length function, then we will need to calculate
the length differently for each charset. We can never have one length
function that works for every possible charset. We could have one pg
function that do N different things inside depending on the charset, but
that's not really a simplification.

For functions where one have not specified the charset of an argument then
we need to be able to pass on that type information to where ever we use
that argument. Variables already have a type and if we have a (pseudo
code) function like

foo (a varchar) returns int
{ select length(a);
}

and call it with

foo ('foo' charset latin1) 

then we need to make sure that variable a inside the function body of foo
get the type from the caller and then the function call to length(a) will
work out since it would select the length function for latin1. I think it
should work but an implementation is the only way to know.

Every string do in the end need to know what charset and what collation it
is in. Otherwise it can not be used for anything, not even to compare it
with another string.

I could even imagine to have different functions for each
charset/collation. It's not that many functions built in that are affected
and not all of them need to work with every collation. The user just need
to call them with the correct one. I don't expect any functions like
 foo (a varchar collation sv_SE,      b varchar collation en_US)

or any other combination of a and b. If any then a and be will be the same
type. So there would not be arbitrary many combinations (but still a lot).

The alternative is storing the charset and collation inside each string.  
That seems like a too big price to pay, it belong in the type.

> Not to mention one for every possible N in varchar(N).

This doesn't matter since one can always implement functions to take 
varchar arguments without any limit and then any shorter string can be 
implictly casted up to that type. Or one can treat the length exactly like 
the charset above.

Of course you do not want one length function for each length.

-- 
/Dennis Björklund



Re: Charset/collate support and function parameters

From
Tom Lane
Date:
Dennis Bjorklund <db@zigo.dhs.org> writes:
> On Sat, 30 Oct 2004, Tom Lane wrote:
>> I'm worried about the fact that instead of, say, one length(text)
>> function, we would now have to have a different one for every
>> characterset/collation.

> For some functions one really want different ones depending on the
> charset. For example the length function, then we will need to calculate
> the length differently for each charset. We can never have one length
> function that works for every possible charset.

Strange ... we seem to have one now ...

> We could have one pg function that do N different things inside
> depending on the charset, but that's not really a simplification.

Yes it is.  It's about the same amount of actual code either way, and it
gets the job done without introducing huge headaches for the overloaded-
function resolution rules.  That stuff is complicated and fragile enough
without introducing issues like whether to match optional subtype
information.

Basically, you're talking about a substantial increase in complexity in
the type resolution rules, and breaking a great number of internal APIs
(because they don't pass the information that would need to be passed),
in order to have a slightly different layout of the code that will
actually implement character-set-specific operations.  I can't see that
that's a sensible use of development effort.

> The alternative is storing the charset and collation inside each string.  
> That seems like a too big price to pay, it belong in the type.

No, the alternative you're proposing is too big a price to pay.  It's
too much work for too little gain, and it's also too inflexible (I think
there will be cases where we don't want to nail down the character set
at datatype-resolution time).
        regards, tom lane


Re: Charset/collate support and function parameters

From
Dennis Bjorklund
Date:
On Sat, 30 Oct 2004, Tom Lane wrote:

> > The alternative is storing the charset and collation inside each string.  
> > That seems like a too big price to pay, it belong in the type.
> 
> No, the alternative you're proposing is too big a price to pay.

So you want to expand every string with 8 bytes (two oid's)?

Or special case the storing in tables so that you only store the text and
not the charset/collation info?

-- 
/Dennis Björklund



Re: Charset/collate support and function parameters

From
Tatsuo Ishii
Date:
> On Sat, 30 Oct 2004, Tom Lane wrote:
> 
> > > The alternative is storing the charset and collation inside each string.  
> > > That seems like a too big price to pay, it belong in the type.
> > 
> > No, the alternative you're proposing is too big a price to pay.
> 
> So you want to expand every string with 8 bytes (two oid's)?

For me that seems to be the right way. I'm not sure if two oids are
the right solution but we need to store extra info in varlena
structure to support charset/collation anyway. In my understanding
TOAST has already done in similar way.

Other than charset/collation we also need coercibility info to meet
with the SQL standard. This could only be represented in each text,
not by function parameters.

> Or special case the storing in tables so that you only store the text and
> not the charset/collation info?

On disk representation we don't need charset/collation since they will
be stored in probably pg_attribute. Coercibility info is not necessary
on disk too.
--
Tatsuo Ishii


Re: Charset/collate support and function parameters

From
Dennis Bjorklund
Date:
On Sun, 31 Oct 2004, Tatsuo Ishii wrote:

> > So you want to expand every string with 8 bytes (two oid's)?
> 
> For me that seems to be the right way. I'm not sure if two oids are
> the right solution but we need to store extra info in varlena
> structure to support charset/collation anyway. In my understanding
> TOAST has already done in similar way.
> 
> Other than charset/collation we also need coercibility info to meet
> with the SQL standard. This could only be represented in each text,
> not by function parameters.

Arn't we limiting ourself in how we can use charsets when we remove it
from the type. The reason why I started to look at the function parameters
is because in the standard one can do this:
 CREATE CHARACTER SET foo GET Latin1;

and now we could add functions that work with this charset
 CREATE FUNCTION bar (x VARCHAR(255) CHARACTER SET foo) ....

What we are saying is that we don't want to be able to do this? I just 
want to understand all the implications of simplifying the types.

Same thing if the user wants to create new collations using CREATE
COLLATION. How can we override functions for these new charsets and
collations if all we can define are functions like foo(x VARCHAR)?

Maybe one wants the information in both place.

-- 
/Dennis Björklund



Re: Charset/collate support and function parameters

From
Tatsuo Ishii
Date:
> > For me that seems to be the right way. I'm not sure if two oids are
> > the right solution but we need to store extra info in varlena
> > structure to support charset/collation anyway. In my understanding
> > TOAST has already done in similar way.
> > 
> > Other than charset/collation we also need coercibility info to meet
> > with the SQL standard. This could only be represented in each text,
> > not by function parameters.
> 
> Arn't we limiting ourself in how we can use charsets when we remove it
> from the type.

? That is a news to me. In my understanding nobody did remove charsets
from the type. The charset info has not been in text type since
PostgreSQL was born.

> The reason why I started to look at the function parameters
> is because in the standard one can do this:
> 
>   CREATE CHARACTER SET foo GET Latin1;

Surely we want add this to PostgreSQL.

> and now we could add functions that work with this charset
> 
>   CREATE FUNCTION bar (x VARCHAR(255) CHARACTER SET foo) ....
> 
> What we are saying is that we don't want to be able to do this?

Not sure we want to add above. Is it something defined in the
standard?

Though I think we want to add something like:

SELECT bar(x CHARACTER SET foo COLLATE buz) ....

>  I just 
> want to understand all the implications of simplifying the types.
> 
> Same thing if the user wants to create new collations using CREATE
> COLLATION. How can we override functions for these new charsets and
> collations if all we can define are functions like foo(x VARCHAR)?
> 
> Maybe one wants the information in both place.

I think we could make our text type richer so that it could include
charset, collate info and so on.

At this point my plan to implement CREATE CHARSET/COLLATE includes:

1) add charset/collation info to pg_attribute
2) make shared tables such as pg_database and pg_shadow can handle  multiple charsets. this is necessary because
databasenames and  user names could be repsented in different charsets
 
3) create new system catalogs to keep defined charsets and collations
4) make text type on memory representation richer so that it can keep  chaset/collate/coercibility etc. info
5) enhance pg_am and friends
6) create our own locale database
--
Tatsuo Ishii


Re: Charset/collate support and function parameters

From
Dennis Bjorklund
Date:
On Sun, 31 Oct 2004, Tatsuo Ishii wrote:

> > and now we could add functions that work with this charset
> > 
> >   CREATE FUNCTION bar (x VARCHAR(255) CHARACTER SET foo) ....
> > 
> > What we are saying is that we don't want to be able to do this?
> 
> Not sure we want to add above. Is it something defined in the
> standard?

The syntax in sql2003 do allow it. The exact semantics is difficult to 
get a clear picture of from the spec (as always).

I thought this question was the whole argument. We can't have something 
like the above with the pg overloading since then the resolving process 
will be too hard according to Tom.

> 2) make shared tables such as pg_database and pg_shadow can handle
>    multiple charsets. this is necessary because database names and
>    user names could be repsented in different charsets

Shouldn't we just define the charset for user names and database names?  
Either one fixed or one that's set during initdb.  You don't mean that we
want different user numes to be defined using different charsets?

The rest of the points looks good to me. The main problem is still what to
do with the function definitions as above. Is it something we want or not?

Is the charset something that makes two text types different or not?

-- 
/Dennis Björklund



Re: Charset/collate support and function parameters

From
Tatsuo Ishii
Date:
> On Sun, 31 Oct 2004, Tatsuo Ishii wrote:
> 
> > > and now we could add functions that work with this charset
> > > 
> > >   CREATE FUNCTION bar (x VARCHAR(255) CHARACTER SET foo) ....
> > > 
> > > What we are saying is that we don't want to be able to do this?
> > 
> > Not sure we want to add above. Is it something defined in the
> > standard?
> 
> The syntax in sql2003 do allow it. The exact semantics is difficult to 
> get a clear picture of from the spec (as always).

I wonder what is the intention to allow such that syntax. It seems
it's just useless since we could make a function bar() which accepts
any charsets.

> I thought this question was the whole argument. We can't have something 
> like the above with the pg overloading since then the resolving process 
> will be too hard according to Tom.
> 
> > 2) make shared tables such as pg_database and pg_shadow can handle
> >    multiple charsets. this is necessary because database names and
> >    user names could be repsented in different charsets
> 
> Shouldn't we just define the charset for user names and database names?  
> Either one fixed or one that's set during initdb.  You don't mean that we
> want different user numes to be defined using different charsets?

What I have in my mind was:

CREATE DATABASE foo CHARSET bar;

If we allow this, we need 2).

> The rest of the points looks good to me. The main problem is still what to
> do with the function definitions as above. Is it something we want or not?
> 
> Is the charset something that makes two text types different or not?

I understand your point. If we are going to allow that, probably we
have to regard each text type distinct which has different charset or
collation each other. This will lead to
number-of-types-explosion-problem which I really want to avoid...
--
Tatsuo Ishii


Re: Charset/collate support and function parameters

From
Dennis Bjorklund
Date:
On Sun, 31 Oct 2004, Tatsuo Ishii wrote:

> I wonder what is the intention to allow such that syntax. It seems
> it's just useless since we could make a function bar() which accepts
> any charsets.

One could override the behaviour of functions by adding a charset and a
adding new definition of an old function name for that charset. Like
adding a new collation and define a new cmp() function for that
collation that works different then some old definitons of cmp().

The whole discussion came because I start to look at problems from what is
in the specification and try to fit that into pg. Not everything will fit,
it's just my starting point when discussing. Tom starts at the other end
and then it looks like a big controversy.

About the explosion of the number of functions needed. It's not obvious to
me that there will be an explosion if one manage to allow both full types
that include charset and more generic functions that work on any text
type. It seems to me that there are not that many interesting combinations
anyway. Most applications will use one charset and define functions that
work with just that charset.

Anyway, the only way to see what problems would arise is to try. I was
hoping that the step A and B in the plan was something that we wanted no
matter of how the locale problem was later solved. With those in place it
would be easier to experiment.

-- 
/Dennis Björklund



Re: Charset/collate support and function parameters

From
Tatsuo Ishii
Date:
> > I wonder what is the intention to allow such that syntax. It seems
> > it's just useless since we could make a function bar() which accepts
> > any charsets.
> 
> One could override the behaviour of functions by adding a charset and a
> adding new definition of an old function name for that charset. Like
> adding a new collation and define a new cmp() function for that
> collation that works different then some old definitons of cmp().

How could that be usefull? For example, length() returns character
length no matter what the charset/collation is. I hardly imagin a
function which changes its behavior according to charsets.

> The whole discussion came because I start to look at problems from what is
> in the specification and try to fit that into pg. Not everything will fit,
> it's just my starting point when discussing. Tom starts at the other end
> and then it looks like a big controversy.
> 
> About the explosion of the number of functions needed. It's not obvious to
> me that there will be an explosion if one manage to allow both full types
> that include charset and more generic functions that work on any text
> type.

I don't understand your point. Today we already use one length()
function for any charsets as Tom has already pointed out.

> It seems to me that there are not that many interesting combinations
> anyway. Most applications will use one charset and define functions that
> work with just that charset.

Really? One of the objectives of i18n is an application can handle
multiple charsets. I don't want to write two applications just for the
charset difference, for example English and Japanese.

> Anyway, the only way to see what problems would arise is to try. I was
> hoping that the step A and B in the plan was something that we wanted no
> matter of how the locale problem was later solved. With those in place it
> would be easier to experiment.

The question in your approach is how you could handle the coercibility
property. It's a transient and on memory property thus will not fit
into the function declaration. No?
--
Tatsuo Ishii


Re: Charset/collate support and function parameters

From
Dennis Bjorklund
Date:
On Sun, 31 Oct 2004, Tatsuo Ishii wrote:

> I don't understand your point. Today we already use one length()
> function for any charsets as Tom has already pointed out.

We have one length function that inside do different things depending on
the charset. If you want to add a charset and implement the length
function for that charset, how do you do that?

The length of a utf-8 string is not calculated the same way as the length
of a latin1 string. Each charset (encoding) have its own way of
calculating the length.

And by the way, today our databases just work with one charset at all and 
what length do is decided by a global variable. The difference we talk 
about here is the one between

length(latin1) ...
length(utf-8) ...
length(ascii) ...

and

length(x)
{ if charset(x) == latin1 then   ,,, else if charset(x) = utf-8 then   ,,,
}

> The question in your approach is how you could handle the coercibility
> property. It's a transient and on memory property thus will not fit
> into the function declaration. No?

No, it's not part of the function signature. Coercibility is a way to
decide what collation to use. Depending on where the value comes from it
can have different coercibility and when one do operations that involves 
different collations the coercibility decide how ambiguities are resolved 
(which value will be coerced).

If one would want function signatures with charsets in them and where the
charset information is stored, it doesn't have to be opposit of each
other.

I've currently been thinking that one can avoid storing the charset in the
value by handling types like that. I even though that there was no way
that anyone in the pg project would ever accept to enlarge the string
values, obviously a wrong assumption :-)

Even when one do store the charset in the value one might want to have
function overloading to depend on the charset of the string (when
specified).

That's the same opinion that if I declare a function

foo (x varchar(5)) 
begin ...
end

then I expect to get strings that are max 5 chars long. Why do we allow
the (5) if it's just droped? If I define a column as varchar(5) then the
column values are relly max 5 chars long, but it does not work for
functions like that.

Let us simply agree that we do store the charset/collation/... in the
(memory) values. On disk we don't want that since the column type do
decide it totally, do we agree on that?

-- 
/Dennis Björklund



Re: Charset/collate support and function parameters

From
Tatsuo Ishii
Date:
> On Sun, 31 Oct 2004, Tatsuo Ishii wrote:
> 
> > I don't understand your point. Today we already use one length()
> > function for any charsets as Tom has already pointed out.
> 
> We have one length function that inside do different things depending on
> the charset. If you want to add a charset and implement the length
> function for that charset, how do you do that?

That's exactly the job of CREATE CHARSET. It will define set of
functions that handle various work including counting length of a
string. One can find the char-length-counting function by looking up the
charset system catalog.

> > The question in your approach is how you could handle the coercibility
> > property. It's a transient and on memory property thus will not fit
> > into the function declaration. No?
> 
> No, it's not part of the function signature. Coercibility is a way to
> decide what collation to use. Depending on where the value comes from it
> can have different coercibility and when one do operations that involves 
> different collations the coercibility decide how ambiguities are resolved 
> (which value will be coerced).

I see.

> If one would want function signatures with charsets in them and where the
> charset information is stored, it doesn't have to be opposit of each
> other.
> 
> I've currently been thinking that one can avoid storing the charset in the
> value by handling types like that. I even though that there was no way
> that anyone in the pg project would ever accept to enlarge the string
> values, obviously a wrong assumption :-)
> 
> Even when one do store the charset in the value one might want to have
> function overloading to depend on the charset of the string (when
> specified).
> 
> That's the same opinion that if I declare a function
> 
> foo (x varchar(5)) 
> begin
>   ...
> end
> 
> then I expect to get strings that are max 5 chars long. Why do we allow
> the (5) if it's just droped? If I define a column as varchar(5) then the
> column values are relly max 5 chars long, but it does not work for
> functions like that.
> 
> Let us simply agree that we do store the charset/collation/... in the
> (memory) values. On disk we don't want that since the column type do
> decide it totally, do we agree on that?

I agree except that shared system catalogs (and probably some non
shared system catalogs such as pg_class) need charset on disk.

I personaly don't see any value in using non English user names, database
names, table names and so on though. However some users love to use
them:-)
--
Tatsuo Ishii