Thread: Vague idea for allowing per-column locale

Vague idea for allowing per-column locale

From
Peter Eisentraut
Date:
We have realized that allowing per-column locale would be difficult with
the existing C library interface, because setlocale() is a pretty
inefficient operation.  But I think what we could allow, and what would be
fairly useful, is the choice between the plain C locale and one "real"
locale of choice (as determined by initdb) on a column or datum basis.

One possible way to implement this is to set or clear a bit somewhere in
the header of each text (char, varchar) type datum, depending on what you
want.  Basically, this bit is going to be part of the atttypmod.  Then the
comparison operators would use strcoll or strcmp, depending on the choice,
and similarly for other functions that are locale-aware.

Does anyone see a problem with this, aside from the fact that this breaks
the internal representation of the character types (which might have to
happen anyway if we ever want to do something in this direction)?

(If this is an acceptable plan then we could tie this in with the proposed
work of making the LIKE optimization work.  We wouldn't have to make up
new ugly-named operators, we'd just have to do a bit of plain old type
casting.)

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Vague idea for allowing per-column locale

From
Tim Allen
Date:
On Fri, 10 Aug 2001, Peter Eisentraut wrote:

> We have realized that allowing per-column locale would be difficult with
> the existing C library interface, because setlocale() is a pretty
> inefficient operation.  But I think what we could allow, and what would be
> fairly useful, is the choice between the plain C locale and one "real"
> locale of choice (as determined by initdb) on a column or datum basis.

Yes, the C library locale notion is somewhat broken, or at best limited,
imho. It doesn't fit at all well with the needs of a server that can have
clients in different locales, or even clients in the same place who have
different locale preferences. I guess it's a pre-ubiquitous-internet
concept. If you keep stretching this idea beyond the model that it
comfortably supports, life will become steadily more difficult, and it may
be better to give up on that model altogether.

A different idea related to this is to treat different text
representations as different data types. In the case of different
multi-byte text representations, this definitely makes sense; in the case
of just different locales for essentially the same character set it might
not be as obviously beneficial, but still merits some consideration, imho.

For converting, say, utf8 to euc-jp, it would be nice to be able to make
use of all the existing infrastructure that PostgreSQL has for type
conversion and type identification. It'd be even nicer if you could make a
table that has, say, one column utf8 (or utf32 even), one column euc-jp
and one column shift-jis, so that you could cache format conversions.

> One possible way to implement this is to set or clear a bit somewhere in
> the header of each text (char, varchar) type datum, depending on what you
> want.  Basically, this bit is going to be part of the atttypmod.  Then the
> comparison operators would use strcoll or strcmp, depending on the choice,
> and similarly for other functions that are locale-aware.

Under my grand plan one would have to implement comparison operators for
each data type (as well as all the other things one has to implement for a
data type); then it should Just Work, because postgres would know what
comparison to use for each column.

> Does anyone see a problem with this, aside from the fact that this breaks
> the internal representation of the character types (which might have to
> happen anyway if we ever want to do something in this direction)?

> (If this is an acceptable plan then we could tie this in with the proposed
> work of making the LIKE optimization work.  We wouldn't have to make up
> new ugly-named operators, we'd just have to do a bit of plain old type
> casting.)

The separate data types notion would work here also, since one could
declare a column to be of plain vanilla ascii data type, with all
comparisons just a simple comparison of numerical values.

BTW, how does postgres store multibyte text? As char * with a multibyte
encoding? As 16 bit or 32 bit code points? I should of course just look at
the code and find out...:) I guess the former, from Peter's earlier
comments. It does seem to me that using an explicit 32 bit representation
(or at least providing that as an option) would make life easier in many
ways.

Tim

-- 
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/



Re: Vague idea for allowing per-column locale

From
Thomas Lockhart
Date:
> (If this is an acceptable plan then we could tie this in with the proposed
> work of making the LIKE optimization work.  We wouldn't have to make up
> new ugly-named operators, we'd just have to do a bit of plain old type
> casting.)

If we are thinking about improvements at this level, why not go ahead
and reopen the discussion of how to do SQL9x national character sets,
collations, etc etc. istm that these will offer a solution for which the
current issues are a (hopefully large) subset.

We could use the type system to support this (my current preference);
others have suggested that this might be too heavy to be usable and had
alternate suggestions.

Issues with SQL9x include:

o character set/collation syntax for string literals

o internal representation

o appropriate operators and functions for these sets/collations

o I/O conventions between client and server (may use the current
scheme?)

o allowing these alternate character sets for table names (or wherever
allowed by SQL9x). How to expose, for example, equality operators to
allow internal PostgreSQL operation: is our current use of strcmp()
enough?

Comments?
                       - Thomas


Re: Vague idea for allowing per-column locale

From
Matthew Kirkwood
Date:
On Fri, 10 Aug 2001, Tim Allen wrote:

> > We have realized that allowing per-column locale would be difficult with
> > the existing C library interface, because setlocale() is a pretty
> > inefficient operation.  But I think what we could allow, and what would be

> Yes, the C library locale notion is somewhat broken, or at best limited,
> imho. It doesn't fit at all well with the needs of a server that can have
> clients in different locales, or even clients in the same place who have
> different locale preferences.

This may be of interest:

http://www.cygnus.com/~drepper/tllocale.ps.bz2

It's not clear if glibc implements it yet, nor if people like
the interface, but it seems like a nice, easily-wrapped and
stubbed option, which might come for free on at least one OS.

Matthew.



Re: Vague idea for allowing per-column locale

From
Peter Eisentraut
Date:
Thomas Lockhart writes:

> o character set/collation syntax for string literals

I think it's <character value expression> COLLATE <collation name>.  It's
another one of these things that break our nicely organized system. :-(

> o internal representation
>
> o appropriate operators and functions for these sets/collations

Hmm, there aren't a lot of things you can do with them, no?

> o I/O conventions between client and server (may use the current
> scheme?)

As long as it works I don't see a problem.  I have said before that we
should allow using the iconv interface because it's more powerful.

> o allowing these alternate character sets for table names (or wherever
> allowed by SQL9x). How to expose, for example, equality operators to
> allow internal PostgreSQL operation: is our current use of strcmp()
> enough?

No.  This could be tricky to do though without sacrificing the designed
efficiency of the "name" type.  For instance, how could we store the
which-locale-is-it information?

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Vague idea for allowing per-column locale

From
Peter Eisentraut
Date:
Tim Allen writes:

> For converting, say, utf8 to euc-jp, it would be nice to be able to make
> use of all the existing infrastructure that PostgreSQL has for type
> conversion and type identification.

Unfortunately, that infrastructure is rather poorly suited for handling
arbitrary types.

> It'd be even nicer if you could make a table that has, say, one column
> utf8 (or utf32 even), one column euc-jp and one column shift-jis, so
> that you could cache format conversions.

This might be a nice thing to show off but I'm not sure about the
practical use.  There's Unicode that you can use if you want to mix and
match on the server, and the ability to convert the character set between
client and server so the right thing shows up on everyone's screen.

> Under my grand plan one would have to implement comparison operators for
> each data type (as well as all the other things one has to implement for a
> data type);

You do realize that this would be hundreds, if not thousands, of things?

> BTW, how does postgres store multibyte text? As char * with a multibyte
> encoding?

Yes.

> As 16 bit or 32 bit code points? I should of course just look at
> the code and find out...:) I guess the former, from Peter's earlier
> comments. It does seem to me that using an explicit 32 bit representation
> (or at least providing that as an option) would make life easier in many
> ways.

I think the storage size penality would be prohibitive.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Vague idea for allowing per-column locale

From
Tatsuo Ishii
Date:
> > It'd be even nicer if you could make a table that has, say, one column
> > utf8 (or utf32 even), one column euc-jp and one column shift-jis, so
> > that you could cache format conversions.
> 
> This might be a nice thing to show off but I'm not sure about the
> practical use.  There's Unicode that you can use if you want to mix and
> match on the server, and the ability to convert the character set between
> client and server so the right thing shows up on everyone's screen.

Storing everything as Unicode is not a good idea, actually. First,
Unicode tends to consume more storage space than other character
sets. For example, UTF-8, one of the most commonly used encoding for
Unicode consumes 3 bytes for Japanese characters, while SJIS only
consumes 2 bytes. Second, a round trip converison between Unicode and
other character sets is not always possible. Third, sorting
issue. There is no convenient way to sort Unicode correctly.
--
Tatsuo Ishii


Re: Vague idea for allowing per-column locale

From
Tim Allen
Date:
On Tue, 14 Aug 2001, Tatsuo Ishii wrote:

> Storing everything as Unicode is not a good idea, actually. First,
> Unicode tends to consume more storage space than other character
> sets. For example, UTF-8, one of the most commonly used encoding for
> Unicode consumes 3 bytes for Japanese characters, while SJIS only
> consumes 2 bytes. Second, a round trip converison between Unicode and
> other character sets is not always possible. Third, sorting
> issue. There is no convenient way to sort Unicode correctly.

UTF-16 can handle most Japanese characters in two bytes, afaict. Generally
it seems that utf8 encodes European text more efficiently on average,
whereas utf16 is better for most Asian languages. I may be mistaken, but I
was under the impression that sorting of unicode characters was a solved
problem. The IBM ICU class library (which does have a C interface), for
example, claims to provide everything you need to sort unicode text in
various locales, and uses utf16 internally:

http://oss.software.ibm.com/developerworks/opensource/icu/project/index.html

The licence is, I gather, the X licence, which presumably is compatible
enough with BSD; not that I would necessarily advocate building this into
postgres at a fundamental level, but it demonstrates that it can be done.

Note that I'm not speaking from experience here, I've just read the docs,
and a book on unicode, never actually performed a Japanese-language (or
any other non-English language) sort, so no need to take me too seriously
:).

> Tatsuo Ishii

Tim

-- 
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/



Re: Vague idea for allowing per-column locale

From
Tatsuo Ishii
Date:
> > Storing everything as Unicode is not a good idea, actually. First,
> > Unicode tends to consume more storage space than other character
> > sets. For example, UTF-8, one of the most commonly used encoding for
> > Unicode consumes 3 bytes for Japanese characters, while SJIS only
> > consumes 2 bytes. Second, a round trip converison between Unicode and
> > other character sets is not always possible. Third, sorting
> > issue. There is no convenient way to sort Unicode correctly.
> 
> UTF-16 can handle most Japanese characters in two bytes, afaict. Generally
> it seems that utf8 encodes European text more efficiently on average,
> whereas utf16 is better for most Asian languages.

Same thing can be said to UCS-2. Most multibyte characters could be
two bytes within UCS-2. The problem with both UTF-16 and UCS-4 is that
data may contain NULL bytes.

> I may be mistaken, but I
> was under the impression that sorting of unicode characters was a solved
> problem. The IBM ICU class library (which does have a C interface), for
> example, claims to provide everything you need to sort unicode text in
> various locales, and uses utf16 internally:

Interesting. Thanks for the info. I will look into this.

BTW, "round trip conversion problem" still need to be addressed.
--
Tatsuo Ishii