Re: Vague idea for allowing per-column locale - Mailing list pgsql-hackers

From Tim Allen
Subject Re: Vague idea for allowing per-column locale
Date
Msg-id Pine.LNX.4.21.0108100903010.6384-100000@bee.proximity.com.au
Whole thread Raw
In response to Vague idea for allowing per-column locale  (Peter Eisentraut <peter_e@gmx.net>)
Responses Re: Vague idea for allowing per-column locale
Re: Vague idea for allowing per-column locale
List pgsql-hackers
On Fri, 10 Aug 2001, Peter Eisentraut wrote:

> We have realized that allowing per-column locale would be difficult with
> the existing C library interface, because setlocale() is a pretty
> inefficient operation.  But I think what we could allow, and what would be
> fairly useful, is the choice between the plain C locale and one "real"
> locale of choice (as determined by initdb) on a column or datum basis.

Yes, the C library locale notion is somewhat broken, or at best limited,
imho. It doesn't fit at all well with the needs of a server that can have
clients in different locales, or even clients in the same place who have
different locale preferences. I guess it's a pre-ubiquitous-internet
concept. If you keep stretching this idea beyond the model that it
comfortably supports, life will become steadily more difficult, and it may
be better to give up on that model altogether.

A different idea related to this is to treat different text
representations as different data types. In the case of different
multi-byte text representations, this definitely makes sense; in the case
of just different locales for essentially the same character set it might
not be as obviously beneficial, but still merits some consideration, imho.

For converting, say, utf8 to euc-jp, it would be nice to be able to make
use of all the existing infrastructure that PostgreSQL has for type
conversion and type identification. It'd be even nicer if you could make a
table that has, say, one column utf8 (or utf32 even), one column euc-jp
and one column shift-jis, so that you could cache format conversions.

> One possible way to implement this is to set or clear a bit somewhere in
> the header of each text (char, varchar) type datum, depending on what you
> want.  Basically, this bit is going to be part of the atttypmod.  Then the
> comparison operators would use strcoll or strcmp, depending on the choice,
> and similarly for other functions that are locale-aware.

Under my grand plan one would have to implement comparison operators for
each data type (as well as all the other things one has to implement for a
data type); then it should Just Work, because postgres would know what
comparison to use for each column.

> Does anyone see a problem with this, aside from the fact that this breaks
> the internal representation of the character types (which might have to
> happen anyway if we ever want to do something in this direction)?

> (If this is an acceptable plan then we could tie this in with the proposed
> work of making the LIKE optimization work.  We wouldn't have to make up
> new ugly-named operators, we'd just have to do a bit of plain old type
> casting.)

The separate data types notion would work here also, since one could
declare a column to be of plain vanilla ascii data type, with all
comparisons just a simple comparison of numerical values.

BTW, how does postgres store multibyte text? As char * with a multibyte
encoding? As 16 bit or 32 bit code points? I should of course just look at
the code and find out...:) I guess the former, from Peter's earlier
comments. It does seem to me that using an explicit 32 bit representation
(or at least providing that as an option) would make life easier in many
ways.

Tim

-- 
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/



pgsql-hackers by date:

Previous
From: Ian Lance Taylor
Date:
Subject: Re: WIN32 errno patch
Next
From: Thomas Lockhart
Date:
Subject: Re: Vague idea for allowing per-column locale