Thread: Vague idea for allowing per-column locale
We have realized that allowing per-column locale would be difficult with the existing C library interface, because setlocale() is a pretty inefficient operation. But I think what we could allow, and what would be fairly useful, is the choice between the plain C locale and one "real" locale of choice (as determined by initdb) on a column or datum basis. One possible way to implement this is to set or clear a bit somewhere in the header of each text (char, varchar) type datum, depending on what you want. Basically, this bit is going to be part of the atttypmod. Then the comparison operators would use strcoll or strcmp, depending on the choice, and similarly for other functions that are locale-aware. Does anyone see a problem with this, aside from the fact that this breaks the internal representation of the character types (which might have to happen anyway if we ever want to do something in this direction)? (If this is an acceptable plan then we could tie this in with the proposed work of making the LIKE optimization work. We wouldn't have to make up new ugly-named operators, we'd just have to do a bit of plain old type casting.) -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
On Fri, 10 Aug 2001, Peter Eisentraut wrote: > We have realized that allowing per-column locale would be difficult with > the existing C library interface, because setlocale() is a pretty > inefficient operation. But I think what we could allow, and what would be > fairly useful, is the choice between the plain C locale and one "real" > locale of choice (as determined by initdb) on a column or datum basis. Yes, the C library locale notion is somewhat broken, or at best limited, imho. It doesn't fit at all well with the needs of a server that can have clients in different locales, or even clients in the same place who have different locale preferences. I guess it's a pre-ubiquitous-internet concept. If you keep stretching this idea beyond the model that it comfortably supports, life will become steadily more difficult, and it may be better to give up on that model altogether. A different idea related to this is to treat different text representations as different data types. In the case of different multi-byte text representations, this definitely makes sense; in the case of just different locales for essentially the same character set it might not be as obviously beneficial, but still merits some consideration, imho. For converting, say, utf8 to euc-jp, it would be nice to be able to make use of all the existing infrastructure that PostgreSQL has for type conversion and type identification. It'd be even nicer if you could make a table that has, say, one column utf8 (or utf32 even), one column euc-jp and one column shift-jis, so that you could cache format conversions. > One possible way to implement this is to set or clear a bit somewhere in > the header of each text (char, varchar) type datum, depending on what you > want. Basically, this bit is going to be part of the atttypmod. Then the > comparison operators would use strcoll or strcmp, depending on the choice, > and similarly for other functions that are locale-aware. Under my grand plan one would have to implement comparison operators for each data type (as well as all the other things one has to implement for a data type); then it should Just Work, because postgres would know what comparison to use for each column. > Does anyone see a problem with this, aside from the fact that this breaks > the internal representation of the character types (which might have to > happen anyway if we ever want to do something in this direction)? > (If this is an acceptable plan then we could tie this in with the proposed > work of making the LIKE optimization work. We wouldn't have to make up > new ugly-named operators, we'd just have to do a bit of plain old type > casting.) The separate data types notion would work here also, since one could declare a column to be of plain vanilla ascii data type, with all comparisons just a simple comparison of numerical values. BTW, how does postgres store multibyte text? As char * with a multibyte encoding? As 16 bit or 32 bit code points? I should of course just look at the code and find out...:) I guess the former, from Peter's earlier comments. It does seem to me that using an explicit 32 bit representation (or at least providing that as an option) would make life easier in many ways. Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/
> (If this is an acceptable plan then we could tie this in with the proposed > work of making the LIKE optimization work. We wouldn't have to make up > new ugly-named operators, we'd just have to do a bit of plain old type > casting.) If we are thinking about improvements at this level, why not go ahead and reopen the discussion of how to do SQL9x national character sets, collations, etc etc. istm that these will offer a solution for which the current issues are a (hopefully large) subset. We could use the type system to support this (my current preference); others have suggested that this might be too heavy to be usable and had alternate suggestions. Issues with SQL9x include: o character set/collation syntax for string literals o internal representation o appropriate operators and functions for these sets/collations o I/O conventions between client and server (may use the current scheme?) o allowing these alternate character sets for table names (or wherever allowed by SQL9x). How to expose, for example, equality operators to allow internal PostgreSQL operation: is our current use of strcmp() enough? Comments? - Thomas
On Fri, 10 Aug 2001, Tim Allen wrote: > > We have realized that allowing per-column locale would be difficult with > > the existing C library interface, because setlocale() is a pretty > > inefficient operation. But I think what we could allow, and what would be > Yes, the C library locale notion is somewhat broken, or at best limited, > imho. It doesn't fit at all well with the needs of a server that can have > clients in different locales, or even clients in the same place who have > different locale preferences. This may be of interest: http://www.cygnus.com/~drepper/tllocale.ps.bz2 It's not clear if glibc implements it yet, nor if people like the interface, but it seems like a nice, easily-wrapped and stubbed option, which might come for free on at least one OS. Matthew.
Thomas Lockhart writes: > o character set/collation syntax for string literals I think it's <character value expression> COLLATE <collation name>. It's another one of these things that break our nicely organized system. :-( > o internal representation > > o appropriate operators and functions for these sets/collations Hmm, there aren't a lot of things you can do with them, no? > o I/O conventions between client and server (may use the current > scheme?) As long as it works I don't see a problem. I have said before that we should allow using the iconv interface because it's more powerful. > o allowing these alternate character sets for table names (or wherever > allowed by SQL9x). How to expose, for example, equality operators to > allow internal PostgreSQL operation: is our current use of strcmp() > enough? No. This could be tricky to do though without sacrificing the designed efficiency of the "name" type. For instance, how could we store the which-locale-is-it information? -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Tim Allen writes: > For converting, say, utf8 to euc-jp, it would be nice to be able to make > use of all the existing infrastructure that PostgreSQL has for type > conversion and type identification. Unfortunately, that infrastructure is rather poorly suited for handling arbitrary types. > It'd be even nicer if you could make a table that has, say, one column > utf8 (or utf32 even), one column euc-jp and one column shift-jis, so > that you could cache format conversions. This might be a nice thing to show off but I'm not sure about the practical use. There's Unicode that you can use if you want to mix and match on the server, and the ability to convert the character set between client and server so the right thing shows up on everyone's screen. > Under my grand plan one would have to implement comparison operators for > each data type (as well as all the other things one has to implement for a > data type); You do realize that this would be hundreds, if not thousands, of things? > BTW, how does postgres store multibyte text? As char * with a multibyte > encoding? Yes. > As 16 bit or 32 bit code points? I should of course just look at > the code and find out...:) I guess the former, from Peter's earlier > comments. It does seem to me that using an explicit 32 bit representation > (or at least providing that as an option) would make life easier in many > ways. I think the storage size penality would be prohibitive. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> > It'd be even nicer if you could make a table that has, say, one column > > utf8 (or utf32 even), one column euc-jp and one column shift-jis, so > > that you could cache format conversions. > > This might be a nice thing to show off but I'm not sure about the > practical use. There's Unicode that you can use if you want to mix and > match on the server, and the ability to convert the character set between > client and server so the right thing shows up on everyone's screen. Storing everything as Unicode is not a good idea, actually. First, Unicode tends to consume more storage space than other character sets. For example, UTF-8, one of the most commonly used encoding for Unicode consumes 3 bytes for Japanese characters, while SJIS only consumes 2 bytes. Second, a round trip converison between Unicode and other character sets is not always possible. Third, sorting issue. There is no convenient way to sort Unicode correctly. -- Tatsuo Ishii
On Tue, 14 Aug 2001, Tatsuo Ishii wrote: > Storing everything as Unicode is not a good idea, actually. First, > Unicode tends to consume more storage space than other character > sets. For example, UTF-8, one of the most commonly used encoding for > Unicode consumes 3 bytes for Japanese characters, while SJIS only > consumes 2 bytes. Second, a round trip converison between Unicode and > other character sets is not always possible. Third, sorting > issue. There is no convenient way to sort Unicode correctly. UTF-16 can handle most Japanese characters in two bytes, afaict. Generally it seems that utf8 encodes European text more efficiently on average, whereas utf16 is better for most Asian languages. I may be mistaken, but I was under the impression that sorting of unicode characters was a solved problem. The IBM ICU class library (which does have a C interface), for example, claims to provide everything you need to sort unicode text in various locales, and uses utf16 internally: http://oss.software.ibm.com/developerworks/opensource/icu/project/index.html The licence is, I gather, the X licence, which presumably is compatible enough with BSD; not that I would necessarily advocate building this into postgres at a fundamental level, but it demonstrates that it can be done. Note that I'm not speaking from experience here, I've just read the docs, and a book on unicode, never actually performed a Japanese-language (or any other non-English language) sort, so no need to take me too seriously :). > Tatsuo Ishii Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/
> > Storing everything as Unicode is not a good idea, actually. First, > > Unicode tends to consume more storage space than other character > > sets. For example, UTF-8, one of the most commonly used encoding for > > Unicode consumes 3 bytes for Japanese characters, while SJIS only > > consumes 2 bytes. Second, a round trip converison between Unicode and > > other character sets is not always possible. Third, sorting > > issue. There is no convenient way to sort Unicode correctly. > > UTF-16 can handle most Japanese characters in two bytes, afaict. Generally > it seems that utf8 encodes European text more efficiently on average, > whereas utf16 is better for most Asian languages. Same thing can be said to UCS-2. Most multibyte characters could be two bytes within UCS-2. The problem with both UTF-16 and UCS-4 is that data may contain NULL bytes. > I may be mistaken, but I > was under the impression that sorting of unicode characters was a solved > problem. The IBM ICU class library (which does have a C interface), for > example, claims to provide everything you need to sort unicode text in > various locales, and uses utf16 internally: Interesting. Thanks for the info. I will look into this. BTW, "round trip conversion problem" still need to be addressed. -- Tatsuo Ishii