Thread: Multibyte encoding vs. SQL_ASCII and European languages
Call me stupid - but I am trying to understand what multibyte encoding (aka Latin1) would buy me with English/German/French etc. languages (the app I am currently slapping together will be mostly used by people writing in German). SQL_ASCII seems to work just fine with German umlauts and other funny characters that I am stuffing into, or pulling out of the database (that is, after having survived the nightmare of importing Filemaker data so as to having umlauts etc. correctly represented). And what exactly does the server vs. client encoding do? Thanks, Frank
On Tue, 2002-01-29 at 13:03, Frank Joerdens wrote: > Call me stupid - but I am trying to understand what multibyte encoding > (aka Latin1) ... !!!!!!???????????!!!...!!!!!!...??????????? so Latin1 i MULTYBYTE ?????????!!!!!!!..!!!...????????????? Regards Frank ( too ;o)
On Tue, Jan 29, 2002 at 01:41:16PM +0100, Frank Schafer wrote: > On Tue, 2002-01-29 at 13:03, Frank Joerdens wrote: > > Call me stupid - but I am trying to understand what multibyte encoding > > (aka Latin1) ... > > !!!!!!???????????!!!...!!!!!!...??????????? > > so Latin1 i MULTYBYTE ?????????!!!!!!!..!!!...????????????? > > Regards > Frank ( too ;o) ^^ and what is that emoticon? ??? What did you mean??? (did your mailer screw things up so I am only seeing exclamation and question marks or did you try to tell me something that way?). By way of explaining myself a little better maybe: Looking at the relevant section in the admin guide, which is entitled 'Localization', you get the impression that either locale support or multibyte support are good things to have if you are not in an English environment. Multibyte support is mainly recommended for character sets that don't fit into a single byte (Chinese, Japanese, Korean), and locale support is said to be mostly sufficient for European languages . . . what escapes me is why I should bother with either of these when SQL_ASCII works just fine with my mostly German users. I must be missing something, right? Regards, Frank
Frank Joerdens <frank@joerdens.de> writes: > Multibyte support is mainly recommended for character sets that don't > fit into a single byte (Chinese, Japanese, Korean), and locale support > is said to be mostly sufficient for European languages . . . what escapes > me is why I should bother with either of these when SQL_ASCII works just > fine with my mostly German users. I must be missing something, right? Sort ordering of non-7-bit-ASCII characters? upper/lower case conversions that work as expected? locale-aware formatting options in to_char and friends? If you don't need any of that, then you won't need locale support. I agree that you have no use for multibyte support. regards, tom lane
On Tue, Jan 29, 2002 at 04:31:39PM +0100, Frank Joerdens <frank@joerdens.de> wrote: > On Tue, Jan 29, 2002 at 01:41:16PM +0100, Frank Schafer wrote: > > On Tue, 2002-01-29 at 13:03, Frank Joerdens wrote: > > > Call me stupid - but I am trying to understand what multibyte encoding > > > (aka Latin1) ... > > ??? What did you mean??? (did your mailer screw things up so I am only > seeing exclamation and question marks or did you try to tell me > something that way?). Latin 1 is not a multibyte code, so I think he was commenting on your example.
On Tue, Jan 29, 2002 at 10:56:37AM -0500, Tom Lane wrote: > Frank Joerdens <frank@joerdens.de> writes: > > Multibyte support is mainly recommended for character sets that don't > > fit into a single byte (Chinese, Japanese, Korean), and locale support > > is said to be mostly sufficient for European languages . . . what escapes > > me is why I should bother with either of these when SQL_ASCII works just > > fine with my mostly German users. I must be missing something, right? > > Sort ordering of non-7-bit-ASCII characters? upper/lower case > conversions that work as expected? locale-aware formatting options > in to_char and friends? Hm, yes. I overlooked that - and it *would* be useful (though no one's complained so far that their entries beginning with an umlaut don't appear in the list a the appropriate place, which is presumably partly due to the fact that not that many German words or names have an umlaut as their first character). > > If you don't need any of that, then you won't need locale support. > > I agree that you have no use for multibyte support. What about the performance penalty that you're warned about with locales (in the admin guide)? Does multibyte support entail the same penalty? If not, then multibyte encoding, providing a superset of the locale functionality (correct?), would be better than locales, right? Regards, Frank
On 29.01.02 18:00 +0100(+0000), Frank Joerdens wrote: > On Tue, Jan 29, 2002 at 10:56:37AM -0500, Tom Lane wrote: > > Frank Joerdens <frank@joerdens.de> writes: > > > Multibyte support is mainly recommended for character sets that don't > > > fit into a single byte (Chinese, Japanese, Korean), and locale support > > > is said to be mostly sufficient for European languages . . . what escapes > > > me is why I should bother with either of these when SQL_ASCII works just > > > fine with my mostly German users. I must be missing something, right? > > > > Sort ordering of non-7-bit-ASCII characters? upper/lower case > > conversions that work as expected? locale-aware formatting options > > in to_char and friends? > > Hm, yes. I overlooked that - and it *would* be useful (though no one's > complained so far that their entries beginning with an umlaut don't > appear in the list a the appropriate place, which is presumably partly > due to the fact that not that many German words or names have an umlaut > as their first character). > And how do we know, how the umlauts are supposed to be alphabetically ordered without locales? Should Ä be between A and B as in Germany or between Å and Ö in the end of the alphabet as in Scandinavia? So the solution would be to have tables for each unibyte locale specifying the sort order... - Einar Karttunen
On Tue, Jan 29, 2002 at 11:01:25AM -0600, Bruno Wolff III wrote: > On Tue, Jan 29, 2002 at 04:31:39PM +0100, > Frank Joerdens <frank@joerdens.de> wrote: > > On Tue, Jan 29, 2002 at 01:41:16PM +0100, Frank Schafer wrote: > > > On Tue, 2002-01-29 at 13:03, Frank Joerdens wrote: > > > > Call me stupid - but I am trying to understand what multibyte encoding > > > > (aka Latin1) ... > > > > ??? What did you mean??? (did your mailer screw things up so I am only > > seeing exclamation and question marks or did you try to tell me > > something that way?). > > Latin 1 is not a multibyte code, so I think he was commenting on your > example. True. What I meant was that you can't specify the encoding LATIN1 with PostgreSQL if you didn't compile in multibyte support (I know it's generally a bad plan to be so elliptical in list postings . . . ). Although technically presumably you can fit Latin1 characters into a single byte. Hence my question was not "What do I gain from multibyte support when I don't need multibyte support?" but "what do I get from specifying Latin1 encoding (which is only available when compiling with --enable-multibyte) and what do I lose when using locales or sql_ascii?". The advantage when using locale support over no locale support is that I can e.g. rely on ORDER BY dealing correctly with my German umlauts (to_char and friends plus like and ~ are also affected). However, you incur a performance penalty with the LIKE operator . . . Regards, Frank
On Tue, Jan 29, 2002 at 07:29:25PM +0200, Einar Karttunen wrote: > On 29.01.02 18:00 +0100(+0000), Frank Joerdens wrote: > > On Tue, Jan 29, 2002 at 10:56:37AM -0500, Tom Lane wrote: > > > Frank Joerdens <frank@joerdens.de> writes: > > > > Multibyte support is mainly recommended for character sets that don't > > > > fit into a single byte (Chinese, Japanese, Korean), and locale support > > > > is said to be mostly sufficient for European languages . . . what escapes > > > > me is why I should bother with either of these when SQL_ASCII works just > > > > fine with my mostly German users. I must be missing something, right? > > > > > > Sort ordering of non-7-bit-ASCII characters? upper/lower case > > > conversions that work as expected? locale-aware formatting options > > > in to_char and friends? > > > > Hm, yes. I overlooked that - and it *would* be useful (though no one's > > complained so far that their entries beginning with an umlaut don't > > appear in the list a the appropriate place, which is presumably partly > > due to the fact that not that many German words or names have an umlaut > > as their first character). > > > And how do we know, how the umlauts are supposed to be alphabetically > ordered without locales? That's what I meant. Getting the sort order right would require you to use locales (Or some latin encoding? Does any one of the latin 1-5 imply the difference between Scandinavian and German umlaut ordering?). I didn't say I wanted to do without locales and still get the sort order right (did it sound that way?). Regards, Frank
Frank Joerdens <frank@joerdens.de> writes: > What about the performance penalty that you're warned about with > locales (in the admin guide)? You pay it if you don't select C locale at initdb time, true. > Does multibyte support entail the same penalty? AFAIR, MULTIBYTE doesn't kill LIKE optimization, but it does incur a generalized performance penalty on all string-mashing operators. Never tried to measure the size of the penalty; perhaps Tatsuo or Hiroshi would know. > If not, then multibyte encoding, providing a superset of the > locale functionality (correct?), would be better than locales, right? MULTIBYTE is *not* a superset of LOCALE; they are independently enablable features. Offhand I don't think they are both interesting for the same character sets. regards, tom lane
Frank Joerdens <frank@joerdens.de> writes: > Hence my question was not "What do I gain from multibyte > support when I don't need multibyte support?" but "what do I get from > specifying Latin1 encoding (which is only available when compiling > with --enable-multibyte) and what do I lose when using locales or > sql_ascii?". You need LOCALE support if you want smarts about sort order, case conversion, etc. This is orthogonal to MULTIBYTE. I was about to say that MULTIBYTE offers no value whatsoever if you aren't using any multibyte character sets, but that's an overstatement. One part of the MULTIBYTE feature is the ability to perform character set conversions between what's physically stored in the server and what's sent/received by clients. This could be of use even in a purely European environment if you have clients who would like to use different encodings, viz the different ISO 8859-n character sets. Or if you want translation to/from UNICODE. But if your clients all agree on the same single-byte character set, I can't see that MULTIBYTE helps you. Also, if you need client character set conversion but all the interesting character sets are single-byte, there's a simpler feature called CYR_RECODE that just does recoding during client I/O without any of the internal-processing penalties that MULTIBYTE carries. I don't think the CYR_RECODE code is as well-tested as the MULTIBYTE code, but it'll never get there unless people use it... regards, tom lane
On Tue, Jan 29, 2002 at 01:54:04PM -0500, Tom Lane wrote: > Frank Joerdens <frank@joerdens.de> writes: > > What about the performance penalty that you're warned about with > > locales (in the admin guide)? > > You pay it if you don't select C locale at initdb time, true. > > > Does multibyte support entail the same penalty? > > AFAIR, MULTIBYTE doesn't kill LIKE optimization, but it does incur > a generalized performance penalty on all string-mashing operators. > Never tried to measure the size of the penalty; perhaps Tatsuo or > Hiroshi would know. > > > If not, then multibyte encoding, providing a superset of the > > locale functionality (correct?), would be better than locales, right? > > MULTIBYTE is *not* a superset of LOCALE; they are independently > enablable features. Offhand I don't think they are both interesting > for the same character sets. Ok. But a big advantage then of multibyte vs. locales would be that with locales I get the performace hit for *all* databases that are hosted under a particular Pg installation (because initdb settings affect all databases), whereas with multibyte I get to choose, on a per-database basis (via createdb or set server_encoding), when I want the locale support, and when performance is more important. This line of reasoning obviously only makes any sense if, funcionality-wise, I don't lose anything by using multibyte instead of locales (which is what I was trying to say by X provides a superset, in terms of functionality, of Y . . . not that locale support and multibyte support are related otherwise, e.g. by sharing bits of code). Regards, Frank
On Tue, Jan 29, 2002 at 02:14:33PM -0500, Tom Lane wrote: > Frank Joerdens <frank@joerdens.de> writes: > > Hence my question was not "What do I gain from multibyte > > support when I don't need multibyte support?" but "what do I get from > > specifying Latin1 encoding (which is only available when compiling > > with --enable-multibyte) and what do I lose when using locales or > > sql_ascii?". > > You need LOCALE support if you want smarts about sort order, case > conversion, etc. This is orthogonal to MULTIBYTE. OK! That answers my question (didn't see your mail a few minutes ago when I posted my last). Actually, just out of curiosity, then how do you sort Chinese, for instance . . . ? I happen to know that Chinese dictionaries are usually ordered by so-called radicals, combinations of strokes that appear in any of the 4000 (simplified mainland Chinese) or so characters, of which there are about 250. Regards, Frank