Thread: Multibyte encoding vs. SQL_ASCII and European languages

Multibyte encoding vs. SQL_ASCII and European languages

From
Frank Joerdens
Date:
Call me stupid - but I am trying to understand what multibyte encoding
(aka Latin1) would buy me with English/German/French etc. languages (the
app I am currently slapping together will be mostly used by people
writing in German). SQL_ASCII seems to work just fine with German
umlauts and other funny characters that I am stuffing into, or pulling
out of the database (that is, after having survived the nightmare of
importing Filemaker data so as to having umlauts etc. correctly
represented). And what exactly does the server vs. client encoding do?

Thanks, Frank

Re: Multibyte encoding vs. SQL_ASCII and European

From
Frank Schafer
Date:
On Tue, 2002-01-29 at 13:03, Frank Joerdens wrote:
> Call me stupid - but I am trying to understand what multibyte encoding
> (aka Latin1) ...

!!!!!!???????????!!!...!!!!!!...???????????

so Latin1 i MULTYBYTE ?????????!!!!!!!..!!!...?????????????

Regards
Frank ( too ;o)




Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Frank Joerdens
Date:
On Tue, Jan 29, 2002 at 01:41:16PM +0100, Frank Schafer wrote:
> On Tue, 2002-01-29 at 13:03, Frank Joerdens wrote:
> > Call me stupid - but I am trying to understand what multibyte encoding
> > (aka Latin1) ...
>
> !!!!!!???????????!!!...!!!!!!...???????????
>
> so Latin1 i MULTYBYTE ?????????!!!!!!!..!!!...?????????????
>
> Regards
> Frank ( too ;o)
              ^^
              and what is that emoticon?

??? What did you mean??? (did your mailer screw things up so I am only
seeing exclamation and question marks or did you try to tell me
something that way?).

By way of explaining myself a little better maybe: Looking at the
relevant section in the admin guide, which is entitled 'Localization',
you get the impression that either locale support or multibyte support
are good things to have if you are not in an English environment.
Multibyte support is mainly recommended for character sets that don't
fit into a single byte (Chinese, Japanese, Korean), and locale support
is said to be mostly sufficient for European languages . . . what escapes
me is why I should bother with either of these when SQL_ASCII works just
fine with my mostly German users. I must be missing something, right?

Regards, Frank

Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Tom Lane
Date:
Frank Joerdens <frank@joerdens.de> writes:
> Multibyte support is mainly recommended for character sets that don't
> fit into a single byte (Chinese, Japanese, Korean), and locale support
> is said to be mostly sufficient for European languages . . . what escapes
> me is why I should bother with either of these when SQL_ASCII works just
> fine with my mostly German users. I must be missing something, right?

Sort ordering of non-7-bit-ASCII characters?  upper/lower case
conversions that work as expected?  locale-aware formatting options
in to_char and friends?

If you don't need any of that, then you won't need locale support.

I agree that you have no use for multibyte support.

            regards, tom lane

Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Bruno Wolff III
Date:
On Tue, Jan 29, 2002 at 04:31:39PM +0100,
  Frank Joerdens <frank@joerdens.de> wrote:
> On Tue, Jan 29, 2002 at 01:41:16PM +0100, Frank Schafer wrote:
> > On Tue, 2002-01-29 at 13:03, Frank Joerdens wrote:
> > > Call me stupid - but I am trying to understand what multibyte encoding
> > > (aka Latin1) ...
>
> ??? What did you mean??? (did your mailer screw things up so I am only
> seeing exclamation and question marks or did you try to tell me
> something that way?).

Latin 1 is not a multibyte code, so I think he was commenting on your
example.

Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Frank Joerdens
Date:
On Tue, Jan 29, 2002 at 10:56:37AM -0500, Tom Lane wrote:
> Frank Joerdens <frank@joerdens.de> writes:
> > Multibyte support is mainly recommended for character sets that don't
> > fit into a single byte (Chinese, Japanese, Korean), and locale support
> > is said to be mostly sufficient for European languages . . . what escapes
> > me is why I should bother with either of these when SQL_ASCII works just
> > fine with my mostly German users. I must be missing something, right?
>
> Sort ordering of non-7-bit-ASCII characters?  upper/lower case
> conversions that work as expected?  locale-aware formatting options
> in to_char and friends?

Hm, yes. I overlooked that - and it *would* be useful (though no one's
complained so far that their entries beginning with an umlaut don't
appear in the list a the appropriate place, which is presumably partly
due to the fact that not that many German words or names have an umlaut
as their first character).

>
> If you don't need any of that, then you won't need locale support.
>
> I agree that you have no use for multibyte support.

What about the performance penalty that you're warned about with
locales (in the admin guide)? Does multibyte support entail the same
penalty? If not, then multibyte encoding, providing a superset of the
locale functionality (correct?), would be better than locales, right?

Regards, Frank

Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Einar Karttunen
Date:
On 29.01.02 18:00 +0100(+0000), Frank Joerdens wrote:
> On Tue, Jan 29, 2002 at 10:56:37AM -0500, Tom Lane wrote:
> > Frank Joerdens <frank@joerdens.de> writes:
> > > Multibyte support is mainly recommended for character sets that don't
> > > fit into a single byte (Chinese, Japanese, Korean), and locale support
> > > is said to be mostly sufficient for European languages . . . what escapes
> > > me is why I should bother with either of these when SQL_ASCII works just
> > > fine with my mostly German users. I must be missing something, right?
> >
> > Sort ordering of non-7-bit-ASCII characters?  upper/lower case
> > conversions that work as expected?  locale-aware formatting options
> > in to_char and friends?
>
> Hm, yes. I overlooked that - and it *would* be useful (though no one's
> complained so far that their entries beginning with an umlaut don't
> appear in the list a the appropriate place, which is presumably partly
> due to the fact that not that many German words or names have an umlaut
> as their first character).
>
And how do we know, how the umlauts are supposed to be alphabetically
ordered without locales? Should Ä be between A and B as in Germany
or between Å and Ö in the end of the alphabet as in Scandinavia?

So the solution would be to have tables for each unibyte locale
specifying the sort order...

- Einar Karttunen

Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Frank Joerdens
Date:
On Tue, Jan 29, 2002 at 11:01:25AM -0600, Bruno Wolff III wrote:
> On Tue, Jan 29, 2002 at 04:31:39PM +0100,
>   Frank Joerdens <frank@joerdens.de> wrote:
> > On Tue, Jan 29, 2002 at 01:41:16PM +0100, Frank Schafer wrote:
> > > On Tue, 2002-01-29 at 13:03, Frank Joerdens wrote:
> > > > Call me stupid - but I am trying to understand what multibyte encoding
> > > > (aka Latin1) ...
> >
> > ??? What did you mean??? (did your mailer screw things up so I am only
> > seeing exclamation and question marks or did you try to tell me
> > something that way?).
>
> Latin 1 is not a multibyte code, so I think he was commenting on your
> example.

True. What I meant was that you can't specify the encoding LATIN1 with
PostgreSQL if you didn't compile in multibyte support (I know it's
generally a bad plan to be so elliptical in list postings . . . ).
Although technically presumably you can fit Latin1 characters into a
single byte. Hence my question was not "What do I gain from multibyte
support when I don't need multibyte support?" but "what do I get from
specifying Latin1 encoding (which is only available when compiling
with --enable-multibyte) and what do I lose when using locales or
sql_ascii?". The advantage when using locale support over no locale
support is that I can e.g. rely on ORDER BY dealing correctly with my
German umlauts (to_char and friends plus like and ~ are also affected).
However, you incur a performance penalty with the LIKE operator . . .

Regards, Frank

Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Frank Joerdens
Date:
On Tue, Jan 29, 2002 at 07:29:25PM +0200, Einar Karttunen wrote:
> On 29.01.02 18:00 +0100(+0000), Frank Joerdens wrote:
> > On Tue, Jan 29, 2002 at 10:56:37AM -0500, Tom Lane wrote:
> > > Frank Joerdens <frank@joerdens.de> writes:
> > > > Multibyte support is mainly recommended for character sets that don't
> > > > fit into a single byte (Chinese, Japanese, Korean), and locale support
> > > > is said to be mostly sufficient for European languages . . . what escapes
> > > > me is why I should bother with either of these when SQL_ASCII works just
> > > > fine with my mostly German users. I must be missing something, right?
> > >
> > > Sort ordering of non-7-bit-ASCII characters?  upper/lower case
> > > conversions that work as expected?  locale-aware formatting options
> > > in to_char and friends?
> >
> > Hm, yes. I overlooked that - and it *would* be useful (though no one's
> > complained so far that their entries beginning with an umlaut don't
> > appear in the list a the appropriate place, which is presumably partly
> > due to the fact that not that many German words or names have an umlaut
> > as their first character).
> >
> And how do we know, how the umlauts are supposed to be alphabetically
> ordered without locales?

That's what I meant. Getting the sort order right would require you to
use locales (Or some latin encoding? Does any one of the latin 1-5 imply
the difference between Scandinavian and German umlaut ordering?). I
didn't say I wanted to do without locales and still get the sort order
right (did it sound that way?).

Regards, Frank

Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Tom Lane
Date:
Frank Joerdens <frank@joerdens.de> writes:
> What about the performance penalty that you're warned about with
> locales (in the admin guide)?

You pay it if you don't select C locale at initdb time, true.

> Does multibyte support entail the same penalty?

AFAIR, MULTIBYTE doesn't kill LIKE optimization, but it does incur
a generalized performance penalty on all string-mashing operators.
Never tried to measure the size of the penalty; perhaps Tatsuo or
Hiroshi would know.

> If not, then multibyte encoding, providing a superset of the
> locale functionality (correct?), would be better than locales, right?

MULTIBYTE is *not* a superset of LOCALE; they are independently
enablable features.  Offhand I don't think they are both interesting
for the same character sets.

            regards, tom lane

Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Tom Lane
Date:
Frank Joerdens <frank@joerdens.de> writes:
> Hence my question was not "What do I gain from multibyte
> support when I don't need multibyte support?" but "what do I get from
> specifying Latin1 encoding (which is only available when compiling
> with --enable-multibyte) and what do I lose when using locales or
> sql_ascii?".

You need LOCALE support if you want smarts about sort order, case
conversion, etc.  This is orthogonal to MULTIBYTE.

I was about to say that MULTIBYTE offers no value whatsoever if you
aren't using any multibyte character sets, but that's an overstatement.
One part of the MULTIBYTE feature is the ability to perform character
set conversions between what's physically stored in the server and
what's sent/received by clients.  This could be of use even in a
purely European environment if you have clients who would like to use
different encodings, viz the different ISO 8859-n character sets.
Or if you want translation to/from UNICODE.

But if your clients all agree on the same single-byte character set,
I can't see that MULTIBYTE helps you.

Also, if you need client character set conversion but all the
interesting character sets are single-byte, there's a simpler feature
called CYR_RECODE that just does recoding during client I/O without
any of the internal-processing penalties that MULTIBYTE carries.
I don't think the CYR_RECODE code is as well-tested as the MULTIBYTE
code, but it'll never get there unless people use it...

            regards, tom lane

Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Frank Joerdens
Date:
On Tue, Jan 29, 2002 at 01:54:04PM -0500, Tom Lane wrote:
> Frank Joerdens <frank@joerdens.de> writes:
> > What about the performance penalty that you're warned about with
> > locales (in the admin guide)?
>
> You pay it if you don't select C locale at initdb time, true.
>
> > Does multibyte support entail the same penalty?
>
> AFAIR, MULTIBYTE doesn't kill LIKE optimization, but it does incur
> a generalized performance penalty on all string-mashing operators.
> Never tried to measure the size of the penalty; perhaps Tatsuo or
> Hiroshi would know.
>
> > If not, then multibyte encoding, providing a superset of the
> > locale functionality (correct?), would be better than locales, right?
>
> MULTIBYTE is *not* a superset of LOCALE; they are independently
> enablable features.  Offhand I don't think they are both interesting
> for the same character sets.

Ok. But a big advantage then of multibyte vs. locales would be that with
locales I get the performace hit for *all* databases that are hosted
under a particular Pg installation (because initdb settings affect all
databases), whereas with multibyte I get to choose, on a per-database
basis (via createdb or set server_encoding), when I want the locale
support, and when performance is more important.

This line of reasoning obviously only makes any sense if,
funcionality-wise, I don't lose anything by using multibyte instead of
locales (which is what I was trying to say by X provides a superset, in
terms of functionality, of Y . . . not that locale support and multibyte
support are related otherwise, e.g. by sharing bits of code).

Regards, Frank

Re: Multibyte encoding vs. SQL_ASCII vs. locales and European languages

From
Frank Joerdens
Date:
On Tue, Jan 29, 2002 at 02:14:33PM -0500, Tom Lane wrote:
> Frank Joerdens <frank@joerdens.de> writes:
> > Hence my question was not "What do I gain from multibyte
> > support when I don't need multibyte support?" but "what do I get from
> > specifying Latin1 encoding (which is only available when compiling
> > with --enable-multibyte) and what do I lose when using locales or
> > sql_ascii?".
>
> You need LOCALE support if you want smarts about sort order, case
> conversion, etc.  This is orthogonal to MULTIBYTE.

OK! That answers my question (didn't see your mail a few minutes ago
when I posted my last).

Actually, just out of curiosity, then how do you sort Chinese, for
instance . . . ? I happen to know that Chinese dictionaries are usually
ordered by so-called radicals, combinations of strokes that appear in
any of the 4000 (simplified mainland Chinese) or so characters, of which
there are about 250.

Regards, Frank