Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields - Mailing list pgsql-odbc

From Dave Page
Subject Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Date
Msg-id E7F85A1B5FF8D44C8A1AF6885BC9A0E4011C9946@ratbert.vale-housing.co.uk
Whole thread Raw
In response to Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields  (Johann Zuschlag <zuschlag2@online.de>)
List pgsql-odbc

> -----Original Message-----
> From: Johann Zuschlag [mailto:zuschlag2@online.de]
> Sent: 30 March 2006 20:41
> To: Dave Page
> Cc: Hiroshi Inoue; pgsql-odbc@postgresql.org
> Subject: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
>
> Dave Page schrieb:
> > If 'ö' is 'ö', then isn't the query above mixing single
> and a multibyte encoding? Ie. It should all be single byte - e.g.
> >
> > select name from kunde where name >= 'ö' order by name asc;
> >
> > Or all multibyte (displayed byte by byte) whatever that results in:
> >
> > s*e*l*e*c*t* *n*a*m*e* *f*r*o*m* *k*u*n*d*e* *w*h*e*r*e* *n*a*m*e*
> > *>*=* *'*ö'*;*
> >
> > Of course, we all know how well I grok encoding issues :-)
> >
> Hi Dave,
>
> I can understand you. This encoding issues drive me also
> crazy some times. :-)
>
> The problem with UTF-8 is that all ASCII characters are
> represented by one byte and all non ASCII characters, e.g.
> German Umlauts, are represented by two bytes. That's why
> UTF-8 is called a "variable-length multibyte encoding". In a
> pure Unicode world, e.g. U+xxxx with two bytes, every
> character is represented by two bytes (fixed-length multibyte
> encoding). So Unicode is not equal to UTF-8, even though the
> PostgreSQL documentation is stating that.
>
> If you like, see: http://www.utf8-chartable.de/ or some
> explanation at http://czyborra.com/utf/

Ahh, thanks for the explanation.

> Windows XP supports ANSI, UTF-8, Unicode and Unicode Big Endian.
> Unfortunately (or fortunately?) Windows seems to use UTF-8
> for European languages. Hiroshi can you explain that? I guess
> the Japanese edition of Windows XP is using pure 2 byte Unicode.

Ahh, now I do know that Windows does not fully support UTF-8. That's the very reason why it is not supported in
PostgreSQL8.0 on Windows, and in 8.1 and above requires conversion routines that were added to the server by Magnus
Haganderto convert to UCS2(?) before doing any sorting etc. 

> I can't say anything about psql. But the new  psqlodbc driver
> 7.03.26X seems to handle that situation very well.
>
> So I suppose the test was valid to a certain extend, since
> the characters are handled in this mixed way in Win XP. I
> still have some funny behaviour with Unicode in psql (even
> after setting LC_COLLATE correctly :-) ).
>
> For my production machines I will anyway use ISO-8859-1 (or
> ISO-8859-15). Then the driver will convert all characters to
> single byte avoiding all kind of problems.
>
> But feel free to ask me for tests... ;-)

I'll need to leave that to Hiroshi - we already know we're past my knowledge on the subject :-)

Regards, Dave.

pgsql-odbc by date:

Previous
From: Johann Zuschlag
Date:
Subject: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields
Next
From: Hiroshi Inoue
Date:
Subject: Re: Unicode is not UTF-8. was :psqlODBC-Driver Test / text fields