Thread: Unicode problem ???

Unicode problem ???

From

"Priem, Alexander"

Date:

21 April 2004, 09:38:32

Hi everyone,

I have a PostgreSQL 7.4 database running, which was initdb-ed using standard
(SQL_ASCII) encoding, with -lc-collate=C option set.

Everything is running fine, but I just discovered something funny. If text
containing characters like ë is inserted into the database (text field), the
inserting goes fine. When I view this data using phpPgAdmin (v3.2.1) it
looks fine, including these special characters.

HOWEVER, when I get this data from within a Delphi application, using
psqlODBC (07.03.0200), the special characters get corrupted (ë becomes
something like Ã«). It seems to be a unicode problem.

What do you think, should I re-create the database using UNICODE encoding?
I'd like to know beforehand if this will solve the problem, since I do not
want to recreate the database just to find out that it does not fix things.

Or does the problem involve psqlODBC. I looked into some of its settings,
but I can't find anything that involves character encoding...

I hope someone can give me a pointer here.

Thanks in advance,
Alexander Priem
The Netherlands.

Re: Unicode problem ???

From

"Stijn Vanroye"

Date:

21 April 2004, 10:17:36

> Hi everyone,
>
> I have a PostgreSQL 7.4 database running, which was initdb-ed
> using standard
> (SQL_ASCII) encoding, with -lc-collate=C option set.
>
> Everything is running fine, but I just discovered something
> funny. If text
> containing characters like ë is inserted into the database
> (text field), the
> inserting goes fine. When I view this data using phpPgAdmin
> (v3.2.1) it
> looks fine, including these special characters.
>
> HOWEVER, when I get this data from within a Delphi application, using
> psqlODBC (07.03.0200), the special characters get corrupted (ë becomes
> something like Ã«). It seems to be a unicode problem.
>
> What do you think, should I re-create the database using
> UNICODE encoding?
> I'd like to know beforehand if this will solve the problem,
> since I do not
> want to recreate the database just to find out that it does
> not fix things.
>
> Or does the problem involve psqlODBC. I looked into some of
> its settings,
> but I can't find anything that involves character encoding...
>
> I hope someone can give me a pointer here.
>
> Thanks in advance,
> Alexander Priem
> The Netherlands.
>
I'm a Delphi (7 Enterprise) user myself. I've ran into that sort of encoding problems to. On the postgres backend I use
UNICODE. 

I'll tell you what I know, I hope it helps a bit.
First off: there seems to be a difference between the datatypes string and widestring in Delphi. widestring is
recommendedfor use with 'wide byte strings' (or something like that), for example UNICODE. string variables however
can'tbe trusted for the full 100% in that case. I don't know exactly how it worked, but I'm sure some browsing in the
DelphiHelp and other rescources will explain things. 

Second: I know that in my case I have to set the CLIENT_ENCODING TO 'LATIN1' in my odbc connect string to make sure
thatall the characters are translated correct.  

I'm sorry that I can't be more specific about these things since I don't remember exactly WHY these things are done,
butI do know they help :-)  This was all set-up and figured out by one of my colleagues a while ago. I was not excactly
intothe database itself, but restricted my work to delphi coding en general development. 
Therefore I hope that someone else can shed more light onto the matter so I can learn a bit too :-)

Kind Regards,

Stijn Vanroye.

Re: Unicode problem ???

From

Peter Eisentraut

Date:

21 April 2004, 10:23:16

Am Mittwoch, 21. April 2004 14:37 schrieb Priem, Alexander:
> I have a PostgreSQL 7.4 database running, which was initdb-ed using
> standard (SQL_ASCII) encoding, with -lc-collate=C option set.
>
> Everything is running fine, but I just discovered something funny. If text
> containing characters like ë is inserted into the database (text field),
> the inserting goes fine. When I view this data using phpPgAdmin (v3.2.1) it
> looks fine, including these special characters.

You can try to update pg_database to change the database encoding to LATIN1.
SQL_ASCII is only totally supported for, well, ASCII characters.

Re: Unicode problem ???

From

"Priem, Alexander"

Date:

21 April 2004, 10:46:51

> > Hi everyone,
> >
> > I have a PostgreSQL 7.4 database running, which was initdb-ed
> > using standard
> > (SQL_ASCII) encoding, with -lc-collate=C option set.
> >
> > Everything is running fine, but I just discovered something
> > funny. If text
> > containing characters like ë is inserted into the database
> > (text field), the
> > inserting goes fine. When I view this data using phpPgAdmin
> > (v3.2.1) it
> > looks fine, including these special characters.
> >
> > HOWEVER, when I get this data from within a Delphi application, using
> > psqlODBC (07.03.0200), the special characters get corrupted (ë becomes
> > something like Ã«). It seems to be a unicode problem.
> >
> > What do you think, should I re-create the database using
> > UNICODE encoding?
> > I'd like to know beforehand if this will solve the problem,
> > since I do not
> > want to recreate the database just to find out that it does
> > not fix things.
> >
> > Or does the problem involve psqlODBC. I looked into some of
> > its settings,
> > but I can't find anything that involves character encoding...
> >
> > I hope someone can give me a pointer here.
> >
> > Thanks in advance,
> > Alexander Priem
> > The Netherlands.
> >
> I'm a Delphi (7 Enterprise) user myself. I've ran into that sort of
> encoding problems to. On the postgres backend I use UNICODE.
>
> I'll tell you what I know, I hope it helps a bit.
> First off: there seems to be a difference between the datatypes string and

> widestring in Delphi. widestring is recommended for use with 'wide byte
> strings' (or something like that), for example UNICODE. string variables
> however can't be trusted for the full 100% in that case. I don't know
> exactly how it worked, but I'm sure some browsing in the Delphi Help and
> other rescources will explain things.
>
> Second: I know that in my case I have to set the CLIENT_ENCODING TO
> 'LATIN1' in my odbc connect string to make sure that all the characters
> are translated correct.
>
> I'm sorry that I can't be more specific about these things since I don't
> remember exactly WHY these things are done, but I do know they help :-)
> This was all set-up and figured out by one of my colleagues a while ago. I

> was not excactly into the database itself, but restricted my work to
> delphi coding en general development.
> Therefore I hope that someone else can shed more light onto the matter so
> I can learn a bit too :-)
>
> Kind Regards,
>
> Stijn Vanroye.

I just tried something else. If I use SQL Explorer (A Borland tool for
manipulating/viewing databases through DSN's) to look at this database,
using the same (psqlODBC) DSN, I also get the wrong characters. So it seems
Delphi doesn't have anything to do with it.

On the other hand, SQL Explorer is also a Borland product...

But I also just tried to view this field as a widestring. I queried the
database and used ShowMessage(VarToWideStr(rs.Fields[1].Value)) to view the
text. Didn't help...

I also tried to change the client_encoding via SQL. I just gave a "set
client_encoding to 'unicode'" command, followed by the command to retrieve
the text. Also didn't help. "set client_encoding to 'latin1'" didn't help
either.

Is this the correct way to change the encoding or do I needd to rebuild the
database? Maybe I am not really changing the encoding this way???

Alexander Priem.

Re: Unicode problem ???

From

"Priem, Alexander"

Date:

21 April 2004, 11:37:48

Hi Stijn,

I tried adding "set client_encoding to 'LATIN1'" to the Connect Settings of
psqlODBC. It didn't help. I still get the same weird characters when I look
at the data using SQL explorer.

Could this be due to the fact that the database was CREATED using SQL_ASCII
encoding? Maybe your solution only works when the database was created using
LATIN1 or UNICODE encoding.

The strange thing is that everything is FINE when I use phpPgAdmin to look
at the data, so there must be a conversion problem somewhere in the ODBC
layer...

Maybe I'll just try recreating the database using UNICODE or LATIN1 encoding
tomorrow. UNICODE seems the best, right? Or does LATIN1 have more
possibilities than UNICODE?

Groeten vanuit Veenendaal,

Alexander.

Re: Unicode problem ???

From

"Stijn Vanroye"

Date:

21 April 2004, 11:52:32

> Hi Stijn,
Hy Alexander,

> I tried adding "set client_encoding to 'LATIN1'" to the
> Connect Settings of
> psqlODBC. It didn't help. I still get the same weird
> characters when I look
> at the data using SQL explorer.
Damn (pardon me). I realy hoped that would do the trick.

> Could this be due to the fact that the database was CREATED
> using SQL_ASCII
> encoding? Maybe your solution only works when the database
> was created using
> LATIN1 or UNICODE encoding.
Could be, as I said I use Unicode. According to the email from Peter Eisentraut on the same topic:
//Quote
SQL_ASCII is only totally supported for, well, ASCII characters.
//End Quote
Sounds to me that SQL_ASCII is more limited then UNICODE.

> The strange thing is that everything is FINE when I use
> phpPgAdmin to look
> at the data, so there must be a conversion problem somewhere
> in the ODBC
> layer...
Guess so to. I think that's a question for the developpers.
In any case there's a difference between what the ODBC does and wath phpPgAdmin does.
I've always used pgAdmin (currently III). No problems there eigther.

> Maybe I'll just try recreating the database using UNICODE or
> LATIN1 encoding
> tomorrow. UNICODE seems the best, right? Or does LATIN1 have more
> possibilities than UNICODE?
Of what I hear, UNICODE indeed seems the best option. But then again, that encoding stuff is still a bit of a mistery
tome. 
What I personally don't understand is: if all my databases are UNICODE, why do I have to set the Client encoding to
latin1to get a correct result? 

> Groeten vanuit Veenendaal,
>
> Alexander.

Stijn.

Re: Unicode problem ???

From

"John Sidney-Woollett"

Date:

21 April 2004, 12:05:12

Priem, Alexander said:
> Could this be due to the fact that the database was CREATED using
> SQL_ASCII
> encoding? Maybe your solution only works when the database was created
> using
> LATIN1 or UNICODE encoding.

Yes, I suspect

> Maybe I'll just try recreating the database using UNICODE or LATIN1
> encoding
> tomorrow. UNICODE seems the best, right? Or does LATIN1 have more
> possibilities than UNICODE?

Unicode has a larger character set than latin-1. But if you only need to
support latin-1 then use that... Of course, if you have to upgrade to
support unicode later, you'll wish you had started off using latin-1!

I'm no expert on this, so I hope this info is correct.

John Sidney-Woollett

Re: Unicode problem ???

From

Alexander Antonakakis

Date:

21 April 2004, 12:26:32

I am also using postgres database with delphi. Therefor I don't use ODBC
but I use the Zeos Database components for delphi.
I had similar problems though. My database is in iso8859-7 and I needed
Greek chars to appear corectly in my application. What I did is I am
running an sql statement when my application is starting up (after my
datasource connection to the db etc) and I am setting my client's
encoding to iso8859-7. No problems since then ... Greeks show ok .. they
are stored ok and the search on the data works file.
Hope I helped a little

Alexander Antonakakis

John Sidney-Woollett wrote:

> Priem, Alexander said:
>
>>Could this be due to the fact that the database was CREATED using
>>SQL_ASCII
>>encoding? Maybe your solution only works when the database was created
>>using
>>LATIN1 or UNICODE encoding.
>
>
> Yes, I suspect
>
>
>>Maybe I'll just try recreating the database using UNICODE or LATIN1
>>encoding
>>tomorrow. UNICODE seems the best, right? Or does LATIN1 have more
>>possibilities than UNICODE?
>
>
> Unicode has a larger character set than latin-1. But if you only need to
> support latin-1 then use that... Of course, if you have to upgrade to
> support unicode later, you'll wish you had started off using latin-1!
>
> I'm no expert on this, so I hope this info is correct.
>
> John Sidney-Woollett
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
>
>

Re: Unicode problem ???

From

Harald Fuchs

Date:

21 April 2004, 13:33:21

In article <200404211522.39036.peter_e@gmx.net>,
Peter Eisentraut <peter_e@gmx.net> writes:

> Am Mittwoch, 21. April 2004 14:37 schrieb Priem, Alexander:
>> I have a PostgreSQL 7.4 database running, which was initdb-ed using
>> standard (SQL_ASCII) encoding, with -lc-collate=C option set.
>>
>> Everything is running fine, but I just discovered something funny. If text
>> containing characters like ë is inserted into the database (text field),
>> the inserting goes fine. When I view this data using phpPgAdmin (v3.2.1) it
>> looks fine, including these special characters.

> You can try to update pg_database to change the database encoding to LATIN1.
> SQL_ASCII is only totally supported for, well, ASCII characters.

Or even better to LATIN9.  This also includes the Euro symbol and the
French OE ligature forgotten in LATIN1.

Re: Unicode problem ???

From

Karsten Hilbert

Date:

21 April 2004, 17:02:17

> What I personally don't understand is: if all my databases
> are UNICODE, why do I have to set the Client encoding to latin1
> to get a correct result?
Because LATIN1 isn't just a subset of UNICODE. If the data
coming out of the database *is* UNICODE *and* your client
*does* handle UNICODE directly you might get away with not
setting a client_encoding, same goes for DB=latin1 +
client=latin1. If, however, the DB delivers UNICODE but your
client really wants LATIN1 you need to tell the database to
convert the stored UNICODE to LATIN1 before delivery. That's
what the client_encoding is for.

Or so is my understanding of it.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Unicode problem ???

From

"Priem, Alexander"

Date:

22 April 2004, 04:36:35

So if I understand correctly, I could best try the following :

* Create a new database, using UNICODE encoding, since this supports the
largest character set (inc. Euro-sign etc.)

* Then, set the client-encoding to LATIN9, so that clients get support for
special dutch characters. The client-encoding can also be changed
on-the-fly, so it won't matter if I don't get this right the first time.
Most important is the database (cluster) encoding...

Am I right in this?

Thanks for all the help.
Alexander Priem.

Re: Unicode problem ???

From

"Stijn Vanroye"

Date:

22 April 2004, 04:48:54

> > What I personally don't understand is: if all my databases
> > are UNICODE, why do I have to set the Client encoding to latin1
> > to get a correct result?

Karsten Hilbert wrote:
> Because LATIN1 isn't just a subset of UNICODE. If the data
> coming out of the database *is* UNICODE *and* your client
> *does* handle UNICODE directly you might get away with not
> setting a client_encoding, same goes for DB=latin1 +
> client=latin1. If, however, the DB delivers UNICODE but your
> client really wants LATIN1 you need to tell the database to
> convert the stored UNICODE to LATIN1 before delivery. That's
> what the client_encoding is for.
>
> Or so is my understanding of it.
>
> Karsten

Thanks, that's a pretty clear explication.
Only one question:
Wouldn't it be better if I just set my client encoding to UNICODE in stead of LATIN1? I suppose the UNICODE encoding
setis understood by windows (and delphi), since I write progs for a win enviroment. 

Stijn.

Re: Unicode problem ???

From

Peter Eisentraut

Date:

22 April 2004, 05:29:36

Am Donnerstag, 22. April 2004 09:48 schrieb Stijn Vanroye:
> Wouldn't it be better if I just set my client encoding to UNICODE in stead
> of LATIN1? I suppose the UNICODE encoding set is understood by windows (and
> delphi), since I write progs for a win enviroment.

That really depends exclusively on what your client program/environment can
tolerate.  You may have to set your console, widget set or desktop to Unicode
or something like that.

Re: Unicode problem ???

From

"Stijn Vanroye"

Date:

22 April 2004, 05:41:47

> Am Donnerstag, 22. April 2004 09:48 schrieb Stijn Vanroye:
> > Wouldn't it be better if I just set my client encoding to
> UNICODE in stead
> > of LATIN1? I suppose the UNICODE encoding set is understood
> by windows (and
> > delphi), since I write progs for a win enviroment.

Peter Eisentraut wrote:
> That really depends exclusively on what your client
> program/environment can
> tolerate.  You may have to set your console, widget set or
> desktop to Unicode
> or something like that.
Ok thanks. All I've got to do now is find out if windows and/or delphi understand UNICODE.

Stijn.

Re: Unicode problem ???

From

"Priem, Alexander"

Date:

22 April 2004, 07:14:41

Hi everyone,

I solved the problem with the special characters (thanks to you all).

I recreated my entire database (luckily I keep scripts for table/index/view
creation) and initdb-ed it using --lc-collate=C --encoding=UNICODE. In my
psqlODBC DSN settings I added "set client_encoding='LATIN9';" to the Connect
Settings and that solved all my problems regarding the special characters.

Thanks a lot, everyone!

Alexander Priem.

Re: Unicode problem ???

From

Harald Fuchs

Date:

22 April 2004, 09:26:14

In article <71E201BE5E881C46811BA160694C5FCB0FA934@fs1000.farcourier.com>,
"Stijn Vanroye" <s.vanroye@farcourier.com> writes:

>> Am Donnerstag, 22. April 2004 09:48 schrieb Stijn Vanroye:
>> > Wouldn't it be better if I just set my client encoding to
>> UNICODE in stead
>> > of LATIN1? I suppose the UNICODE encoding set is understood
>> by windows (and
>> > delphi), since I write progs for a win enviroment.

> Peter Eisentraut wrote:
>> That really depends exclusively on what your client
>> program/environment can
>> tolerate.  You may have to set your console, widget set or
>> desktop to Unicode
>> or something like that.
> Ok thanks. All I've got to do now is find out if windows and/or delphi understand UNICODE.

I'd say you should first decide if you really need Unicode.  If you're
dealing exclusively with English/French/Spanish/German or so and if
you're pretty sure you'll never touch Polish/Russian/Chinese, you can
stick to Latin-1 or Latin-9 and happily ignore Unicode.

Re: Unicode problem ???

From

"Priem, Alexander"

Date:

22 April 2004, 11:22:21

> I'd say you should first decide if you really need Unicode.  If you're
> dealing exclusively with English/French/Spanish/German or so and if
> you're pretty sure you'll never touch Polish/Russian/Chinese, you can
> stick to Latin-1 or Latin-9 and happily ignore Unicode.

But wouldn't it be better to pick unicode just in case? Or would Latin-1 /
Latin-9 perform better, compared to Unicode?

Alexander Priem

Re: Unicode problem ???

From

Björn Lundin

Date:

26 April 2004, 12:40:00

Stijn Vanroye wrote:

> Of what I hear, UNICODE indeed seems the best option. But then again, that
> encoding stuff is still a bit of a mistery to me. What I personally don't

The following is cut from the documentation of XML Ada
(xmlada-1.0/docs/xml_2.html#SEC6)
( which is available at http://libre.act-europe.fr/xmlada/)

<quote>
We now know how each encoded character can be represented by an integer
value (code point) depending on the character set.

Character encoding schemes deal with the representation of a sequence of
integers to a sequence of code units. A code unit is a sequence of bytes on
a computer architecture.

There exists a number of possible encoding schemes. Some of them encode all
integers on the same number of bytes. They are called fixed-width encoding
forms, and include the standard encoding for Internet emails (7bits, but it
can't encode all characters), as well as the simple 8bits scheme, or the
EBCDIC scheme. Among them is also the UTF-32 scheme which is defined in the
Unicode standard.

Another set of encoding schemes encode integers on a variable number of
bytes. These include two schemes that are also defined in the Unicode
standard, namely Utf-8 and Utf-16.

Unicode doesn't impose any specific encoding. However, it is most often
associated with one of the Utf encodings. They each have their own
properties and advantages:

Utf32
This is the simplest of all these encodings. It simply encodes all the
characters on 32 bits (4 bytes). This encodes all the possible characters
in Unicode, and is obviously straightforward to manipulate. However, given
that the first 65535 characters in Unicode are enough to encode all known
languages currently in use, Utf32 is also a waste of space in most cases.

Utf16
For the above reason, Utf16 was defined. Most characters are only encoded on
two bytes (which is enough for the first 65535 and most current
characters). In addition, a number of special code points have been
defined, known as surrogate pairs, that make the encoding of integers
greater than 65535 possible. The integers are then encoded on four bytes.
As a result, Utf16 is thus much more memory-efficient and requires less
space than Utf32 to encode sequences of characters. However, it is also
more complex to decode.

Utf8
This is an even more space-efficient encoding, but is also more complex to
decode. More important, it is compatible with the most currently used
simple 8bit encoding.

Utf8 has the following properties:

Characters 0 to 127 (ASCII) are encoded simply as a single byte. This means
that files and strings which contain only 7-bit ASCII characters have the
same encoding under both ASCII and UTF-8.

Characters greater than 127 are encoded as a sequence of several bytes, each
of which has the most significant bit set. Therefore, no ASCII byte can
appear as part of any other character.

The first byte of a multibyte sequence that represents a non-ASCII character
is always in the range 0xC0 to 0xFD and it indicates how many bytes follow
for this character. All further bytes in a multibyte sequence are in the
range 0x80 to 0xBF. This allows easy resynchronization and makes the
encoding stateless and robust against missing bytes.

UTF-8 encoded characters may theoretically be up to six bytes long, however
the first 16-bit characters are only up to three bytes long.

Note that the encodings above, except for Utf8, have two versions, depending
on the chosen byte order on the machine.

</quote>

So yes, Unicode in Utf8 is tricky to handle
/Björn