Thread: Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.

Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.

From

"J. Michael Crawford"

Date:

08 November 2004, 19:17:15

   I am posting this to the General and JDBC groups so that it's in the
archive -- I know many people have had problems with Latin1 characters, and
I'd like to share what has worked for us.  If anyone can add this
information to a more permanent FAQ, I'd be much obliged.

---

Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.

   This is a general primer for using postgres with alternate character
sets.  For those who have done so successfully, the process is obvious in
hindsight.  For those who haven't gotten it to work yet, it can be very
complex, especially with web applications where four separate encodings can
come into play.

   Postgres is quite intelligent in dealing with character sets, but even
so, it's possible to get into a situation where your application returns
garbage for Latin1 or other extended characters (n with a tilde,
etc.).  This is particularly true in Java-based web applications.

   There are four steps to ensuring proper character retrieval, with the
fourth applying to Java web applications.  They are explained briefly here,
and then in more detail further down in the document:

   1.  Use a database encoding that will store the appropriate characters.

   2.  Set the client encoding first, before importing or adding data to
the database.

   3.  Set the client encoding for each session where a user or program
accesses the database.

   4.  Set the proper character encoding for the web page.

   Do these four steps, and things should work out nicely.  Miss even one
of them, and your application will mysteriously return goofy extended
characters.

   For anyone who has already done this, the solution is probably
obvious.  However, it took us months to figure this out through
trial-and-error, even with the help of the Java and Postgres
communities.  After all, we had five different potential encodings to deal
with, in four separate steps, and that yields a LOT of non-working
combinations.

   We found dozens of people with the same problem, but most of them never
fully resolved their problems.  We tried a heck of a lot of different
strategies, but the only thing that worked was breaking the situation down
into these steps, and finding the right encoding combination for each step,
one-at-a-time.  Now that our application works, we want to help minimize
the number of people who have to reinvent this wheel from scratch.

   The rest of this document explains the steps in more detail.

---

1.  Create the database with the proper encoding.

   In our experience, the best thing to do is simply choose Unicode as the
database's character encoding if you think there's a chance of storing
Latin1 or other characters.

   You could choose Latin1, and this should work in most cases.  However,
there are times when normal-looking characters refuse to be stored in a
Latin1 database, such as character 239, which is the same in Latin1 as it
is is Utf-8.  Rather than attempt to beat our heads against this wall, we
went with Unicode because it will hold whatever we need to hold.

---

2.  Set client encoding before importing or adding data to the database.

   If you do a mass import from another database, or a data retore from
Postgres, make sure the encoding is set first.  If you try to restore a
unicode characters to a Latin1 database, the data likely won't look
right.  Likewise, if you try to restore Latin1 characters into a Unicode
database without first telling Postgres that it's getting Latin1
characters, the extended characters will come back as garbage,
*even*though* a unicode database can hold them.

   The command to set the client encoding for a session is:

     "SET CLIENT_ENCODING TO Unicode", "SET CLIENT_ENCODING TO Latin1", and
so forth.

   If you find out what kind of characters are coming into the postgres
database, and tell Postgres ahead of time, it should be able to
automatically translate them.  As it is designed to do, of course.  Just
remember that unicode is a HUGE character set, and there will be unicode
characters that can't be translated into the much smaller Latin1 character set.

   You may have to experiment to find out what the right client encoding
is.  It will likely depend upon the database from which you're getting
data, such as a backup of a Latin1 postgres database, or data retrieved
from a SQL Server database, which will have its own encoding.  It could
also depend upon the character set used by your operating system, such as
the character set of data someone entered into a Microsoft Access database
(which, if they used the US English Windows, will be cp1252).  Fortunately
a Unicode or Latin1 client encoding will handle most Windows-related
encoding issues.

---

3. Set the client encoding for each session where a user or program
accesses the database.

   If you have users entering data from a Java application, the encoding
will vary from JVM to JVM.  A windows-based JVM might have an encoding of
cp1252, while a Linus JVM might be utf-8.  A windows-database application
will likely ave a cp1252 or Latin1 character set, but that could depend
upon what application you're using or what version of Windows you're using.

   Postgres needs to know what kind of characters to store and return.  If
you tell Postgres to return unicode characters to an application that's
using Latin1, extended characters will look like gibberish.

   Even in Java, where you can do all sorts of character-encoding
translation, it can be impossible to translate data retrieved from Postgres
if it's in the wrong encoding.  We've tried changing the JVM encoding,
altering the jdbc driver, translating encodings on the database read, and
translating encodings after the read while building a new string, to no
avail.  We tried 25 combinations of each strategy (five different possible
read encodings and five different possible string encodings), and nothing
worked.  We could get an application working in one JVM with one encoding,
but another JVM would break, and no amount of translation would help.

   But when we finally told Postgres what to return, everythign worked like
a charm.

   Just as with step two, the key is to use the "SET CLIENT_ENCODING TO
(encoding)" sql command.  If you're using an application where you can send
SQL to the server, this is all you need.  In something like MS Access,
you'll have to move to a passthrough query.  For Java, you'll need to send
a command through JDBC:

String DBEncoding = "Unicode"  //use a real encoding, either returned from
the jvm or explicitly stated
PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING
TO '" + DBEncoding + "'");
statement.execute();

---

4.  Set the proper character encoding for the web page.

   Web browsers also need to know what kind of characters they are
displaying.  If you send them Unicode characters and the browser thinks
it's getting Latin1, extended characters will again come up as
nonsense.  If you're writing a web-based application, it's a good idea to
include the character set as part of the html, such as:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

   For example, we have an application that serves up pages perfectly from
a Linux JVM, which has a utf-8 encoding, and serves up pages with goofy
extended characters from a Windows JVM, which has a cp1252 encoding.  The
solution is either to translate the strings in the Windows JVM into unicode
before serving the page, or alternately, tell the web page to use a
different character set.


---

   Those are all the basics.  Many times you may not have to worry about
one or more of the steps because your application server or JSP generator
may already handle the character set conversion.  But if it doesn't, or if
you're writing Servlets and not JSP's, these steps are good ones to
consider if you work with Latin1 characters.

Re: Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8)

From

Kris Jurka

Date:

08 November 2004, 20:23:36


On Mon, 8 Nov 2004, J. Michael Crawford wrote:
>
>    Even in Java, where you can do all sorts of character-encoding
> translation, it can be impossible to translate data retrieved from Postgres
> if it's in the wrong encoding.  We've tried changing the JVM encoding,
> altering the jdbc driver, translating encodings on the database read, and
> translating encodings after the read while building a new string, to no
> avail.  We tried 25 combinations of each strategy (five different possible
> read encodings and five different possible string encodings), and nothing
> worked.  We could get an application working in one JVM with one encoding,
> but another JVM would break, and no amount of translation would help.
>
>    But when we finally told Postgres what to return, everythign worked like
> a charm.
>
>    Just as with step two, the key is to use the "SET CLIENT_ENCODING TO
> (encoding)" sql command.  If you're using an application where you can send
> SQL to the server, this is all you need.  In something like MS Access,
> you'll have to move to a passthrough query.  For Java, you'll need to send
> a command through JDBC:
>
> String DBEncoding = "Unicode"  //use a real encoding, either returned from
> the jvm or explicitly stated
> PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING
> TO '" + DBEncoding + "'");
> statement.execute();
>

This is bad advice for a Java client and does not work.  The JDBC driver
always expects data in unicode and issues a SET client_encoding of it's
own at connection startup to make sure it gets unicode data.  Changing
this to another encoding will break the driver and in the cvs version a
check has been added to error out if it detects you doing this.

Kris Jurka

Re: [GENERAL] Using Postgres with Latin1 (ISO8859-1)

From

"J. Michael Crawford"

Date:

09 November 2004, 00:19:41

<<This is bad advice for a Java client and does not work.>>

   Well then, perhaps we shouldn't share the procedure with other folks.  I
apologize if I'm introducing some misinformation.

   However, this has been the only way to get our system to work on more
than one JVM.  People from this group provided many suggestions, people
from other groups did the same, and nothing helped.  Taking bytes and
translating encodings (examples follow my signature below) had no
effect.  Changing the url of the postgres connection to include an encoding
also had no effect.  Setting the encoding for the entire JVM didn't work
either.  Either the data worked in a Linux VM, or a Windows VM, but not both.

   So, if you're going to correct us for the wrong solution (which I'm glad
you have done), do you have any suggestions as to what the right solution
might be?

    - Mike

Encoding translations that didn't work:

a) Getting encoded bytes from the result set.  We tried the following block
five times, once for each different encoding we were trying to test with
the database:

dataRead = new String(result.getBytes(longName),"utf-8");
dataLatin_a = new String(dataRead.getBytes("ISO-8859-1"));
dataLatin_b = new String(dataRead.getBytes("Latin1"));
dataUnicode_a = new String(dataRead.getBytes("utf-8"));
dataUnicode_b = new String(dataRead.getBytes("UTF8"));
dataWin = new String(dataRead.getBytes("Cp1252"));

b)  Getting a string, turning it bytes, and then translating.  Same process
as above, but we use result.getString...

   No matter  what, strings showed up as gibberish in one JVM or another,
depending upon the native encoding of the database.  A Latin1 database
worked in the windows JVM, a Unicode in the Linux JVM, but not the other
way around.

At 12:15 PM 11/8/2004, Kris Jurka wrote:
 >
 >
 >On Mon, 8 Nov 2004, J. Michael Crawford wrote:
 >>
 >>    Even in Java, where you can do all sorts of character-encoding
 >> translation, it can be impossible to translate data retrieved from
Postgres
 >> if it's in the wrong encoding.  We've tried changing the JVM encoding,
 >> altering the jdbc driver, translating encodings on the database read, and
 >> translating encodings after the read while building a new string, to no
 >> avail.  We tried 25 combinations of each strategy (five different possible
 >> read encodings and five different possible string encodings), and nothing
 >> worked.  We could get an application working in one JVM with one encoding,
 >> but another JVM would break, and no amount of translation would help.
 >>
 >>    But when we finally told Postgres what to return, everythign worked
like
 >> a charm.
 >>
 >>    Just as with step two, the key is to use the "SET CLIENT_ENCODING TO
 >> (encoding)" sql command.  If you're using an application where you can
send
 >> SQL to the server, this is all you need.  In something like MS Access,
 >> you'll have to move to a passthrough query.  For Java, you'll need to send
 >> a command through JDBC:
 >>
 >> String DBEncoding = "Unicode"  //use a real encoding, either returned from
 >> the jvm or explicitly stated
 >> PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING
 >> TO '" + DBEncoding + "'");
 >> statement.execute();
 >>
 >
 >This is bad advice for a Java client and does not work.  The JDBC driver
 >always expects data in unicode and issues a SET client_encoding of it's
 >own at connection startup to make sure it gets unicode data.  Changing
 >this to another encoding will break the driver and in the cvs version a
 >check has been added to error out if it detects you doing this.
 >
 >Kris Jurka
 >
 >---------------------------(end of broadcast)---------------------------
 >TIP 4: Don't 'kill -9' the postmaster

Re: [GENERAL] Using Postgres with Latin1 (ISO8859-1)

From

Oliver Jowett

Date:

09 November 2004, 00:53:54

J. Michael Crawford wrote:

> Encoding translations that didn't work:
>
> a) Getting encoded bytes from the result set.  We tried the following
> block five times, once for each different encoding we were trying to
> test with the database:

> b)  Getting a string, turning it bytes, and then translating.  Same
> process as above, but we use result.getString...

>   No matter  what, strings showed up as gibberish in one JVM or another,
> depending upon the native encoding of the database.  A Latin1 database
> worked in the windows JVM, a Unicode in the Linux JVM, but not the other
> way around.

The "right way" is to just use getString() and not do any translation
yourself. The driver has already done the transcoding from whatever the
DB encoding is, to the internal UTF-16 string representation. You don't
need to mess with byte-based representations.

When you then display that string, you will need to use an appropriate
encoding, obviously..

Can you provide a self-contained testcase that demonstrates getString()
doing the wrong thing?

-O

Re: [GENERAL] Using Postgres with Latin1 (ISO8859-1)

From

"J. Michael Crawford"

Date:

09 November 2004, 03:21:17

<<The "right way" is to just use getString() and not do any translation
yourself.>>

   That was the first thing we tried.  Sorry I didn't mention it.

   The next step was getting a string, turning it into bytes, and
translating the bytes.  The third step was getting bytes.  Nothing worked
in our Java GUI, the console, or the web page returned.

   Maybe it was just something weird in our system.  Perhaps Postgres just
wanted to send Latin1 characters out of our database, which at the time was
Latin1.  I don't know.  Whatever it was sending we couldn't use, regardless
of the translation attempted.  Forcing it to Unicode (improperly) did the
trick for us.  Now that our database is unicode, I'm not sure we'd get the
same problem.

   If there's a better way to make sure Postgres is sending out Unicode in
our JVM session, while simultaneously sending out other encodings to other
apps that access the database using different encodings, I'm all ears.

<<When you then display that string, you will need to use an appropriate
encoding, obviously..>>

   Definitely.

<<Can you provide a self-contained testcase that demonstrates getString()
doing the wrong thing?>>

   I can try.  If I can find a copy of the old Latin1 database, and
replicate the error, I'll let you know via email.

---

   Thanks for your attention on this, both this time, and back in the
summer when I was asking questions before we arrived at our "fix".  You and
Kris really carry this discussion group, and we're all lucky to have you do
it.

   My goal was to add something to the group so that I could do more than
just receive help.  I also wanted something in the archives that would help
"newbie" searchers who haven't yet had to deal with the encoding process in
a java servlet -- we searched for weeks without finding anything that
covered all the bases.  If there's a way to remove/correct/comment my
posting, I'm fine with that.

         - Mike

At 04:55 PM 11/8/2004, Oliver Jowett wrote:
>J. Michael Crawford wrote:
>
>>Encoding translations that didn't work:
>>a) Getting encoded bytes from the result set.  We tried the following
>>block five times, once for each different encoding we were trying to test
>>with the database:
>
>>b)  Getting a string, turning it bytes, and then translating.  Same
>>process as above, but we use result.getString...
>
>>   No matter  what, strings showed up as gibberish in one JVM or another,
>> depending upon the native encoding of the database.  A Latin1 database
>> worked in the windows JVM, a Unicode in the Linux JVM, but not the other
>> way around.
>
>The "right way" is to just use getString() and not do any translation
>yourself. The driver has already done the transcoding from whatever the DB
>encoding is, to the internal UTF-16 string representation. You don't need
>to mess with byte-based representations.
>
>When you then display that string, you will need to use an appropriate
>encoding, obviously..
>
>Can you provide a self-contained testcase that demonstrates getString()
>doing the wrong thing?
>
>-O