Thread: Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.
Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets.
From
"J. Michael Crawford"
Date:
I am posting this to the General and JDBC groups so that it's in the archive -- I know many people have had problems with Latin1 characters, and I'd like to share what has worked for us. If anyone can add this information to a more permanent FAQ, I'd be much obliged. --- Using Postgres with Latin1 (ISO8859-1) and Unicode (utf-8) character sets. This is a general primer for using postgres with alternate character sets. For those who have done so successfully, the process is obvious in hindsight. For those who haven't gotten it to work yet, it can be very complex, especially with web applications where four separate encodings can come into play. Postgres is quite intelligent in dealing with character sets, but even so, it's possible to get into a situation where your application returns garbage for Latin1 or other extended characters (n with a tilde, etc.). This is particularly true in Java-based web applications. There are four steps to ensuring proper character retrieval, with the fourth applying to Java web applications. They are explained briefly here, and then in more detail further down in the document: 1. Use a database encoding that will store the appropriate characters. 2. Set the client encoding first, before importing or adding data to the database. 3. Set the client encoding for each session where a user or program accesses the database. 4. Set the proper character encoding for the web page. Do these four steps, and things should work out nicely. Miss even one of them, and your application will mysteriously return goofy extended characters. For anyone who has already done this, the solution is probably obvious. However, it took us months to figure this out through trial-and-error, even with the help of the Java and Postgres communities. After all, we had five different potential encodings to deal with, in four separate steps, and that yields a LOT of non-working combinations. We found dozens of people with the same problem, but most of them never fully resolved their problems. We tried a heck of a lot of different strategies, but the only thing that worked was breaking the situation down into these steps, and finding the right encoding combination for each step, one-at-a-time. Now that our application works, we want to help minimize the number of people who have to reinvent this wheel from scratch. The rest of this document explains the steps in more detail. --- 1. Create the database with the proper encoding. In our experience, the best thing to do is simply choose Unicode as the database's character encoding if you think there's a chance of storing Latin1 or other characters. You could choose Latin1, and this should work in most cases. However, there are times when normal-looking characters refuse to be stored in a Latin1 database, such as character 239, which is the same in Latin1 as it is is Utf-8. Rather than attempt to beat our heads against this wall, we went with Unicode because it will hold whatever we need to hold. --- 2. Set client encoding before importing or adding data to the database. If you do a mass import from another database, or a data retore from Postgres, make sure the encoding is set first. If you try to restore a unicode characters to a Latin1 database, the data likely won't look right. Likewise, if you try to restore Latin1 characters into a Unicode database without first telling Postgres that it's getting Latin1 characters, the extended characters will come back as garbage, *even*though* a unicode database can hold them. The command to set the client encoding for a session is: "SET CLIENT_ENCODING TO Unicode", "SET CLIENT_ENCODING TO Latin1", and so forth. If you find out what kind of characters are coming into the postgres database, and tell Postgres ahead of time, it should be able to automatically translate them. As it is designed to do, of course. Just remember that unicode is a HUGE character set, and there will be unicode characters that can't be translated into the much smaller Latin1 character set. You may have to experiment to find out what the right client encoding is. It will likely depend upon the database from which you're getting data, such as a backup of a Latin1 postgres database, or data retrieved from a SQL Server database, which will have its own encoding. It could also depend upon the character set used by your operating system, such as the character set of data someone entered into a Microsoft Access database (which, if they used the US English Windows, will be cp1252). Fortunately a Unicode or Latin1 client encoding will handle most Windows-related encoding issues. --- 3. Set the client encoding for each session where a user or program accesses the database. If you have users entering data from a Java application, the encoding will vary from JVM to JVM. A windows-based JVM might have an encoding of cp1252, while a Linus JVM might be utf-8. A windows-database application will likely ave a cp1252 or Latin1 character set, but that could depend upon what application you're using or what version of Windows you're using. Postgres needs to know what kind of characters to store and return. If you tell Postgres to return unicode characters to an application that's using Latin1, extended characters will look like gibberish. Even in Java, where you can do all sorts of character-encoding translation, it can be impossible to translate data retrieved from Postgres if it's in the wrong encoding. We've tried changing the JVM encoding, altering the jdbc driver, translating encodings on the database read, and translating encodings after the read while building a new string, to no avail. We tried 25 combinations of each strategy (five different possible read encodings and five different possible string encodings), and nothing worked. We could get an application working in one JVM with one encoding, but another JVM would break, and no amount of translation would help. But when we finally told Postgres what to return, everythign worked like a charm. Just as with step two, the key is to use the "SET CLIENT_ENCODING TO (encoding)" sql command. If you're using an application where you can send SQL to the server, this is all you need. In something like MS Access, you'll have to move to a passthrough query. For Java, you'll need to send a command through JDBC: String DBEncoding = "Unicode" //use a real encoding, either returned from the jvm or explicitly stated PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING TO '" + DBEncoding + "'"); statement.execute(); --- 4. Set the proper character encoding for the web page. Web browsers also need to know what kind of characters they are displaying. If you send them Unicode characters and the browser thinks it's getting Latin1, extended characters will again come up as nonsense. If you're writing a web-based application, it's a good idea to include the character set as part of the html, such as: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> For example, we have an application that serves up pages perfectly from a Linux JVM, which has a utf-8 encoding, and serves up pages with goofy extended characters from a Windows JVM, which has a cp1252 encoding. The solution is either to translate the strings in the Windows JVM into unicode before serving the page, or alternately, tell the web page to use a different character set. --- Those are all the basics. Many times you may not have to worry about one or more of the steps because your application server or JSP generator may already handle the character set conversion. But if it doesn't, or if you're writing Servlets and not JSP's, these steps are good ones to consider if you work with Latin1 characters.
On Mon, 8 Nov 2004, J. Michael Crawford wrote: > > Even in Java, where you can do all sorts of character-encoding > translation, it can be impossible to translate data retrieved from Postgres > if it's in the wrong encoding. We've tried changing the JVM encoding, > altering the jdbc driver, translating encodings on the database read, and > translating encodings after the read while building a new string, to no > avail. We tried 25 combinations of each strategy (five different possible > read encodings and five different possible string encodings), and nothing > worked. We could get an application working in one JVM with one encoding, > but another JVM would break, and no amount of translation would help. > > But when we finally told Postgres what to return, everythign worked like > a charm. > > Just as with step two, the key is to use the "SET CLIENT_ENCODING TO > (encoding)" sql command. If you're using an application where you can send > SQL to the server, this is all you need. In something like MS Access, > you'll have to move to a passthrough query. For Java, you'll need to send > a command through JDBC: > > String DBEncoding = "Unicode" //use a real encoding, either returned from > the jvm or explicitly stated > PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING > TO '" + DBEncoding + "'"); > statement.execute(); > This is bad advice for a Java client and does not work. The JDBC driver always expects data in unicode and issues a SET client_encoding of it's own at connection startup to make sure it gets unicode data. Changing this to another encoding will break the driver and in the cvs version a check has been added to error out if it detects you doing this. Kris Jurka
<<This is bad advice for a Java client and does not work.>> Well then, perhaps we shouldn't share the procedure with other folks. I apologize if I'm introducing some misinformation. However, this has been the only way to get our system to work on more than one JVM. People from this group provided many suggestions, people from other groups did the same, and nothing helped. Taking bytes and translating encodings (examples follow my signature below) had no effect. Changing the url of the postgres connection to include an encoding also had no effect. Setting the encoding for the entire JVM didn't work either. Either the data worked in a Linux VM, or a Windows VM, but not both. So, if you're going to correct us for the wrong solution (which I'm glad you have done), do you have any suggestions as to what the right solution might be? - Mike Encoding translations that didn't work: a) Getting encoded bytes from the result set. We tried the following block five times, once for each different encoding we were trying to test with the database: dataRead = new String(result.getBytes(longName),"utf-8"); dataLatin_a = new String(dataRead.getBytes("ISO-8859-1")); dataLatin_b = new String(dataRead.getBytes("Latin1")); dataUnicode_a = new String(dataRead.getBytes("utf-8")); dataUnicode_b = new String(dataRead.getBytes("UTF8")); dataWin = new String(dataRead.getBytes("Cp1252")); b) Getting a string, turning it bytes, and then translating. Same process as above, but we use result.getString... No matter what, strings showed up as gibberish in one JVM or another, depending upon the native encoding of the database. A Latin1 database worked in the windows JVM, a Unicode in the Linux JVM, but not the other way around. At 12:15 PM 11/8/2004, Kris Jurka wrote: > > >On Mon, 8 Nov 2004, J. Michael Crawford wrote: >> >> Even in Java, where you can do all sorts of character-encoding >> translation, it can be impossible to translate data retrieved from Postgres >> if it's in the wrong encoding. We've tried changing the JVM encoding, >> altering the jdbc driver, translating encodings on the database read, and >> translating encodings after the read while building a new string, to no >> avail. We tried 25 combinations of each strategy (five different possible >> read encodings and five different possible string encodings), and nothing >> worked. We could get an application working in one JVM with one encoding, >> but another JVM would break, and no amount of translation would help. >> >> But when we finally told Postgres what to return, everythign worked like >> a charm. >> >> Just as with step two, the key is to use the "SET CLIENT_ENCODING TO >> (encoding)" sql command. If you're using an application where you can send >> SQL to the server, this is all you need. In something like MS Access, >> you'll have to move to a passthrough query. For Java, you'll need to send >> a command through JDBC: >> >> String DBEncoding = "Unicode" //use a real encoding, either returned from >> the jvm or explicitly stated >> PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING >> TO '" + DBEncoding + "'"); >> statement.execute(); >> > >This is bad advice for a Java client and does not work. The JDBC driver >always expects data in unicode and issues a SET client_encoding of it's >own at connection startup to make sure it gets unicode data. Changing >this to another encoding will break the driver and in the cvs version a >check has been added to error out if it detects you doing this. > >Kris Jurka > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster
J. Michael Crawford wrote: > Encoding translations that didn't work: > > a) Getting encoded bytes from the result set. We tried the following > block five times, once for each different encoding we were trying to > test with the database: > b) Getting a string, turning it bytes, and then translating. Same > process as above, but we use result.getString... > No matter what, strings showed up as gibberish in one JVM or another, > depending upon the native encoding of the database. A Latin1 database > worked in the windows JVM, a Unicode in the Linux JVM, but not the other > way around. The "right way" is to just use getString() and not do any translation yourself. The driver has already done the transcoding from whatever the DB encoding is, to the internal UTF-16 string representation. You don't need to mess with byte-based representations. When you then display that string, you will need to use an appropriate encoding, obviously.. Can you provide a self-contained testcase that demonstrates getString() doing the wrong thing? -O
<<The "right way" is to just use getString() and not do any translation yourself.>> That was the first thing we tried. Sorry I didn't mention it. The next step was getting a string, turning it into bytes, and translating the bytes. The third step was getting bytes. Nothing worked in our Java GUI, the console, or the web page returned. Maybe it was just something weird in our system. Perhaps Postgres just wanted to send Latin1 characters out of our database, which at the time was Latin1. I don't know. Whatever it was sending we couldn't use, regardless of the translation attempted. Forcing it to Unicode (improperly) did the trick for us. Now that our database is unicode, I'm not sure we'd get the same problem. If there's a better way to make sure Postgres is sending out Unicode in our JVM session, while simultaneously sending out other encodings to other apps that access the database using different encodings, I'm all ears. <<When you then display that string, you will need to use an appropriate encoding, obviously..>> Definitely. <<Can you provide a self-contained testcase that demonstrates getString() doing the wrong thing?>> I can try. If I can find a copy of the old Latin1 database, and replicate the error, I'll let you know via email. --- Thanks for your attention on this, both this time, and back in the summer when I was asking questions before we arrived at our "fix". You and Kris really carry this discussion group, and we're all lucky to have you do it. My goal was to add something to the group so that I could do more than just receive help. I also wanted something in the archives that would help "newbie" searchers who haven't yet had to deal with the encoding process in a java servlet -- we searched for weeks without finding anything that covered all the bases. If there's a way to remove/correct/comment my posting, I'm fine with that. - Mike At 04:55 PM 11/8/2004, Oliver Jowett wrote: >J. Michael Crawford wrote: > >>Encoding translations that didn't work: >>a) Getting encoded bytes from the result set. We tried the following >>block five times, once for each different encoding we were trying to test >>with the database: > >>b) Getting a string, turning it bytes, and then translating. Same >>process as above, but we use result.getString... > >> No matter what, strings showed up as gibberish in one JVM or another, >> depending upon the native encoding of the database. A Latin1 database >> worked in the windows JVM, a Unicode in the Linux JVM, but not the other >> way around. > >The "right way" is to just use getString() and not do any translation >yourself. The driver has already done the transcoding from whatever the DB >encoding is, to the internal UTF-16 string representation. You don't need >to mess with byte-based representations. > >When you then display that string, you will need to use an appropriate >encoding, obviously.. > >Can you provide a self-contained testcase that demonstrates getString() >doing the wrong thing? > >-O