Thread: UTF-8 -> ISO8859-1 conversion problem
ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 Running 7.4.5, I frequently get this error, and ONLY on this particular character despite seeing quite a bit of 8 bit. I don't really follow why it can't be converted, it's the same character (239) in both character sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8. Am I flubbing something up? I'm probably going to (reluctantly) convert to UTF-8 in the database at some point, but it'd sure be nice if this worked without that. :) thanks!
In my experience, there are just some characters that don't want to be converted, even if they appear to be part of the normal 8-bit character system. We went to Unicode databases to hold our Latin1 characters because of this. There was even a case where the client was cutting and pasting ascii text into our database, and it just wouldn't take some of the letters, giving the same error you reported. I'm going to send a more detailed post on the topic, but in general, we've found that there are four things that need to be done (four, if you're not serving up web pages) for Latin1 characters to work on multiple platforms. 1. Create the database in Unicode so that it will hold anything you throw at it. 2. When importing data, set the encoding in the script that loads the data, or if there's no script, use the "SET CLIENT_ENCODING TO (encoding)" command. Setting the encoding in a tool like pgManager is not always enough. Use this to be sure. 3. When retrieving data in a java application, the JVM encoding will vary from JVM to JVM, and no attempt on our part to change the JVM encoding or translate the encoding of the database strings has worked, either to or from the database. We spent weeks going through every permutation getBytes("ISO-8859-1") and related calls we could find, but to no avail. The JVM will tell you it has a new encoding, but postgres will return gibberish. You can translate the bytes, or get a translated string, but it's all the same garbage. The solution: set the client encoding manually through a jdbc prepared statement. Once you set the client encoding properly, all seems to be fine: String DBEncoding = "anEncoding" //use a real encoding, either returned from the jvm or explicitly stated PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING TO '" + DBEncoding + "'"); statement.execute(); 4. If writing html for a web page, make sure the encoding of the web page matches the encoding of the strings you're throwing at it. So if you have a Linux JVM that has a "UTF-8" encoding, the web page will need the html equivalent: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> --- This is likely far more information than you require, but I thought I'd add it anyway so that the information is in the archives. It took us months to solve our problem, even with help from the postgres community, so I at least want the basics to be posted while I get my act together and write something with more detail. - Mike At 12:12 PM 10/29/2004, Cott Lang wrote: >ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 > >Running 7.4.5, I frequently get this error, and ONLY on this particular >character despite seeing quite a bit of 8 bit. I don't really follow why >it can't be converted, it's the same character (239) in both character >sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8. > >Am I flubbing something up? I'm probably going to (reluctantly) convert >to UTF-8 in the database at some point, but it'd sure be nice if this >worked without that. :) > >thanks! > > > > > > > >---------------------------(end of broadcast)--------------------------- >TIP 8: explain analyze is your friend
Correction: Four things that need to be done, THREE if you're not serving up html. Sorry for the editing error. - Mike At 01:19 PM 10/29/2004, J. Michael Crawford wrote: > In my experience, there are just some characters that don't want to be > converted, even if they appear to be part of the normal 8-bit character > system. We went to Unicode databases to hold our Latin1 characters > because of this. There was even a case where the client was cutting and > pasting ascii text into our database, and it just wouldn't take some of > the letters, giving the same error you reported. > > I'm going to send a more detailed post on the topic, but in general, > we've found that there are four things that need to be done (four, if > you're not serving up web pages) for Latin1 characters to work on > multiple platforms. > > 1. Create the database in Unicode so that it will hold anything you > throw at it. > > 2. When importing data, set the encoding in the script that loads the > data, or if there's no script, use the "SET CLIENT_ENCODING TO > (encoding)" command. Setting the encoding in a tool like pgManager is > not always enough. Use this to be sure. > > 3. When retrieving data in a java application, the JVM encoding will > vary from JVM to JVM, and no attempt on our part to change the JVM > encoding or translate the encoding of the database strings has worked, > either to or from the database. We spent weeks going through every > permutation getBytes("ISO-8859-1") and related calls we could find, but > to no avail. The JVM will tell you it has a new encoding, but postgres > will return gibberish. You can translate the bytes, or get a translated > string, but it's all the same garbage. The solution: set the client > encoding manually through a jdbc prepared statement. Once you set the > client encoding properly, all seems to be fine: > >String DBEncoding = "anEncoding" //use a real encoding, either returned >from the jvm or explicitly stated >PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING >TO '" + DBEncoding + "'"); >statement.execute(); > > 4. If writing html for a web page, make sure the encoding of the web > page matches the encoding of the strings you're throwing at it. So if > you have a Linux JVM that has a "UTF-8" encoding, the web page will need > the html equivalent: > ><meta http-equiv="Content-Type" content="text/html; charset=utf-8"> > >--- > > This is likely far more information than you require, but I thought I'd > add it anyway so that the information is in the archives. It took us > months to solve our problem, even with help from the postgres community, > so I at least want the basics to be posted while I get my act together > and write something with more detail. > > - Mike > > >At 12:12 PM 10/29/2004, Cott Lang wrote: > >ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 > > > >Running 7.4.5, I frequently get this error, and ONLY on this particular > >character despite seeing quite a bit of 8 bit. I don't really follow why > >it can't be converted, it's the same character (239) in both character > >sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8. > > > >Am I flubbing something up? I'm probably going to (reluctantly) convert > >to UTF-8 in the database at some point, but it'd sure be nice if this > >worked without that. :) > > > >thanks! > > > > > > > > > > > > > > > >---------------------------(end of broadcast)--------------------------- > >TIP 8: explain analyze is your friend > > >---------------------------(end of broadcast)--------------------------- >TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html
Cott Lang wrote: > ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 > > Running 7.4.5, I frequently get this error, and ONLY on this particular > character despite seeing quite a bit of 8 bit. I don't really follow why > it can't be converted, it's the same character (239) in both character > sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8. > > Am I flubbing something up? I'm probably going to (reluctantly) convert > to UTF-8 in the database at some point, but it'd sure be nice if this > worked without that. :) > Can you post a code snippet? There's really no such thing as a "UTF-8 character". Java chars and Strings are UTF-16 (or maybe UCS-2 in JVMs prior to 1.5), not UTF-8. Note that 0xEF should not appear by itself in a UTF-8 bytestream. The UTF-8 representation of U+00EF is 0xC3 0xAF. -- ======================================================================== Clearly, there is no political benefit to expediting the admission of legal immigrants into the United States. Nevertheless, I believe that our elected officials have an obligation to do more than simply pander to the thinly veiled racism of their constituents. Ian Pilcher ========================================================================
Thanks for the detailed reply, you've confirmed what I suspected. :) I guess I have some work to do! On Fri, 2004-10-29 at 10:19, J. Michael Crawford wrote: > In my experience, there are just some characters that don't want to be > converted, even if they appear to be part of the normal 8-bit character > system. We went to Unicode databases to hold our Latin1 characters because > of this. There was even a case where the client was cutting and pasting > ascii text into our database, and it just wouldn't take some of the > letters, giving the same error you reported. > > I'm going to send a more detailed post on the topic, but in general, > we've found that there are four things that need to be done (four, if > you're not serving up web pages) for Latin1 characters to work on multiple > platforms. > > 1. Create the database in Unicode so that it will hold anything you > throw at it. > > 2. When importing data, set the encoding in the script that loads the > data, or if there's no script, use the "SET CLIENT_ENCODING TO (encoding)" > command. Setting the encoding in a tool like pgManager is not always > enough. Use this to be sure. > > 3. When retrieving data in a java application, the JVM encoding will > vary from JVM to JVM, and no attempt on our part to change the JVM encoding > or translate the encoding of the database strings has worked, either to or > from the database. We spent weeks going through every permutation > getBytes("ISO-8859-1") and related calls we could find, but to no > avail. The JVM will tell you it has a new encoding, but postgres will > return gibberish. You can translate the bytes, or get a translated string, > but it's all the same garbage. The solution: set the client encoding > manually through a jdbc prepared statement. Once you set the client > encoding properly, all seems to be fine: > > String DBEncoding = "anEncoding" //use a real encoding, either returned > from the jvm or explicitly stated > PreparedStatement statement = dbCon.prepareStatement("SET CLIENT_ENCODING > TO '" + DBEncoding + "'"); > statement.execute(); > > 4. If writing html for a web page, make sure the encoding of the web > page matches the encoding of the strings you're throwing at it. So if you > have a Linux JVM that has a "UTF-8" encoding, the web page will need the > html equivalent: > > <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> > > --- > > This is likely far more information than you require, but I thought I'd > add it anyway so that the information is in the archives. It took us > months to solve our problem, even with help from the postgres community, so > I at least want the basics to be posted while I get my act together and > write something with more detail. > > - Mike > > > At 12:12 PM 10/29/2004, Cott Lang wrote: > >ERROR: could not convert UTF-8 character 0x00ef to ISO8859-1 > > > >Running 7.4.5, I frequently get this error, and ONLY on this particular > >character despite seeing quite a bit of 8 bit. I don't really follow why > >it can't be converted, it's the same character (239) in both character > >sets. Databases are in ISO8859-1, JDBC driver is defaulting to UTF-8. > > > >Am I flubbing something up? I'm probably going to (reluctantly) convert > >to UTF-8 in the database at some point, but it'd sure be nice if this > >worked without that. :) > > > >thanks! > > > > > > > > > > > > > > > >---------------------------(end of broadcast)--------------------------- > >TIP 8: explain analyze is your friend > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html