Thread: A JDBC bug or problem relating to string length in Java vs. PG (long)
String lengths aren't the same in Java and Postgres. _All_ Strings in Java are stored in UTF-16 (1 char = 16 bits). That means that a string of 100 Chinese characters will have a Java String length of 100. But, Postgres by default does not store all strings as UTF-16. (I couldn't find in the docs what the default is, except that it takes it from the computer it is installed on, which in my case is plain-vanilla Redhat 9, which I assume uses a Latin encoding as the default, which is 1 char = 1 byte, right?) It has 8-bit chars (is this correct?), so that the string of 100 Chinese chars would have a length greater than 100 in Postgres-world. Where this causes problems is in VARCHAR types. Let's say I have a table with a VARCHAR(100). If I try to store a string longer than 100 chars into that, I will get an SQLException, so, being the defensive programmer that I am, I write a little static method called truncate that looks like this: static String truncate(String s, int l) { if(s.length() < l) return s; else return s.substring(0, l); } and then I do this kind of thing: preparedStatement.setString(1, truncate(theString, 100)); which should never throw an SQLException because I know the String length is safe... except that this is a web application, being used by people all over the world, including people with non-Latin characters. What is the solution to this? I think that the real solution would be to ammend the JDBC spec to say that there should be an SQL string length method somewhere that lets us test what the database thinks is going to be the length. Another solution would be to convert the string to a byte[] using UTF-8 (I assume this is what Postgres does) and then look at the length of that byte[]. It would be possible to take the be-very-conservative approach, and, for a VARCHAR(100), truncate the Java String to 50, but even this will not work. I think there are some characters in UTF-16 that encode to more than 2 bytes in UTF-8. Another approach would be for me to use TEXT instead of VARCHAR. I have various reasons for not wanting to do that if I could find some other solution. Another option would be to tell Postgres that I want to use UTF-16 for its encoding. I don't really care about the disk space, but the problem is that this means that a given page of disk only holds half as much data, which means that the OS, which caches disk pages in RAM, will only be able to do half the effective caching, which means that I need to buy twice as much RAM to get the same performance, and that would be a major hardware investment. Finally, it would be good if there were a Postgres option that says, "Silently truncate strings which are too long." That's all I really want to do anyway. It would actually be cool if java.lang.String had a length(String encoding) method for this purpose. Any thoughts on this? All this is with PG 7.3.4, btw. __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com
On Sun, 31 Aug 2003, joe user wrote: [long discussion of string length in characters vs. bytes] Actually varchar(N) in postgresql means the number of characters, not the number of bytes, so you should not have to worry about how it is actually encoded. Kris Jurka
--- Kris Jurka <books@ejurka.com> wrote: > Actually varchar(N) in postgresql means the number > of characters, not the > number of bytes, so you should not have to worry > about how it is actually > encoded. I still think there is something else going on here. I definitely take all of my input through a truncate method which truncates strings to 100 chars, and I definitely get a "value too long for type character varying(100)" error every once in a while. The logs show the input string to be some kind of multibyte string, which I don't know the encoding of. This is a log of the "referrer" header in http requests. There is no specification of encoding of strings in http headers, so these strings could be anything. I have tried to take those strings out of the logs and use them to make the error happen again but I'm not getting it to reduplicate. Could it be possible that Java is taking this input as binary form from the net, and constructing UTF-16 (using some incorrect default encoding guess) which is not really valid encoding of anything, and getting the length of that, and then PG transforms that invalid UTF-16 into its best effort of UTF-8, which may have a different number of chars because what Java thinks is a single UTF-16 char is converted (incorrectly) into two UTF-8 chars? I think this is what is happening, and I'm not sure how to handle this. The fundamental problem is that the "get request header" method of the Servlet API returns a String, when it should be returning a byte[], because there is NO way of knowing which encoding the client thinks it is using. Btw, this, and the null-byte problem, could probably cause various low-bandwidth DoS attacks against any site that uses PG/JDBC. Imagine a typical JDBC use like this: try { [ .... ] preparedStatement.setString(...); db.close(); } catch(SQLException sqe) { [log it...] } If enough of these multi-byte problems or null problems are thrown at the app, it will throw an exception in the try block before it can get to the db.close() statement, quickly exhausting link resources. This is in fact happening on our web application right now. It seems that it would be possible to bring down a service with at most a few hundred requests like this. Any ideas? __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com
On Mon, Sep 01, 2003 at 06:30:24PM -0700, joe user wrote: > Btw, this, and the null-byte problem, could probably > cause various low-bandwidth DoS attacks against any > site that uses PG/JDBC. Imagine a typical JDBC use > like this: > > try { > [ .... ] > preparedStatement.setString(...); > db.close(); > } > catch(SQLException sqe) { [log it...] } > > If enough of these multi-byte problems or null > problems are thrown at the app, it will throw an > exception in the try block before it can get to the > db.close() statement, quickly exhausting link > resources. This is in fact happening on our web > application right now. It seems that it would be > possible to bring down a service with at most a few > hundred requests like this. > > Any ideas? Do the close() in a finally block. It's good practice anyway. -O
On Mon, 1 Sep 2003, joe user wrote: > > --- Kris Jurka <books@ejurka.com> wrote: > > Actually varchar(N) in postgresql means the number > > of characters, not the > > number of bytes, so you should not have to worry > > about how it is actually > > encoded. > > I still think there is something else going on here. > I definitely take all of my input through a truncate > method which truncates strings to 100 chars, and I > definitely get a "value too long for type character > varying(100)" error every once in a while. The logs > show the input string to be some kind of multibyte > string, which I don't know the encoding of. This is a > log of the "referrer" header in http requests. There > is no specification of encoding of strings in http > headers, so these strings could be anything. I have > tried to take those strings out of the logs and use > them to make the error happen again but I'm not > getting it to reduplicate. You did not mention what your server encoding is. I assumed it was unicode, but could you confirm this? Kris Jurka
--- Oliver Jowett <oliver@opencloud.com> wrote: > Do the close() in a finally block. It's good > practice anyway. That's a good idea, but unfortunately Connection.close() throws SQLException, so I would have to do something like this: try { } catch { } finally { try { db.close(); } catch(SQLException e) { log(...); } } This works... but isn't Java/JDBC supposed to make our lives easier, and focus on the problem we're solving instead of putting in extra hard-to-read boilerplate code (a try/catch nested inside a finally block)? I think I'm going to convert everything to JDO so the JDO implementor can handle ALL of this stuff. __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com
--- Kris Jurka <books@ejurka.com> wrote: > You did not mention what your server encoding is. I > assumed it was > unicode, but could you confirm this? You mean the Postgres server? I don't know. I never set it. This is a plain-vanilla installation of RedHat 9, and I downloaded and compiled plain-vanilla PG 7.3.4, so whatever the default encoding would be, that is what I'm using. For the Java runtime, I'm also using the plain-vanilla defaults under Redhat 9. Is there a way to find out? __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com
On Mon, Sep 01, 2003 at 07:03:33PM -0700, joe user wrote: > > --- Oliver Jowett <oliver@opencloud.com> wrote: > > Do the close() in a finally block. It's good > > practice anyway. > > That's a good idea, but unfortunately > Connection.close() throws SQLException, so I would > have to do something like this: > > try { } > catch { } > finally { > try { db.close(); } > catch(SQLException e) { log(...); } > } Yes. Note that real code is likely to be doing a decent amount of work in the try block (directly or indirectly), so the added complexity of the finally block isn't usually an issue. The other way of doing it, which has slightly different error-handling behaviour, is: try { try { // ... } finally { db.close(); } } catch (SQLException e) { // ... } but that can be just as hard to read, depending on the exact code. > This works... but isn't Java/JDBC supposed to make our > lives easier, and focus on the problem we're solving > instead of putting in extra hard-to-read boilerplate > code (a try/catch nested inside a finally block)? As I see it, JDBC is more about providing a standardized API to many database implementations, that can then be used to build higher-level tools. As it's more of a system-level thing it does make sense to explicitly report everything that goes wrong, as different applications may need different error-handling policies. Think "system library" not "application framework". > I think I'm going to convert everything to JDO so the > JDO implementor can handle ALL of this stuff. More generally, you want an adaption layer that presents an API that is useful to your application and hides the JDBC details that are irrelevant or can be dealt with in a standardized way. This is a flexibility-vs-convenience design tradeoff. -O
joe user wrote: > > You mean the Postgres server? I don't know. I never > set it. This is a plain-vanilla installation of > RedHat 9, and I downloaded and compiled plain-vanilla > PG 7.3.4, so whatever the default encoding would be, > that is what I'm using. If you didn't set an encoding you're using whatever locale (?) RH9 installed, probably en_US.UTF-8. This means that your database is probably configured for this also. So if you wanted to insert japanese in EUC-JP format, your current database would *not* support it. All you can insert into your current database is UTF-8. If you want to insert strings in anything else then UTF-8 you would need to create a new database and specify the encoding to use for that database (I think it is the -D flag?). I think that's right ... Jean-Christian Imbeault
Hello, I have created a reproducible test case for the behavior I was describing, where Java thinks the string length is one thing but PG thinks it is something else (longer) throwing an SQLException. I'm not sure where the bug is, but it's a real bug and it's somewhere. I have attached the test case as a file. I would appreciate it if someone else could verify that he sees this behavior and that it is a bug. Test details: Tested on both Redhat 9, and SuSE 8.2 Sun JDK 1.4.2 Postgres 7.3.4, with the JDBC that comes with it Linux was installed "plain vanilla". If anyone has questions about my Linux environment I can answer them. Postgres was compiled also "plain vanilla", like this: ./configure --with-java make and that is all. The output that I get from my test program is this: % java SqlClientTest The length of the evil string is: 99 Exception in thread "main" java.sql.SQLException: ERROR: value too long for type character varying(100) Clearly, Java thinks the string is 99 chars, and Postgres thinks it is longer than 100 chars. Any feedback or information about this would be most welcome, especially any suggestions of fixes or workarounds. Thanks! __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.comimport java.sql.*; public class SqlClientTest { /** Truncates a String s to length i or less */ public static String truncate(String s, int i) { if(s.length() > i) return s.substring(0, i); else return s; } public static byte[] evilBytes = { (byte) 0150, (byte) 0164, (byte) 0164, (byte) 0160, (byte) 0072, (byte) 0057, (byte) 0057, (byte) 0163, (byte) 0145, (byte) 0145, (byte) 0153, (byte) 0056, (byte) 0063, (byte) 0067, (byte) 0062, (byte) 0061, (byte) 0056, (byte) 0143, (byte) 0157, (byte) 0155, (byte) 0057, (byte) 0063, (byte) 0067, (byte) 0062, (byte) 0061, (byte) 0057, (byte) 0141, (byte) 0144, (byte) 0166, (byte) 0137, (byte) 0162, (byte) 0145, (byte) 0156, (byte) 0144, (byte) 0145, (byte) 0162, (byte) 0056, (byte) 0150, (byte) 0164, (byte) 0155, (byte) 0077, (byte) 0141, (byte) 0144, (byte) 0166, (byte) 0151, (byte) 0075, (byte) 0141, (byte) 0144, (byte) 0166, (byte) 0137, (byte) 0162, (byte) 0151, (byte) 0147, (byte) 0150, (byte) 0164, (byte) 0046, (byte) 0156, (byte) 0141, (byte) 0155, (byte) 0145, (byte) 0075, (byte) 0302, (byte) 0270, (byte) 0303, (byte) 0237, (byte) 0303, (byte) 0220, (byte) 0302, (byte) 0275, (byte) 0303, (byte) 0226, (byte) 0302, (byte) 0260, (byte) 0303, (byte) 0216, (byte) 0302, (byte) 0273, (byte) 0303, (byte) 0225, (byte) 0303, (byte) 0220, (byte) 0303, (byte) 0206, (byte) 0302, (byte) 0270, (byte) 0046, (byte) 0160, (byte) 0141, (byte) 0147, (byte) 0145, (byte) 0075, (byte) 0061, (byte) 0046, (byte) 0167, (byte) 0151, (byte) 0144, (byte) 0164, (byte) 0150, (byte) 0075, (byte) 0061, (byte) 0063, (byte) 0060, (byte) 0012 }; /** This exposes a bug in the Postgres JDBC. It relies on a database with * this table: * <pre> * CREATE TABLE foo (bar VARCHAR(100)); * </pre> */ public static void main(String args[]) throws Exception { String driver = "org.postgresql.Driver"; String connectionUrl = "jdbc:postgresql://localhost:5432/testing"; Class.forName(driver).newInstance(); Connection db = DriverManager.getConnection(connectionUrl, "me", ""); String qstring = "INSERT INTO foo (bar) VALUES (?)"; PreparedStatement ps = db.prepareStatement(qstring); String evilString = new String(evilBytes); evilString = truncate(evilString, 99); System.out.println("The length of the evil string is: " + evilString.length()); ps.setString(1, evilString); ps.executeUpdate(); ps.close(); db.close(); System.out.println("Done."); } }
On Mon, 1 Sep 2003, joe user wrote: > > --- Kris Jurka <books@ejurka.com> wrote: > > You did not mention what your server encoding is. I > > assumed it was > > unicode, but could you confirm this? > > You mean the Postgres server? I don't know. I never > set it. This is a plain-vanilla installation of > RedHat 9, and I downloaded and compiled plain-vanilla > PG 7.3.4, so whatever the default encoding would be, > that is what I'm using. For the Java runtime, I'm > also using the plain-vanilla defaults under Redhat 9. > Is there a way to find out? SELECT datname,pg_encoding_to_char(encoding) FROM pg_database; Kris Jurka
> SELECT datname,pg_encoding_to_char(encoding) FROM > pg_database; Thanks for the suggestion Kris. I also got some help from Barry. The problem is that the encoding in the db was ascii, which causes a discrepency between Java string lengths and the string lengths in the db. The solution is to drop and recreate the db using unicode encoding. One conclusion from this is that, as a Java developer, unless you are absolutely sure that all the strings you are dealing with really are ascii, you should be sure that the db has unicode encoding. Java does not internally use ascii for anything. Everything in Java is UTF-16, so external storage that Java uses should be able to represent the full range of UTF characters. I think this would make a good entry in the JDBC FAQ: "Always use Unicode encoding when using the JDBC unless you are sure that you are only dealing with ASCII." Thanks __________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com