Thread: A JDBC bug or problem relating to string length in Java vs. PG (long)

A JDBC bug or problem relating to string length in Java vs. PG (long)

From
joe user
Date:
String lengths aren't the same in Java and Postgres.
_All_ Strings in Java are stored in UTF-16 (1 char =
16 bits).  That means that a string of 100 Chinese
characters will have a Java String length of 100.
But, Postgres by default does not store all strings as
UTF-16.  (I couldn't find in the docs what the default
is, except that it takes it from the computer it is
installed on, which in my case is plain-vanilla Redhat
9, which I assume uses a Latin encoding as the
default, which is 1 char = 1 byte, right?)  It has
8-bit chars (is this correct?), so that the string of
100 Chinese chars would have a length greater than 100
in Postgres-world.

Where this causes problems is in VARCHAR types.  Let's
say I have a table with a VARCHAR(100).  If I try to
store a string longer than 100 chars into that, I will
get an SQLException, so, being the defensive
programmer that I am, I write a little static method
called truncate that looks like this:

    static String truncate(String s, int l) {
        if(s.length() < l) return s;
        else return s.substring(0, l);
    }

and then I do this kind of thing:

    preparedStatement.setString(1,
        truncate(theString, 100));

which should never throw an SQLException because I
know the String length is safe... except that this is
a web application, being used by people all over the
world, including people with non-Latin characters.

What is the solution to this?  I think that the real
solution would be to ammend the JDBC spec to say that
there should be an SQL string length method somewhere
that lets us test what the database thinks is going to
be the length.

Another solution would be to convert the string to a
byte[] using UTF-8 (I assume this is what Postgres
does) and then look at the length of that byte[].

It would be possible to take the be-very-conservative
approach, and, for a VARCHAR(100), truncate the Java
String to 50, but even this will not work.  I think
there are some characters in UTF-16 that encode to
more than 2 bytes in UTF-8.

Another approach would be for me to use TEXT instead
of VARCHAR.  I have various reasons for not wanting to
do that if I could find some other solution.

Another option would be to tell Postgres that I want
to use UTF-16 for its encoding.  I don't really care
about the disk space, but the problem is that this
means that a given page of disk only holds half as
much data, which means that the OS, which caches disk
pages in RAM, will only be able to do half the
effective caching, which means that I need to buy
twice as much RAM to get the same performance, and
that would be a major hardware investment.

Finally, it would be good if there were a Postgres
option that says, "Silently truncate strings which are
too long."  That's all I really want to do anyway.

It would actually be cool if java.lang.String had a
length(String encoding) method for this purpose.

Any thoughts on this?  All this is with PG 7.3.4, btw.


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

Re: A JDBC bug or problem relating to string length in Java

From
Kris Jurka
Date:

On Sun, 31 Aug 2003, joe user wrote:

[long discussion of string length in characters vs. bytes]

Actually varchar(N) in postgresql means the number of characters, not the
number of bytes, so you should not have to worry about how it is actually
encoded.

Kris Jurka


Re: A JDBC bug or problem relating to string length in Java

From
joe user
Date:
--- Kris Jurka <books@ejurka.com> wrote:
> Actually varchar(N) in postgresql means the number
> of characters, not the
> number of bytes, so you should not have to worry
> about how it is actually
> encoded.

I still think there is something else going on here.
I definitely take all of my input through a truncate
method which truncates strings to 100 chars, and I
definitely get a "value too long for type character
varying(100)" error every once in a while.  The logs
show the input string to be some kind of multibyte
string, which I don't know the encoding of.  This is a
log of the "referrer" header in http requests.  There
is no specification of encoding of strings in http
headers, so these strings could be anything.  I have
tried to take those strings out of the logs and use
them to make the error happen again but I'm not
getting it to reduplicate.

Could it be possible that Java is taking this input as
binary form from the net, and constructing UTF-16
(using some incorrect default encoding guess) which is
not really valid encoding of anything, and getting the
length of that, and then PG transforms that invalid
UTF-16 into its best effort of UTF-8, which may have a
different number of chars because what Java thinks is
a single UTF-16 char is converted (incorrectly) into
two UTF-8 chars?

I think this is what is happening, and I'm not sure
how to handle this.

The fundamental problem is that the "get request
header" method of the Servlet API returns a String,
when it should be returning a byte[], because there is
NO way of knowing which encoding the client thinks it
is using.

Btw, this, and the null-byte problem, could probably
cause various low-bandwidth DoS attacks against any
site that uses PG/JDBC.  Imagine a typical JDBC use
like this:

    try {
        [ .... ]
        preparedStatement.setString(...);
        db.close();
    }
    catch(SQLException sqe) { [log it...] }

If enough of these multi-byte problems or null
problems are thrown at the app, it will throw an
exception in the try block before it can get to the
db.close() statement, quickly exhausting link
resources.  This is in fact happening on our web
application right now.  It seems that it would be
possible to bring down a service with at most a few
hundred requests like this.

Any ideas?


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

Re: A JDBC bug or problem relating to string length in Java

From
Oliver Jowett
Date:
On Mon, Sep 01, 2003 at 06:30:24PM -0700, joe user wrote:

> Btw, this, and the null-byte problem, could probably
> cause various low-bandwidth DoS attacks against any
> site that uses PG/JDBC.  Imagine a typical JDBC use
> like this:
>
>     try {
>         [ .... ]
>         preparedStatement.setString(...);
>         db.close();
>     }
>     catch(SQLException sqe) { [log it...] }
>
> If enough of these multi-byte problems or null
> problems are thrown at the app, it will throw an
> exception in the try block before it can get to the
> db.close() statement, quickly exhausting link
> resources.  This is in fact happening on our web
> application right now.  It seems that it would be
> possible to bring down a service with at most a few
> hundred requests like this.
>
> Any ideas?

Do the close() in a finally block. It's good practice anyway.

-O

Re: A JDBC bug or problem relating to string length in Java

From
Kris Jurka
Date:

On Mon, 1 Sep 2003, joe user wrote:

>
> --- Kris Jurka <books@ejurka.com> wrote:
> > Actually varchar(N) in postgresql means the number
> > of characters, not the
> > number of bytes, so you should not have to worry
> > about how it is actually
> > encoded.
>
> I still think there is something else going on here.
> I definitely take all of my input through a truncate
> method which truncates strings to 100 chars, and I
> definitely get a "value too long for type character
> varying(100)" error every once in a while.  The logs
> show the input string to be some kind of multibyte
> string, which I don't know the encoding of.  This is a
> log of the "referrer" header in http requests.  There
> is no specification of encoding of strings in http
> headers, so these strings could be anything.  I have
> tried to take those strings out of the logs and use
> them to make the error happen again but I'm not
> getting it to reduplicate.

You did not mention what your server encoding is.  I assumed it was
unicode, but could you confirm this?

Kris Jurka



Re: A JDBC bug or problem relating to string length in Java

From
joe user
Date:
--- Oliver Jowett <oliver@opencloud.com> wrote:
> Do the close() in a finally block. It's good
> practice anyway.

That's a good idea, but unfortunately
Connection.close() throws SQLException, so I would
have to do something like this:

try { }
catch { }
finally {
    try { db.close(); }
    catch(SQLException e) { log(...); }
}

This works... but isn't Java/JDBC supposed to make our
lives easier, and focus on the problem we're solving
instead of putting in extra hard-to-read boilerplate
code (a try/catch nested inside a finally block)?

I think I'm going to convert everything to JDO so the
JDO implementor can handle ALL of this stuff.


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

Re: A JDBC bug or problem relating to string length in Java

From
joe user
Date:
--- Kris Jurka <books@ejurka.com> wrote:
> You did not mention what your server encoding is.  I
> assumed it was
> unicode, but could you confirm this?

You mean the Postgres server?  I don't know.  I never
set it.  This is a plain-vanilla installation of
RedHat 9, and I downloaded and compiled plain-vanilla
PG 7.3.4, so whatever the default encoding would be,
that is what I'm using.  For the Java runtime, I'm
also using the plain-vanilla defaults under Redhat 9.
Is there a way to find out?



__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

Re: A JDBC bug or problem relating to string length in Java

From
Oliver Jowett
Date:
On Mon, Sep 01, 2003 at 07:03:33PM -0700, joe user wrote:
>
> --- Oliver Jowett <oliver@opencloud.com> wrote:
> > Do the close() in a finally block. It's good
> > practice anyway.
>
> That's a good idea, but unfortunately
> Connection.close() throws SQLException, so I would
> have to do something like this:
>
> try { }
> catch { }
> finally {
>     try { db.close(); }
>     catch(SQLException e) { log(...); }
> }

Yes. Note that real code is likely to be doing a decent amount of work in
the try block (directly or indirectly), so the added complexity of the
finally block isn't usually an issue.

The other way of doing it, which has slightly different error-handling
behaviour, is:

  try {
     try {
       // ...
     } finally {
       db.close();
     }
  } catch (SQLException e) {
    // ...
  }

but that can be just as hard to read, depending on the exact code.

> This works... but isn't Java/JDBC supposed to make our
> lives easier, and focus on the problem we're solving
> instead of putting in extra hard-to-read boilerplate
> code (a try/catch nested inside a finally block)?

As I see it, JDBC is more about providing a standardized API to many
database implementations, that can then be used to build higher-level tools.
As it's more of a system-level thing it does make sense to explicitly report
everything that goes wrong, as different applications may need different
error-handling policies. Think "system library" not "application framework".

> I think I'm going to convert everything to JDO so the
> JDO implementor can handle ALL of this stuff.

More generally, you want an adaption layer that presents an API that is
useful to your application and hides the JDBC details that are irrelevant or
can be dealt with in a standardized way. This is a flexibility-vs-convenience
design tradeoff.

-O

Re: A JDBC bug or problem relating to string length in Java

From
Jean-Christian Imbeault
Date:
joe user wrote:
>
> You mean the Postgres server?  I don't know.  I never
> set it.  This is a plain-vanilla installation of
> RedHat 9, and I downloaded and compiled plain-vanilla
> PG 7.3.4, so whatever the default encoding would be,
> that is what I'm using.

If you didn't set an encoding you're using whatever locale (?) RH9
installed, probably en_US.UTF-8. This means that your database is
probably configured for this also.

So if you wanted to insert japanese in EUC-JP format, your current
database would *not* support it. All you can insert into your current
database is UTF-8. If you want to insert strings in anything else then
UTF-8 you would need to create a new database and specify the encoding
to use for that database (I think it is the -D flag?).

I think that's right ...

Jean-Christian Imbeault


Test case (re: A JDBC bug or problem relating to string length in Java)

From
joe user
Date:
Hello, I have created a reproducible test case for the
behavior I was describing, where Java thinks the
string length is one thing but PG thinks it is
something else (longer) throwing an SQLException.  I'm
not sure where the bug is, but it's a real bug and
it's somewhere.  I have attached the test case as a
file.  I would appreciate it if someone else could
verify that he sees this behavior and that it is a
bug.

Test details:

Tested on both Redhat 9, and SuSE 8.2
Sun JDK 1.4.2
Postgres 7.3.4, with the JDBC that comes with it

Linux was installed "plain vanilla".  If anyone has
questions about my Linux environment I can answer
them.  Postgres was compiled also "plain vanilla",
like this:

./configure --with-java
make

and that is all.

The output that I get from my test program is this:

% java SqlClientTest
The length of the evil string is: 99
Exception in thread "main" java.sql.SQLException:
ERROR:  value too long for type character varying(100)

Clearly, Java thinks the string is 99 chars, and
Postgres thinks it is longer than 100 chars.  Any
feedback or information about this would be most
welcome, especially any suggestions of fixes or
workarounds.

Thanks!


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.comimport java.sql.*;

public class SqlClientTest {

    /** Truncates a String s to length i or less */
    public static String truncate(String s, int i) {
    if(s.length() > i) return s.substring(0, i);
    else return s;
    }

    public static byte[] evilBytes = { (byte) 0150, (byte) 0164, (byte) 0164, (byte) 0160,
(byte) 0072, (byte) 0057, (byte) 0057, (byte) 0163, (byte) 0145,
(byte) 0145, (byte) 0153, (byte) 0056, (byte) 0063, (byte) 0067,
(byte) 0062, (byte) 0061, (byte) 0056, (byte) 0143, (byte) 0157, (byte) 0155,
(byte) 0057, (byte) 0063, (byte) 0067, (byte) 0062, (byte) 0061,
(byte) 0057, (byte) 0141, (byte) 0144, (byte) 0166, (byte) 0137,
(byte) 0162, (byte) 0145, (byte) 0156, (byte) 0144, (byte) 0145, (byte) 0162,
(byte) 0056, (byte) 0150, (byte) 0164, (byte) 0155,
(byte) 0077, (byte) 0141, (byte) 0144, (byte) 0166,
(byte) 0151, (byte) 0075, (byte) 0141, (byte) 0144,
(byte) 0166, (byte) 0137, (byte) 0162, (byte) 0151, (byte) 0147,
(byte) 0150, (byte) 0164, (byte) 0046, (byte) 0156, (byte) 0141,
(byte) 0155, (byte) 0145, (byte) 0075, (byte) 0302, (byte) 0270, (byte) 0303,
(byte) 0237, (byte) 0303, (byte) 0220, (byte) 0302, (byte) 0275, (byte) 0303,
(byte) 0226, (byte) 0302, (byte) 0260, (byte) 0303, (byte) 0216, (byte) 0302,
(byte) 0273, (byte) 0303, (byte) 0225, (byte) 0303, (byte) 0220, (byte) 0303,
(byte) 0206, (byte) 0302, (byte) 0270, (byte) 0046, (byte) 0160, (byte) 0141,
(byte) 0147, (byte) 0145, (byte) 0075, (byte) 0061, (byte) 0046, (byte) 0167,
(byte) 0151, (byte) 0144, (byte) 0164, (byte) 0150, (byte) 0075, (byte) 0061,
(byte) 0063, (byte) 0060, (byte) 0012 };

    /** This exposes a bug in the Postgres JDBC.  It relies on a database with
     * this table:
     * <pre>
     * CREATE TABLE foo (bar VARCHAR(100));
     * </pre> */
    public static void main(String args[]) throws Exception {

    String driver = "org.postgresql.Driver";
    String connectionUrl = "jdbc:postgresql://localhost:5432/testing";
    Class.forName(driver).newInstance();
    Connection db = DriverManager.getConnection(connectionUrl, "me", "");

    String qstring = "INSERT INTO foo (bar) VALUES (?)";

    PreparedStatement ps = db.prepareStatement(qstring);

    String evilString = new String(evilBytes);

    evilString = truncate(evilString, 99);

    System.out.println("The length of the evil string is: " + evilString.length());

    ps.setString(1, evilString);
    ps.executeUpdate();
    ps.close();
    db.close();
    System.out.println("Done.");
    }
}

Re: A JDBC bug or problem relating to string length in Java

From
Kris Jurka
Date:

On Mon, 1 Sep 2003, joe user wrote:

>
> --- Kris Jurka <books@ejurka.com> wrote:
> > You did not mention what your server encoding is.  I
> > assumed it was
> > unicode, but could you confirm this?
>
> You mean the Postgres server?  I don't know.  I never
> set it.  This is a plain-vanilla installation of
> RedHat 9, and I downloaded and compiled plain-vanilla
> PG 7.3.4, so whatever the default encoding would be,
> that is what I'm using.  For the Java runtime, I'm
> also using the plain-vanilla defaults under Redhat 9.
> Is there a way to find out?

SELECT datname,pg_encoding_to_char(encoding) FROM pg_database;

Kris Jurka



Re: A JDBC bug or problem relating to string length in Java

From
joe user
Date:
> SELECT datname,pg_encoding_to_char(encoding) FROM
> pg_database;

Thanks for the suggestion Kris.  I also got some help
from Barry.  The problem is that the encoding in the
db was ascii, which causes a discrepency between Java
string lengths and the string lengths in the db.  The
solution is to drop and recreate the db using unicode
encoding.

One conclusion from this is that, as a Java developer,
unless you are absolutely sure that all the strings
you are dealing with really are ascii, you should be
sure that the db has unicode encoding.  Java does not
internally use ascii for anything.  Everything in Java
is UTF-16, so external storage that Java uses should
be able to represent the full range of UTF characters.
 I think this would make a good entry in the JDBC FAQ:
"Always use Unicode encoding when using the JDBC
unless you are sure that you are only dealing with
ASCII."

Thanks


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com