Thread: Selecting on non ASCII varchars

Selecting on non ASCII varchars

From
Jeremy LaCivita
Date:
Hi,

I have a unicode database.  Inserting unicode strings works fine.
Selecting data based on int columns works fine too.

However, I am unable to select based on varchar columns when the
select contains non ascii characters.

the same select will work in Aqua Data Studio, just not from java.
Am i setting up my connections or prepared statements wrong?

/* begin example code */
javax.naming.InitialContext ctx = new javax.naming.InitialContext();
javax.sql.DataSource ref1 = (javax.sql.DataSource)ctx.lookup("java:/
PostgresDS");
Connection conn = ref1.getConnection();
PreparedStatement pst = conn.prepareStatement("SELECT * from mytable
m where m.title ~* ?");
pst.setString(1, myString);
ResultSet rs = pst.executeQuery();
/* end example code */

mytable.title is a varchar(300)
myString is a java.lang.String which was loaded from a unicode xml
stream.

whenever myString contains accented or chinese characters, for
example, the result set will be empty even though there are records
in the database that should match.  doing the same query manually in
aqua data studio works fine.

I'm using postgres 8.0.3

Any ideas?

-Jeremy

Re: Selecting on non ASCII varchars

From
Oliver Jowett
Date:
Jeremy LaCivita wrote:

> PreparedStatement pst = conn.prepareStatement("SELECT * from mytable  m
> where m.title ~* ?");

If you use direct equality (=), does it work?

There have been comments on pgsql-bugs recently that some areas of the
backend code (case insensitive comparison and regexp) do not work
correctly in all cases when multibyte encodings are used. You might want
to repost to -bugs if basic equality works correctly.

Do you have a selfcontained testcase we can try? In particular we need
to know the actual column values and regexp patterns you have problems with.

-O

Re: Selecting on non ASCII varchars

From
Jeremy LaCivita
Date:
Hmmm

so it turns out if i take all my Strings and do this:

str = new String(str.getBytes(), "utf-8");

then it works.

Correct me if i'm wrong, but that says to me that the Strings were in
UTF-8 already, but Java didn't know it, so it couldn't send them to
postgres properly.

because str.getBytes() will return the same bytes that were used to
create the string, and new String(bytes, "utf-8") will repackage them
into a string using utf-8, so nothing has really changed at the byte
level,  java has just explicitly marked it as UTF-8.

Anyway, problem solved.  As to why my strings aren't flagged as
UTF-8, thats not a postgres problem.

Thanks!

-jl

On Oct 2, 2005, at 9:41 PM, Oliver Jowett wrote:

> Jeremy LaCivita wrote:
>
>
>> PreparedStatement pst = conn.prepareStatement("SELECT * from
>> mytable  m
>> where m.title ~* ?");
>>
>
> If you use direct equality (=), does it work?
>
> There have been comments on pgsql-bugs recently that some areas of the
> backend code (case insensitive comparison and regexp) do not work
> correctly in all cases when multibyte encodings are used. You might
> want
> to repost to -bugs if basic equality works correctly.
>
> Do you have a selfcontained testcase we can try? In particular we need
> to know the actual column values and regexp patterns you have
> problems with.
>
> -O
>


Attachment

Re: Selecting on non ASCII varchars

From
"Kevin Grittner"
Date:
A String object doesn't contain an array of bytes; it contains an
array of characters.  Somehow you created String objects from
bytes using the wrong character encoding technique (not to be
confused with a character set).  Your str.getBytes() is using the
default encoding scheme to convert the characters to bytes.  In
this case, it seems that all the characters are mapping back to
the original bytes, although I don't think that's always necessarily
going to happen.  By specifying the "utf-8" in the String
constructor, you're telling it to use a specific encoding technique
to convert those bytes to characters.

There is nothing in a String object to "flag" it for any particular
encoding.  The encoding only comes into play when turning
bytes into characters or vice versa.

-Kevin


>>> Jeremy LaCivita <jlacivita@broadrelay.com> 10/04/05 3:16 PM >>>
Hmmm

so it turns out if i take all my Strings and do this:

str = new String(str.getBytes(), "utf-8");

then it works.

Correct me if i'm wrong, but that says to me that the Strings were in
UTF-8 already, but Java didn't know it, so it couldn't send them to
postgres properly.

because str.getBytes() will return the same bytes that were used to
create the string, and new String(bytes, "utf-8") will repackage them
into a string using utf-8, so nothing has really changed at the byte
level,  java has just explicitly marked it as UTF-8.

Anyway, problem solved.  As to why my strings aren't flagged as
UTF-8, thats not a postgres problem.

Thanks!

-jl

On Oct 2, 2005, at 9:41 PM, Oliver Jowett wrote:

> Jeremy LaCivita wrote:
>
>
>> PreparedStatement pst = conn.prepareStatement("SELECT * from
>> mytable  m
>> where m.title ~* ?");
>>
>
> If you use direct equality (=), does it work?
>
> There have been comments on pgsql-bugs recently that some areas of the
> backend code (case insensitive comparison and regexp) do not work
> correctly in all cases when multibyte encodings are used. You might
> want
> to repost to -bugs if basic equality works correctly.
>
> Do you have a selfcontained testcase we can try? In particular we need
> to know the actual column values and regexp patterns you have
> problems with.
>
> -O
>



Re: Selecting on non ASCII varchars

From
Jeremy LaCivita
Date:
Makes sense.  So i guess when my original UTF-8 byte stream comes to
flash, its creating a new string with the default encoding out of
those bytes.

makes sense that it would fix it by getting the bytes again using the
default encoding and then recreating the string as UTF-8.

I agree its risky though, since its not all under my control.

I'll have to find out where the strings are getting created initially.

thanks!

-jl

On Oct 4, 2005, at 4:37 PM, Kevin Grittner wrote:

> A String object doesn't contain an array of bytes; it contains an
> array of characters.  Somehow you created String objects from
> bytes using the wrong character encoding technique (not to be
> confused with a character set).  Your str.getBytes() is using the
> default encoding scheme to convert the characters to bytes.  In
> this case, it seems that all the characters are mapping back to
> the original bytes, although I don't think that's always necessarily
> going to happen.  By specifying the "utf-8" in the String
> constructor, you're telling it to use a specific encoding technique
> to convert those bytes to characters.
>
> There is nothing in a String object to "flag" it for any particular
> encoding.  The encoding only comes into play when turning
> bytes into characters or vice versa.
>
> -Kevin
>
>
>
>>>> Jeremy LaCivita <jlacivita@broadrelay.com> 10/04/05 3:16 PM >>>
>>>>
> Hmmm
>
> so it turns out if i take all my Strings and do this:
>
> str = new String(str.getBytes(), "utf-8");
>
> then it works.
>
> Correct me if i'm wrong, but that says to me that the Strings were in
> UTF-8 already, but Java didn't know it, so it couldn't send them to
> postgres properly.
>
> because str.getBytes() will return the same bytes that were used to
> create the string, and new String(bytes, "utf-8") will repackage them
> into a string using utf-8, so nothing has really changed at the byte
> level,  java has just explicitly marked it as UTF-8.
>
> Anyway, problem solved.  As to why my strings aren't flagged as
> UTF-8, thats not a postgres problem.
>
> Thanks!
>
> -jl
>
> On Oct 2, 2005, at 9:41 PM, Oliver Jowett wrote:
>
>
>> Jeremy LaCivita wrote:
>>
>>
>>
>>> PreparedStatement pst = conn.prepareStatement("SELECT * from
>>> mytable  m
>>> where m.title ~* ?");
>>>
>>>
>>
>> If you use direct equality (=), does it work?
>>
>> There have been comments on pgsql-bugs recently that some areas of
>> the
>> backend code (case insensitive comparison and regexp) do not work
>> correctly in all cases when multibyte encodings are used. You might
>> want
>> to repost to -bugs if basic equality works correctly.
>>
>> Do you have a selfcontained testcase we can try? In particular we
>> need
>> to know the actual column values and regexp patterns you have
>> problems with.
>>
>> -O
>>
>>
>
>
>


Attachment

Re: Selecting on non ASCII varchars

From
Vadim Nasardinov
Date:
On Tuesday 04 October 2005 16:16, Jeremy LaCivita wrote:
> Hmmm
>
> so it turns out if i take all my Strings and do this:
>
> str = new String(str.getBytes(), "utf-8");
>
> then it works.
>
> Correct me if i'm wrong, but that says to me that the Strings were
> in UTF-8 already, but Java didn't know it, so it couldn't send them
> to postgres properly.

It's meaningless to ask what encoding a String has.  String are
sequence of chars -- they don't have an encoding.  The notion of
"encoding" comes into play only when you have to represent a String as
a sequence of bytes.

So, if this returns true for you:

   str.equals(new String(str.getBytes(), "utf-8"));

that means your default encoding is either utf-8 or a subset of utf-8,
at least for the characters found in str.

String#getBytes() uses the default encoding which may be specified via
the environment variable LANG on on Unix-like systems.

So, if my default encoding is UTF-8, I get this:

| $ echo $LANG
| en_US.UTF-8
| $ bsh2
| BeanShell 2.0-0.b1.7jpp - by Pat Niemeyer (pat@pat.net)
| bsh % print(System.getProperty("file.encoding"));
| UTF-8
| bsh % str = "Funny char: \u00e8";
| bsh % print(str);
| Funny char: è
| bsh % print(str.equals(new String(str.getBytes(), "utf-8")));
| true
| bsh %

If I change the default encoding to ISO-8859-1, I get this:

| $ env LANG=en_US.iso88591 bsh2
| BeanShell 2.0-0.b1.7jpp - by Pat Niemeyer (pat@pat.net)
| bsh % print(System.getProperty("file.encoding"));
| ISO-8859-1
| bsh % str = "Funny char: \u00e8";
| bsh % print(str);
| Funny char: è
| bsh % print(str.equals(new String(str.getBytes(), "utf-8")));
| false
| bsh %

Re: Selecting on non ASCII varchars

From
Marc Herbert
Date:
Vadim Nasardinov <vadimn@redhat.com> writes:

> On Tuesday 04 October 2005 16:16, Jeremy LaCivita wrote:

>> Correct me if i'm wrong, but that says to me that the Strings were
>> in UTF-8 already, but Java didn't know it, so it couldn't send them
>> to postgres properly.
>
> It's meaningless to ask what encoding a String has.  String are
> sequence of chars -- they don't have an encoding.

Actually they are encoded using UTF-16

 <http://java.sun.com/developer/technicalArticles/Intl/Supplementary/>

Granted, this is the no-brainer "same value" encoding... as long as
codepoint < U+FFFF