Thread: Encoding nightmare! Pls help!

Encoding nightmare! Pls help!

From
"John Sidney-Woollett"
Date:
I've had a discussion on the general list about the implications for
storing accented characters within a postgres (7.4.1) db.

As a result, I have created a database with no locale = C locale (using
initdb {other parms} --no-locale)

Here's the database (test) with UNICODE encoding

         List of databases
     Name     |  Owner   | Encoding
--------------+----------+----------
 test         | postgres | UNICODE
 template0    | postgres | UNICODE
 template1    | postgres | UNICODE

The problem I'm having is that I CANNOT write accented characters into the
database or get them out correctly using my java code and the
pg74.1jdbc3.jar file.

Using psql, I select the data (with client encoding = UNICODE), and I get

tést.jpg

With client encoding = LATIN1, I get

tést.jpg

But in my little java test app, I get:

tést.jpg, tést.jpg, tést.jpg,

I want tést.jpg!!!!!

Here is the offending section of code:

String filename = rset.getString(2);
System.out.print(filename);
System.out.print(", ");

if (filename != null)
{
  try
  {
    filename = new String(rset.getBytes(2), "UTF-8");
  }
  catch (UnsupportedEncodingException e)
  {
    System.out.println("Cannot decode string?");
  }

  System.out.print(filename);
  System.out.print(", ");

  try
  {
    filename = new String(rset.getBytes(2), "ISO-8859-1");
  }
  catch (UnsupportedEncodingException e)
  {
    System.out.println("Cannot decode string?");
  }

  System.out.print(filename);
  System.out.print(", ");
}

Can anyone explain what I am doing wrong? I have become so confused by all
this, that I don't think I can see the problem straight anymore.

How can I read and write unicode chars into the db. Is there some magick
parameter that needs to be passed when setting up the connection/driver?

Thanks for any/all help!!

John Sidney-Woollett



Re: Encoding nightmare! Pls help!

From
Barry Lind
Date:
John,

My first guess is that your code is working fine, it is just your
System.out.print() calls that are the problem.  You haven't specified
what character set to use when printing, so it will use the default
character set for your jvm.  This is necessary since java strings are
stored internally in ucs2 and java needs to convert to some other
character set when printing them out.

thanks,
--Barry

John Sidney-Woollett wrote:
> I've had a discussion on the general list about the implications for
> storing accented characters within a postgres (7.4.1) db.
>
> As a result, I have created a database with no locale = C locale (using
> initdb {other parms} --no-locale)
>
> Here's the database (test) with UNICODE encoding
>
>          List of databases
>      Name     |  Owner   | Encoding
> --------------+----------+----------
>  test         | postgres | UNICODE
>  template0    | postgres | UNICODE
>  template1    | postgres | UNICODE
>
> The problem I'm having is that I CANNOT write accented characters into the
> database or get them out correctly using my java code and the
> pg74.1jdbc3.jar file.
>
> Using psql, I select the data (with client encoding = UNICODE), and I get
>
> tést.jpg
>
> With client encoding = LATIN1, I get
>
> tést.jpg
>
> But in my little java test app, I get:
>
> tést.jpg, tést.jpg, tést.jpg,
>
> I want tést.jpg!!!!!
>
> Here is the offending section of code:
>
> String filename = rset.getString(2);
> System.out.print(filename);
> System.out.print(", ");
>
> if (filename != null)
> {
>   try
>   {
>     filename = new String(rset.getBytes(2), "UTF-8");
>   }
>   catch (UnsupportedEncodingException e)
>   {
>     System.out.println("Cannot decode string?");
>   }
>
>   System.out.print(filename);
>   System.out.print(", ");
>
>   try
>   {
>     filename = new String(rset.getBytes(2), "ISO-8859-1");
>   }
>   catch (UnsupportedEncodingException e)
>   {
>     System.out.println("Cannot decode string?");
>   }
>
>   System.out.print(filename);
>   System.out.print(", ");
> }
>
> Can anyone explain what I am doing wrong? I have become so confused by all
> this, that I don't think I can see the problem straight anymore.
>
> How can I read and write unicode chars into the db. Is there some magick
> parameter that needs to be passed when setting up the connection/driver?
>
> Thanks for any/all help!!
>
> John Sidney-Woollett
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly


Re: Encoding nightmare! Pls help!

From
Paul Thomas
Date:
On 02/02/2004 18:44 John Sidney-Woollett wrote:
> [snip]
> Can anyone explain what I am doing wrong? I have become so confused by
> all
> this, that I don't think I can see the problem straight anymore.
>
> How can I read and write unicode chars into the db. Is there some magick
> parameter that needs to be passed when setting up the connection/driver?
>
> Thanks for any/all help!!
>
> John Sidney-Woollett

A total shot in the dark but what have you got your JVM Locale/encoding
set to? I've not had the need to delve too deeply into this area but this
link might be of some help:

http://java.sun.com/developer/JDCTechTips/2003/tt0110.html

--
Paul Thomas
+------------------------------+---------------------------------------------+
| Thomas Micro Systems Limited | Software Solutions for the Smaller
Business |
| Computer Consultants         |
http://www.thomas-micro-systems-ltd.co.uk   |
+------------------------------+---------------------------------------------+

Re: Encoding nightmare! Pls help!

From
"John Sidney-Woollett"
Date:
Paul, thanks for the link - I'll have a read.

I've tried (probably incorrectly) setting various JVM locales/encodings,
and I still get the same rubbish output.

Perhaps a good night's sleep will help!

John

Paul Thomas said:
>
> On 02/02/2004 18:44 John Sidney-Woollett wrote:
>> [snip]
>> Can anyone explain what I am doing wrong? I have become so confused by
>> all
>> this, that I don't think I can see the problem straight anymore.
>>
>> How can I read and write unicode chars into the db. Is there some magick
>> parameter that needs to be passed when setting up the connection/driver?
>>
>> Thanks for any/all help!!
>>
>> John Sidney-Woollett
>
> A total shot in the dark but what have you got your JVM Locale/encoding
> set to? I've not had the need to delve too deeply into this area but this
> link might be of some help:
>
> http://java.sun.com/developer/JDCTechTips/2003/tt0110.html
>
> --
> Paul Thomas
> +------------------------------+---------------------------------------------+
> | Thomas Micro Systems Limited | Software Solutions for the Smaller
> Business |
> | Computer Consultants         |
> http://www.thomas-micro-systems-ltd.co.uk   |
> +------------------------------+---------------------------------------------+
>


Re: Encoding nightmare! Pls help!

From
"John Sidney-Woollett"
Date:
Thanks for your reply - but I'm still totally baffled and confused...

I've got JSP pages printing tést.jpg as tést.jpg even though I have set
the content type to "text/html", 'UTF-8'.

Also, I've got servlets writing garbage into the database even though I
have explicitly set the encoding on the request object to UTF-8 - but this
may because I don't really know what encoding scheme the browser has used.

It's been a long day...

Thanks for your help.

John

Barry Lind said:
> John,
>
> My first guess is that your code is working fine, it is just your
> System.out.print() calls that are the problem.  You haven't specified
> what character set to use when printing, so it will use the default
> character set for your jvm.  This is necessary since java strings are
> stored internally in ucs2 and java needs to convert to some other
> character set when printing them out.
>
> thanks,
> --Barry
>
> John Sidney-Woollett wrote:
>> I've had a discussion on the general list about the implications for
>> storing accented characters within a postgres (7.4.1) db.
>>
>> As a result, I have created a database with no locale = C locale (using
>> initdb {other parms} --no-locale)
>>
>> Here's the database (test) with UNICODE encoding
>>
>>          List of databases
>>      Name     |  Owner   | Encoding
>> --------------+----------+----------
>>  test         | postgres | UNICODE
>>  template0    | postgres | UNICODE
>>  template1    | postgres | UNICODE
>>
>> The problem I'm having is that I CANNOT write accented characters into
>> the
>> database or get them out correctly using my java code and the
>> pg74.1jdbc3.jar file.
>>
>> Using psql, I select the data (with client encoding = UNICODE), and I
>> get
>>
>> tést.jpg
>>
>> With client encoding = LATIN1, I get
>>
>> tést.jpg
>>
>> But in my little java test app, I get:
>>
>> tést.jpg, tést.jpg, tést.jpg,
>>
>> I want tést.jpg!!!!!
>>
>> Here is the offending section of code:
>>
>> String filename = rset.getString(2);
>> System.out.print(filename);
>> System.out.print(", ");
>>
>> if (filename != null)
>> {
>>   try
>>   {
>>     filename = new String(rset.getBytes(2), "UTF-8");
>>   }
>>   catch (UnsupportedEncodingException e)
>>   {
>>     System.out.println("Cannot decode string?");
>>   }
>>
>>   System.out.print(filename);
>>   System.out.print(", ");
>>
>>   try
>>   {
>>     filename = new String(rset.getBytes(2), "ISO-8859-1");
>>   }
>>   catch (UnsupportedEncodingException e)
>>   {
>>     System.out.println("Cannot decode string?");
>>   }
>>
>>   System.out.print(filename);
>>   System.out.print(", ");
>> }
>>
>> Can anyone explain what I am doing wrong? I have become so confused by
>> all
>> this, that I don't think I can see the problem straight anymore.
>>
>> How can I read and write unicode chars into the db. Is there some magick
>> parameter that needs to be passed when setting up the connection/driver?
>>
>> Thanks for any/all help!!
>>
>> John Sidney-Woollett
>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: if posting/reading through Usenet, please send an appropriate
>>       subscribe-nomail command to majordomo@postgresql.org so that your
>>       message can get through to the mailing list cleanly
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>


OFF TOPIC: email postage should be of interest to those who use OSS newslists

From
"David Wall"
Date:
Sorry for this off topic posting, but it should be of interest to those in
the OSS communities since it threatens us.

The following story appeared in the New York Times as well as various local
papers (like the one here in Seattle).

http://www.nytimes.com/2004/02/02/technology/02spam.html

The gist is that Microsoft, Yahoo and others are trying to create a scam in
which they charge postage for email.  Note that this would be used against
all open source projects that rely heavily on free emails going out to
developers and users.  Note that spam filters in the big ISPs will only be
made more restrictive in order to increase utilization of the postage scam.
After all, nearly every email sent arrives at its destination today, so
nobody will pay.  But as they tighten the rules, more legit email will get
blocked as spam, thus forcing us into paying for postage that provides no
added services, and of course would cripple OSS projects that rely on email.

Below is my letter to the editor of the NYT and a few quotes from the
article for those who aren't registered on the NYT site.

Thanks,
David

++++

Re: "Gates Backs E-Mail Stamp in War on Spam,"
http://www.nytimes.com/2004/02/02/technology/02spam.html

Dear Mr. Hansell:

Postage for sending email?  That sounds like a greedy attempt to charge
twice without providing any added services.

Our tax dollars paid to create the Internet and the communications protocols
freely used by Microsoft and Yahoo.  But now there's a new land grab to try
to privatize our public, worldwide network that promotes freedom.

Our monthly ISP and telephone fees already pay for our usage of the
Internet.

Now we're told that for our own good, we should pay for each email sent,
even though we've already paid for that privilege.  Open source projects
rely heavily on email for developer communications and for user support.  Is
it surprising that Microsoft likes such a scheme?  Will we next have to pay
for each instant message or each web page we visit?

A typical email, like this one, is about 2 KB in size.  Today's MSN homepage
is 100 KB, with lots of unsolicited ads.  Unsolicited popup ads often run in
the 13-20KB range.  Should Microsoft have to pay me to view these ads?
Their web page consumes 50 times more bandwidth than this email.

Lastly, there are commercials services today like Yozons.com that charge for
sending secure messages that have no spam or viruses, but at least they
offer lots of features beyond email (working return receipts, encrypted
delivery
to ensure privacy, electronic signatures, status tracking of messages sent,
etc.)
so many find it worth the extra money spent.

We'll pay for services we want, but paying twice for no added service is bad
all around.

Sincerely,
David Wall

+++++

Some quotes from the NYT story, since I realize I cannot post the entire
story here for copyright reasons:

"Ten days ago, Bill Gates, Microsoft's chairman, told the World Economic
Forum in Davos, Switzerland, that spam would not be a problem in two years,
in part because of systems that would require people to pay money to send
e-mail. Yahoo, meanwhile, is quietly evaluating an e-mail postage plan being
developed by Goodmail, a Silicon Valley start-up company."

""Damn if I will pay postage for my nice list," said David Farber, a
professor at Carnegie Mellon University, who runs a mailing list on
technology and policy with 30,000 recipients. He said electronic postage
systems are likely to be too complex and would charge noncommercial users
who should be able to send e-mail free."

"But for the big Internet access providers, or I.S.P.'s, the prospect of
e-mail postage creating a new revenue stream that could help offset the cost
of their e-mail systems is undeniably attractive"