Thread: Postgresql JDBC UTF8 Conversion Throughput

Postgresql JDBC UTF8 Conversion Throughput

From
Paul Lindner
Date:
Hi,

On a heavily trafficed web site we found hundreds of threads stuck
looking up character set names.  This was traced back to the
encodeUTF8() method in the package org.postgresql.core.Utils.

It turns out the using more than two character sets in your Java
Application causes very poor throughput because of synchronization
overhead.  I wrote about this here:

  http://paul.vox.com/library/post/the-mysteries-of-java-character-set-performance.html

In a web application you can easily find yourself in this situation:
  * ISO-8859-1 is often the default character set
  * UTF-8 is used for DBs and more
  * Your web container might request 'utf-8' or other aliased
    character sets while processing web requests.
  * Web browsers sometimes request the strangest encodings.

In Java 1.6 there's an easy way to fix this charset lookup problem.
Just create a static Charset for UTF-8 and pass that to getBytes(...)
instead of the string constant "UTF-8".

   Charset UTF8_CHARSET = Charset.forName("UTF-8");
   ...
   return str.getBytes(UTF8_CHARSET);

For backwards compatibility with Java 1.4 you can use the attached
patch instead.  It uses nio classes to do the UTF-8 to byte
conversion.

You may want to consider applying this patch.  If not, at least
this message will be in the archives.

Comments/Suggestions welcome...

--
Paul Lindner        ||||| | | | |  |  |  |   |   |
lindner@inuus.com

Attachment

Re: Postgresql JDBC UTF8 Conversion Throughput

From
Kris Jurka
Date:

On Mon, 2 Jun 2008, Paul Lindner wrote:

> It turns out the using more than two character sets in your Java
> Application causes very poor throughput because of synchronization
> overhead.  I wrote about this here:
>
>  http://paul.vox.com/library/post/the-mysteries-of-java-character-set-performance.html
>

Very interesting.

> In Java 1.6 there's an easy way to fix this charset lookup problem.
> Just create a static Charset for UTF-8 and pass that to getBytes(...)
> instead of the string constant "UTF-8".

Note that this is actually a performance hit (when you aren't stuck doing
charset lookups), see

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6633613

> For backwards compatibility with Java 1.4 you can use the attached
> patch instead.  It uses nio classes to do the UTF-8 to byte
> conversion.
>

This is also a performance loser in the simple case.  The attached test
case shows times of:

Doing 10000000 iterations of each.
2606 getBytes(String)
6200 getBytes(Charset)
3346 via ByteBuffer

It would be nice to fix the blocking problem, but it seems like a rather
unusual situation to be in (being one charset over the two charset cache).
If you've got more than three charsets in play then fixing the JDBC driver
won't help you because at most it could eliminate one.  So I'd like the
driver to be a good citizen, but I'm not convinced the performance hit is
worth it without having some more field reports or benchmarks.

Maybe it depends how much reading vs writing is done.  Right now we have
our own UTF8 decoder so this hit only happens when encoding data to send
it to the DB.  If you're loading a lot of data this might be a problem,
but if you're sending a small query with a couple of parameters, then
perhaps the thread safety is more important.

Kris Jurka

Attachment

Re: Postgresql JDBC UTF8 Conversion Throughput

From
Paul Lindner
Date:
From: Kris Jurka <books@ejurka.com>
Date: September 19, 2008 12:29:45 AM PDT
To: Paul Lindner <lindner@inuus.com>
Subject: Re: Postgresql JDBC UTF8 Conversion Throughput



On Mon, 2 Jun 2008, Paul Lindner wrote:

It turns out the using more than two character sets in your Java
Application causes very poor throughput because of synchronization
overhead.  I wrote about this here:

http://paul.vox.com/library/post/the-mysteries-of-java-character-set-performance.html


Very interesting.

In Java 1.6 there's an easy way to fix this charset lookup problem.
Just create a static Charset for UTF-8 and pass that to getBytes(...)
instead of the string constant "UTF-8".

Note that this is actually a performance hit (when you aren't stuck doing charset lookups), see

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6633613

For backwards compatibility with Java 1.4 you can use the attached
patch instead.  It uses nio classes to do the UTF-8 to byte
conversion.


This is also a performance loser in the simple case.  The attached test case shows times of:

Doing 10000000 iterations of each.
2606 getBytes(String)
6200 getBytes(Charset)
3346 via ByteBuffer

It would be nice to fix the blocking problem, but it seems like a rather unusual situation to be in (being one charset over the two charset cache). If you've got more than three charsets in play then fixing the JDBC driver won't help you because at most it could eliminate one.  So I'd like the driver to be a good citizen, but I'm not convinced the performance hit is worth it without having some more field reports or benchmarks.

Maybe it depends how much reading vs writing is done.  Right now we have our own UTF8 decoder so this hit only happens when encoding data to send it to the DB.  If you're loading a lot of data this might be a problem, but if you're sending a small query with a couple of parameters, then perhaps the thread safety is more important.


Hi Kris,

getBytes(String) when using a constant string will always win.  StringCoding.java (see http://www.docjar.net/html/api/java/lang/StringCoding.java.html)  caches the charset locally.

When you use 2 or more character sets getBytes(Charset) and getBytes(String) single-thread performance are about the same with getBytes(String) slightly ahead.  ByteBuffer ends up being the big winner:

Doing 10000000 iterations of each for string - 'abcd1234'
15662 getBytes(Charset)
14958 getBytes(String)
10098 via ByteBuffer

In any case all of this only pertains to single thread performance.  Our web apps are running on 8 and 16 core systems where contention is the biggest performance killer.

Attachment

Re: Postgresql JDBC UTF8 Conversion Throughput

From
Kris Jurka
Date:

On Fri, 19 Sep 2008, Paul Lindner wrote:

> When you use 2 or more character sets getBytes(Charset) and getBytes(String)
> single-thread performance are about the same with getBytes(String) slightly
> ahead.  ByteBuffer ends up being the big winner:
>
> Doing 10000000 iterations of each for string - 'abcd1234'
> 15662 getBytes(Charset)
> 14958 getBytes(String)
> 10098 via ByteBuffer
>
> In any case all of this only pertains to single thread performance.  Our web
> apps are running on 8 and 16 core systems where contention is the biggest
> performance killer.
>

I see, so this wins even single threaded.  In that case I've applied your
patch to CVS.

Kris Jurka