Thread: Unicode database + JDBC driver performance

Unicode database + JDBC driver performance

From
Jan Ploski
Date:
Hello,

I have some questions regarding PostgreSQL handling of Unicode databases
and their performance. I am using version 7.2.1 and running two benchmarks
against a database set up with LATIN1 encoding and the same database
with UNICODE. The database consists of a single "test" table:

Column |  Type   | Modifiers
--------+---------+-----------
id     | integer | not null
txt    | text    | not null
Primary key: test_pkey

The client is written in Java, it relies on the official JDBC driver,
and is being run on the same machine as the database.

Benchmark 1:

Insert 10,000 rows (in 10 transactions, 1000 rows per transaction)
into table "test". Each row contains 674 characters, most of which
are ASCII.

Benchmark 2:

select * from test, repeated 10 times in a loop


I am measuring the disk space taken by the database in each case
(LATIN1 vs UNICODE) and the time it takes to run the benchmarks.
I don't understand the results:

Disk space change (after inserts and vacuumdb -f):
LATIN1      UNICODE
764K        640K

I would rather assume that the Unicode database takes more space,
even 2 times as more.. Apparently not (and that's nice).

Avg. Benchmark execution times (obtained with the 'time' command, repeatedly):
Benchmark 1:
LATIN1      UNICODE
11.5s       14.5s

Benchmark 2:
LATIN1      UNICODE
4.7s        8.6s

The Unicode database is slower both on INSERTs and especially on
SELECTs. I am wondering why. Since Java uses Unicode internally,
shouldn't it actually be more efficient to store/retrieve character
data in that format, with no recoding? Maybe it is an issue with the
JDBC driver? Or is handling Unicode inherently much slower on the
backend side?

Take care -
JPL


Re: Unicode database + JDBC driver performance

From
Tom Lane
Date:
Jan Ploski <jpljpl@gmx.de> writes:
> The Unicode database is slower both on INSERTs and especially on
> SELECTs.

What did you have PGCLIENTENCODING set to in each case?  I am wondering
if you were incurring an extra conversion ...

            regards, tom lane

Re: Unicode database + JDBC driver performance

From
Jan Ploski
Date:
On Sat, Dec 21, 2002 at 11:45:08AM -0500, Tom Lane wrote:
> Jan Ploski <jpljpl@gmx.de> writes:
> > The Unicode database is slower both on INSERTs and especially on
> > SELECTs.
>
> What did you have PGCLIENTENCODING set to in each case?  I am wondering
> if you were incurring an extra conversion ...

Tom,

When connecting to the Unicode database, I did not use any 'charSet'
parameter for the JDBC driver (which I assume translates to
PGCLIENTENCODING on the backend?). When I specified charSet=UNICODE,
I got an exception when executing the query. Without, it worked fine.

For the LATIN1 database, I used charSet=LATIN1 in driver's connection
string. It worked as expected (the conversion of non-ASCII characters
was not correct, but that's ok).

Is there some other way to find out what PGCLIENTENCODING is used
in each case?

Take care -
JPL


Re: Unicode database + JDBC driver performance

From
Tom Lane
Date:
Jan Ploski <jpljpl@gmx.de> writes:
> Is there some other way to find out what PGCLIENTENCODING is used
> in each case?

I think "show client_encoding" and "show server_encoding" would tell
what you want (not sure how well JDBC cooperates with the SHOW command
though).

            regards, tom lane

Re: Unicode database + JDBC driver performance

From
Jan Ploski
Date:
On Sat, Dec 21, 2002 at 02:57:51PM -0500, Tom Lane wrote:
> Jan Ploski <jpljpl@gmx.de> writes:
> > Is there some other way to find out what PGCLIENTENCODING is used
> > in each case?
>
> I think "show client_encoding" and "show server_encoding" would tell
> what you want (not sure how well JDBC cooperates with the SHOW command
> though).

Hello,

I could not get to this info via JDBC directly, but sniffing on port 5432
reveals the following output when the SHOW commands are run:

NOTICE:  Current server encoding is 'UNICODE'.
NOTICE:  Current client encoding is 'UNICODE'.

It seems correct to me, so the question about Unicode's slowness remains.
I guess that I will have a look at the driver's internals. Do you believe
that the backend should not be much slower [than LATIN1] when delivering
query results in the UNICODE encoding?

-JPL


Re: Unicode database + JDBC driver performance

From
Tom Lane
Date:
Jan Ploski <jpljpl@gmx.de> writes:
> It seems correct to me, so the question about Unicode's slowness remains.
> I guess that I will have a look at the driver's internals. Do you believe
> that the backend should not be much slower [than LATIN1] when delivering
> query results in the UNICODE encoding?

If no translation is happening then I can't see a reason for any
difference at all, given what you were doing.

I can imagine speed differences for operations that actually interpret
the character data --- for example, sorting on a text column --- but
AFAICT you weren't doing anything like that.

It'd be worth checking first to see whether the extra time is really
being spent in the backend, client, or both.

            regards, tom lane

Re: Unicode database + JDBC driver performance

From
Jan Ploski
Date:
On Sat, Dec 21, 2002 at 03:33:57PM -0500, Tom Lane wrote:
> If no translation is happening then I can't see a reason for any
> difference at all, given what you were doing.
>
> I can imagine speed differences for operations that actually interpret
> the character data --- for example, sorting on a text column --- but
> AFAICT you weren't doing anything like that.
>
> It'd be worth checking first to see whether the extra time is really
> being spent in the backend, client, or both.

Hello,

Here are the results of the more detailed benchmarks.

Legend: the left column is "utime", right column is "stime" (both in
jiffies) according to "cat /proc/<pid>/stat" output for the 'postgres'
process created by establishing the database connection. For the client,
I used the Unix 'time' command (and assumed 1 jiffie == 1/100s).

In short, we can see that the backend's performance is somewhat worse
for the Unicode database in the INSERT benchmark. The client's
(driver's) performance is much worse for Unicode in this same
benchmark.

In the SELECT benchmark, the backend performance does not vary
between UNICODE and LATIN1 (nice!). The difference in client's
performance is even bigger than in the INSERT benchmark, though.

It looks like the driver needs some profiling to find out the reason.

Thanks for your assistance.

-JPL


INSERT benchmark

backend
LATIN1      UNICODE
351 102     480 134
370 112     476 164
376 126     496 178
359 125     498 126
-------     -------
364 116     487 150     33% 29% worse

client
LATIN1      UNICODE
234  45     418  56
224  56     441  48
241  43     430  49
222  41     439  34
-------     -------
230  46     432  47     87% 2% worse


SELECT benchmark

backend
LATIN1      UNICODE
70 9        74 15
74 19       84 14
84 12       71 11
71 13       74 14
-----       -----
75 13       76 14       about the same

client
LATIN1      UNICODE
217  22     529  14
217  18     544  25
217  26     534  14
219  21     542  17
-------     -------
218  22     537  18     146% worse


Re: Unicode database + JDBC driver performance

From
Barry Lind
Date:
Jan,

You say you are using 7.2.1, is that for both server and jdbc driver?
There is a performance patch in the 7.3 driver that bypasses the built
in java routines for converting to/from utf8 with a custom one.  The
built in java routines are very slow on some jdks (although on jdk1.4
they are pretty good).  Can you try the 7.3 drivers?

thanks,
--Barry


Jan Ploski wrote:
> Hello,
>
> I have some questions regarding PostgreSQL handling of Unicode databases
> and their performance. I am using version 7.2.1 and running two benchmarks
> against a database set up with LATIN1 encoding and the same database
> with UNICODE. The database consists of a single "test" table:
>
> Column |  Type   | Modifiers
> --------+---------+-----------
> id     | integer | not null
> txt    | text    | not null
> Primary key: test_pkey
>
> The client is written in Java, it relies on the official JDBC driver,
> and is being run on the same machine as the database.
>
> Benchmark 1:
>
> Insert 10,000 rows (in 10 transactions, 1000 rows per transaction)
> into table "test". Each row contains 674 characters, most of which
> are ASCII.
>
> Benchmark 2:
>
> select * from test, repeated 10 times in a loop
>
>
> I am measuring the disk space taken by the database in each case
> (LATIN1 vs UNICODE) and the time it takes to run the benchmarks.
> I don't understand the results:
>
> Disk space change (after inserts and vacuumdb -f):
> LATIN1      UNICODE
> 764K        640K
>
> I would rather assume that the Unicode database takes more space,
> even 2 times as more.. Apparently not (and that's nice).
>
> Avg. Benchmark execution times (obtained with the 'time' command, repeatedly):
> Benchmark 1:
> LATIN1      UNICODE
> 11.5s       14.5s
>
> Benchmark 2:
> LATIN1      UNICODE
> 4.7s        8.6s
>
> The Unicode database is slower both on INSERTs and especially on
> SELECTs. I am wondering why. Since Java uses Unicode internally,
> shouldn't it actually be more efficient to store/retrieve character
> data in that format, with no recoding? Maybe it is an issue with the
> JDBC driver? Or is handling Unicode inherently much slower on the
> backend side?
>
> Take care -
> JPL
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly
>



Re: Unicode database + JDBC driver performance

From
Jan Ploski
Date:
On Mon, Dec 23, 2002 at 09:52:28AM -0800, Barry Lind wrote:
> Jan,
>
> You say you are using 7.2.1, is that for both server and jdbc driver?
> There is a performance patch in the 7.3 driver that bypasses the built
> in java routines for converting to/from utf8 with a custom one.  The
> built in java routines are very slow on some jdks (although on jdk1.4
> they are pretty good).  Can you try the 7.3 drivers?
>
> thanks,
> --Barry

Hello,

I was using the 7.2.2 JDBC driver and 7.2.1 backend in the original
benchmark. The Java runtime version was Sun's 1.3.1. I repeated the same
tests today with the 7.3 JDBC driver. The results, together with the older
driver's figures for comparison, are pasted below. The SELECT performance
is definitely improved, but the INSERT performance got even worse.

You are right. The difference is not there when I am using Sun's 1.4.

Could it be that the improvement you mentioned does not apply there?

Take care -
JPL


SELECT benchmark

client (7.3.1 JDBC driver)
LATIN1      UNICODE
212  22     281  29
222  29     275  36
216  26     275  19
228  21     275  17
-------     -------
220  25     277  25     26% worse

client (7.2.2 JDBC driver)
LATIN1      UNICODE
217  22     529  14
217  18     544  25
217  26     534  14
219  21     542  17
-------     -------
218  22     537  18     146% worse


INSERT benchmark

client (7.3.1 JDBC driver)
LATIN1      UNICODE
159  40     345  49
189  36     375  26
181  37     397  44
180  36     376  36
-------     -------
177  37     373  39     111% 5% worse

client (7.2.2 JDBC driver)
LATIN1      UNICODE
234  45     418  56
224  56     441  48
241  43     430  49
222  41     439  34
-------     -------
230  46     432  47     87% 2% worse



Re: Unicode database + JDBC driver performance

From
Barry Lind
Date:
Jan,

You are correct, the 7.3 performance improvement was only implemented
for the decode method, not the encode method, so inserts will still have
the problem on certain jvms.

You can look at the 7.3 version of org.postgresql.core.Encoding to see
what the code is using to do character set conversions.  If you want to
submit a patch that handles the encoding case as well, that would be great.

thanks,
--Barry


Jan Ploski wrote:
> On Mon, Dec 23, 2002 at 09:52:28AM -0800, Barry Lind wrote:
>
>>Jan,
>>
>>You say you are using 7.2.1, is that for both server and jdbc driver?
>>There is a performance patch in the 7.3 driver that bypasses the built
>>in java routines for converting to/from utf8 with a custom one.  The
>>built in java routines are very slow on some jdks (although on jdk1.4
>>they are pretty good).  Can you try the 7.3 drivers?
>>
>>thanks,
>>--Barry
>
>
> Hello,
>
> I was using the 7.2.2 JDBC driver and 7.2.1 backend in the original
> benchmark. The Java runtime version was Sun's 1.3.1. I repeated the same
> tests today with the 7.3 JDBC driver. The results, together with the older
> driver's figures for comparison, are pasted below. The SELECT performance
> is definitely improved, but the INSERT performance got even worse.
>
> You are right. The difference is not there when I am using Sun's 1.4.
>
> Could it be that the improvement you mentioned does not apply there?
>
> Take care -
> JPL
>
>
> SELECT benchmark
>
> client (7.3.1 JDBC driver)
> LATIN1      UNICODE
> 212  22     281  29
> 222  29     275  36
> 216  26     275  19
> 228  21     275  17
> -------     -------
> 220  25     277  25     26% worse
>
> client (7.2.2 JDBC driver)
> LATIN1      UNICODE
> 217  22     529  14
> 217  18     544  25
> 217  26     534  14
> 219  21     542  17
> -------     -------
> 218  22     537  18     146% worse
>
>
> INSERT benchmark
>
> client (7.3.1 JDBC driver)
> LATIN1      UNICODE
> 159  40     345  49
> 189  36     375  26
> 181  37     397  44
> 180  36     376  36
> -------     -------
> 177  37     373  39     111% 5% worse
>
> client (7.2.2 JDBC driver)
> LATIN1      UNICODE
> 234  45     418  56
> 224  56     441  48
> 241  43     430  49
> 222  41     439  34
> -------     -------
> 230  46     432  47     87% 2% worse
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>




Re: Unicode database + JDBC driver performance

From
Jan Ploski
Date:
Hello,

Here is my UTF8Encoder class mentioned on pgsql-general.

It should be put into org/postgresql/core, and you will also
need to patch Encoding.java, so that it uses this class:

    if (encoding.equals("UTF-8")) {
        return UTF8Encoder.encode2(s);
    }
    else {
        return s.getBytes(encoding);
    }

There are two public utility methods in UTF8Encoder, encode1 and encode2.
They use two different approaches to determining how big the output
buffer should be. Performance-wise they seem very similiar (encode2
being a bit slower), but I favor encode2 because it does less memory
allocation and copying.

Note that I did not use any shared buffer in order to avoid
synchronization of multiple threads (as I understand, the class
Encoding must ensure thread safety itself). This may be an unnecessary
concern after all... I don't know.

UTF8Encoder can be used as is, or made into a private static inner
class of Encoding.java, whatever you prefer.

UTF8Encoder.main contains some tests to assert that it stays compatible
with Java's built-in encoder. It may be nicer to move them into
a JUnit test case, you decide.

Take care -
JPL

Attachment

Re: Unicode database + JDBC driver performance

From
Jan Ploski
Date:
On Mon, Dec 30, 2002 at 09:10:24AM -0800, Barry Lind wrote:
> Jan,
>
> You are correct, the 7.3 performance improvement was only implemented
> for the decode method, not the encode method, so inserts will still have
> the problem on certain jvms.
>
> You can look at the 7.3 version of org.postgresql.core.Encoding to see
> what the code is using to do character set conversions.  If you want to
> submit a patch that handles the encoding case as well, that would be great.

Hello,

I wrote an optimized UTF8Encoder class, which I will send to you
in private. To finish this thread and for the archive, here are my
benchmark's results with the improved version, compared to those with
the original 7.3.1 driver (again, JDK 1.3.1 + Linux).

Take care -
JPL

INSERT benchmark

client (7.3.1 JDBC driver, old)
LATIN1      UNICODE
159  40     345  49
189  36     375  26
181  37     397  44
180  36     376  36
-------     -------
177  37     373  39     111% 5% worse

client (7.3.1 JDBC driver, new)

LATIN1      UNICODE
209  27     162  30
210  37     194  27
178  33     163  28
167  29     186  33
-------     -------
191  32     176  30     7% 6% better :)