Thread: Unicode database + JDBC driver performance
Hello, I have some questions regarding PostgreSQL handling of Unicode databases and their performance. I am using version 7.2.1 and running two benchmarks against a database set up with LATIN1 encoding and the same database with UNICODE. The database consists of a single "test" table: Column | Type | Modifiers --------+---------+----------- id | integer | not null txt | text | not null Primary key: test_pkey The client is written in Java, it relies on the official JDBC driver, and is being run on the same machine as the database. Benchmark 1: Insert 10,000 rows (in 10 transactions, 1000 rows per transaction) into table "test". Each row contains 674 characters, most of which are ASCII. Benchmark 2: select * from test, repeated 10 times in a loop I am measuring the disk space taken by the database in each case (LATIN1 vs UNICODE) and the time it takes to run the benchmarks. I don't understand the results: Disk space change (after inserts and vacuumdb -f): LATIN1 UNICODE 764K 640K I would rather assume that the Unicode database takes more space, even 2 times as more.. Apparently not (and that's nice). Avg. Benchmark execution times (obtained with the 'time' command, repeatedly): Benchmark 1: LATIN1 UNICODE 11.5s 14.5s Benchmark 2: LATIN1 UNICODE 4.7s 8.6s The Unicode database is slower both on INSERTs and especially on SELECTs. I am wondering why. Since Java uses Unicode internally, shouldn't it actually be more efficient to store/retrieve character data in that format, with no recoding? Maybe it is an issue with the JDBC driver? Or is handling Unicode inherently much slower on the backend side? Take care - JPL
Jan Ploski <jpljpl@gmx.de> writes: > The Unicode database is slower both on INSERTs and especially on > SELECTs. What did you have PGCLIENTENCODING set to in each case? I am wondering if you were incurring an extra conversion ... regards, tom lane
On Sat, Dec 21, 2002 at 11:45:08AM -0500, Tom Lane wrote: > Jan Ploski <jpljpl@gmx.de> writes: > > The Unicode database is slower both on INSERTs and especially on > > SELECTs. > > What did you have PGCLIENTENCODING set to in each case? I am wondering > if you were incurring an extra conversion ... Tom, When connecting to the Unicode database, I did not use any 'charSet' parameter for the JDBC driver (which I assume translates to PGCLIENTENCODING on the backend?). When I specified charSet=UNICODE, I got an exception when executing the query. Without, it worked fine. For the LATIN1 database, I used charSet=LATIN1 in driver's connection string. It worked as expected (the conversion of non-ASCII characters was not correct, but that's ok). Is there some other way to find out what PGCLIENTENCODING is used in each case? Take care - JPL
Jan Ploski <jpljpl@gmx.de> writes: > Is there some other way to find out what PGCLIENTENCODING is used > in each case? I think "show client_encoding" and "show server_encoding" would tell what you want (not sure how well JDBC cooperates with the SHOW command though). regards, tom lane
On Sat, Dec 21, 2002 at 02:57:51PM -0500, Tom Lane wrote: > Jan Ploski <jpljpl@gmx.de> writes: > > Is there some other way to find out what PGCLIENTENCODING is used > > in each case? > > I think "show client_encoding" and "show server_encoding" would tell > what you want (not sure how well JDBC cooperates with the SHOW command > though). Hello, I could not get to this info via JDBC directly, but sniffing on port 5432 reveals the following output when the SHOW commands are run: NOTICE: Current server encoding is 'UNICODE'. NOTICE: Current client encoding is 'UNICODE'. It seems correct to me, so the question about Unicode's slowness remains. I guess that I will have a look at the driver's internals. Do you believe that the backend should not be much slower [than LATIN1] when delivering query results in the UNICODE encoding? -JPL
Jan Ploski <jpljpl@gmx.de> writes: > It seems correct to me, so the question about Unicode's slowness remains. > I guess that I will have a look at the driver's internals. Do you believe > that the backend should not be much slower [than LATIN1] when delivering > query results in the UNICODE encoding? If no translation is happening then I can't see a reason for any difference at all, given what you were doing. I can imagine speed differences for operations that actually interpret the character data --- for example, sorting on a text column --- but AFAICT you weren't doing anything like that. It'd be worth checking first to see whether the extra time is really being spent in the backend, client, or both. regards, tom lane
On Sat, Dec 21, 2002 at 03:33:57PM -0500, Tom Lane wrote: > If no translation is happening then I can't see a reason for any > difference at all, given what you were doing. > > I can imagine speed differences for operations that actually interpret > the character data --- for example, sorting on a text column --- but > AFAICT you weren't doing anything like that. > > It'd be worth checking first to see whether the extra time is really > being spent in the backend, client, or both. Hello, Here are the results of the more detailed benchmarks. Legend: the left column is "utime", right column is "stime" (both in jiffies) according to "cat /proc/<pid>/stat" output for the 'postgres' process created by establishing the database connection. For the client, I used the Unix 'time' command (and assumed 1 jiffie == 1/100s). In short, we can see that the backend's performance is somewhat worse for the Unicode database in the INSERT benchmark. The client's (driver's) performance is much worse for Unicode in this same benchmark. In the SELECT benchmark, the backend performance does not vary between UNICODE and LATIN1 (nice!). The difference in client's performance is even bigger than in the INSERT benchmark, though. It looks like the driver needs some profiling to find out the reason. Thanks for your assistance. -JPL INSERT benchmark backend LATIN1 UNICODE 351 102 480 134 370 112 476 164 376 126 496 178 359 125 498 126 ------- ------- 364 116 487 150 33% 29% worse client LATIN1 UNICODE 234 45 418 56 224 56 441 48 241 43 430 49 222 41 439 34 ------- ------- 230 46 432 47 87% 2% worse SELECT benchmark backend LATIN1 UNICODE 70 9 74 15 74 19 84 14 84 12 71 11 71 13 74 14 ----- ----- 75 13 76 14 about the same client LATIN1 UNICODE 217 22 529 14 217 18 544 25 217 26 534 14 219 21 542 17 ------- ------- 218 22 537 18 146% worse
Jan, You say you are using 7.2.1, is that for both server and jdbc driver? There is a performance patch in the 7.3 driver that bypasses the built in java routines for converting to/from utf8 with a custom one. The built in java routines are very slow on some jdks (although on jdk1.4 they are pretty good). Can you try the 7.3 drivers? thanks, --Barry Jan Ploski wrote: > Hello, > > I have some questions regarding PostgreSQL handling of Unicode databases > and their performance. I am using version 7.2.1 and running two benchmarks > against a database set up with LATIN1 encoding and the same database > with UNICODE. The database consists of a single "test" table: > > Column | Type | Modifiers > --------+---------+----------- > id | integer | not null > txt | text | not null > Primary key: test_pkey > > The client is written in Java, it relies on the official JDBC driver, > and is being run on the same machine as the database. > > Benchmark 1: > > Insert 10,000 rows (in 10 transactions, 1000 rows per transaction) > into table "test". Each row contains 674 characters, most of which > are ASCII. > > Benchmark 2: > > select * from test, repeated 10 times in a loop > > > I am measuring the disk space taken by the database in each case > (LATIN1 vs UNICODE) and the time it takes to run the benchmarks. > I don't understand the results: > > Disk space change (after inserts and vacuumdb -f): > LATIN1 UNICODE > 764K 640K > > I would rather assume that the Unicode database takes more space, > even 2 times as more.. Apparently not (and that's nice). > > Avg. Benchmark execution times (obtained with the 'time' command, repeatedly): > Benchmark 1: > LATIN1 UNICODE > 11.5s 14.5s > > Benchmark 2: > LATIN1 UNICODE > 4.7s 8.6s > > The Unicode database is slower both on INSERTs and especially on > SELECTs. I am wondering why. Since Java uses Unicode internally, > shouldn't it actually be more efficient to store/retrieve character > data in that format, with no recoding? Maybe it is an issue with the > JDBC driver? Or is handling Unicode inherently much slower on the > backend side? > > Take care - > JPL > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Mon, Dec 23, 2002 at 09:52:28AM -0800, Barry Lind wrote: > Jan, > > You say you are using 7.2.1, is that for both server and jdbc driver? > There is a performance patch in the 7.3 driver that bypasses the built > in java routines for converting to/from utf8 with a custom one. The > built in java routines are very slow on some jdks (although on jdk1.4 > they are pretty good). Can you try the 7.3 drivers? > > thanks, > --Barry Hello, I was using the 7.2.2 JDBC driver and 7.2.1 backend in the original benchmark. The Java runtime version was Sun's 1.3.1. I repeated the same tests today with the 7.3 JDBC driver. The results, together with the older driver's figures for comparison, are pasted below. The SELECT performance is definitely improved, but the INSERT performance got even worse. You are right. The difference is not there when I am using Sun's 1.4. Could it be that the improvement you mentioned does not apply there? Take care - JPL SELECT benchmark client (7.3.1 JDBC driver) LATIN1 UNICODE 212 22 281 29 222 29 275 36 216 26 275 19 228 21 275 17 ------- ------- 220 25 277 25 26% worse client (7.2.2 JDBC driver) LATIN1 UNICODE 217 22 529 14 217 18 544 25 217 26 534 14 219 21 542 17 ------- ------- 218 22 537 18 146% worse INSERT benchmark client (7.3.1 JDBC driver) LATIN1 UNICODE 159 40 345 49 189 36 375 26 181 37 397 44 180 36 376 36 ------- ------- 177 37 373 39 111% 5% worse client (7.2.2 JDBC driver) LATIN1 UNICODE 234 45 418 56 224 56 441 48 241 43 430 49 222 41 439 34 ------- ------- 230 46 432 47 87% 2% worse
Jan, You are correct, the 7.3 performance improvement was only implemented for the decode method, not the encode method, so inserts will still have the problem on certain jvms. You can look at the 7.3 version of org.postgresql.core.Encoding to see what the code is using to do character set conversions. If you want to submit a patch that handles the encoding case as well, that would be great. thanks, --Barry Jan Ploski wrote: > On Mon, Dec 23, 2002 at 09:52:28AM -0800, Barry Lind wrote: > >>Jan, >> >>You say you are using 7.2.1, is that for both server and jdbc driver? >>There is a performance patch in the 7.3 driver that bypasses the built >>in java routines for converting to/from utf8 with a custom one. The >>built in java routines are very slow on some jdks (although on jdk1.4 >>they are pretty good). Can you try the 7.3 drivers? >> >>thanks, >>--Barry > > > Hello, > > I was using the 7.2.2 JDBC driver and 7.2.1 backend in the original > benchmark. The Java runtime version was Sun's 1.3.1. I repeated the same > tests today with the 7.3 JDBC driver. The results, together with the older > driver's figures for comparison, are pasted below. The SELECT performance > is definitely improved, but the INSERT performance got even worse. > > You are right. The difference is not there when I am using Sun's 1.4. > > Could it be that the improvement you mentioned does not apply there? > > Take care - > JPL > > > SELECT benchmark > > client (7.3.1 JDBC driver) > LATIN1 UNICODE > 212 22 281 29 > 222 29 275 36 > 216 26 275 19 > 228 21 275 17 > ------- ------- > 220 25 277 25 26% worse > > client (7.2.2 JDBC driver) > LATIN1 UNICODE > 217 22 529 14 > 217 18 544 25 > 217 26 534 14 > 219 21 542 17 > ------- ------- > 218 22 537 18 146% worse > > > INSERT benchmark > > client (7.3.1 JDBC driver) > LATIN1 UNICODE > 159 40 345 49 > 189 36 375 26 > 181 37 397 44 > 180 36 376 36 > ------- ------- > 177 37 373 39 111% 5% worse > > client (7.2.2 JDBC driver) > LATIN1 UNICODE > 234 45 418 56 > 224 56 441 48 > 241 43 430 49 > 222 41 439 34 > ------- ------- > 230 46 432 47 87% 2% worse > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
Hello, Here is my UTF8Encoder class mentioned on pgsql-general. It should be put into org/postgresql/core, and you will also need to patch Encoding.java, so that it uses this class: if (encoding.equals("UTF-8")) { return UTF8Encoder.encode2(s); } else { return s.getBytes(encoding); } There are two public utility methods in UTF8Encoder, encode1 and encode2. They use two different approaches to determining how big the output buffer should be. Performance-wise they seem very similiar (encode2 being a bit slower), but I favor encode2 because it does less memory allocation and copying. Note that I did not use any shared buffer in order to avoid synchronization of multiple threads (as I understand, the class Encoding must ensure thread safety itself). This may be an unnecessary concern after all... I don't know. UTF8Encoder can be used as is, or made into a private static inner class of Encoding.java, whatever you prefer. UTF8Encoder.main contains some tests to assert that it stays compatible with Java's built-in encoder. It may be nicer to move them into a JUnit test case, you decide. Take care - JPL
Attachment
On Mon, Dec 30, 2002 at 09:10:24AM -0800, Barry Lind wrote: > Jan, > > You are correct, the 7.3 performance improvement was only implemented > for the decode method, not the encode method, so inserts will still have > the problem on certain jvms. > > You can look at the 7.3 version of org.postgresql.core.Encoding to see > what the code is using to do character set conversions. If you want to > submit a patch that handles the encoding case as well, that would be great. Hello, I wrote an optimized UTF8Encoder class, which I will send to you in private. To finish this thread and for the archive, here are my benchmark's results with the improved version, compared to those with the original 7.3.1 driver (again, JDK 1.3.1 + Linux). Take care - JPL INSERT benchmark client (7.3.1 JDBC driver, old) LATIN1 UNICODE 159 40 345 49 189 36 375 26 181 37 397 44 180 36 376 36 ------- ------- 177 37 373 39 111% 5% worse client (7.3.1 JDBC driver, new) LATIN1 UNICODE 209 27 162 30 210 37 194 27 178 33 163 28 167 29 186 33 ------- ------- 191 32 176 30 7% 6% better :)