Hello,
I have some questions regarding PostgreSQL handling of Unicode databases
and their performance. I am using version 7.2.1 and running two benchmarks
against a database set up with LATIN1 encoding and the same database
with UNICODE. The database consists of a single "test" table:
Column | Type | Modifiers
--------+---------+-----------
id | integer | not null
txt | text | not null
Primary key: test_pkey
The client is written in Java, it relies on the official JDBC driver,
and is being run on the same machine as the database.
Benchmark 1:
Insert 10,000 rows (in 10 transactions, 1000 rows per transaction)
into table "test". Each row contains 674 characters, most of which
are ASCII.
Benchmark 2:
select * from test, repeated 10 times in a loop
I am measuring the disk space taken by the database in each case
(LATIN1 vs UNICODE) and the time it takes to run the benchmarks.
I don't understand the results:
Disk space change (after inserts and vacuumdb -f):
LATIN1 UNICODE
764K 640K
I would rather assume that the Unicode database takes more space,
even 2 times as more.. Apparently not (and that's nice).
Avg. Benchmark execution times (obtained with the 'time' command, repeatedly):
Benchmark 1:
LATIN1 UNICODE
11.5s 14.5s
Benchmark 2:
LATIN1 UNICODE
4.7s 8.6s
The Unicode database is slower both on INSERTs and especially on
SELECTs. I am wondering why. Since Java uses Unicode internally,
shouldn't it actually be more efficient to store/retrieve character
data in that format, with no recoding? Maybe it is an issue with the
JDBC driver? Or is handling Unicode inherently much slower on the
backend side?
Take care -
JPL