Unicode vs SQL_ASCII DBs - Mailing list pgsql-general

From John Sidney-Woollett
Subject Unicode vs SQL_ASCII DBs
Date
Msg-id 3379.192.168.0.64.1075563161.squirrel@mercury.wardbrook.com
Whole thread Raw
Responses Re: Unicode vs SQL_ASCII DBs  (Kris Jurka <books@ejurka.com>)
Re: Unicode vs SQL_ASCII DBs  (Culley Harrelson <culley@fastmail.fm>)
List pgsql-general
Hi

I need to store accented characters in a postgres (7.4) database, and
access the data (mostly) using the postgres JDBC driver (from a web app).

Does anyone know if:

1) Is there a performance loss using (multibyte) UNICODE vs (single byte)
SQL_ASCII/LATINxxx character encoding? (In terms of extra data, and
searching/sorting speeds).

2) Can SQL_ASCII be used for accented characters.

3) If I want accented characters to sort correctly, must I select UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?

4) I'm not initially expecting arabic, chinese, cyrillic or other language
types to be stored in the database. But if they were, would UNICODE be the
best encoding scheme to use for future proofing the data?

5) If I decide not to support/store non-latin languages, is it better to
use a LATIN encoding over UNICODE?

6) What is MULE_INTERNAL? Does it offer performance advantages?

[This next question probably belongs to the JDBC list, but I'll ask anyway]

7) Because the database is being used to backend a java web application,
are there other issues that I need to be aware of, for example, do I have
to convert all data received to UTF-8 before writing it into the database?
And do I have to ensure that the response (from the webserver)
content-type is always set to UTF-8 to be rendered correctly in a user's
browser?

Thanks for any help/advice.

John Sidney-Woollett

ps I did some tests between two databases; once created using UNICODE
encoding, and the other using SQL_ASCII encoding. The database initdb
command specified no encoding, so I guess that SQL_ASCII is the default
encoding. The results are below:

I created the following table in two databases:

            Table "public.table1"
 Column |          Type          | Modifiers
--------+------------------------+-----------
 id     | integer                | not null
 data   | character varying(255) |

Here are the databases; one with UNICODE encoding, and the other with
SQL_ASCII encoding.

          List of databases
     Name     |  Owner   | Encoding
--------------+----------+-----------
 johntest     | postgres | UNICODE
 johntest2    | postgres | SQL_ASCII
 template0    | postgres | SQL_ASCII
 template1    | postgres | SQL_ASCII

I inserted the data into both databases in psql (first setting the client
encoding to LATIN1). Here is the data (the funny character is the word
"test" with the e replaced by an e acute (233).

set client_encoding to 'LATIN1';

insert into table1 values (1, 'tést');
insert into table1 values (2, 'tast');
insert into table1 values (3, 'tost');
insert into table1 values (4, 'test');

Now select the data ordered

(UNICODE DB): select * from table1 order by data;
 id | data
----+------
  2 | tast
  4 | test
  1 | tést
  3 | tost

(SQL_ASCII DB): select * from table1 order by data;
 id | data
----+------
  2 | tast
  4 | test
  3 | tost
  1 | tést

NOTE: Bad ordering using SQL_ASCII...

Now I did some other tests, I set the client encoding to UNICODE, and
retrieved the data from both databases:

set client_encoding to 'UNICODE';

(SQL_ASCII DB): select * from table1 order by data;
 id | data
----+------
  2 | tast
  4 | test
  3 | tost
  1 | tst

NOTE: You can see that the e-acute has been "lost"...

 (UNICODE DB): select * from table1 order by data;
 id | data
----+------
  2 | tast
  4 | test
  1 | tést
  3 | tost









pgsql-general by date:

Previous
From: Michael Glaesemann
Date:
Subject: Re: Two joins on same foreign key
Next
From: Louis LeBlanc
Date:
Subject: Large object insert/update and oid use