Unicode vs SQL_ASCII DBs - Mailing list pgsql-general
From | John Sidney-Woollett |
---|---|
Subject | Unicode vs SQL_ASCII DBs |
Date | |
Msg-id | 3379.192.168.0.64.1075563161.squirrel@mercury.wardbrook.com Whole thread Raw |
Responses |
Re: Unicode vs SQL_ASCII DBs
Re: Unicode vs SQL_ASCII DBs |
List | pgsql-general |
Hi I need to store accented characters in a postgres (7.4) database, and access the data (mostly) using the postgres JDBC driver (from a web app). Does anyone know if: 1) Is there a performance loss using (multibyte) UNICODE vs (single byte) SQL_ASCII/LATINxxx character encoding? (In terms of extra data, and searching/sorting speeds). 2) Can SQL_ASCII be used for accented characters. 3) If I want accented characters to sort correctly, must I select UNICODE (or the appropriate ISO 8859 char set) over SQL_ASCII? 4) I'm not initially expecting arabic, chinese, cyrillic or other language types to be stored in the database. But if they were, would UNICODE be the best encoding scheme to use for future proofing the data? 5) If I decide not to support/store non-latin languages, is it better to use a LATIN encoding over UNICODE? 6) What is MULE_INTERNAL? Does it offer performance advantages? [This next question probably belongs to the JDBC list, but I'll ask anyway] 7) Because the database is being used to backend a java web application, are there other issues that I need to be aware of, for example, do I have to convert all data received to UTF-8 before writing it into the database? And do I have to ensure that the response (from the webserver) content-type is always set to UTF-8 to be rendered correctly in a user's browser? Thanks for any help/advice. John Sidney-Woollett ps I did some tests between two databases; once created using UNICODE encoding, and the other using SQL_ASCII encoding. The database initdb command specified no encoding, so I guess that SQL_ASCII is the default encoding. The results are below: I created the following table in two databases: Table "public.table1" Column | Type | Modifiers --------+------------------------+----------- id | integer | not null data | character varying(255) | Here are the databases; one with UNICODE encoding, and the other with SQL_ASCII encoding. List of databases Name | Owner | Encoding --------------+----------+----------- johntest | postgres | UNICODE johntest2 | postgres | SQL_ASCII template0 | postgres | SQL_ASCII template1 | postgres | SQL_ASCII I inserted the data into both databases in psql (first setting the client encoding to LATIN1). Here is the data (the funny character is the word "test" with the e replaced by an e acute (233). set client_encoding to 'LATIN1'; insert into table1 values (1, 'tést'); insert into table1 values (2, 'tast'); insert into table1 values (3, 'tost'); insert into table1 values (4, 'test'); Now select the data ordered (UNICODE DB): select * from table1 order by data; id | data ----+------ 2 | tast 4 | test 1 | tést 3 | tost (SQL_ASCII DB): select * from table1 order by data; id | data ----+------ 2 | tast 4 | test 3 | tost 1 | tést NOTE: Bad ordering using SQL_ASCII... Now I did some other tests, I set the client encoding to UNICODE, and retrieved the data from both databases: set client_encoding to 'UNICODE'; (SQL_ASCII DB): select * from table1 order by data; id | data ----+------ 2 | tast 4 | test 3 | tost 1 | tst NOTE: You can see that the e-acute has been "lost"... (UNICODE DB): select * from table1 order by data; id | data ----+------ 2 | tast 4 | test 1 | tést 3 | tost
pgsql-general by date: