Re: UTF8 national character data type support WIP patch and list of open issues. - Mailing list pgsql-hackers
From | MauMau |
---|---|
Subject | Re: UTF8 national character data type support WIP patch and list of open issues. |
Date | |
Msg-id | B1A7485194DE4FDAB8FA781AFB570079@maumau Whole thread Raw |
In response to | Re: UTF8 national character data type support WIP patch and list of open issues. (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: UTF8 national character data type support WIP patch and list of open issues.
Re: UTF8 national character data type support WIP patch and list of open issues. |
List | pgsql-hackers |
Hello, I think it would be nice for PostgreSQL to support national character types largely because it should ease migration from other DBMSs. [Reasons why we need NCHAR] -------------------------------------------------- 1. Invite users of other DBMSs to PostgreSQL. Oracle, SQL Server, MySQL, etc. all have NCHAR support. PostgreSQL is probably the only database out of major ones that does not support NCHAR. Sadly, I've read a report from some Japanese government agency that the number of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or 2011. I wouldn't say that is due to NCHAR support, but it might be one reason. I want PostgreSQL to be more popular and regain those users. 2. Enhance the "open" image of PostgreSQL by implementing more features of SQL standard. NCHAR may be a wrong and unnecessary feature of SQL standard now that we have Unicode support, but it is defined in the standard and widely implemented. 3. I have heard that some potential customers didn't adopt PostgreSQL due to lack of NCHAR support. However, I don't know the exact reason why they need NCHAR. 4. I guess some users really want to continue to use ShiftJIS or EUC_JP for database encoding, and use NCHAR for a limited set of columns to store international text in Unicode: - to avoid code conversion between the server and the client for performance - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of "Oracle Database Globalization Support Guide". -------------------------------------------------- I think we need to do the following: [Minimum requirements] -------------------------------------------------- 1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically. This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms for CHAR/VARCHAR, and ignores N prefix. But this is not documented. 2. Declare support for national character support in the manual. 1 is not sufficient because users don't want to depend on undocumented behavior. This is exactly what the TODO item "national character support" in PostgreSQL TODO wiki is about. 3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that: - psql \d can display the user-specified data types. - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as CHAR/VARCHAR. - To implement additional features for NCHAR/NVARCHAR in the future, as described below. -------------------------------------------------- [Optional requirements] -------------------------------------------------- 1. Implement client driver support, such as: - NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as specified in the SQL standard. - national character methods (e.g. setNString, getNString, setNCharacterStream) as specified in JDBC 4.0. I think at first we can treat these national-character-specific features as the same as CHAR/VARCHAR. 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain Unicode data. I think it is sufficient at first that NCHAR/NVARCHAR columns can only be used in UTF-8 databases and they store UTF-8 strings. This allows us to reuse the input/output/send/recv functions and other infrastructure of CHAR/VARCHAR. This is a reasonable compromise to avoid duplication and minimize the first implementation of NCHAR support. 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. Fixed-width encoding may allow faster string manipulation as described in Oracle's manual. But I'm not sure about this, because UTF-16 is not a real fixed-width encoding due to supplementary characters. -------------------------------------------------- I don't think it is good to implement NCHAR/NVARCHAR types as extensions like contrib/citext, because NCHAR/NVARCHAR are basic types and need client-side support. That is, client drivers need to be aware of the fixed NCHAR/NVARCHAR OID values. How do you think we should implement NCHAR support? Regards MauMau
pgsql-hackers by date: