Re: UTF8 national character data type support WIP patch and list of open issues. - Mailing list pgsql-hackers

From MauMau
Subject Re: UTF8 national character data type support WIP patch and list of open issues.
Date
Msg-id B1A7485194DE4FDAB8FA781AFB570079@maumau
Whole thread Raw
In response to Re: UTF8 national character data type support WIP patch and list of open issues.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: UTF8 national character data type support WIP patch and list of open issues.
Re: UTF8 national character data type support WIP patch and list of open issues.
List pgsql-hackers
Hello,

I think it would be nice for PostgreSQL to support national character types 
largely because it should ease migration from other DBMSs.

[Reasons why we need NCHAR]
--------------------------------------------------
1. Invite users of other DBMSs to PostgreSQL.  Oracle, SQL Server, MySQL, 
etc. all have NCHAR support.  PostgreSQL is probably the only database out 
of major ones that does not support NCHAR.
Sadly, I've read a report from some Japanese government agency that the 
number of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or 
2011.  I wouldn't say that is due to NCHAR support, but it might be one 
reason.  I want PostgreSQL to be more popular and regain those users.

2. Enhance the "open" image of PostgreSQL by implementing more features of 
SQL standard.  NCHAR may be a wrong and unnecessary feature of SQL standard 
now that we have Unicode support, but it is defined in the standard and 
widely implemented.

3. I have heard that some potential customers didn't adopt PostgreSQL due to 
lack of NCHAR support.  However, I don't know the exact reason why they need 
NCHAR.

4. I guess some users really want to continue to use ShiftJIS or EUC_JP for 
database encoding, and use NCHAR for a limited set of columns to store 
international text in Unicode:
- to avoid code conversion between the server and the client for performance
- because ShiftJIS and EUC_JP require less amount of storage (2 bytes for 
most Kanji) than UTF-8 (3 bytes)
This use case is described in chapter 6 of "Oracle Database Globalization 
Support Guide".
--------------------------------------------------


I think we need to do the following:

[Minimum requirements]
--------------------------------------------------
1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
This is already implemented.  PostgreSQL treats NCHAR/NVARCHAR as synonyms 
for CHAR/VARCHAR, and ignores N prefix.  But this is not documented.

2. Declare support for national character support in the manual.
1 is not sufficient because users don't want to depend on undocumented 
behavior.  This is exactly what the TODO item "national character support" 
in PostgreSQL TODO wiki is about.

3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that:
- psql \d can display the user-specified data types.
- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as 
CHAR/VARCHAR.
- To implement additional features for NCHAR/NVARCHAR in the future, as 
described below.
--------------------------------------------------




[Optional requirements]
--------------------------------------------------
1. Implement client driver support, such as:
- NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as 
specified in the SQL standard.
- national character methods (e.g. setNString, getNString, 
setNCharacterStream) as specified in JDBC 4.0.
I think at first we can treat these national-character-specific features as 
the same as CHAR/VARCHAR.

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always 
contain Unicode data.
I think it is sufficient at first that NCHAR/NVARCHAR columns can only be 
used in UTF-8 databases and they store UTF-8 strings.  This allows us to 
reuse the input/output/send/recv functions and other infrastructure of 
CHAR/VARCHAR.  This is a reasonable compromise to avoid duplication and 
minimize the first implementation of NCHAR support.

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in 
Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real 
fixed-width encoding due to supplementary characters.
--------------------------------------------------


I don't think it is good to implement NCHAR/NVARCHAR types as extensions 
like contrib/citext, because NCHAR/NVARCHAR are basic types and need 
client-side support.  That is, client drivers need to be aware of the fixed 
NCHAR/NVARCHAR OID values.

How do you think we should implement NCHAR support?

Regards
MauMau




pgsql-hackers by date:

Previous
From: Andrew Gierth
Date:
Subject: Re: Fix picksplit with nan values
Next
From: Heikki Linnakangas
Date:
Subject: Re: patch: add MAP_HUGETLB to mmap() where supported (WIP)