Re: UTF8 national character data type support WIP patch and list of open issues. - Mailing list pgsql-hackers

From Arulappan, Arul Shaji
Subject Re: UTF8 national character data type support WIP patch and list of open issues.
Date
Msg-id 022C711CCA8AF2459F370E936F2B9E8C010E634B@SYDExchTmp.au.fjanz.com
Whole thread Raw
In response to Re: UTF8 national character data type support WIP patch and list of open issues.  ("MauMau" <maumau307@gmail.com>)
List pgsql-hackers

>-----Original Message-----
>From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
>owner@postgresql.org] On Behalf Of MauMau
>
>Hello,
>
>I think it would be nice for PostgreSQL to support national character
types
>largely because it should ease migration from other DBMSs.
>
>[Reasons why we need NCHAR]
>--------------------------------------------------
>1. Invite users of other DBMSs to PostgreSQL.  Oracle, SQL Server,
MySQL, etc.
>all have NCHAR support.  PostgreSQL is probably the only database out
of major
>ones that does not support NCHAR.
>Sadly, I've read a report from some Japanese government agency that the
number
>of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or
2011.  I
>wouldn't say that is due to NCHAR support, but it might be one reason.
I want
>PostgreSQL to be more popular and regain those users.
>
>2. Enhance the "open" image of PostgreSQL by implementing more features
of SQL
>standard.  NCHAR may be a wrong and unnecessary feature of SQL standard
now
>that we have Unicode support, but it is defined in the standard and
widely
>implemented.
>
>3. I have heard that some potential customers didn't adopt PostgreSQL
due to
>lack of NCHAR support.  However, I don't know the exact reason why they
need
>NCHAR.

The use case we have is for customer(s) who are modernizing their
databases on mainframes. These applications are typically written in
COBOL which does have extensive support for National Characters.
Supporting National Characters as in-built data types in PostgreSQL is,
not to exaggerate, an important criteria in their decision to use
PostgreSQL or not. (So is Embedded COBOL. But that is a separate issue.)



>
>4. I guess some users really want to continue to use ShiftJIS or EUC_JP
for
>database encoding, and use NCHAR for a limited set of columns to store
>international text in Unicode:
>- to avoid code conversion between the server and the client for
performance
>- because ShiftJIS and EUC_JP require less amount of storage (2 bytes
for most
>Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of
"Oracle
>Database Globalization Support Guide".
>--------------------------------------------------
>
>
>I think we need to do the following:
>
>[Minimum requirements]
>--------------------------------------------------
>1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
>This is already implemented.  PostgreSQL treats NCHAR/NVARCHAR as
synonyms for
>CHAR/VARCHAR, and ignores N prefix.  But this is not documented.
>
>2. Declare support for national character support in the manual.
>1 is not sufficient because users don't want to depend on undocumented
>behavior.  This is exactly what the TODO item "national character
support"
>in PostgreSQL TODO wiki is about.
>
>3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so
that:
>- psql \d can display the user-specified data types.
>- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
>CHAR/VARCHAR.
>- To implement additional features for NCHAR/NVARCHAR in the future, as
>described below.
>--------------------------------------------------
>

Agreed. This is our minimum requirement too.

Rgds,
Arul Shaji





>
>
>
>[Optional requirements]
>--------------------------------------------------
>1. Implement client driver support, such as:
>- NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as
specified
>in the SQL standard.
>- national character methods (e.g. setNString, getNString,
>setNCharacterStream) as specified in JDBC 4.0.
>I think at first we can treat these national-character-specific
features as the
>same as CHAR/VARCHAR.
>
>2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain
>Unicode data.
>I think it is sufficient at first that NCHAR/NVARCHAR columns can only
be used
>in UTF-8 databases and they store UTF-8 strings.  This allows us to
reuse the
>input/output/send/recv functions and other infrastructure of
CHAR/VARCHAR.
>This is a reasonable compromise to avoid duplication and minimize the
first
>implementation of NCHAR support.
>
>3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
>Fixed-width encoding may allow faster string manipulation as described
in
>Oracle's manual.  But I'm not sure about this, because UTF-16 is not a
real
>fixed-width encoding due to supplementary characters.

This would definitely be a welcome addition.



>--------------------------------------------------
>
>
>I don't think it is good to implement NCHAR/NVARCHAR types as
extensions like
>contrib/citext, because NCHAR/NVARCHAR are basic types and need
client-side
>support.  That is, client drivers need to be aware of the fixed
NCHAR/NVARCHAR
>OID values.
>
>How do you think we should implement NCHAR support?
>
>Regards
>MauMau
>
>
>
>--
>Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
make
>changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-hackers





pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: record identical operator
Next
From: Rod Taylor
Date:
Subject: Re: record identical operator