Thread: UTF8 national character data type support WIP patch and list of open issues.

UTF8 national character data type support WIP patch and list of open issues.

From

"Boguk, Maksym"

Date:

03 September 2013, 02:29:24

Hi,

As part of my job I started developing in-core support for the UTF8
National Character types (national character/national character
variable).
I attached current WIP patch (against HEAD) to community review.

Target usage:  ability to store UTF8 national characters in some
selected fields inside a single-byte encoded database.
For sample if I have a ru-RU.koi8r encoded database with mostly Russian
text inside,  it would be nice to be able store an Japanese text in one
field without converting the whole database to UTF8 (convert such
database to UTF8 easily could almost double the database size even if
only one field in whole database will use any symbols outside of
ru-RU.koi8r encoding).

What has been done:

1)Addition of new string data types NATIONAL CHARACTER and NATIONAL
CHARACTER VARIABLE.
These types differ from the char/varchar data types in one important
respect:  NATIONAL string types are always have UTF8 encoding even
(independent from used database encoding).
Of course that lead to encoding conversion overhead when comparing
NATIONAL string types with common string types (that is expected and
unavoidable).
2)Some ECPG support for these types
3)Some documentation patch (not finished)

What need to be done:

1)Full set of string functions and operators for NATIONAL types (we
could not use generic text functions because they assume that the stings
will have database encoding).
Now only basic set implemented.
2)Need implement some way to define default collation for a NATIONAL
types.
3)Need implement some way to input UTF8 characters into NATIONAL types
via SQL  (there are serious open problem... it will be defined later in
the text).

Most serious open problem that the patch in current state doesn't allow
input/output UTF8 symbols which could not be represented in used
database encoding into NATIONAL fields.
It happen because encoding conversion from the client_encoding to the
database encoding happens before syntax analyze/parse stage and throw an
error for symbols which could not be represented.
I don't see any good solution to this problem except made whole codebase
use an UTF8 encoding for the all internal operations with huge
performance hit.
May be someone have good idea how to deal with this issue.

That is really WIP patch (with lots things on todo list/required
polish).

Kindly please tell me what you think about this idea/patch in general.

PS: It is my first patch to PostgreSQL so there are a lot of space to
improvement/style for sure.


Kind Regards,
Maksym

Attachment

pgHEAD_UTF8NCHAR.patch

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Heikki Linnakangas

Date:

03 September 2013, 07:51:12

On 03.09.2013 05:28, Boguk, Maksym wrote:
> Target usage:  ability to store UTF8 national characters in some
> selected fields inside a single-byte encoded database.
> For sample if I have a ru-RU.koi8r encoded database with mostly Russian
> text inside,  it would be nice to be able store an Japanese text in one
> field without converting the whole database to UTF8 (convert such
> database to UTF8 easily could almost double the database size even if
> only one field in whole database will use any symbols outside of
> ru-RU.koi8r encoding).

Ok.

> What has been done:
>
> 1)Addition of new string data types NATIONAL CHARACTER and NATIONAL
> CHARACTER VARIABLE.
> These types differ from the char/varchar data types in one important
> respect:  NATIONAL string types are always have UTF8 encoding even
> (independent from used database encoding).

I don't like the approach of adding a new data type for this. The 
encoding used for a text field should be an implementation detail, not 
something that's exposed to users at the schema-level. A separate data 
type makes an nvarchar field behave slightly differently from text, for 
example when it's passed to and from functions. It will also require 
drivers and client applications to know about it.

> What need to be done:
>
> 1)Full set of string functions and operators for NATIONAL types (we
> could not use generic text functions because they assume that the stings
> will have database encoding).
> Now only basic set implemented.
> 2)Need implement some way to define default collation for a NATIONAL
> types.
> 3)Need implement some way to input UTF8 characters into NATIONAL types
> via SQL  (there are serious open problem... it will be defined later in
> the text).

Yeah, all of these issues stem from the fact that the NATIONAL types are 
separate from text.

I think we should take a completely different approach to this. Two 
alternatives spring to mind:

1. Implement a new encoding.  The new encoding would be some variant of 
UTF-8 that encodes languages like Russian more efficiently. Then just 
use that in the whole database. Something like SCSU 
(http://www.unicode.org/reports/tr6/) should do the trick, although I'm 
not sure if SCSU can be used as a server-encoding. A lot of code relies 
on the fact that a server encoding must have the high bit set in all 
bytes that are part of a multi-byte character. That's why SJIS for 
example can only be used as a client-encoding. But surely you could come 
up with some subset or variant of SCSU which satisfies that requirement.

2. Compress the column. Simply do "ALTER TABLE foo ALTER COLUMN bar SET 
STORAGE MAIN". That will make Postgres compress that field. That might 
not be very efficient for compressing short cyrillic text encoded in 
UTF-8 today, but that could be improved. There has been discussion on 
supporting more compression algorithms in the past, and one such 
algorithm could be again something like SCSU.

- Heikki

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tom Lane

Date:

03 September 2013, 13:58:33

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 03.09.2013 05:28, Boguk, Maksym wrote:
>> Target usage:  ability to store UTF8 national characters in some
>> selected fields inside a single-byte encoded database.

> I think we should take a completely different approach to this. Two 
> alternatives spring to mind:

> 1. Implement a new encoding.  The new encoding would be some variant of 
> UTF-8 that encodes languages like Russian more efficiently.

+1.  I'm not sure that SCSU satisfies the requirement (which I read as
that Russian text should be pretty much 1 byte/character).  But surely
we could devise a variant that does.  For instance, it could look like
koi8r (or any other single-byte encoding of your choice) with one byte
value, say 255, reserved as a prefix.  255 means that a UTF8 character
follows.  The main complication here is that you don't want to allow more
than one way to represent a character --- else you break text hashing,
for instance.  So you'd have to take care that you never emit the 255+UTF8
representation for a character that can be represented in the single-byte
encoding.  In particular, you'd never encode ASCII that way, and thus this
would satisfy the all-multibyte-chars-must-have-all-high-bits-set rule.

Ideally we could make a variant like this for each supported single-byte
encoding, and thus you could optimize a database for "mostly but not
entirely LATIN1 text", etc.
        regards, tom lane

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"Boguk, Maksym"

Date:

04 September 2013, 00:13:38

>> 1)Addition of new string data types NATIONAL CHARACTER and NATIONAL
>> CHARACTER VARIABLE.
>> These types differ from the char/varchar data types in one important
>> respect:  NATIONAL string types are always have UTF8 encoding even
>> (independent from used database encoding).

>I don't like the approach of adding a new data type for this. The
encoding used for a text field should be an implementation detail, not
something that's exposed to users at the schema-level. A separate data
type makes an >nvarchar field behave slightly differently from text, for
example when it's passed to and from functions. It will also require
drivers and client applications to know about it.

Hi,  my task is implementing ANSI NATIONAL character string types as
part of PostgreSQL core.
And requirement " require drivers and client applications to know about
it" is reason why it could not be done as add-on (these new types should
have a fixed OID for most drivers from my experience).
Implementing them as UTF8 data-type is first step which allows have
NATIONAL characters with encoding differ from database encoding (and
might me even support multiple encoding for common string types in
future).


>> 1)Full set of string functions and operators for NATIONAL types (we
>> could not use generic text functions because they assume that the
>> stings will have database encoding).
>> Now only basic set implemented.
>> 2)Need implement some way to define default collation for a NATIONAL
>> types.
>> 3)Need implement some way to input UTF8 characters into NATIONAL
types
>> via SQL  (there are serious open problem... it will be defined later
>> in the text).

>Yeah, all of these issues stem from the fact that the NATIONAL types
are separate from text.
>I think we should take a completely different approach to this. Two
alternatives spring to mind:

>1. Implement a new encoding.  The new encoding would be some variant of
>UTF-8 that encodes languages like Russian more efficiently. Then just
use that in the whole database. Something like SCSU
>(http://www.unicode.org/reports/tr6/) should do the trick, although I'm
not sure if SCSU can be used as a server-encoding. A lot of code relies
on the fact that a server encoding must have the high bit set in all
bytes that >are part of a multi-byte character. That's why SJIS for
example can only be used as a client-encoding. But surely you could come
up with some subset or variant of SCSU which satisfies that requirement.

>2. Compress the column. Simply do "ALTER TABLE foo ALTER COLUMN bar SET
STORAGE MAIN". That will make Postgres compress that field. That might
not be very efficient for compressing short Cyrillic text encoded in
>UTF-8 today, but that could be improved. There has been discussion on
supporting more compression algorithms in the past, and one such
algorithm could be again something like SCSU.

Both of these approach requires dump/restore the whole database which is
not always an opinion.
Implementing an UTF8 NATIONAL character as new datatype will provide
opinion use pg_upgrade to latest version and have required functionality
without prolonged downtime.

PS: is it possible to reserve some narrow type OID range in PostgreSQL
core for the future use?

Kind Regards,
Maksym

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tom Lane

Date:

04 September 2013, 14:29:05

"Boguk, Maksym" <maksymb@fast.au.fujitsu.com> writes:
> Hi,  my task is implementing ANSI NATIONAL character string types as
> part of PostgreSQL core.

No, that's not a given.  You have a problem to solve, ie store some UTF8
strings in a database that's mostly just 1-byte data.  It is not clear
that NATIONAL CHARACTER is the best solution to that problem.  And I don't
think that you're going to convince anybody that this is an improvement in
spec compliance, because there's too much gap between what you're doing
here and what it says in the spec.

> Both of these approach requires dump/restore the whole database which is
> not always an opinion.

That's a disadvantage, agreed, but it's not a large enough one to reject
the approach, because what you want to do also has very significant
disadvantages.

I think it is extremely likely that we will end up rejecting a patch based
on NATIONAL CHARACTER altogether.  It will require too much duplicative
code, it requires too many application-side changes to make use of the
functionality, and it will break any applications that are relying on the
current behavior of that syntax.  But the real problem is that you're
commandeering syntax defined in the SQL spec for what is in the end quite
a narrow usage.  I agree that the use-case will be very handy for some
applications ... but if we were ever to try to achieve real spec
compliance for the SQL features around character sets, this doesn't look
like a step on the way to that.

I think you'd be well advised to take a hard look at the
specialized-database-encoding approach.  From here it looks like a 99%
solution for about 1% of the effort; and since it would be quite
uninvasive to the system as a whole, it's unlikely that such a patch
would get rejected.
        regards, tom lane

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

16 September 2013, 12:48:20

Hello,

I think it would be nice for PostgreSQL to support national character types 
largely because it should ease migration from other DBMSs.

[Reasons why we need NCHAR]
--------------------------------------------------
1. Invite users of other DBMSs to PostgreSQL.  Oracle, SQL Server, MySQL, 
etc. all have NCHAR support.  PostgreSQL is probably the only database out 
of major ones that does not support NCHAR.
Sadly, I've read a report from some Japanese government agency that the 
number of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or 
2011.  I wouldn't say that is due to NCHAR support, but it might be one 
reason.  I want PostgreSQL to be more popular and regain those users.

2. Enhance the "open" image of PostgreSQL by implementing more features of 
SQL standard.  NCHAR may be a wrong and unnecessary feature of SQL standard 
now that we have Unicode support, but it is defined in the standard and 
widely implemented.

3. I have heard that some potential customers didn't adopt PostgreSQL due to 
lack of NCHAR support.  However, I don't know the exact reason why they need 
NCHAR.

4. I guess some users really want to continue to use ShiftJIS or EUC_JP for 
database encoding, and use NCHAR for a limited set of columns to store 
international text in Unicode:
- to avoid code conversion between the server and the client for performance
- because ShiftJIS and EUC_JP require less amount of storage (2 bytes for 
most Kanji) than UTF-8 (3 bytes)
This use case is described in chapter 6 of "Oracle Database Globalization 
Support Guide".
--------------------------------------------------


I think we need to do the following:

[Minimum requirements]
--------------------------------------------------
1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
This is already implemented.  PostgreSQL treats NCHAR/NVARCHAR as synonyms 
for CHAR/VARCHAR, and ignores N prefix.  But this is not documented.

2. Declare support for national character support in the manual.
1 is not sufficient because users don't want to depend on undocumented 
behavior.  This is exactly what the TODO item "national character support" 
in PostgreSQL TODO wiki is about.

3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that:
- psql \d can display the user-specified data types.
- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as 
CHAR/VARCHAR.
- To implement additional features for NCHAR/NVARCHAR in the future, as 
described below.
--------------------------------------------------




[Optional requirements]
--------------------------------------------------
1. Implement client driver support, such as:
- NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as 
specified in the SQL standard.
- national character methods (e.g. setNString, getNString, 
setNCharacterStream) as specified in JDBC 4.0.
I think at first we can treat these national-character-specific features as 
the same as CHAR/VARCHAR.

2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always 
contain Unicode data.
I think it is sufficient at first that NCHAR/NVARCHAR columns can only be 
used in UTF-8 databases and they store UTF-8 strings.  This allows us to 
reuse the input/output/send/recv functions and other infrastructure of 
CHAR/VARCHAR.  This is a reasonable compromise to avoid duplication and 
minimize the first implementation of NCHAR support.

3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
Fixed-width encoding may allow faster string manipulation as described in 
Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real 
fixed-width encoding due to supplementary characters.
--------------------------------------------------


I don't think it is good to implement NCHAR/NVARCHAR types as extensions 
like contrib/citext, because NCHAR/NVARCHAR are basic types and need 
client-side support.  That is, client drivers need to be aware of the fixed 
NCHAR/NVARCHAR OID values.

How do you think we should implement NCHAR support?

Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"Arulappan, Arul Shaji"

Date:

17 September 2013, 12:43:34


>-----Original Message-----
>From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
>owner@postgresql.org] On Behalf Of MauMau
>
>Hello,
>
>I think it would be nice for PostgreSQL to support national character
types
>largely because it should ease migration from other DBMSs.
>
>[Reasons why we need NCHAR]
>--------------------------------------------------
>1. Invite users of other DBMSs to PostgreSQL.  Oracle, SQL Server,
MySQL, etc.
>all have NCHAR support.  PostgreSQL is probably the only database out
of major
>ones that does not support NCHAR.
>Sadly, I've read a report from some Japanese government agency that the
number
>of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or
2011.  I
>wouldn't say that is due to NCHAR support, but it might be one reason.
I want
>PostgreSQL to be more popular and regain those users.
>
>2. Enhance the "open" image of PostgreSQL by implementing more features
of SQL
>standard.  NCHAR may be a wrong and unnecessary feature of SQL standard
now
>that we have Unicode support, but it is defined in the standard and
widely
>implemented.
>
>3. I have heard that some potential customers didn't adopt PostgreSQL
due to
>lack of NCHAR support.  However, I don't know the exact reason why they
need
>NCHAR.

The use case we have is for customer(s) who are modernizing their
databases on mainframes. These applications are typically written in
COBOL which does have extensive support for National Characters.
Supporting National Characters as in-built data types in PostgreSQL is,
not to exaggerate, an important criteria in their decision to use
PostgreSQL or not. (So is Embedded COBOL. But that is a separate issue.)



>
>4. I guess some users really want to continue to use ShiftJIS or EUC_JP
for
>database encoding, and use NCHAR for a limited set of columns to store
>international text in Unicode:
>- to avoid code conversion between the server and the client for
performance
>- because ShiftJIS and EUC_JP require less amount of storage (2 bytes
for most
>Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of
"Oracle
>Database Globalization Support Guide".
>--------------------------------------------------
>
>
>I think we need to do the following:
>
>[Minimum requirements]
>--------------------------------------------------
>1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
>This is already implemented.  PostgreSQL treats NCHAR/NVARCHAR as
synonyms for
>CHAR/VARCHAR, and ignores N prefix.  But this is not documented.
>
>2. Declare support for national character support in the manual.
>1 is not sufficient because users don't want to depend on undocumented
>behavior.  This is exactly what the TODO item "national character
support"
>in PostgreSQL TODO wiki is about.
>
>3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so
that:
>- psql \d can display the user-specified data types.
>- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
>CHAR/VARCHAR.
>- To implement additional features for NCHAR/NVARCHAR in the future, as
>described below.
>--------------------------------------------------
>

Agreed. This is our minimum requirement too.

Rgds,
Arul Shaji





>
>
>
>[Optional requirements]
>--------------------------------------------------
>1. Implement client driver support, such as:
>- NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as
specified
>in the SQL standard.
>- national character methods (e.g. setNString, getNString,
>setNCharacterStream) as specified in JDBC 4.0.
>I think at first we can treat these national-character-specific
features as the
>same as CHAR/VARCHAR.
>
>2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
contain
>Unicode data.
>I think it is sufficient at first that NCHAR/NVARCHAR columns can only
be used
>in UTF-8 databases and they store UTF-8 strings.  This allows us to
reuse the
>input/output/send/recv functions and other infrastructure of
CHAR/VARCHAR.
>This is a reasonable compromise to avoid duplication and minimize the
first
>implementation of NCHAR support.
>
>3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
>Fixed-width encoding may allow faster string manipulation as described
in
>Oracle's manual.  But I'm not sure about this, because UTF-16 is not a
real
>fixed-width encoding due to supplementary characters.

This would definitely be a welcome addition.



>--------------------------------------------------
>
>
>I don't think it is good to implement NCHAR/NVARCHAR types as
extensions like
>contrib/citext, because NCHAR/NVARCHAR are basic types and need
client-side
>support.  That is, client drivers need to be aware of the fixed
NCHAR/NVARCHAR
>OID values.
>
>How do you think we should implement NCHAR support?
>
>Regards
>MauMau
>
>
>
>--
>Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
make
>changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-hackers

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Robert Haas

Date:

18 September 2013, 13:16:17

On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote:
> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
> contain Unicode data.
...
> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
> Fixed-width encoding may allow faster string manipulation as described in
> Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real
> fixed-width encoding due to supplementary characters.

It seems to me that these two points here are the real core of your
proposal.  The rest is just syntactic sugar.

Let me start with the second one: I don't think there's likely to be
any benefit in using UTF-16 as the internal encoding.  In fact, I
think it's likely to make things quite a bit more complicated, because
we have a lot of code that assumes that server encodings have certain
properties that UTF-16 doesn't - specifically, that any byte with the
high-bit clear represents the corresponding ASCII character.

As to the first one, if we're going to go to the (substantial) trouble
of building infrastructure to allow a database to store data in
multiple encodings, why limit it to storing UTF-8 in non-UTF-8
databases?  What about storing SHIFT-JIS in UTF-8 databases, or
Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
combination you might care to name?

Whether we go that way or not, I think storing data in one encoding in
a database with a different encoding is going to be pretty tricky and
require far-reaching changes.  You haven't mentioned any of those
issues or discussed how you would solve them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tom Lane

Date:

18 September 2013, 17:18:16

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote:
>> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
>> contain Unicode data.
>> ...
>> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
>> Fixed-width encoding may allow faster string manipulation as described in
>> Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real
>> fixed-width encoding due to supplementary characters.

> It seems to me that these two points here are the real core of your
> proposal.  The rest is just syntactic sugar.

> Let me start with the second one: I don't think there's likely to be
> any benefit in using UTF-16 as the internal encoding.  In fact, I
> think it's likely to make things quite a bit more complicated, because
> we have a lot of code that assumes that server encodings have certain
> properties that UTF-16 doesn't - specifically, that any byte with the
> high-bit clear represents the corresponding ASCII character.

Another point to keep in mind is that UTF16 is not really any easier
to deal with than UTF8, unless you write code that fails to support
characters outside the basic multilingual plane.  Which is a restriction
I don't believe we'd accept.  But without that restriction, you're still
forced to deal with variable-width characters; and there's nothing very
nice about the way that's done in UTF16.  So on the whole I think it
makes more sense to use UTF8 for this.

I share Robert's misgivings about difficulties in dealing with characters
that are not representable in the database's principal encoding.  Still,
you probably won't find out about many of those until you try it.
        regards, tom lane

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Heikki Linnakangas

Date:

18 September 2013, 18:07:13

On 18.09.2013 16:16, Robert Haas wrote:
> On Mon, Sep 16, 2013 at 8:49 AM, MauMau<maumau307@gmail.com>  wrote:
>> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
>> contain Unicode data.
> ...
>> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
>> Fixed-width encoding may allow faster string manipulation as described in
>> Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real
>> fixed-width encoding due to supplementary characters.
>
> It seems to me that these two points here are the real core of your
> proposal.  The rest is just syntactic sugar.
>
> Let me start with the second one: I don't think there's likely to be
> any benefit in using UTF-16 as the internal encoding.  In fact, I
> think it's likely to make things quite a bit more complicated, because
> we have a lot of code that assumes that server encodings have certain
> properties that UTF-16 doesn't - specifically, that any byte with the
> high-bit clear represents the corresponding ASCII character.
>
> As to the first one, if we're going to go to the (substantial) trouble
> of building infrastructure to allow a database to store data in
> multiple encodings, why limit it to storing UTF-8 in non-UTF-8
> databases?  What about storing SHIFT-JIS in UTF-8 databases, or
> Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
> combination you might care to name?
>
> Whether we go that way or not, I think storing data in one encoding in
> a database with a different encoding is going to be pretty tricky and
> require far-reaching changes.  You haven't mentioned any of those
> issues or discussed how you would solve them.

I'm not too thrilled about complicating the system for that, either. If 
you really need to deal with many different languages, you can do that 
today by using UTF-8 everywhere. Sure, it might not be the most 
efficient encoding for some characters, but it works.

There is one reason, however, that makes it a lot more compelling: we 
already support having databases with different encodings in the same 
cluster, but the encoding used in the shared catalogs, for usernames and 
database names for example, is not well-defined. If we dealt with 
different encodings in the same database, that inconsistency would go away.

- Heikki

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

18 September 2013, 22:40:32

From: "Robert Haas" <robertmhaas@gmail.com>
> On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote:
>> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
>> contain Unicode data.
> ...
>> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
>> Fixed-width encoding may allow faster string manipulation as described in
>> Oracle's manual.  But I'm not sure about this, because UTF-16 is not a 
>> real
>> fixed-width encoding due to supplementary characters.
>
> It seems to me that these two points here are the real core of your
> proposal.  The rest is just syntactic sugar.

No, those are "desirable if possible" features.  What's important is to 
declare in the manual that PostgreSQL officially supports national character 
types, as I stated below.

> 1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically.
> This is already implemented.  PostgreSQL treats NCHAR/NVARCHAR as synonyms
> for CHAR/VARCHAR, and ignores N prefix.  But this is not documented.
>
> 2. Declare support for national character support in the manual.
> 1 is not sufficient because users don't want to depend on undocumented
> behavior.  This is exactly what the TODO item "national character support"
> in PostgreSQL TODO wiki is about.
>
> 3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so 
> that:
> - psql \d can display the user-specified data types.
> - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as
> CHAR/VARCHAR.
> - To implement additional features for NCHAR/NVARCHAR in the future, as
> described below.


And when declaring that, we had better implement NCHAR types as distinct 
types with their own OIDs so that we can extend NCHAR behavior in the 
future.
As the first stage, I think it's okay to treat NCHAR types exactly the same 
as CHAR/VARCHAR types.  For example, in ECPG:

switch (type)
case OID_FOR_CHAR:
case OID_FOR_VARCHAR:
case OID_FOR_TEXT:
case OID_FOR_NCHAR:  /* new code */
case OID_FOR_NVARCHAR:  /* new code */some processing;break;And in JDBC, just call methods for non-national character
types.
 
Currently, those national character methods throw SQLException.

public void setNString(int parameterIndex, String value) throws SQLException
{setString(parameterIndex, value);
}


> Let me start with the second one: I don't think there's likely to be
> any benefit in using UTF-16 as the internal encoding.  In fact, I
> think it's likely to make things quite a bit more complicated, because
> we have a lot of code that assumes that server encodings have certain
> properties that UTF-16 doesn't - specifically, that any byte with the
> high-bit clear represents the corresponding ASCII character.
>
> As to the first one, if we're going to go to the (substantial) trouble
> of building infrastructure to allow a database to store data in
> multiple encodings, why limit it to storing UTF-8 in non-UTF-8
> databases?  What about storing SHIFT-JIS in UTF-8 databases, or
> Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
> combination you might care to name?
>
> Whether we go that way or not, I think storing data in one encoding in
> a database with a different encoding is going to be pretty tricky and
> require far-reaching changes.  You haven't mentioned any of those
> issues or discussed how you would solve them.


Yes, you are probably right -- I'm not sure UTF-16 has really benefits that 
UTF-8 doesn't have.  But why did Windows and Java choose UTF-16 for internal 
strings rather than UTF-8?  Why did Oracle recommend UTF-16 for NCHAR?  I 
have no clear idea.  Anyway, I don't strongly push UTF-16 and complicate the 
encoding handling.

Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

18 September 2013, 22:44:25

From: "Tom Lane" <tgl@sss.pgh.pa.us>
> Another point to keep in mind is that UTF16 is not really any easier
> to deal with than UTF8, unless you write code that fails to support
> characters outside the basic multilingual plane.  Which is a restriction
> I don't believe we'd accept.  But without that restriction, you're still
> forced to deal with variable-width characters; and there's nothing very
> nice about the way that's done in UTF16.  So on the whole I think it
> makes more sense to use UTF8 for this.

I feel so.  I guess why Windows, Java, and Oracle chose UTF-16 is ... it was 
UCS-2 only with BMP when they chose it.  So character handling was easier 
and faster thanks to fixed-width encoding.

Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Robert Haas

Date:

19 September 2013, 18:47:06

On Wed, Sep 18, 2013 at 6:42 PM, MauMau <maumau307@gmail.com> wrote:
>> It seems to me that these two points here are the real core of your
>> proposal.  The rest is just syntactic sugar.
>
> No, those are "desirable if possible" features.  What's important is to
> declare in the manual that PostgreSQL officially supports national character
> types, as I stated below.

That may be what's important to you, but it's not what's important to
me.  I am not keen to introduce support for nchar and nvarchar as
differently-named types with identical semantics.  And I think it's an
even worse idea to introduce them now, making them work one way, and
then later change the behavior in a backward-incompatible fashion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

19 September 2013, 22:40:39

From: "Robert Haas" <robertmhaas@gmail.com>
> That may be what's important to you, but it's not what's important to
> me.

National character types support may be important to some potential users of 
PostgreSQL and the popularity of PostgreSQL, not me.  That's why national 
character support is listed in the PostgreSQL TODO wiki.  We might be losing 
potential users just because their selection criteria includes national 
character support.


> I am not keen to introduce support for nchar and nvarchar as
> differently-named types with identical semantics.

Similar examples already exist:

- varchar and text: the only difference is the existence of explicit length 
limit
- numeric and decimal
- int and int4, smallint and int2, bigint and int8
- real/double precison and float

In addition, the SQL standard itself admits:

"The <key word>s NATIONAL CHARACTER are used to specify the character type 
with an implementation-
defined character set. Special syntax (N'string') is provided for 
representing literals in that character set.
...
"NATIONAL CHARACTER" is equivalent to the corresponding <character string 
type> with a specification
of "CHARACTER SET CSN", where "CSN" is an implementation-defined <character 
set name>."

"A <national character string literal> is equivalent to a <character string 
literal> with the "N" replaced by
"<introducer><character set specification>", where "<character set 
specification>" is an implementation-
defined <character set name>."


> And I think it's an
> even worse idea to introduce them now, making them work one way, and
> then later change the behavior in a backward-incompatible fashion.

I understand your feeling.  The concern about incompatibility can be 
eliminated by thinking the following way.  How about this?

- NCHAR can be used with any database encoding.

- At first, NCHAR is exactly the same as CHAR.  That is, 
"implementation-defined character set" described in the SQL standard is the 
database character set.

- In the future, the character set for NCHAR can be selected at database 
creation like Oracle's CREATE DATABAWSE .... NATIONAL CHARACTER SET 
AL16UTF16.  The default it the database set.


Could you tell me what kind of specification we should implement if we 
officially support national character types?

Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tatsuo Ishii

Date:

19 September 2013, 23:59:15

> On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote:
>> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
>> contain Unicode data.
> ...
>> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
>> Fixed-width encoding may allow faster string manipulation as described in
>> Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real
>> fixed-width encoding due to supplementary characters.
> 
> It seems to me that these two points here are the real core of your
> proposal.  The rest is just syntactic sugar.
> 
> Let me start with the second one: I don't think there's likely to be
> any benefit in using UTF-16 as the internal encoding.  In fact, I
> think it's likely to make things quite a bit more complicated, because
> we have a lot of code that assumes that server encodings have certain
> properties that UTF-16 doesn't - specifically, that any byte with the
> high-bit clear represents the corresponding ASCII character.

Agreed.

> As to the first one, if we're going to go to the (substantial) trouble
> of building infrastructure to allow a database to store data in
> multiple encodings, why limit it to storing UTF-8 in non-UTF-8
> databases?  What about storing SHIFT-JIS in UTF-8 databases, or
> Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
> combination you might care to name?
> 
> Whether we go that way or not, I think storing data in one encoding in
> a database with a different encoding is going to be pretty tricky and
> require far-reaching changes.  You haven't mentioned any of those
> issues or discussed how you would solve them.

What about limiting to use NCHAR with a database which has same
encoding or "compatible" encoding (on which the encoding conversion is
defined)? This way, NCHAR text can be automatically converted from
NCHAR to the database encoding in the server side thus we can treat
NCHAR exactly same as CHAR afterward.  I suppose what encoding is used
for NCHAR should be defined in initdb time or creation of the database
(if we allow this, we need to add a new column to know what encoding
is used for NCHAR).

For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and database encoding is UTF-8 because there is a conversion
between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
SHIFT-JIS and database encoding is ISO-8859-1 because there's no
conversion between them.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Valentine Gogichashvili

Date:

20 September 2013, 01:20:48

Hi,

That may be what's important to you, but it's not what's important to
me.

National character types support may be important to some potential users of PostgreSQL and the popularity of PostgreSQL, not me. That's why national character support is listed in the PostgreSQL TODO wiki. We might be losing potential users just because their selection criteria includes national character support.

the whole NCHAR appeared as hack for the systems, that did not have it from the beginning. It would not be needed, if all the text would be magically stored in UNICODE or UTF from the beginning and idea of character would be the same as an idea of a rune and not a byte.

PostgreSQL has a very powerful possibilities for storing any kind of encoding. So maybe it makes sense to add the ENCODING as another column property, the same way a COLLATION was added?

It would make it possible to have a database, that talks to the clients in UTF8 and stores text and varchar data in the encoding that is the most appropriate for the situation.

It will make it impossible (or complicated) to make the database have a non-UTF8 default encoding (I wonder who should need that in this case), as conversions will not be possible from the broader charsets into the default database encoding.

One could define an additional DATABASE property like LC_ENCODING that would work for the ENCODING property of a column like LC_COLLATE for COLLATE property of a column.

Text operations should work automatically, as in memory all strings will be converted to the database encoding.

This approach will also open a possibility to implement custom ENCODINGs for the column data storage, like snappy compression or even BSON, gobs or protbufs for much more compact type storage.

Regards,

-- Valentine Gogichashvili

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Martijn van Oosterhout

Date:

20 September 2013, 10:11:09

On Fri, Sep 20, 2013 at 08:58:53AM +0900, Tatsuo Ishii wrote:
> For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
> UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
> SHIFT-JIS and database encoding is UTF-8 because there is a conversion
> between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
> SHIFT-JIS and database encoding is ISO-8859-1 because there's no
> conversion between them.

As far as I can tell the whole reason for introducing NCHAR is to
support SHIFT-JIS, there hasn't been call for any other encodings, that
I can remember anyway.

So rather than this whole NCHAR thing, why not just add a type
"sjistext", and a few type casts and call it a day...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Robert Haas

Date:

20 September 2013, 18:22:50

On Thu, Sep 19, 2013 at 6:42 PM, MauMau <maumau307@gmail.com> wrote:
> National character types support may be important to some potential users of
> PostgreSQL and the popularity of PostgreSQL, not me.  That's why national
> character support is listed in the PostgreSQL TODO wiki.  We might be losing
> potential users just because their selection criteria includes national
> character support.

We'd have to go back and search the archives to figure out why that
item was added to the TODO, but I'd be surprised if anyone ever had it
in mind to create additional types that behave just like existing
types but with different names.  I don't think that you'll be able to
get consensus around that path on this mailing list.

>> I am not keen to introduce support for nchar and nvarchar as
>> differently-named types with identical semantics.
>
> Similar examples already exist:
>
> - varchar and text: the only difference is the existence of explicit length
> limit
> - numeric and decimal
> - int and int4, smallint and int2, bigint and int8
> - real/double precison and float

I agree that the fact we have both varchar and text feels like a wart.The other examples mostly involve different names
forthe same

underlying type, and so are different from what you are asking for
here.

> I understand your feeling.  The concern about incompatibility can be
> eliminated by thinking the following way.  How about this?
>
> - NCHAR can be used with any database encoding.
>
> - At first, NCHAR is exactly the same as CHAR.  That is,
> "implementation-defined character set" described in the SQL standard is the
> database character set.
>
> - In the future, the character set for NCHAR can be selected at database
> creation like Oracle's CREATE DATABAWSE .... NATIONAL CHARACTER SET
> AL16UTF16.  The default it the database set.

Hmm.  So under that design, a database could support up to a total of
two character sets, the one that you get when you say 'foo' and the
other one that you get when you say n'foo'.

I guess we could do that, but it seems a bit limited.  If we're going
to go to the trouble of supporting multiple character sets, why not
support an arbitrary number instead of just two?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Robert Haas

Date:

20 September 2013, 18:30:30

On Thu, Sep 19, 2013 at 7:58 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
> What about limiting to use NCHAR with a database which has same
> encoding or "compatible" encoding (on which the encoding conversion is
> defined)? This way, NCHAR text can be automatically converted from
> NCHAR to the database encoding in the server side thus we can treat
> NCHAR exactly same as CHAR afterward.  I suppose what encoding is used
> for NCHAR should be defined in initdb time or creation of the database
> (if we allow this, we need to add a new column to know what encoding
> is used for NCHAR).
>
> For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
> UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
> SHIFT-JIS and database encoding is UTF-8 because there is a conversion
> between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
> SHIFT-JIS and database encoding is ISO-8859-1 because there's no
> conversion between them.

I think the point here is that, at least as I understand it, encoding
conversion and sanitization happens at a very early stage right now,
when we first receive the input from the client.  If the user sends a
string of bytes as part of a query or bind placeholder that's not
valid in the database encoding, it's going to error out before any
type-specific code has an opportunity to get control.   Look at
textin(), for example.  There's no encoding check there.  That means
it's already been done at that point.  To make this work, someone's
going to have to figure out what to do about *that*.  Until we have a
sketch of what the design for that looks like, I don't see how we can
credibly entertain more specific proposals.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Peter Eisentraut

Date:

20 September 2013, 20:55:07

On 9/20/13 2:22 PM, Robert Haas wrote:
>>> I am not keen to introduce support for nchar and nvarchar as
>>> >> differently-named types with identical semantics.
>> >
>> > Similar examples already exist:
>> >
>> > - varchar and text: the only difference is the existence of explicit length
>> > limit
>> > - numeric and decimal
>> > - int and int4, smallint and int2, bigint and int8
>> > - real/double precison and float
> I agree that the fact we have both varchar and text feels like a wart.
>  The other examples mostly involve different names for the same
> underlying type, and so are different from what you are asking for
> here.

Also note that we already have NCHAR [VARYING].  It's mapped to char or
varchar, respectively, in the parser, just like int, real, etc. are handled.

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

20 September 2013, 23:31:45

From: "Tatsuo Ishii" <ishii@postgresql.org>
> What about limiting to use NCHAR with a database which has same
> encoding or "compatible" encoding (on which the encoding conversion is
> defined)? This way, NCHAR text can be automatically converted from
> NCHAR to the database encoding in the server side thus we can treat
> NCHAR exactly same as CHAR afterward.  I suppose what encoding is used
> for NCHAR should be defined in initdb time or creation of the database
> (if we allow this, we need to add a new column to know what encoding
> is used for NCHAR).
>
> For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
> UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
> SHIFT-JIS and database encoding is UTF-8 because there is a conversion
> between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
> SHIFT-JIS and database encoding is ISO-8859-1 because there's no
> conversion between them.

Thanks for the idea, it sounds flexible for wider use.  Your cooperation 
would be much appreciated to devise implementation with as little code as 
possible.

Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

21 September 2013, 00:05:15

From: "Valentine Gogichashvili" <valgog@gmail.com>
> the whole NCHAR appeared as hack for the systems, that did not have it 
> from
> the beginning. It would not be needed, if all the text would be magically
> stored in UNICODE or UTF from the beginning and idea of character would be
> the same as an idea of a rune and not a byte.

I guess so, too.


> PostgreSQL has a very powerful possibilities for storing any kind of
> encoding. So maybe it makes sense to add the ENCODING as another column
> property, the same way a COLLATION was added?

Some other people in this community suggested that.  ANd the SQL standard 
suggests the same -- specifying a character encoding for each column: 
CHAR(n) CHARASET SET ch.


> Text operations should work automatically, as in memory all strings will 
> be
> converted to the database encoding.
>
> This approach will also open a possibility to implement custom ENCODINGs
> for the column data storage, like snappy compression or even BSON, gobs or
> protbufs for much more compact type storage.

Thanks for your idea that sounds interesting, although I don't understand 
that well.

Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

21 September 2013, 00:06:50

From: "Martijn van Oosterhout" <kleptog@svana.org>
> As far as I can tell the whole reason for introducing NCHAR is to
> support SHIFT-JIS, there hasn't been call for any other encodings, that
> I can remember anyway.

Could you elaborate on this, giving some info sources?


> So rather than this whole NCHAR thing, why not just add a type
> "sjistext", and a few type casts and call it a day...

The main reason for supporting NCHAR types is to ease migration from other 
DBMSs, not requiring DDL changes.  So sjistext does not match that purpose.


Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

21 September 2013, 00:30:28

From: "Robert Haas" <robertmhaas@gmail.com>
> I don't think that you'll be able to
> get consensus around that path on this mailing list.

> I agree that the fact we have both varchar and text feels like a wart.

Is that right?  I don't feel varchar/text case is a wart.  I think text was 
introduced for a positive reason to ease migration from other DBMSs.  The 
manual says:

http://www.postgresql.org/docs/current/static/datatype-character.html

"Although the type text is not in the SQL standard, several other SQL 
database management systems have it as well."

And isn't EnterpriseDB doing similar things for Oracle compatibility, 
although I'm not sure about the details?  Could you share your idea why we 
won't get consensus?



>> I understand your feeling.  The concern about incompatibility can be
>> eliminated by thinking the following way.  How about this?
>>
>> - NCHAR can be used with any database encoding.
>>
>> - At first, NCHAR is exactly the same as CHAR.  That is,
>> "implementation-defined character set" described in the SQL standard is 
>> the
>> database character set.
>>
>> - In the future, the character set for NCHAR can be selected at database
>> creation like Oracle's CREATE DATABAWSE .... NATIONAL CHARACTER SET
>> AL16UTF16.  The default it the database set.
>
> Hmm.  So under that design, a database could support up to a total of
> two character sets, the one that you get when you say 'foo' and the
> other one that you get when you say n'foo'.
>
> I guess we could do that, but it seems a bit limited.  If we're going
> to go to the trouble of supporting multiple character sets, why not
> support an arbitrary number instead of just two?

I agree with you about the arbitrary number.  Tatsuo san gave us a good 
suggestion.  Let's consider how to implement that.

Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

21 September 2013, 00:34:14

From: "Robert Haas" <robertmhaas@gmail.com>
> On Thu, Sep 19, 2013 at 7:58 PM, Tatsuo Ishii <ishii@postgresql.org> 
> wrote:
>> What about limiting to use NCHAR with a database which has same
>> encoding or "compatible" encoding (on which the encoding conversion is
>> defined)? This way, NCHAR text can be automatically converted from
>> NCHAR to the database encoding in the server side thus we can treat
>> NCHAR exactly same as CHAR afterward.  I suppose what encoding is used
>> for NCHAR should be defined in initdb time or creation of the database
>> (if we allow this, we need to add a new column to know what encoding
>> is used for NCHAR).
>>
>> For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
>> UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
>> SHIFT-JIS and database encoding is UTF-8 because there is a conversion
>> between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
>> SHIFT-JIS and database encoding is ISO-8859-1 because there's no
>> conversion between them.
>
> I think the point here is that, at least as I understand it, encoding
> conversion and sanitization happens at a very early stage right now,
> when we first receive the input from the client.  If the user sends a
> string of bytes as part of a query or bind placeholder that's not
> valid in the database encoding, it's going to error out before any
> type-specific code has an opportunity to get control.   Look at
> textin(), for example.  There's no encoding check there.  That means
> it's already been done at that point.  To make this work, someone's
> going to have to figure out what to do about *that*.  Until we have a
> sketch of what the design for that looks like, I don't see how we can
> credibly entertain more specific proposals.

OK, I see your point.  Let's consider that design.  I'll learn the code 
regarding this.  Does anybody, especially Tatsuo san, Tom san, Peter san, 
have any good idea?

Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tatsuo Ishii

Date:

21 September 2013, 22:30:19

> I think the point here is that, at least as I understand it, encoding
> conversion and sanitization happens at a very early stage right now,
> when we first receive the input from the client. If the user sends a
> string of bytes as part of a query or bind placeholder that's not
> valid in the database encoding, it's going to error out before any
> type-specific code has an opportunity to get control.   Look at
> textin(), for example.  There's no encoding check there.  That means
> it's already been done at that point.  To make this work, someone's
> going to have to figure out what to do about *that*.  Until we have a
> sketch of what the design for that looks like, I don't see how we can
> credibly entertain more specific proposals.

I don't think the bind placeholder is the case. That is processed by
exec_bind_message() in postgres.c. It has enough info about the type
of the placeholder, and I think we can easily deal with NCHAR. Same
thing can be said to COPY case.

Problem is an ordinary query (simple protocol "Q" message) as you
pointed out. Encoding conversion happens at a very early stage (note
that fast-path case has the same issue). If a query message contains,
say, SHIFT-JIS and EUC-JP, then we are going into trouble because the
encoding conversion routine (pg_client_to_server) regards that the
message from client contains only one encoding. However my question
is, does it really happen? Because there's any text editor which can
create SHIFT-JIS and EUC-JP mixed text. So my guess is, when user want
to use NCHAR as SHIFT-JIS text, the rest of query consist of either
SHIFT-JIS or plain ASCII. If so, what the user need to do is, set the
client encoding to SJIFT-JIS and everything should be fine.

Maumau, is my guess correct?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Valentine Gogichashvili

Date:

22 September 2013, 21:56:07

PostgreSQL has a very powerful possibilities for storing any kind of
encoding. So maybe it makes sense to add the ENCODING as another column
property, the same way a COLLATION was added?

Some other people in this community suggested that. ANd the SQL standard suggests the same -- specifying a character encoding for each column: CHAR(n) CHARASET SET ch.

Text operations should work automatically, as in memory all strings will be
converted to the database encoding.

This approach will also open a possibility to implement custom ENCODINGs
for the column data storage, like snappy compression or even BSON, gobs or
protbufs for much more compact type storage.

Thanks for your idea that sounds interesting, although I don't understand that well.

The idea is very simple:

CREATE DATABASE utf8_database ENCODING 'utf8';

\c utf8_database

CREATE TABLE a(

id serial,

ascii_data text ENCODING 'ascii', -- will use ascii_to_utf8 to read and utf8_to_ascii to write

koi8_data text ENCODING 'koi8_r', -- will use koi8_r_to_utf8 to read and utf8_to_koi8_r to write
json_data json ENCODING 'bson' -- will use bson_to_json to read and json_to_bson to write

);

The problem with bson_to_json here is that probably it will not be possible to write JSON in koi8_r for example. But now it is also even not considered in these discussions.

If the ENCODING machinery would get not only the encoding name, but also the type OID, it should be possible to write encoders for TYPEs and array of TYPEs (I had to do it using the casts to bytea and protobuff to minimize the size of storage for an array of types when writing a lot of data, that could be unpacked afterwords directly in the DB as normal database types).

I hope I made my point a little bit clearer.

Regards,

Valentine Gogichashvili

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

23 September 2013, 06:51:03

From: "Tatsuo Ishii" <ishii@postgresql.org>
> I don't think the bind placeholder is the case. That is processed by
> exec_bind_message() in postgres.c. It has enough info about the type
> of the placeholder, and I think we can easily deal with NCHAR. Same
> thing can be said to COPY case.

Yes, I've learned it.  Agreed.  If we allow an encoding for NCHAR different 
from the database encoding, we can convert text from the client encoding to 
the NCHAR encoding in nchar_in() for example.  We can retrieve the NCHAR 
encoding from pg_database and store it in a global variable at session 
start.


> Problem is an ordinary query (simple protocol "Q" message) as you
> pointed out. Encoding conversion happens at a very early stage (note
> that fast-path case has the same issue). If a query message contains,
> say, SHIFT-JIS and EUC-JP, then we are going into trouble because the
> encoding conversion routine (pg_client_to_server) regards that the
> message from client contains only one encoding. However my question
> is, does it really happen? Because there's any text editor which can
> create SHIFT-JIS and EUC-JP mixed text. So my guess is, when user want
> to use NCHAR as SHIFT-JIS text, the rest of query consist of either
> SHIFT-JIS or plain ASCII. If so, what the user need to do is, set the
> client encoding to SJIFT-JIS and everything should be fine.
>
> Maumau, is my guess correct?

Yes, I believe you are right.  Regardless of whether we support multiple 
encodings in one database or not, a single client encoding will be 
sufficient for one session.  When receiving the "Q" message, the whole SQL 
text is converted from the client encoding to the database encoding.  This 
part needs no modification.  During execution of the "Q" message, NCHAR 
values are converted from the database encoding to the NCHAR encoding.

Thank you very much, Tatsuo san.  Everybody, is there any other challenge we 
should consider to support NCHAR/NVARCHAR types as distinct types?

Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Robert Haas

Date:

23 September 2013, 20:09:39

On Fri, Sep 20, 2013 at 8:32 PM, MauMau <maumau307@gmail.com> wrote:
>> I don't think that you'll be able to
>> get consensus around that path on this mailing list.
>> I agree that the fact we have both varchar and text feels like a wart.
>
> Is that right?  I don't feel varchar/text case is a wart.  I think text was
> introduced for a positive reason to ease migration from other DBMSs.  The
> manual says:
>
> http://www.postgresql.org/docs/current/static/datatype-character.html
>
> "Although the type text is not in the SQL standard, several other SQL
> database management systems have it as well."
>
> And isn't EnterpriseDB doing similar things for Oracle compatibility,
> although I'm not sure about the details?  Could you share your idea why we
> won't get consensus?

Sure, it's EnterpriseDB's policy to add features that facilitate
migrations from other databases - particularly Oracle - to our
product, Advanced Server, even if those features don't otherwise add
any value.  However, the community is usually reluctant to add such
features to PostgreSQL.  Also, at least up until now, the existing
aliasing of nchar and nchar varying to other data types has been
adequate for the needs of our customers, and we've handled a bunch of
other type-name incompatibilities with similar tricks.  What you are
proposing goes off in a different direction from both PostgreSQL and
Advanced Server, and that's why I'm skeptical.  If you were proposing
something that we were doing in Advanced Server with great success, it
would be a bit disingenuous of me to argue against doing the same
thing in PostgreSQL, but that's not the case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Peter Eisentraut

Date:

23 September 2013, 21:26:58

On 9/23/13 2:53 AM, MauMau wrote:
> Yes, I believe you are right.  Regardless of whether we support multiple
> encodings in one database or not, a single client encoding will be
> sufficient for one session.  When receiving the "Q" message, the whole
> SQL text is converted from the client encoding to the database
> encoding.  This part needs no modification.  During execution of the "Q"
> message, NCHAR values are converted from the database encoding to the
> NCHAR encoding.

That assumes that the conversion client encoding -> server encoding ->
NCHAR encoding is not lossy.  I thought one main point of this exercise
was the avoid these conversions and be able to go straight from client
encoding into NCHAR.

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

23 September 2013, 22:39:39

From: "Robert Haas" <robertmhaas@gmail.com>
> Sure, it's EnterpriseDB's policy to add features that facilitate
> migrations from other databases - particularly Oracle - to our
> product, Advanced Server, even if those features don't otherwise add
> any value.  However, the community is usually reluctant to add such
> features to PostgreSQL.  Also, at least up until now, the existing
> aliasing of nchar and nchar varying to other data types has been
> adequate for the needs of our customers, and we've handled a bunch of
> other type-name incompatibilities with similar tricks.  What you are
> proposing goes off in a different direction from both PostgreSQL and
> Advanced Server, and that's why I'm skeptical.  If you were proposing
> something that we were doing in Advanced Server with great success, it
> would be a bit disingenuous of me to argue against doing the same
> thing in PostgreSQL, but that's not the case.

Sorry, I didn't mean to imitate EnterpriseDB.  My intent is to just increase 
the popularity of PostgreSQL (or prevent the drop in popularity?).  NCHAR is 
so basic that we can/should accept proper support.

Aliasing would be nice to some extent, if its offical support would be 
documented in PG manual.  However, just aliasing loses NCHAR type 
information through pg_dump.  This is contrary to the benefit of pg_dump --  
allow migration from PG to other DBMSs, possibly for performance comparison:

http://www.postgresql.org/docs/current/static/app-pgdump.html

"Script files can be used to reconstruct the database even on other machines 
and other architectures; with some modifications, even on other SQL database 
products."

In addition, distinct types for NCHAR/NVARCHAR allow future extension such 
as different encoding for NCHAR and UTF-16.


Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

24 September 2013, 12:02:17

From: "Peter Eisentraut" <peter_e@gmx.net>
> That assumes that the conversion client encoding -> server encoding ->
> NCHAR encoding is not lossy.

Yes, so Tatsuo san suggested to restrict server encoding <-> NCHAR encoding 
combination to those with lossless conversion.


> I thought one main point of this exercise
> was the avoid these conversions and be able to go straight from client
> encoding into NCHAR.

It's slightly different.  Please see the following excerpt:

http://www.postgresql.org/message-id/B1A7485194DE4FDAB8FA781AFB570079@maumau

"4. I guess some users really want to continue to use ShiftJIS or EUC_JP for
database encoding, and use NCHAR for a limited set of columns to store
international text in Unicode:
- to avoid code conversion between the server and the client for performance
- because ShiftJIS and EUC_JP require less amount of storage (2 bytes for
most Kanji) than UTF-8 (3 bytes)
This use case is described in chapter 6 of "Oracle Database Globalization
Support Guide"."



Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Peter Eisentraut

Date:

25 September 2013, 02:50:06

On Tue, 2013-09-24 at 21:04 +0900, MauMau wrote:
> "4. I guess some users really want to continue to use ShiftJIS or EUC_JP for
> database encoding, and use NCHAR for a limited set of columns to store
> international text in Unicode:
> - to avoid code conversion between the server and the client for performance
> - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for
> most Kanji) than UTF-8 (3 bytes)
> This use case is described in chapter 6 of "Oracle Database Globalization
> Support Guide"."

But your proposal wouldn't address the first point, because data would
have to go client -> server -> NCHAR.

The second point is valid, but it's going to be an awful amount of work
for that limited result.

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

25 September 2013, 11:41:18

From: "Peter Eisentraut" <peter_e@gmx.net>
> On Tue, 2013-09-24 at 21:04 +0900, MauMau wrote:
>> "4. I guess some users really want to continue to use ShiftJIS or EUC_JP 
>> for
>> database encoding, and use NCHAR for a limited set of columns to store
>> international text in Unicode:
>> - to avoid code conversion between the server and the client for 
>> performance
>> - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for
>> most Kanji) than UTF-8 (3 bytes)
>> This use case is described in chapter 6 of "Oracle Database Globalization
>> Support Guide"."
>
> But your proposal wouldn't address the first point, because data would
> have to go client -> server -> NCHAR.
>
> The second point is valid, but it's going to be an awful amount of work
> for that limited result.

I (or, Oracle's use case) meant the following, for example:

initdb -E EUC_JP
CREATE DATABASE mydb ENCODING EUC_JP NATIONAL ENCODING UTF-8;
CREATE TABLE mytable (   col1 char(10),   -- EUC_JP text   col2 Nchar(10), -- UTF-8 text
);
client encoding = EUC_JP

That is,

1. Currently, the user is only handling Japanese text.  To avoid unnecessary 
conversion, he uses EUC_JP for both client and server.
2. He needs to store some limited amount of international (non-Japanese) 
text in a few columns for a new feature of the system.  But the 
international text is limited, so he wants to sacrifice performance and 
storage cost due to code conversion for most text and more bytes for each 
character.

Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Greg Stark

Date:

25 September 2013, 15:04:09

<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Tue, Sep 24, 2013 at 1:04 PM, MauMau <span
dir="ltr"><<ahref="mailto:maumau307@gmail.com" target="_blank">maumau307@gmail.com</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Yes, so Tatsuo
sansuggested to restrict server encoding <-> NCHAR encoding combination to those with lossless
conversion.</blockquote></div><br/></div><div class="gmail_extra">If it's not lossy then what's the point? From the
client'spoint of view it'll be functionally equivalent to text then.<br /></div><div class="gmail_extra"><br
clear="all"/><br />-- <br />greg<br /></div></div>

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

25 September 2013, 21:17:06

From: "Greg Stark" <stark@mit.edu>
> If it's not lossy then what's the point? From the client's point of view
> it'll be functionally equivalent to text then.

Sorry, what Tatsuo san suggested meant was "same or compatible", not lossy. 
I quote the relevant part below.  This is enough for the use case I 
mentioned in my previous mail several hours ago (actually, that is what 
Oracle manual describes...).

http://www.postgresql.org/message-id/20130920.085853.1628917054830864151.t-ishii@sraoss.co.jp

[Excerpt]
----------------------------------------
What about limiting to use NCHAR with a database which has same
encoding or "compatible" encoding (on which the encoding conversion is
defined)? This way, NCHAR text can be automatically converted from
NCHAR to the database encoding in the server side thus we can treat
NCHAR exactly same as CHAR afterward.  I suppose what encoding is used
for NCHAR should be defined in initdb time or creation of the database
(if we allow this, we need to add a new column to know what encoding
is used for NCHAR).

For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and database encoding is UTF-8 because there is a conversion
between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is
SHIFT-JIS and database encoding is ISO-8859-1 because there's no
conversion between them.
----------------------------------------


Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"Arulappan, Arul Shaji"

Date:

05 November 2013, 06:05:39

Attached is a patch that implements the first set of changes discussed
in this thread originally. They are:

(i) Implements NCHAR/NVARCHAR as distinct data types, not as synonyms so
that:
    - psql \d can display the user-specified data types.
    - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is,
not as CHAR/VARCHAR.
    - Groundwork to implement additional features for NCHAR/NVARCHAR
in the future (For eg: separate encoding for nchar columns).
(ii) Support for NCHAR/NVARCHAR in ECPG
(iii) Documentation changes to reflect the new data type


Rgds,
Arul Shaji




>-----Original Message-----
>From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
>owner@postgresql.org] On Behalf Of MauMau
>
>From: "Greg Stark" <stark@mit.edu>
>> If it's not lossy then what's the point? From the client's point of
>> view it'll be functionally equivalent to text then.
>
>Sorry, what Tatsuo san suggested meant was "same or compatible", not
lossy.
>I quote the relevant part below.  This is enough for the use case I
mentioned
>in my previous mail several hours ago (actually, that is what Oracle
manual
>describes...).
>
>http://www.postgresql.org/message-id/20130920.085853.162891705483086415
1.t-
>ishii@sraoss.co.jp
>
>[Excerpt]
>----------------------------------------
>What about limiting to use NCHAR with a database which has same
encoding or
>"compatible" encoding (on which the encoding conversion is defined)?
This way,
>NCHAR text can be automatically converted from NCHAR to the database
encoding
>in the server side thus we can treat NCHAR exactly same as CHAR
afterward.  I
>suppose what encoding is used for NCHAR should be defined in initdb
time or
>creation of the database (if we allow this, we need to add a new column
to know
>what encoding is used for NCHAR).
>
>For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is
>UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is
SHIFT-JIS and
>database encoding is UTF-8 because there is a conversion between UTF-8
and
>SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database
encoding
>is ISO-8859-1 because there's no conversion between them.
>----------------------------------------
>
>
>Regards
>MauMau
>
>
>
>--
>Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
make
>changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

PGHEAD_nchar_v1.patch

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Albe Laurenz

Date:

05 November 2013, 08:52:10

Arul Shaji Arulappan wrote:
> Attached is a patch that implements the first set of changes discussed
> in this thread originally. They are:
> 
> (i) Implements NCHAR/NVARCHAR as distinct data types, not as synonyms so
> that:
>     - psql \d can display the user-specified data types.
>     - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is,
> not as CHAR/VARCHAR.
>     - Groundwork to implement additional features for NCHAR/NVARCHAR
> in the future (For eg: separate encoding for nchar columns).
> (ii) Support for NCHAR/NVARCHAR in ECPG
> (iii) Documentation changes to reflect the new data type

If I understood the discussion correctly the use case is that
there are advantages to having a database encoding different
from UTF-8, but you'd still want sume UTF-8 columns.

Wouldn't it be a better design to allow specifying the encoding
per column?  That would give you more flexibility.

I know that NCHAR/NVARCHAR is SQL Standard, but as I still think
that it is a wart.

Yours,
Laurenz Albe

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

05 November 2013, 12:31:39

From: "Albe Laurenz" <laurenz.albe@wien.gv.at>
> If I understood the discussion correctly the use case is that
> there are advantages to having a database encoding different
> from UTF-8, but you'd still want sume UTF-8 columns.
>
> Wouldn't it be a better design to allow specifying the encoding
> per column?  That would give you more flexibility.

Yes, you are right.  In the previous discussion:

- That would be nice if available, but it is hard to implement multiple 
encodings in one database.
- Some people (I'm not sure many or few) are NCHAR/NVARCHAR in other DBMSs. 
To invite them to PostgreSQL, it's important to support national character 
feature syntactically and document it in the manual.  This is the first 
step.
- As the second step, we can implement multiple encodings in one database. 
According to the SQL standard, "NCHAR(n)" is equivalent to "CHAR(n) 
CHARACTER SET cs", where cs is an implementation-defined character set.


Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Albe Laurenz

Date:

05 November 2013, 13:17:56

MauMau wrote:
> From: "Albe Laurenz" <laurenz.albe@wien.gv.at>
>> If I understood the discussion correctly the use case is that
>> there are advantages to having a database encoding different
>> from UTF-8, but you'd still want sume UTF-8 columns.
>>
>> Wouldn't it be a better design to allow specifying the encoding
>> per column?  That would give you more flexibility.
> 
> Yes, you are right.  In the previous discussion:
> 
> - That would be nice if available, but it is hard to implement multiple
> encodings in one database.

Granted.

> - Some people (I'm not sure many or few) are NCHAR/NVARCHAR in other DBMSs.
> To invite them to PostgreSQL, it's important to support national character
> feature syntactically and document it in the manual.  This is the first
> step.

I looked into the Standard, and it does not have NVARCHAR.
The type is called NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING
or NCHAR VARYING.

I guess that the goal of this patch is to support Oracle syntax.
But anybody trying to port CREATE TABLE statements from Oracle
is already exposed to enough incompatibilities that the difference between
NVARCHAR and NCHAR VARYING will not be the reason to reject PostgreSQL.

In other words, I doubt that introducing the nonstandard NVARCHAR
will have more benefits than drawbacks (new reserved word).

Regarding the Standard compliant names of these data types, PostgreSQL
already supports those.  Maybe some documentation would help.

> - As the second step, we can implement multiple encodings in one database.
> According to the SQL standard, "NCHAR(n)" is equivalent to "CHAR(n)
> CHARACTER SET cs", where cs is an implementation-defined character set.

That second step would definitely have benefits.

But I don't think that this requires the first step that your patch
implements, it is in fact orthogonal.

I don't think that there is any need to change NCHAR even if we
get per-column encoding, it is just syntactic sugar to support
SQL Feature F421.

Why not tackle the second step first?

Yours,
Laurenz Albe

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

05 November 2013, 21:24:18

From: "Albe Laurenz" <laurenz.albe@wien.gv.at>
> I looked into the Standard, and it does not have NVARCHAR.
> The type is called NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING
> or NCHAR VARYING.

OUch, that's just a mistake in my mail.  You are correct.


> > I guess that the goal of this patch is to support Oracle syntax.
> But anybody trying to port CREATE TABLE statements from Oracle
> is already exposed to enough incompatibilities that the difference between
> NVARCHAR and NCHAR VARYING will not be the reason to reject PostgreSQL.
> In other words, I doubt that introducing the nonstandard NVARCHAR
> will have more benefits than drawbacks (new reserved word).

Agreed.  But I'm in favor of supporting other DBMS's syntax if it doesn't 
complicate the spec or implementation too much, because it can help migrate 
to PostgreSQL.  I understand PostgreSQL has made such efforts like PL/pgSQL 
which is similar to PL/SQL, text data type, AS in SELECT statement, etc.


> But I don't think that this requires the first step that your patch
> implements, it is in fact orthogonal.

(It's not "my" patch.)


> Regarding the Standard compliant names of these data types, PostgreSQL
> already supports those.  Maybe some documentation would help.
>
> I don't think that there is any need to change NCHAR even if we
> get per-column encoding, it is just syntactic sugar to support
> SQL Feature F421.

Maybe so.  I guess the distinct type for NCHAR is for future extension and 
user friendliness.  As one user, I expect to get "national character" 
instead of "char character set xxx" as output of psql \d and pg_dump when I 
specified "national character" in DDL.  In addition, that makes it easy to 
use the pg_dump output for importing data to other DBMSs for some reason.


Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Peter Eisentraut

Date:

05 November 2013, 22:15:54

On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote:
> Implements NCHAR/NVARCHAR as distinct data types, not as synonyms

If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET
"cs", then for some "cs", NCHAR(x) must be the same as CHAR(x).
Therefore, an implementation as separate data types is wrong.

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Robert Haas

Date:

08 November 2013, 15:10:44

On Tue, Nov 5, 2013 at 5:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote:
>> Implements NCHAR/NVARCHAR as distinct data types, not as synonyms
>
> If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET
> "cs", then for some "cs", NCHAR(x) must be the same as CHAR(x).
> Therefore, an implementation as separate data types is wrong.

Interesting.

Since the point doesn't seem to be getting through, let me try to be
more clear: we're not going to accept any form of this patch.  A patch
that makes some progress toward actually coping with multiple
encodings in the same database would be very much worth considering,
but adding compatible syntax with incompatible semantics is not of
interest to the PostgreSQL project.  We have had this debate on many
other topics in the past and will no doubt have it again in the
future, but the outcome is always the same.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

09 November 2013, 07:23:57

From: "Robert Haas" <robertmhaas@gmail.com>
> On Tue, Nov 5, 2013 at 5:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote:
>>> Implements NCHAR/NVARCHAR as distinct data types, not as synonyms
>>
>> If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET
>> "cs", then for some "cs", NCHAR(x) must be the same as CHAR(x).
>> Therefore, an implementation as separate data types is wrong.
>
> Since the point doesn't seem to be getting through, let me try to be
> more clear: we're not going to accept any form of this patch.  A patch
> that makes some progress toward actually coping with multiple
> encodings in the same database would be very much worth considering,
> but adding compatible syntax with incompatible semantics is not of
> interest to the PostgreSQL project.  We have had this debate on many
> other topics in the past and will no doubt have it again in the
> future, but the outcome is always the same.

It doesn't seem that there is any semantics incompatible with the SQL 
standard as follows:

- In the first step, "cs" is the database encoding, which is used for 
char/varchar/text.
- In the second (or final) step, where multiple encodings per database is 
supported, "cs" is the national character encoding which is specified with 
CREATE DATABASE ... NATIONAL CHARACTER ENCODING cs.  If NATIONAL CHARACTER 
ENCODING clause is omitted, "cs" is the database encoding as step 1.

Let me repeat myself: I think the biggest and immediate issue is that 
PostgreSQL does not support national character types at least officially. 
"Officially" means the description in the manual.  So I don't have strong 
objection against the current (hidden) implementation of nchar types in 
PostgreSQL which are just synonyms, as long as the official support is 
documented.  Serious users don't want to depend on hidden features.

However, doesn't the current synonym approach have any problems?  Wouldn't 
it produce any trouble in the future?  If we treat nchar as char, we lose 
the fact that the user requested nchar.  Can we lose the fact so easily and 
produce irreversible result as below?

--------------------------------------------------
Maybe so.  I guess the distinct type for NCHAR is for future extension and
user friendliness.  As one user, I expect to get "national character"
instead of "char character set xxx" as output of psql \d and pg_dump when I
specified "national character" in DDL.  In addition, that makes it easy to
use the pg_dump output for importing data to other DBMSs for some reason.
--------------------------------------------------


Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Albe Laurenz

Date:

09 November 2013, 20:47:05

MauMau wrote:
> Let me repeat myself: I think the biggest and immediate issue is that
> PostgreSQL does not support national character types at least officially.
> "Officially" means the description in the manual.  So I don't have strong
> objection against the current (hidden) implementation of nchar types in
> PostgreSQL which are just synonyms, as long as the official support is
> documented.  Serious users don't want to depend on hidden features.

I agree with you there.
Actually it is somewhat documented in
http://www.postgresql.org/docs/9.3/static/features-sql-standard.html
as "F421", but that requires that you read the SQL standard.

> However, doesn't the current synonym approach have any problems?  Wouldn't
> it produce any trouble in the future?  If we treat nchar as char, we lose
> the fact that the user requested nchar.  Can we lose the fact so easily and
> produce irreversible result as below?

I don't think that it is a problem.
According to the SQL standard, the user requested a CHAR or VARCHAR with
an encoding of the choice of the DBMS.
PostgreSQL chooses the database encoding.

In a way, it is similar to using the "data type" serial.  The column will be
displayed as "integer", and the information that it was a serial can
only be inferred from the DEFAULT value.
It seems that this is working fine and does not cause many problems,
so I don't see why things should be different here.

Again, for serial the behaviour is well documented, so that seconds
your request for more documentation.
Would you like to write a patch for that?

Yours,
Laurenz Albe

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

10 November 2013, 08:48:28

From: "Albe Laurenz" <laurenz.albe@wien.gv.at>
In a way, it is similar to using the "data type" serial.  The column will be
displayed as "integer", and the information that it was a serial can
only be inferred from the DEFAULT value.
It seems that this is working fine and does not cause many problems,
so I don't see why things should be different here.


Yes, I agree with you in that serial being a synonym is almost no problem. 
But that's because serial is not an SQL-standard data type but a type unique 
to PostgreSQL.

On the other hand, nchar is an established data type in the SQL standard.  I 
think most people will expect to get "nchar" as output from psql \d and 
pg_dump as they specified in DDL.  If they get "char" as output for "nchar" 
columns from pg_dump, wouldn't they get in trouble if they want to import 
schema/data from PostgreSQL to other database products?  The documentation 
for pg_dump says that pg_dump pays attention to easing migrating to other 
DBMSs.  I like this idea and want to respect this.

http://www.postgresql.org/docs/current/static/app-pgdump.html
--------------------------------------------------
Script files can be used to reconstruct the database even on other machines 
and other architectures; with some modifications, even on other SQL database 
products.
...
--use-set-session-authorization
Output SQL-standard SET SESSION AUTHORIZATION commands instead of ALTER 
OWNER commands to determine object ownership. This makes the dump more 
standards-compatible, ...
--------------------------------------------------


Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

"MauMau"

Date:

10 November 2013, 08:48:29

From: "Albe Laurenz" <laurenz.albe@wien.gv.at>
In a way, it is similar to using the "data type" serial.  The column will be
displayed as "integer", and the information that it was a serial can
only be inferred from the DEFAULT value.
It seems that this is working fine and does not cause many problems,
so I don't see why things should be different here.


Yes, I agree with you in that serial being a synonym is almost no problem. 
But that's because serial is not an SQL-standard data type but a type unique 
to PostgreSQL.

On the other hand, nchar is an established data type in the SQL standard.  I 
think most people will expect to get "nchar" as output from psql \d and 
pg_dump as they specified in DDL.  If they get "char" as output for "nchar" 
columns from pg_dump, wouldn't they get in trouble if they want to import 
schema/data from PostgreSQL to other database products?  The documentation 
for pg_dump says that pg_dump pays attention to easing migrating to other 
DBMSs.  I like this idea and want to respect this.

http://www.postgresql.org/docs/current/static/app-pgdump.html
--------------------------------------------------
Script files can be used to reconstruct the database even on other machines 
and other architectures; with some modifications, even on other SQL database 
products.
...
--use-set-session-authorization
Output SQL-standard SET SESSION AUTHORIZATION commands instead of ALTER 
OWNER commands to determine object ownership. This makes the dump more 
standards-compatible, ...
--------------------------------------------------


Regards
MauMau

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tom Lane

Date:

10 November 2013, 16:46:38

"MauMau" <maumau307@gmail.com> writes:
> On the other hand, nchar is an established data type in the SQL standard.  I 
> think most people will expect to get "nchar" as output from psql \d and 
> pg_dump as they specified in DDL.

This argument seems awfully weak.  You've been able to say    create table nt (nf national character varying(22));
in Postgres since around 1997, but I don't recall one single bug report
about how that is displayed as just "character varying(22)".

The other big problem with this line of argument is that you're trying
to claim better spec compliance for what is at best a rather narrow
interpretation with really minimal added functionality.  (In fact,
until you have a solution for the problem that incoming and outgoing
data must be in the database's primary encoding, you don't actually have
*any* added functionality, just syntactic sugar that does nothing useful.)
Unless you can demonstrate by lawyerly reading of the spec that the spec
requires exactly the behavior this patch implements, you do not have a leg
to stand on here.  But you can't demonstrate that, because it doesn't.

I'd be much more impressed by seeing a road map for how we get to a
useful amount of added functionality --- which, to my mind, would be
the ability to support N different encodings in one database, for N>2.
But even if you think N=2 is sufficient, we haven't got a road map, and
commandeering spec-mandated syntax for an inadequate feature doesn't seem
like a good first step.  It'll just make our backwards-compatibility
problems even worse when somebody does come up with a real solution.
        regards, tom lane

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tatsuo Ishii

Date:

12 November 2013, 06:58:33

> I'd be much more impressed by seeing a road map for how we get to a
> useful amount of added functionality --- which, to my mind, would be
> the ability to support N different encodings in one database, for N>2.
> But even if you think N=2 is sufficient, we haven't got a road map, and
> commandeering spec-mandated syntax for an inadequate feature doesn't seem
> like a good first step.  It'll just make our backwards-compatibility
> problems even worse when somebody does come up with a real solution.

I have been thinking about this for years and I think the key idea for
this is, implementing "universal encoding". The universal encoding
should have following characteristics to implement N>2 encoding in a
database.

1) no loss of round trip encoding conversion

2) no mapping table is necessary to convert from/to existing encodings

Once we implement the universal encoding, other problem such as
"pg_database with multiple encoding problem" can be solved easily.

Currently there's no such an universal encoding in the universe, I
think the only way is, inventing it by ourselves.

At this point the design of the encoding I have in mind is,

1) 1 byte encoding identifier + 7 bytes body (totaly 8 bytes). The  encoding identifier's value is between 0x80 and
0xffand is  assigned to exiting encoding such as UTF-8, ascii, EUC-JP and so  on. The encodings should be limited to
"databasesafe"  encodings. The encoding body is raw characters represented by  existing encodings. This form is called
"word".

2) We also have "mutibyte" representation of the universal  encoding. The first byte represents the lenght of the
multibyte character (similar to the first byte of UTF-8). The second byte is  the encoding identifier explained in
above.The rest of the  character is same as above.
 

#1 and #2 are logically same and converted to each other, and we can
use one of them whenever we like.

The form #1 is easy to handle because each word has fixed length (8
bytes). So probably used in temporary data in memory. The second form
can save space and will be used in the data itself.

If we want to have a table encoded in an encoding different from the
database encoding, the table is encoded in the universal
encoding. pg_class should remember the fact to avoid the confusion
about what encoding a table is using. I think majority of tables in a
database uses the same encoding as the database encoding. Only a few
tables want to have different encoding. The design pushes the penalty
to such minorities.

If we need to join two tables which have different encoding, we need
to convert them into the same encoding (this should succeed if the
encodings are "compatible"). If fails, the join will fail too.

We could expand the technique above for the design which allow each
column has different encoding.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Peter Eisentraut

Date:

13 November 2013, 14:07:20

On 11/12/13, 1:57 AM, Tatsuo Ishii wrote:
> Currently there's no such an universal encoding in the universe, I
> think the only way is, inventing it by ourselves.

I think ISO 2022 is something in that direction, but it's not
ASCII-safe, AFAICT.

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Martijn van Oosterhout

Date:

13 November 2013, 20:20:24

On Tue, Nov 12, 2013 at 03:57:52PM +0900, Tatsuo Ishii wrote:
> I have been thinking about this for years and I think the key idea for
> this is, implementing "universal encoding". The universal encoding
> should have following characteristics to implement N>2 encoding in a
> database.
>
> 1) no loss of round trip encoding conversion
>
> 2) no mapping table is necessary to convert from/to existing encodings
>
> Once we implement the universal encoding, other problem such as
> "pg_database with multiple encoding problem" can be solved easily.

Isn't this essentially what the MULE internal encoding is?

> Currently there's no such an universal encoding in the universe, I
> think the only way is, inventing it by ourselves.

This sounds like a terrible idea. In the future people are only going
to want more advanced text functions, regular expressions, indexing and
making encodings that don't exist anywhere else seems like a way to
make a lot of work for little benefit.

A better idea seems to me is to (if postgres is configured properly)
embed the non-round-trippable characters in the custom character part
of the unicode character set. In other words, adjust the mappings
tables on demand and voila.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tom Lane

Date:

13 November 2013, 20:31:57

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Tue, Nov 12, 2013 at 03:57:52PM +0900, Tatsuo Ishii wrote:
>> Once we implement the universal encoding, other problem such as
>> "pg_database with multiple encoding problem" can be solved easily.

> Isn't this essentially what the MULE internal encoding is?

MULE is completely evil.  It has N different encodings for the same
character, not to mention no support code available.

>> Currently there's no such an universal encoding in the universe, I
>> think the only way is, inventing it by ourselves.

> This sounds like a terrible idea. In the future people are only going
> to want more advanced text functions, regular expressions, indexing and
> making encodings that don't exist anywhere else seems like a way to
> make a lot of work for little benefit.

Agreed.

> A better idea seems to me is to (if postgres is configured properly)
> embed the non-round-trippable characters in the custom character part
> of the unicode character set. In other words, adjust the mappings
> tables on demand and voila.

From the standpoint of what will happen with existing library code
(like strcoll), I'm not sure it's all that easy.
        regards, tom lane

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tatsuo Ishii

Date:

13 November 2013, 23:03:54

> Isn't this essentially what the MULE internal encoding is?

No. MULE is not powerfull enough and overly complicated to deal with
different encodings (character sets).

>> Currently there's no such an universal encoding in the universe, I
>> think the only way is, inventing it by ourselves.
> 
> This sounds like a terrible idea. In the future people are only going
> to want more advanced text functions, regular expressions, indexing and
> making encodings that don't exist anywhere else seems like a way to
> make a lot of work for little benefit.

That is probably a misunderstanding. We don't need to modify existing
text handling modules such as text functions, regular expressions,
indexing etc. We just convert from the "universal" encoding X to the
original encoding before calling them. The process is pretty easy and
fast because it just requires skipping "encoding identifier" and
"encoding length" part.

Basically the encoding X should be used for lower layer modules of
PostgreSQL and higher layer module such as living in
src/backend/utils/adt should not aware it.

> A better idea seems to me is to (if postgres is configured properly)
> embed the non-round-trippable characters in the custom character part
> of the unicode character set. In other words, adjust the mappings
> tables on demand and voila.

Using Unicode requires overhead for encoding conversion because it
needs to look up mapping tables. That will be a huge handicap for
large data and that I want to avoid in the first place.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tatsuo Ishii

Date:

14 November 2013, 00:25:09

> MULE is completely evil.

> It has N different encodings for the same
> character,

What's wrong with that? It aims that in the first place.

> not to mention no support code available.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tom Lane

Date:

14 November 2013, 00:28:22

Tatsuo Ishii <ishii@postgresql.org> writes:
>> MULE is completely evil.
>> It has N different encodings for the same character,

> What's wrong with that? It aims that in the first place.

It greatly complicates comparisons --- at least, if you'd like to preserve
the principle that strings that appear the same are equal.
        regards, tom lane

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tatsuo Ishii

Date:

14 November 2013, 00:43:41

> Tatsuo Ishii <ishii@postgresql.org> writes:
>>> MULE is completely evil.
>>> It has N different encodings for the same character,
> 
>> What's wrong with that? It aims that in the first place.
> 
> It greatly complicates comparisons --- at least, if you'd like to preserve
> the principle that strings that appear the same are equal.

You don't need to consider it because there's no place in PostgreSQL
where a MULE encoded text consists of multiple encodings as far as I
know.

BTW, same characters are assigned different code points are pretty
common in many character sets (Unicode, for example).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Tom Lane

Date:

14 November 2013, 03:06:43

Tatsuo Ishii <ishii@postgresql.org> writes:
> BTW, same characters are assigned different code points are pretty
> common in many character sets (Unicode, for example).

This is widely considered a security bug; read section 10 in RFC 3629 (the
definition of UTF8), and search the CVE database a bit if you still doubt
it's a threat.  I'm going to push back very hard on any suggestion that
Postgres should build itself around a text representation with that kind
of weakness designed in.
        regards, tom lane

[1] http://tools.ietf.org/html/rfc3629#section-10

Re: UTF8 national character data type support WIP patch and list of open issues.

From

Chapman Flack

Date:

27 August 2023, 16:56:18

Hi,

Although this is a ten-year-old message, it was the one I found quickly
when looking to see what the current state of play on this might be.

On 2013-09-20 14:22, Robert Haas wrote:
> Hmm.  So under that design, a database could support up to a total of
> two character sets, the one that you get when you say 'foo' and the
> other one that you get when you say n'foo'.
> 
> I guess we could do that, but it seems a bit limited.  If we're going
> to go to the trouble of supporting multiple character sets, why not
> support an arbitrary number instead of just two?

Because that old thread came to an end without mentioning how the
standard approaches that, it seemed worth adding, just to complete the
record.

In the draft of the standard I'm looking at (which is also around a
decade old), n'foo' is nothing but a handy shorthand for _csname'foo'
(which is a syntax we do not accept) for some particular csname that
was chosen when setting up the db.

So really, the standard contemplates letting you have columns of
arbitrary different charsets (CHAR(x) CHARACTER SET csname), and
literals of arbitrary charsets _csname'foo'. Then, as a bit of
sugar, you get to pick which two of those charsets you'd like
to have easy shorter ways of writing, 'foo' or n'foo',
CHAR or NCHAR.

The grammar for csname is kind of funky. It can be nothing but
<SQL language identifier>, which has the nice restricted form
/[A-Za-z][A-Za-z0-9_]*/. But it can also be schema-qualified,
with the schema of course being a full-fledged <identifier>.

So yeah, to fully meet this part of the standard, the parser'd
have to know that
  _U&"I am a schema nameZ0021" UESCAPE 'Z'/*hi!*/.LATIN1'foo'
is a string literal, expressing foo, in a character set named
LATIN1, in some cutely-named schema.

Never a dull moment.

Regards,
-Chap