Thread: UTF8 national character data type support WIP patch and list of open issues.
UTF8 national character data type support WIP patch and list of open issues.
From
"Boguk, Maksym"
Date:
Hi, As part of my job I started developing in-core support for the UTF8 National Character types (national character/national character variable). I attached current WIP patch (against HEAD) to community review. Target usage: ability to store UTF8 national characters in some selected fields inside a single-byte encoded database. For sample if I have a ru-RU.koi8r encoded database with mostly Russian text inside, it would be nice to be able store an Japanese text in one field without converting the whole database to UTF8 (convert such database to UTF8 easily could almost double the database size even if only one field in whole database will use any symbols outside of ru-RU.koi8r encoding). What has been done: 1)Addition of new string data types NATIONAL CHARACTER and NATIONAL CHARACTER VARIABLE. These types differ from the char/varchar data types in one important respect: NATIONAL string types are always have UTF8 encoding even (independent from used database encoding). Of course that lead to encoding conversion overhead when comparing NATIONAL string types with common string types (that is expected and unavoidable). 2)Some ECPG support for these types 3)Some documentation patch (not finished) What need to be done: 1)Full set of string functions and operators for NATIONAL types (we could not use generic text functions because they assume that the stings will have database encoding). Now only basic set implemented. 2)Need implement some way to define default collation for a NATIONAL types. 3)Need implement some way to input UTF8 characters into NATIONAL types via SQL (there are serious open problem... it will be defined later in the text). Most serious open problem that the patch in current state doesn't allow input/output UTF8 symbols which could not be represented in used database encoding into NATIONAL fields. It happen because encoding conversion from the client_encoding to the database encoding happens before syntax analyze/parse stage and throw an error for symbols which could not be represented. I don't see any good solution to this problem except made whole codebase use an UTF8 encoding for the all internal operations with huge performance hit. May be someone have good idea how to deal with this issue. That is really WIP patch (with lots things on todo list/required polish). Kindly please tell me what you think about this idea/patch in general. PS: It is my first patch to PostgreSQL so there are a lot of space to improvement/style for sure. Kind Regards, Maksym
Attachment
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Heikki Linnakangas
Date:
On 03.09.2013 05:28, Boguk, Maksym wrote: > Target usage: ability to store UTF8 national characters in some > selected fields inside a single-byte encoded database. > For sample if I have a ru-RU.koi8r encoded database with mostly Russian > text inside, it would be nice to be able store an Japanese text in one > field without converting the whole database to UTF8 (convert such > database to UTF8 easily could almost double the database size even if > only one field in whole database will use any symbols outside of > ru-RU.koi8r encoding). Ok. > What has been done: > > 1)Addition of new string data types NATIONAL CHARACTER and NATIONAL > CHARACTER VARIABLE. > These types differ from the char/varchar data types in one important > respect: NATIONAL string types are always have UTF8 encoding even > (independent from used database encoding). I don't like the approach of adding a new data type for this. The encoding used for a text field should be an implementation detail, not something that's exposed to users at the schema-level. A separate data type makes an nvarchar field behave slightly differently from text, for example when it's passed to and from functions. It will also require drivers and client applications to know about it. > What need to be done: > > 1)Full set of string functions and operators for NATIONAL types (we > could not use generic text functions because they assume that the stings > will have database encoding). > Now only basic set implemented. > 2)Need implement some way to define default collation for a NATIONAL > types. > 3)Need implement some way to input UTF8 characters into NATIONAL types > via SQL (there are serious open problem... it will be defined later in > the text). Yeah, all of these issues stem from the fact that the NATIONAL types are separate from text. I think we should take a completely different approach to this. Two alternatives spring to mind: 1. Implement a new encoding. The new encoding would be some variant of UTF-8 that encodes languages like Russian more efficiently. Then just use that in the whole database. Something like SCSU (http://www.unicode.org/reports/tr6/) should do the trick, although I'm not sure if SCSU can be used as a server-encoding. A lot of code relies on the fact that a server encoding must have the high bit set in all bytes that are part of a multi-byte character. That's why SJIS for example can only be used as a client-encoding. But surely you could come up with some subset or variant of SCSU which satisfies that requirement. 2. Compress the column. Simply do "ALTER TABLE foo ALTER COLUMN bar SET STORAGE MAIN". That will make Postgres compress that field. That might not be very efficient for compressing short cyrillic text encoded in UTF-8 today, but that could be improved. There has been discussion on supporting more compression algorithms in the past, and one such algorithm could be again something like SCSU. - Heikki
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 03.09.2013 05:28, Boguk, Maksym wrote: >> Target usage: ability to store UTF8 national characters in some >> selected fields inside a single-byte encoded database. > I think we should take a completely different approach to this. Two > alternatives spring to mind: > 1. Implement a new encoding. The new encoding would be some variant of > UTF-8 that encodes languages like Russian more efficiently. +1. I'm not sure that SCSU satisfies the requirement (which I read as that Russian text should be pretty much 1 byte/character). But surely we could devise a variant that does. For instance, it could look like koi8r (or any other single-byte encoding of your choice) with one byte value, say 255, reserved as a prefix. 255 means that a UTF8 character follows. The main complication here is that you don't want to allow more than one way to represent a character --- else you break text hashing, for instance. So you'd have to take care that you never emit the 255+UTF8 representation for a character that can be represented in the single-byte encoding. In particular, you'd never encode ASCII that way, and thus this would satisfy the all-multibyte-chars-must-have-all-high-bits-set rule. Ideally we could make a variant like this for each supported single-byte encoding, and thus you could optimize a database for "mostly but not entirely LATIN1 text", etc. regards, tom lane
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"Boguk, Maksym"
Date:
>> 1)Addition of new string data types NATIONAL CHARACTER and NATIONAL >> CHARACTER VARIABLE. >> These types differ from the char/varchar data types in one important >> respect: NATIONAL string types are always have UTF8 encoding even >> (independent from used database encoding). >I don't like the approach of adding a new data type for this. The encoding used for a text field should be an implementation detail, not something that's exposed to users at the schema-level. A separate data type makes an >nvarchar field behave slightly differently from text, for example when it's passed to and from functions. It will also require drivers and client applications to know about it. Hi, my task is implementing ANSI NATIONAL character string types as part of PostgreSQL core. And requirement " require drivers and client applications to know about it" is reason why it could not be done as add-on (these new types should have a fixed OID for most drivers from my experience). Implementing them as UTF8 data-type is first step which allows have NATIONAL characters with encoding differ from database encoding (and might me even support multiple encoding for common string types in future). >> 1)Full set of string functions and operators for NATIONAL types (we >> could not use generic text functions because they assume that the >> stings will have database encoding). >> Now only basic set implemented. >> 2)Need implement some way to define default collation for a NATIONAL >> types. >> 3)Need implement some way to input UTF8 characters into NATIONAL types >> via SQL (there are serious open problem... it will be defined later >> in the text). >Yeah, all of these issues stem from the fact that the NATIONAL types are separate from text. >I think we should take a completely different approach to this. Two alternatives spring to mind: >1. Implement a new encoding. The new encoding would be some variant of >UTF-8 that encodes languages like Russian more efficiently. Then just use that in the whole database. Something like SCSU >(http://www.unicode.org/reports/tr6/) should do the trick, although I'm not sure if SCSU can be used as a server-encoding. A lot of code relies on the fact that a server encoding must have the high bit set in all bytes that >are part of a multi-byte character. That's why SJIS for example can only be used as a client-encoding. But surely you could come up with some subset or variant of SCSU which satisfies that requirement. >2. Compress the column. Simply do "ALTER TABLE foo ALTER COLUMN bar SET STORAGE MAIN". That will make Postgres compress that field. That might not be very efficient for compressing short Cyrillic text encoded in >UTF-8 today, but that could be improved. There has been discussion on supporting more compression algorithms in the past, and one such algorithm could be again something like SCSU. Both of these approach requires dump/restore the whole database which is not always an opinion. Implementing an UTF8 NATIONAL character as new datatype will provide opinion use pg_upgrade to latest version and have required functionality without prolonged downtime. PS: is it possible to reserve some narrow type OID range in PostgreSQL core for the future use? Kind Regards, Maksym
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tom Lane
Date:
"Boguk, Maksym" <maksymb@fast.au.fujitsu.com> writes: > Hi, my task is implementing ANSI NATIONAL character string types as > part of PostgreSQL core. No, that's not a given. You have a problem to solve, ie store some UTF8 strings in a database that's mostly just 1-byte data. It is not clear that NATIONAL CHARACTER is the best solution to that problem. And I don't think that you're going to convince anybody that this is an improvement in spec compliance, because there's too much gap between what you're doing here and what it says in the spec. > Both of these approach requires dump/restore the whole database which is > not always an opinion. That's a disadvantage, agreed, but it's not a large enough one to reject the approach, because what you want to do also has very significant disadvantages. I think it is extremely likely that we will end up rejecting a patch based on NATIONAL CHARACTER altogether. It will require too much duplicative code, it requires too many application-side changes to make use of the functionality, and it will break any applications that are relying on the current behavior of that syntax. But the real problem is that you're commandeering syntax defined in the SQL spec for what is in the end quite a narrow usage. I agree that the use-case will be very handy for some applications ... but if we were ever to try to achieve real spec compliance for the SQL features around character sets, this doesn't look like a step on the way to that. I think you'd be well advised to take a hard look at the specialized-database-encoding approach. From here it looks like a 99% solution for about 1% of the effort; and since it would be quite uninvasive to the system as a whole, it's unlikely that such a patch would get rejected. regards, tom lane
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
Hello, I think it would be nice for PostgreSQL to support national character types largely because it should ease migration from other DBMSs. [Reasons why we need NCHAR] -------------------------------------------------- 1. Invite users of other DBMSs to PostgreSQL. Oracle, SQL Server, MySQL, etc. all have NCHAR support. PostgreSQL is probably the only database out of major ones that does not support NCHAR. Sadly, I've read a report from some Japanese government agency that the number of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or 2011. I wouldn't say that is due to NCHAR support, but it might be one reason. I want PostgreSQL to be more popular and regain those users. 2. Enhance the "open" image of PostgreSQL by implementing more features of SQL standard. NCHAR may be a wrong and unnecessary feature of SQL standard now that we have Unicode support, but it is defined in the standard and widely implemented. 3. I have heard that some potential customers didn't adopt PostgreSQL due to lack of NCHAR support. However, I don't know the exact reason why they need NCHAR. 4. I guess some users really want to continue to use ShiftJIS or EUC_JP for database encoding, and use NCHAR for a limited set of columns to store international text in Unicode: - to avoid code conversion between the server and the client for performance - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of "Oracle Database Globalization Support Guide". -------------------------------------------------- I think we need to do the following: [Minimum requirements] -------------------------------------------------- 1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically. This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms for CHAR/VARCHAR, and ignores N prefix. But this is not documented. 2. Declare support for national character support in the manual. 1 is not sufficient because users don't want to depend on undocumented behavior. This is exactly what the TODO item "national character support" in PostgreSQL TODO wiki is about. 3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that: - psql \d can display the user-specified data types. - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as CHAR/VARCHAR. - To implement additional features for NCHAR/NVARCHAR in the future, as described below. -------------------------------------------------- [Optional requirements] -------------------------------------------------- 1. Implement client driver support, such as: - NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as specified in the SQL standard. - national character methods (e.g. setNString, getNString, setNCharacterStream) as specified in JDBC 4.0. I think at first we can treat these national-character-specific features as the same as CHAR/VARCHAR. 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain Unicode data. I think it is sufficient at first that NCHAR/NVARCHAR columns can only be used in UTF-8 databases and they store UTF-8 strings. This allows us to reuse the input/output/send/recv functions and other infrastructure of CHAR/VARCHAR. This is a reasonable compromise to avoid duplication and minimize the first implementation of NCHAR support. 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. Fixed-width encoding may allow faster string manipulation as described in Oracle's manual. But I'm not sure about this, because UTF-16 is not a real fixed-width encoding due to supplementary characters. -------------------------------------------------- I don't think it is good to implement NCHAR/NVARCHAR types as extensions like contrib/citext, because NCHAR/NVARCHAR are basic types and need client-side support. That is, client drivers need to be aware of the fixed NCHAR/NVARCHAR OID values. How do you think we should implement NCHAR support? Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"Arulappan, Arul Shaji"
Date:
>-----Original Message----- >From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- >owner@postgresql.org] On Behalf Of MauMau > >Hello, > >I think it would be nice for PostgreSQL to support national character types >largely because it should ease migration from other DBMSs. > >[Reasons why we need NCHAR] >-------------------------------------------------- >1. Invite users of other DBMSs to PostgreSQL. Oracle, SQL Server, MySQL, etc. >all have NCHAR support. PostgreSQL is probably the only database out of major >ones that does not support NCHAR. >Sadly, I've read a report from some Japanese government agency that the number >of MySQL users exceeded that of PostgreSQL here in Japan in 2010 or 2011. I >wouldn't say that is due to NCHAR support, but it might be one reason. I want >PostgreSQL to be more popular and regain those users. > >2. Enhance the "open" image of PostgreSQL by implementing more features of SQL >standard. NCHAR may be a wrong and unnecessary feature of SQL standard now >that we have Unicode support, but it is defined in the standard and widely >implemented. > >3. I have heard that some potential customers didn't adopt PostgreSQL due to >lack of NCHAR support. However, I don't know the exact reason why they need >NCHAR. The use case we have is for customer(s) who are modernizing their databases on mainframes. These applications are typically written in COBOL which does have extensive support for National Characters. Supporting National Characters as in-built data types in PostgreSQL is, not to exaggerate, an important criteria in their decision to use PostgreSQL or not. (So is Embedded COBOL. But that is a separate issue.) > >4. I guess some users really want to continue to use ShiftJIS or EUC_JP for >database encoding, and use NCHAR for a limited set of columns to store >international text in Unicode: >- to avoid code conversion between the server and the client for performance >- because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most >Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of "Oracle >Database Globalization Support Guide". >-------------------------------------------------- > > >I think we need to do the following: > >[Minimum requirements] >-------------------------------------------------- >1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically. >This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms for >CHAR/VARCHAR, and ignores N prefix. But this is not documented. > >2. Declare support for national character support in the manual. >1 is not sufficient because users don't want to depend on undocumented >behavior. This is exactly what the TODO item "national character support" >in PostgreSQL TODO wiki is about. > >3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so that: >- psql \d can display the user-specified data types. >- pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as >CHAR/VARCHAR. >- To implement additional features for NCHAR/NVARCHAR in the future, as >described below. >-------------------------------------------------- > Agreed. This is our minimum requirement too. Rgds, Arul Shaji > > > >[Optional requirements] >-------------------------------------------------- >1. Implement client driver support, such as: >- NCHAR host variable type (e.g. "NCHAR var_name[12];") in ECPG, as specified >in the SQL standard. >- national character methods (e.g. setNString, getNString, >setNCharacterStream) as specified in JDBC 4.0. >I think at first we can treat these national-character-specific features as the >same as CHAR/VARCHAR. > >2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always contain >Unicode data. >I think it is sufficient at first that NCHAR/NVARCHAR columns can only be used >in UTF-8 databases and they store UTF-8 strings. This allows us to reuse the >input/output/send/recv functions and other infrastructure of CHAR/VARCHAR. >This is a reasonable compromise to avoid duplication and minimize the first >implementation of NCHAR support. > >3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. >Fixed-width encoding may allow faster string manipulation as described in >Oracle's manual. But I'm not sure about this, because UTF-16 is not a real >fixed-width encoding due to supplementary characters. This would definitely be a welcome addition. >-------------------------------------------------- > > >I don't think it is good to implement NCHAR/NVARCHAR types as extensions like >contrib/citext, because NCHAR/NVARCHAR are basic types and need client-side >support. That is, client drivers need to be aware of the fixed NCHAR/NVARCHAR >OID values. > >How do you think we should implement NCHAR support? > >Regards >MauMau > > > >-- >Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make >changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-hackers
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Robert Haas
Date:
On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote: > 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always > contain Unicode data. ... > 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. > Fixed-width encoding may allow faster string manipulation as described in > Oracle's manual. But I'm not sure about this, because UTF-16 is not a real > fixed-width encoding due to supplementary characters. It seems to me that these two points here are the real core of your proposal. The rest is just syntactic sugar. Let me start with the second one: I don't think there's likely to be any benefit in using UTF-16 as the internal encoding. In fact, I think it's likely to make things quite a bit more complicated, because we have a lot of code that assumes that server encodings have certain properties that UTF-16 doesn't - specifically, that any byte with the high-bit clear represents the corresponding ASCII character. As to the first one, if we're going to go to the (substantial) trouble of building infrastructure to allow a database to store data in multiple encodings, why limit it to storing UTF-8 in non-UTF-8 databases? What about storing SHIFT-JIS in UTF-8 databases, or Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other combination you might care to name? Whether we go that way or not, I think storing data in one encoding in a database with a different encoding is going to be pretty tricky and require far-reaching changes. You haven't mentioned any of those issues or discussed how you would solve them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote: >> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always >> contain Unicode data. >> ... >> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. >> Fixed-width encoding may allow faster string manipulation as described in >> Oracle's manual. But I'm not sure about this, because UTF-16 is not a real >> fixed-width encoding due to supplementary characters. > It seems to me that these two points here are the real core of your > proposal. The rest is just syntactic sugar. > Let me start with the second one: I don't think there's likely to be > any benefit in using UTF-16 as the internal encoding. In fact, I > think it's likely to make things quite a bit more complicated, because > we have a lot of code that assumes that server encodings have certain > properties that UTF-16 doesn't - specifically, that any byte with the > high-bit clear represents the corresponding ASCII character. Another point to keep in mind is that UTF16 is not really any easier to deal with than UTF8, unless you write code that fails to support characters outside the basic multilingual plane. Which is a restriction I don't believe we'd accept. But without that restriction, you're still forced to deal with variable-width characters; and there's nothing very nice about the way that's done in UTF16. So on the whole I think it makes more sense to use UTF8 for this. I share Robert's misgivings about difficulties in dealing with characters that are not representable in the database's principal encoding. Still, you probably won't find out about many of those until you try it. regards, tom lane
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Heikki Linnakangas
Date:
On 18.09.2013 16:16, Robert Haas wrote: > On Mon, Sep 16, 2013 at 8:49 AM, MauMau<maumau307@gmail.com> wrote: >> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always >> contain Unicode data. > ... >> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. >> Fixed-width encoding may allow faster string manipulation as described in >> Oracle's manual. But I'm not sure about this, because UTF-16 is not a real >> fixed-width encoding due to supplementary characters. > > It seems to me that these two points here are the real core of your > proposal. The rest is just syntactic sugar. > > Let me start with the second one: I don't think there's likely to be > any benefit in using UTF-16 as the internal encoding. In fact, I > think it's likely to make things quite a bit more complicated, because > we have a lot of code that assumes that server encodings have certain > properties that UTF-16 doesn't - specifically, that any byte with the > high-bit clear represents the corresponding ASCII character. > > As to the first one, if we're going to go to the (substantial) trouble > of building infrastructure to allow a database to store data in > multiple encodings, why limit it to storing UTF-8 in non-UTF-8 > databases? What about storing SHIFT-JIS in UTF-8 databases, or > Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other > combination you might care to name? > > Whether we go that way or not, I think storing data in one encoding in > a database with a different encoding is going to be pretty tricky and > require far-reaching changes. You haven't mentioned any of those > issues or discussed how you would solve them. I'm not too thrilled about complicating the system for that, either. If you really need to deal with many different languages, you can do that today by using UTF-8 everywhere. Sure, it might not be the most efficient encoding for some characters, but it works. There is one reason, however, that makes it a lot more compelling: we already support having databases with different encodings in the same cluster, but the encoding used in the shared catalogs, for usernames and database names for example, is not well-defined. If we dealt with different encodings in the same database, that inconsistency would go away. - Heikki
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Robert Haas" <robertmhaas@gmail.com> > On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote: >> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always >> contain Unicode data. > ... >> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. >> Fixed-width encoding may allow faster string manipulation as described in >> Oracle's manual. But I'm not sure about this, because UTF-16 is not a >> real >> fixed-width encoding due to supplementary characters. > > It seems to me that these two points here are the real core of your > proposal. The rest is just syntactic sugar. No, those are "desirable if possible" features. What's important is to declare in the manual that PostgreSQL officially supports national character types, as I stated below. > 1. Accept NCHAR/NVARCHAR as data type name and N'...' syntactically. > This is already implemented. PostgreSQL treats NCHAR/NVARCHAR as synonyms > for CHAR/VARCHAR, and ignores N prefix. But this is not documented. > > 2. Declare support for national character support in the manual. > 1 is not sufficient because users don't want to depend on undocumented > behavior. This is exactly what the TODO item "national character support" > in PostgreSQL TODO wiki is about. > > 3. Implement NCHAR/NVARCHAR as distinct data types, not as synonyms so > that: > - psql \d can display the user-specified data types. > - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as > CHAR/VARCHAR. > - To implement additional features for NCHAR/NVARCHAR in the future, as > described below. And when declaring that, we had better implement NCHAR types as distinct types with their own OIDs so that we can extend NCHAR behavior in the future. As the first stage, I think it's okay to treat NCHAR types exactly the same as CHAR/VARCHAR types. For example, in ECPG: switch (type) case OID_FOR_CHAR: case OID_FOR_VARCHAR: case OID_FOR_TEXT: case OID_FOR_NCHAR: /* new code */ case OID_FOR_NVARCHAR: /* new code */some processing;break;And in JDBC, just call methods for non-national character types. Currently, those national character methods throw SQLException. public void setNString(int parameterIndex, String value) throws SQLException {setString(parameterIndex, value); } > Let me start with the second one: I don't think there's likely to be > any benefit in using UTF-16 as the internal encoding. In fact, I > think it's likely to make things quite a bit more complicated, because > we have a lot of code that assumes that server encodings have certain > properties that UTF-16 doesn't - specifically, that any byte with the > high-bit clear represents the corresponding ASCII character. > > As to the first one, if we're going to go to the (substantial) trouble > of building infrastructure to allow a database to store data in > multiple encodings, why limit it to storing UTF-8 in non-UTF-8 > databases? What about storing SHIFT-JIS in UTF-8 databases, or > Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other > combination you might care to name? > > Whether we go that way or not, I think storing data in one encoding in > a database with a different encoding is going to be pretty tricky and > require far-reaching changes. You haven't mentioned any of those > issues or discussed how you would solve them. Yes, you are probably right -- I'm not sure UTF-16 has really benefits that UTF-8 doesn't have. But why did Windows and Java choose UTF-16 for internal strings rather than UTF-8? Why did Oracle recommend UTF-16 for NCHAR? I have no clear idea. Anyway, I don't strongly push UTF-16 and complicate the encoding handling. Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us> > Another point to keep in mind is that UTF16 is not really any easier > to deal with than UTF8, unless you write code that fails to support > characters outside the basic multilingual plane. Which is a restriction > I don't believe we'd accept. But without that restriction, you're still > forced to deal with variable-width characters; and there's nothing very > nice about the way that's done in UTF16. So on the whole I think it > makes more sense to use UTF8 for this. I feel so. I guess why Windows, Java, and Oracle chose UTF-16 is ... it was UCS-2 only with BMP when they chose it. So character handling was easier and faster thanks to fixed-width encoding. Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Robert Haas
Date:
On Wed, Sep 18, 2013 at 6:42 PM, MauMau <maumau307@gmail.com> wrote: >> It seems to me that these two points here are the real core of your >> proposal. The rest is just syntactic sugar. > > No, those are "desirable if possible" features. What's important is to > declare in the manual that PostgreSQL officially supports national character > types, as I stated below. That may be what's important to you, but it's not what's important to me. I am not keen to introduce support for nchar and nvarchar as differently-named types with identical semantics. And I think it's an even worse idea to introduce them now, making them work one way, and then later change the behavior in a backward-incompatible fashion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Robert Haas" <robertmhaas@gmail.com> > That may be what's important to you, but it's not what's important to > me. National character types support may be important to some potential users of PostgreSQL and the popularity of PostgreSQL, not me. That's why national character support is listed in the PostgreSQL TODO wiki. We might be losing potential users just because their selection criteria includes national character support. > I am not keen to introduce support for nchar and nvarchar as > differently-named types with identical semantics. Similar examples already exist: - varchar and text: the only difference is the existence of explicit length limit - numeric and decimal - int and int4, smallint and int2, bigint and int8 - real/double precison and float In addition, the SQL standard itself admits: "The <key word>s NATIONAL CHARACTER are used to specify the character type with an implementation- defined character set. Special syntax (N'string') is provided for representing literals in that character set. ... "NATIONAL CHARACTER" is equivalent to the corresponding <character string type> with a specification of "CHARACTER SET CSN", where "CSN" is an implementation-defined <character set name>." "A <national character string literal> is equivalent to a <character string literal> with the "N" replaced by "<introducer><character set specification>", where "<character set specification>" is an implementation- defined <character set name>." > And I think it's an > even worse idea to introduce them now, making them work one way, and > then later change the behavior in a backward-incompatible fashion. I understand your feeling. The concern about incompatibility can be eliminated by thinking the following way. How about this? - NCHAR can be used with any database encoding. - At first, NCHAR is exactly the same as CHAR. That is, "implementation-defined character set" described in the SQL standard is the database character set. - In the future, the character set for NCHAR can be selected at database creation like Oracle's CREATE DATABAWSE .... NATIONAL CHARACTER SET AL16UTF16. The default it the database set. Could you tell me what kind of specification we should implement if we officially support national character types? Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tatsuo Ishii
Date:
> On Mon, Sep 16, 2013 at 8:49 AM, MauMau <maumau307@gmail.com> wrote: >> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always >> contain Unicode data. > ... >> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns. >> Fixed-width encoding may allow faster string manipulation as described in >> Oracle's manual. But I'm not sure about this, because UTF-16 is not a real >> fixed-width encoding due to supplementary characters. > > It seems to me that these two points here are the real core of your > proposal. The rest is just syntactic sugar. > > Let me start with the second one: I don't think there's likely to be > any benefit in using UTF-16 as the internal encoding. In fact, I > think it's likely to make things quite a bit more complicated, because > we have a lot of code that assumes that server encodings have certain > properties that UTF-16 doesn't - specifically, that any byte with the > high-bit clear represents the corresponding ASCII character. Agreed. > As to the first one, if we're going to go to the (substantial) trouble > of building infrastructure to allow a database to store data in > multiple encodings, why limit it to storing UTF-8 in non-UTF-8 > databases? What about storing SHIFT-JIS in UTF-8 databases, or > Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other > combination you might care to name? > > Whether we go that way or not, I think storing data in one encoding in > a database with a different encoding is going to be pretty tricky and > require far-reaching changes. You haven't mentioned any of those > issues or discussed how you would solve them. What about limiting to use NCHAR with a database which has same encoding or "compatible" encoding (on which the encoding conversion is defined)? This way, NCHAR text can be automatically converted from NCHAR to the database encoding in the server side thus we can treat NCHAR exactly same as CHAR afterward. I suppose what encoding is used for NCHAR should be defined in initdb time or creation of the database (if we allow this, we need to add a new column to know what encoding is used for NCHAR). For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is SHIFT-JIS and database encoding is UTF-8 because there is a conversion between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database encoding is ISO-8859-1 because there's no conversion between them. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Valentine Gogichashvili
Date:
Hi,
National character types support may be important to some potential users of PostgreSQL and the popularity of PostgreSQL, not me. That's why national character support is listed in the PostgreSQL TODO wiki. We might be losing potential users just because their selection criteria includes national character support.That may be what's important to you, but it's not what's important to
me.
the whole NCHAR appeared as hack for the systems, that did not have it from the beginning. It would not be needed, if all the text would be magically stored in UNICODE or UTF from the beginning and idea of character would be the same as an idea of a rune and not a byte.
PostgreSQL has a very powerful possibilities for storing any kind of encoding. So maybe it makes sense to add the ENCODING as another column property, the same way a COLLATION was added?
It would make it possible to have a database, that talks to the clients in UTF8 and stores text and varchar data in the encoding that is the most appropriate for the situation.
It will make it impossible (or complicated) to make the database have a non-UTF8 default encoding (I wonder who should need that in this case), as conversions will not be possible from the broader charsets into the default database encoding.
One could define an additional DATABASE property like LC_ENCODING that would work for the ENCODING property of a column like LC_COLLATE for COLLATE property of a column.
Text operations should work automatically, as in memory all strings will be converted to the database encoding.
This approach will also open a possibility to implement custom ENCODINGs for the column data storage, like snappy compression or even BSON, gobs or protbufs for much more compact type storage.
Regards,
-- Valentine Gogichashvili
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Martijn van Oosterhout
Date:
On Fri, Sep 20, 2013 at 08:58:53AM +0900, Tatsuo Ishii wrote: > For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is > UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is > SHIFT-JIS and database encoding is UTF-8 because there is a conversion > between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is > SHIFT-JIS and database encoding is ISO-8859-1 because there's no > conversion between them. As far as I can tell the whole reason for introducing NCHAR is to support SHIFT-JIS, there hasn't been call for any other encodings, that I can remember anyway. So rather than this whole NCHAR thing, why not just add a type "sjistext", and a few type casts and call it a day... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Robert Haas
Date:
On Thu, Sep 19, 2013 at 6:42 PM, MauMau <maumau307@gmail.com> wrote: > National character types support may be important to some potential users of > PostgreSQL and the popularity of PostgreSQL, not me. That's why national > character support is listed in the PostgreSQL TODO wiki. We might be losing > potential users just because their selection criteria includes national > character support. We'd have to go back and search the archives to figure out why that item was added to the TODO, but I'd be surprised if anyone ever had it in mind to create additional types that behave just like existing types but with different names. I don't think that you'll be able to get consensus around that path on this mailing list. >> I am not keen to introduce support for nchar and nvarchar as >> differently-named types with identical semantics. > > Similar examples already exist: > > - varchar and text: the only difference is the existence of explicit length > limit > - numeric and decimal > - int and int4, smallint and int2, bigint and int8 > - real/double precison and float I agree that the fact we have both varchar and text feels like a wart.The other examples mostly involve different names forthe same underlying type, and so are different from what you are asking for here. > I understand your feeling. The concern about incompatibility can be > eliminated by thinking the following way. How about this? > > - NCHAR can be used with any database encoding. > > - At first, NCHAR is exactly the same as CHAR. That is, > "implementation-defined character set" described in the SQL standard is the > database character set. > > - In the future, the character set for NCHAR can be selected at database > creation like Oracle's CREATE DATABAWSE .... NATIONAL CHARACTER SET > AL16UTF16. The default it the database set. Hmm. So under that design, a database could support up to a total of two character sets, the one that you get when you say 'foo' and the other one that you get when you say n'foo'. I guess we could do that, but it seems a bit limited. If we're going to go to the trouble of supporting multiple character sets, why not support an arbitrary number instead of just two? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Robert Haas
Date:
On Thu, Sep 19, 2013 at 7:58 PM, Tatsuo Ishii <ishii@postgresql.org> wrote: > What about limiting to use NCHAR with a database which has same > encoding or "compatible" encoding (on which the encoding conversion is > defined)? This way, NCHAR text can be automatically converted from > NCHAR to the database encoding in the server side thus we can treat > NCHAR exactly same as CHAR afterward. I suppose what encoding is used > for NCHAR should be defined in initdb time or creation of the database > (if we allow this, we need to add a new column to know what encoding > is used for NCHAR). > > For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is > UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is > SHIFT-JIS and database encoding is UTF-8 because there is a conversion > between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is > SHIFT-JIS and database encoding is ISO-8859-1 because there's no > conversion between them. I think the point here is that, at least as I understand it, encoding conversion and sanitization happens at a very early stage right now, when we first receive the input from the client. If the user sends a string of bytes as part of a query or bind placeholder that's not valid in the database encoding, it's going to error out before any type-specific code has an opportunity to get control. Look at textin(), for example. There's no encoding check there. That means it's already been done at that point. To make this work, someone's going to have to figure out what to do about *that*. Until we have a sketch of what the design for that looks like, I don't see how we can credibly entertain more specific proposals. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Peter Eisentraut
Date:
On 9/20/13 2:22 PM, Robert Haas wrote: >>> I am not keen to introduce support for nchar and nvarchar as >>> >> differently-named types with identical semantics. >> > >> > Similar examples already exist: >> > >> > - varchar and text: the only difference is the existence of explicit length >> > limit >> > - numeric and decimal >> > - int and int4, smallint and int2, bigint and int8 >> > - real/double precison and float > I agree that the fact we have both varchar and text feels like a wart. > The other examples mostly involve different names for the same > underlying type, and so are different from what you are asking for > here. Also note that we already have NCHAR [VARYING]. It's mapped to char or varchar, respectively, in the parser, just like int, real, etc. are handled.
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Tatsuo Ishii" <ishii@postgresql.org> > What about limiting to use NCHAR with a database which has same > encoding or "compatible" encoding (on which the encoding conversion is > defined)? This way, NCHAR text can be automatically converted from > NCHAR to the database encoding in the server side thus we can treat > NCHAR exactly same as CHAR afterward. I suppose what encoding is used > for NCHAR should be defined in initdb time or creation of the database > (if we allow this, we need to add a new column to know what encoding > is used for NCHAR). > > For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is > UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is > SHIFT-JIS and database encoding is UTF-8 because there is a conversion > between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is > SHIFT-JIS and database encoding is ISO-8859-1 because there's no > conversion between them. Thanks for the idea, it sounds flexible for wider use. Your cooperation would be much appreciated to devise implementation with as little code as possible. Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Valentine Gogichashvili" <valgog@gmail.com> > the whole NCHAR appeared as hack for the systems, that did not have it > from > the beginning. It would not be needed, if all the text would be magically > stored in UNICODE or UTF from the beginning and idea of character would be > the same as an idea of a rune and not a byte. I guess so, too. > PostgreSQL has a very powerful possibilities for storing any kind of > encoding. So maybe it makes sense to add the ENCODING as another column > property, the same way a COLLATION was added? Some other people in this community suggested that. ANd the SQL standard suggests the same -- specifying a character encoding for each column: CHAR(n) CHARASET SET ch. > Text operations should work automatically, as in memory all strings will > be > converted to the database encoding. > > This approach will also open a possibility to implement custom ENCODINGs > for the column data storage, like snappy compression or even BSON, gobs or > protbufs for much more compact type storage. Thanks for your idea that sounds interesting, although I don't understand that well. Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Martijn van Oosterhout" <kleptog@svana.org> > As far as I can tell the whole reason for introducing NCHAR is to > support SHIFT-JIS, there hasn't been call for any other encodings, that > I can remember anyway. Could you elaborate on this, giving some info sources? > So rather than this whole NCHAR thing, why not just add a type > "sjistext", and a few type casts and call it a day... The main reason for supporting NCHAR types is to ease migration from other DBMSs, not requiring DDL changes. So sjistext does not match that purpose. Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Robert Haas" <robertmhaas@gmail.com> > I don't think that you'll be able to > get consensus around that path on this mailing list. > I agree that the fact we have both varchar and text feels like a wart. Is that right? I don't feel varchar/text case is a wart. I think text was introduced for a positive reason to ease migration from other DBMSs. The manual says: http://www.postgresql.org/docs/current/static/datatype-character.html "Although the type text is not in the SQL standard, several other SQL database management systems have it as well." And isn't EnterpriseDB doing similar things for Oracle compatibility, although I'm not sure about the details? Could you share your idea why we won't get consensus? >> I understand your feeling. The concern about incompatibility can be >> eliminated by thinking the following way. How about this? >> >> - NCHAR can be used with any database encoding. >> >> - At first, NCHAR is exactly the same as CHAR. That is, >> "implementation-defined character set" described in the SQL standard is >> the >> database character set. >> >> - In the future, the character set for NCHAR can be selected at database >> creation like Oracle's CREATE DATABAWSE .... NATIONAL CHARACTER SET >> AL16UTF16. The default it the database set. > > Hmm. So under that design, a database could support up to a total of > two character sets, the one that you get when you say 'foo' and the > other one that you get when you say n'foo'. > > I guess we could do that, but it seems a bit limited. If we're going > to go to the trouble of supporting multiple character sets, why not > support an arbitrary number instead of just two? I agree with you about the arbitrary number. Tatsuo san gave us a good suggestion. Let's consider how to implement that. Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Robert Haas" <robertmhaas@gmail.com> > On Thu, Sep 19, 2013 at 7:58 PM, Tatsuo Ishii <ishii@postgresql.org> > wrote: >> What about limiting to use NCHAR with a database which has same >> encoding or "compatible" encoding (on which the encoding conversion is >> defined)? This way, NCHAR text can be automatically converted from >> NCHAR to the database encoding in the server side thus we can treat >> NCHAR exactly same as CHAR afterward. I suppose what encoding is used >> for NCHAR should be defined in initdb time or creation of the database >> (if we allow this, we need to add a new column to know what encoding >> is used for NCHAR). >> >> For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is >> UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is >> SHIFT-JIS and database encoding is UTF-8 because there is a conversion >> between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is >> SHIFT-JIS and database encoding is ISO-8859-1 because there's no >> conversion between them. > > I think the point here is that, at least as I understand it, encoding > conversion and sanitization happens at a very early stage right now, > when we first receive the input from the client. If the user sends a > string of bytes as part of a query or bind placeholder that's not > valid in the database encoding, it's going to error out before any > type-specific code has an opportunity to get control. Look at > textin(), for example. There's no encoding check there. That means > it's already been done at that point. To make this work, someone's > going to have to figure out what to do about *that*. Until we have a > sketch of what the design for that looks like, I don't see how we can > credibly entertain more specific proposals. OK, I see your point. Let's consider that design. I'll learn the code regarding this. Does anybody, especially Tatsuo san, Tom san, Peter san, have any good idea? Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tatsuo Ishii
Date:
> I think the point here is that, at least as I understand it, encoding > conversion and sanitization happens at a very early stage right now, > when we first receive the input from the client. If the user sends a > string of bytes as part of a query or bind placeholder that's not > valid in the database encoding, it's going to error out before any > type-specific code has an opportunity to get control. Look at > textin(), for example. There's no encoding check there. That means > it's already been done at that point. To make this work, someone's > going to have to figure out what to do about *that*. Until we have a > sketch of what the design for that looks like, I don't see how we can > credibly entertain more specific proposals. I don't think the bind placeholder is the case. That is processed by exec_bind_message() in postgres.c. It has enough info about the type of the placeholder, and I think we can easily deal with NCHAR. Same thing can be said to COPY case. Problem is an ordinary query (simple protocol "Q" message) as you pointed out. Encoding conversion happens at a very early stage (note that fast-path case has the same issue). If a query message contains, say, SHIFT-JIS and EUC-JP, then we are going into trouble because the encoding conversion routine (pg_client_to_server) regards that the message from client contains only one encoding. However my question is, does it really happen? Because there's any text editor which can create SHIFT-JIS and EUC-JP mixed text. So my guess is, when user want to use NCHAR as SHIFT-JIS text, the rest of query consist of either SHIFT-JIS or plain ASCII. If so, what the user need to do is, set the client encoding to SJIFT-JIS and everything should be fine. Maumau, is my guess correct? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Valentine Gogichashvili
Date:
Some other people in this community suggested that. ANd the SQL standard suggests the same -- specifying a character encoding for each column: CHAR(n) CHARASET SET ch.PostgreSQL has a very powerful possibilities for storing any kind of
encoding. So maybe it makes sense to add the ENCODING as another column
property, the same way a COLLATION was added?Thanks for your idea that sounds interesting, although I don't understand that well.Text operations should work automatically, as in memory all strings will be
converted to the database encoding.
This approach will also open a possibility to implement custom ENCODINGs
for the column data storage, like snappy compression or even BSON, gobs or
protbufs for much more compact type storage.
The idea is very simple:
CREATE DATABASE utf8_database ENCODING 'utf8';
\c utf8_database
CREATE TABLE a(
id serial,
ascii_data text ENCODING 'ascii', -- will use ascii_to_utf8 to read and utf8_to_ascii to write
koi8_data text ENCODING 'koi8_r', -- will use koi8_r_to_utf8 to read and utf8_to_koi8_r to write
json_data json ENCODING 'bson' -- will use bson_to_json to read and json_to_bson to write
json_data json ENCODING 'bson' -- will use bson_to_json to read and json_to_bson to write
);
The problem with bson_to_json here is that probably it will not be possible to write JSON in koi8_r for example. But now it is also even not considered in these discussions.
If the ENCODING machinery would get not only the encoding name, but also the type OID, it should be possible to write encoders for TYPEs and array of TYPEs (I had to do it using the casts to bytea and protobuff to minimize the size of storage for an array of types when writing a lot of data, that could be unpacked afterwords directly in the DB as normal database types).
The problem with bson_to_json here is that probably it will not be possible to write JSON in koi8_r for example. But now it is also even not considered in these discussions.
If the ENCODING machinery would get not only the encoding name, but also the type OID, it should be possible to write encoders for TYPEs and array of TYPEs (I had to do it using the casts to bytea and protobuff to minimize the size of storage for an array of types when writing a lot of data, that could be unpacked afterwords directly in the DB as normal database types).
I hope I made my point a little bit clearer.
Regards,
Valentine Gogichashvili
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Tatsuo Ishii" <ishii@postgresql.org> > I don't think the bind placeholder is the case. That is processed by > exec_bind_message() in postgres.c. It has enough info about the type > of the placeholder, and I think we can easily deal with NCHAR. Same > thing can be said to COPY case. Yes, I've learned it. Agreed. If we allow an encoding for NCHAR different from the database encoding, we can convert text from the client encoding to the NCHAR encoding in nchar_in() for example. We can retrieve the NCHAR encoding from pg_database and store it in a global variable at session start. > Problem is an ordinary query (simple protocol "Q" message) as you > pointed out. Encoding conversion happens at a very early stage (note > that fast-path case has the same issue). If a query message contains, > say, SHIFT-JIS and EUC-JP, then we are going into trouble because the > encoding conversion routine (pg_client_to_server) regards that the > message from client contains only one encoding. However my question > is, does it really happen? Because there's any text editor which can > create SHIFT-JIS and EUC-JP mixed text. So my guess is, when user want > to use NCHAR as SHIFT-JIS text, the rest of query consist of either > SHIFT-JIS or plain ASCII. If so, what the user need to do is, set the > client encoding to SJIFT-JIS and everything should be fine. > > Maumau, is my guess correct? Yes, I believe you are right. Regardless of whether we support multiple encodings in one database or not, a single client encoding will be sufficient for one session. When receiving the "Q" message, the whole SQL text is converted from the client encoding to the database encoding. This part needs no modification. During execution of the "Q" message, NCHAR values are converted from the database encoding to the NCHAR encoding. Thank you very much, Tatsuo san. Everybody, is there any other challenge we should consider to support NCHAR/NVARCHAR types as distinct types? Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Robert Haas
Date:
On Fri, Sep 20, 2013 at 8:32 PM, MauMau <maumau307@gmail.com> wrote: >> I don't think that you'll be able to >> get consensus around that path on this mailing list. >> I agree that the fact we have both varchar and text feels like a wart. > > Is that right? I don't feel varchar/text case is a wart. I think text was > introduced for a positive reason to ease migration from other DBMSs. The > manual says: > > http://www.postgresql.org/docs/current/static/datatype-character.html > > "Although the type text is not in the SQL standard, several other SQL > database management systems have it as well." > > And isn't EnterpriseDB doing similar things for Oracle compatibility, > although I'm not sure about the details? Could you share your idea why we > won't get consensus? Sure, it's EnterpriseDB's policy to add features that facilitate migrations from other databases - particularly Oracle - to our product, Advanced Server, even if those features don't otherwise add any value. However, the community is usually reluctant to add such features to PostgreSQL. Also, at least up until now, the existing aliasing of nchar and nchar varying to other data types has been adequate for the needs of our customers, and we've handled a bunch of other type-name incompatibilities with similar tricks. What you are proposing goes off in a different direction from both PostgreSQL and Advanced Server, and that's why I'm skeptical. If you were proposing something that we were doing in Advanced Server with great success, it would be a bit disingenuous of me to argue against doing the same thing in PostgreSQL, but that's not the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Peter Eisentraut
Date:
On 9/23/13 2:53 AM, MauMau wrote: > Yes, I believe you are right. Regardless of whether we support multiple > encodings in one database or not, a single client encoding will be > sufficient for one session. When receiving the "Q" message, the whole > SQL text is converted from the client encoding to the database > encoding. This part needs no modification. During execution of the "Q" > message, NCHAR values are converted from the database encoding to the > NCHAR encoding. That assumes that the conversion client encoding -> server encoding -> NCHAR encoding is not lossy. I thought one main point of this exercise was the avoid these conversions and be able to go straight from client encoding into NCHAR.
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Robert Haas" <robertmhaas@gmail.com> > Sure, it's EnterpriseDB's policy to add features that facilitate > migrations from other databases - particularly Oracle - to our > product, Advanced Server, even if those features don't otherwise add > any value. However, the community is usually reluctant to add such > features to PostgreSQL. Also, at least up until now, the existing > aliasing of nchar and nchar varying to other data types has been > adequate for the needs of our customers, and we've handled a bunch of > other type-name incompatibilities with similar tricks. What you are > proposing goes off in a different direction from both PostgreSQL and > Advanced Server, and that's why I'm skeptical. If you were proposing > something that we were doing in Advanced Server with great success, it > would be a bit disingenuous of me to argue against doing the same > thing in PostgreSQL, but that's not the case. Sorry, I didn't mean to imitate EnterpriseDB. My intent is to just increase the popularity of PostgreSQL (or prevent the drop in popularity?). NCHAR is so basic that we can/should accept proper support. Aliasing would be nice to some extent, if its offical support would be documented in PG manual. However, just aliasing loses NCHAR type information through pg_dump. This is contrary to the benefit of pg_dump -- allow migration from PG to other DBMSs, possibly for performance comparison: http://www.postgresql.org/docs/current/static/app-pgdump.html "Script files can be used to reconstruct the database even on other machines and other architectures; with some modifications, even on other SQL database products." In addition, distinct types for NCHAR/NVARCHAR allow future extension such as different encoding for NCHAR and UTF-16. Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Peter Eisentraut" <peter_e@gmx.net> > That assumes that the conversion client encoding -> server encoding -> > NCHAR encoding is not lossy. Yes, so Tatsuo san suggested to restrict server encoding <-> NCHAR encoding combination to those with lossless conversion. > I thought one main point of this exercise > was the avoid these conversions and be able to go straight from client > encoding into NCHAR. It's slightly different. Please see the following excerpt: http://www.postgresql.org/message-id/B1A7485194DE4FDAB8FA781AFB570079@maumau "4. I guess some users really want to continue to use ShiftJIS or EUC_JP for database encoding, and use NCHAR for a limited set of columns to store international text in Unicode: - to avoid code conversion between the server and the client for performance - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for most Kanji) than UTF-8 (3 bytes) This use case is described in chapter 6 of "Oracle Database Globalization Support Guide"." Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Peter Eisentraut
Date:
On Tue, 2013-09-24 at 21:04 +0900, MauMau wrote: > "4. I guess some users really want to continue to use ShiftJIS or EUC_JP for > database encoding, and use NCHAR for a limited set of columns to store > international text in Unicode: > - to avoid code conversion between the server and the client for performance > - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for > most Kanji) than UTF-8 (3 bytes) > This use case is described in chapter 6 of "Oracle Database Globalization > Support Guide"." But your proposal wouldn't address the first point, because data would have to go client -> server -> NCHAR. The second point is valid, but it's going to be an awful amount of work for that limited result.
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Peter Eisentraut" <peter_e@gmx.net> > On Tue, 2013-09-24 at 21:04 +0900, MauMau wrote: >> "4. I guess some users really want to continue to use ShiftJIS or EUC_JP >> for >> database encoding, and use NCHAR for a limited set of columns to store >> international text in Unicode: >> - to avoid code conversion between the server and the client for >> performance >> - because ShiftJIS and EUC_JP require less amount of storage (2 bytes for >> most Kanji) than UTF-8 (3 bytes) >> This use case is described in chapter 6 of "Oracle Database Globalization >> Support Guide"." > > But your proposal wouldn't address the first point, because data would > have to go client -> server -> NCHAR. > > The second point is valid, but it's going to be an awful amount of work > for that limited result. I (or, Oracle's use case) meant the following, for example: initdb -E EUC_JP CREATE DATABASE mydb ENCODING EUC_JP NATIONAL ENCODING UTF-8; CREATE TABLE mytable ( col1 char(10), -- EUC_JP text col2 Nchar(10), -- UTF-8 text ); client encoding = EUC_JP That is, 1. Currently, the user is only handling Japanese text. To avoid unnecessary conversion, he uses EUC_JP for both client and server. 2. He needs to store some limited amount of international (non-Japanese) text in a few columns for a new feature of the system. But the international text is limited, so he wants to sacrifice performance and storage cost due to code conversion for most text and more bytes for each character. Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Greg Stark
Date:
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Tue, Sep 24, 2013 at 1:04 PM, MauMau <span dir="ltr"><<ahref="mailto:maumau307@gmail.com" target="_blank">maumau307@gmail.com</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Yes, so Tatsuo sansuggested to restrict server encoding <-> NCHAR encoding combination to those with lossless conversion.</blockquote></div><br/></div><div class="gmail_extra">If it's not lossy then what's the point? From the client'spoint of view it'll be functionally equivalent to text then.<br /></div><div class="gmail_extra"><br clear="all"/><br />-- <br />greg<br /></div></div>
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Greg Stark" <stark@mit.edu> > If it's not lossy then what's the point? From the client's point of view > it'll be functionally equivalent to text then. Sorry, what Tatsuo san suggested meant was "same or compatible", not lossy. I quote the relevant part below. This is enough for the use case I mentioned in my previous mail several hours ago (actually, that is what Oracle manual describes...). http://www.postgresql.org/message-id/20130920.085853.1628917054830864151.t-ishii@sraoss.co.jp [Excerpt] ---------------------------------------- What about limiting to use NCHAR with a database which has same encoding or "compatible" encoding (on which the encoding conversion is defined)? This way, NCHAR text can be automatically converted from NCHAR to the database encoding in the server side thus we can treat NCHAR exactly same as CHAR afterward. I suppose what encoding is used for NCHAR should be defined in initdb time or creation of the database (if we allow this, we need to add a new column to know what encoding is used for NCHAR). For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is SHIFT-JIS and database encoding is UTF-8 because there is a conversion between UTF-8 and SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database encoding is ISO-8859-1 because there's no conversion between them. ---------------------------------------- Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"Arulappan, Arul Shaji"
Date:
Attached is a patch that implements the first set of changes discussed in this thread originally. They are: (i) Implements NCHAR/NVARCHAR as distinct data types, not as synonyms so that: - psql \d can display the user-specified data types. - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, not as CHAR/VARCHAR. - Groundwork to implement additional features for NCHAR/NVARCHAR in the future (For eg: separate encoding for nchar columns). (ii) Support for NCHAR/NVARCHAR in ECPG (iii) Documentation changes to reflect the new data type Rgds, Arul Shaji >-----Original Message----- >From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers- >owner@postgresql.org] On Behalf Of MauMau > >From: "Greg Stark" <stark@mit.edu> >> If it's not lossy then what's the point? From the client's point of >> view it'll be functionally equivalent to text then. > >Sorry, what Tatsuo san suggested meant was "same or compatible", not lossy. >I quote the relevant part below. This is enough for the use case I mentioned >in my previous mail several hours ago (actually, that is what Oracle manual >describes...). > >http://www.postgresql.org/message-id/20130920.085853.162891705483086415 1.t- >ishii@sraoss.co.jp > >[Excerpt] >---------------------------------------- >What about limiting to use NCHAR with a database which has same encoding or >"compatible" encoding (on which the encoding conversion is defined)? This way, >NCHAR text can be automatically converted from NCHAR to the database encoding >in the server side thus we can treat NCHAR exactly same as CHAR afterward. I >suppose what encoding is used for NCHAR should be defined in initdb time or >creation of the database (if we allow this, we need to add a new column to know >what encoding is used for NCHAR). > >For example, "CREATE TABLE t1(t NCHAR(10))" will succeed if NCHAR is >UTF-8 and database encoding is UTF-8. Even succeed if NCHAR is SHIFT-JIS and >database encoding is UTF-8 because there is a conversion between UTF-8 and >SHIFT-JIS. However will not succeed if NCHAR is SHIFT-JIS and database encoding >is ISO-8859-1 because there's no conversion between them. >---------------------------------------- > > >Regards >MauMau > > > >-- >Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make >changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Albe Laurenz
Date:
Arul Shaji Arulappan wrote: > Attached is a patch that implements the first set of changes discussed > in this thread originally. They are: > > (i) Implements NCHAR/NVARCHAR as distinct data types, not as synonyms so > that: > - psql \d can display the user-specified data types. > - pg_dump/pg_dumpall can output NCHAR/NVARCHAR columns as-is, > not as CHAR/VARCHAR. > - Groundwork to implement additional features for NCHAR/NVARCHAR > in the future (For eg: separate encoding for nchar columns). > (ii) Support for NCHAR/NVARCHAR in ECPG > (iii) Documentation changes to reflect the new data type If I understood the discussion correctly the use case is that there are advantages to having a database encoding different from UTF-8, but you'd still want sume UTF-8 columns. Wouldn't it be a better design to allow specifying the encoding per column? That would give you more flexibility. I know that NCHAR/NVARCHAR is SQL Standard, but as I still think that it is a wart. Yours, Laurenz Albe
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Albe Laurenz" <laurenz.albe@wien.gv.at> > If I understood the discussion correctly the use case is that > there are advantages to having a database encoding different > from UTF-8, but you'd still want sume UTF-8 columns. > > Wouldn't it be a better design to allow specifying the encoding > per column? That would give you more flexibility. Yes, you are right. In the previous discussion: - That would be nice if available, but it is hard to implement multiple encodings in one database. - Some people (I'm not sure many or few) are NCHAR/NVARCHAR in other DBMSs. To invite them to PostgreSQL, it's important to support national character feature syntactically and document it in the manual. This is the first step. - As the second step, we can implement multiple encodings in one database. According to the SQL standard, "NCHAR(n)" is equivalent to "CHAR(n) CHARACTER SET cs", where cs is an implementation-defined character set. Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Albe Laurenz
Date:
MauMau wrote: > From: "Albe Laurenz" <laurenz.albe@wien.gv.at> >> If I understood the discussion correctly the use case is that >> there are advantages to having a database encoding different >> from UTF-8, but you'd still want sume UTF-8 columns. >> >> Wouldn't it be a better design to allow specifying the encoding >> per column? That would give you more flexibility. > > Yes, you are right. In the previous discussion: > > - That would be nice if available, but it is hard to implement multiple > encodings in one database. Granted. > - Some people (I'm not sure many or few) are NCHAR/NVARCHAR in other DBMSs. > To invite them to PostgreSQL, it's important to support national character > feature syntactically and document it in the manual. This is the first > step. I looked into the Standard, and it does not have NVARCHAR. The type is called NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING or NCHAR VARYING. I guess that the goal of this patch is to support Oracle syntax. But anybody trying to port CREATE TABLE statements from Oracle is already exposed to enough incompatibilities that the difference between NVARCHAR and NCHAR VARYING will not be the reason to reject PostgreSQL. In other words, I doubt that introducing the nonstandard NVARCHAR will have more benefits than drawbacks (new reserved word). Regarding the Standard compliant names of these data types, PostgreSQL already supports those. Maybe some documentation would help. > - As the second step, we can implement multiple encodings in one database. > According to the SQL standard, "NCHAR(n)" is equivalent to "CHAR(n) > CHARACTER SET cs", where cs is an implementation-defined character set. That second step would definitely have benefits. But I don't think that this requires the first step that your patch implements, it is in fact orthogonal. I don't think that there is any need to change NCHAR even if we get per-column encoding, it is just syntactic sugar to support SQL Feature F421. Why not tackle the second step first? Yours, Laurenz Albe
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Albe Laurenz" <laurenz.albe@wien.gv.at> > I looked into the Standard, and it does not have NVARCHAR. > The type is called NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING > or NCHAR VARYING. OUch, that's just a mistake in my mail. You are correct. > > I guess that the goal of this patch is to support Oracle syntax. > But anybody trying to port CREATE TABLE statements from Oracle > is already exposed to enough incompatibilities that the difference between > NVARCHAR and NCHAR VARYING will not be the reason to reject PostgreSQL. > In other words, I doubt that introducing the nonstandard NVARCHAR > will have more benefits than drawbacks (new reserved word). Agreed. But I'm in favor of supporting other DBMS's syntax if it doesn't complicate the spec or implementation too much, because it can help migrate to PostgreSQL. I understand PostgreSQL has made such efforts like PL/pgSQL which is similar to PL/SQL, text data type, AS in SELECT statement, etc. > But I don't think that this requires the first step that your patch > implements, it is in fact orthogonal. (It's not "my" patch.) > Regarding the Standard compliant names of these data types, PostgreSQL > already supports those. Maybe some documentation would help. > > I don't think that there is any need to change NCHAR even if we > get per-column encoding, it is just syntactic sugar to support > SQL Feature F421. Maybe so. I guess the distinct type for NCHAR is for future extension and user friendliness. As one user, I expect to get "national character" instead of "char character set xxx" as output of psql \d and pg_dump when I specified "national character" in DDL. In addition, that makes it easy to use the pg_dump output for importing data to other DBMSs for some reason. Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Peter Eisentraut
Date:
On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote: > Implements NCHAR/NVARCHAR as distinct data types, not as synonyms If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET "cs", then for some "cs", NCHAR(x) must be the same as CHAR(x). Therefore, an implementation as separate data types is wrong.
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Robert Haas
Date:
On Tue, Nov 5, 2013 at 5:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote: >> Implements NCHAR/NVARCHAR as distinct data types, not as synonyms > > If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET > "cs", then for some "cs", NCHAR(x) must be the same as CHAR(x). > Therefore, an implementation as separate data types is wrong. Interesting. Since the point doesn't seem to be getting through, let me try to be more clear: we're not going to accept any form of this patch. A patch that makes some progress toward actually coping with multiple encodings in the same database would be very much worth considering, but adding compatible syntax with incompatible semantics is not of interest to the PostgreSQL project. We have had this debate on many other topics in the past and will no doubt have it again in the future, but the outcome is always the same. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Robert Haas" <robertmhaas@gmail.com> > On Tue, Nov 5, 2013 at 5:15 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >> On 11/5/13, 1:04 AM, Arulappan, Arul Shaji wrote: >>> Implements NCHAR/NVARCHAR as distinct data types, not as synonyms >> >> If, per SQL standard, NCHAR(x) is equivalent to CHAR(x) CHARACTER SET >> "cs", then for some "cs", NCHAR(x) must be the same as CHAR(x). >> Therefore, an implementation as separate data types is wrong. > > Since the point doesn't seem to be getting through, let me try to be > more clear: we're not going to accept any form of this patch. A patch > that makes some progress toward actually coping with multiple > encodings in the same database would be very much worth considering, > but adding compatible syntax with incompatible semantics is not of > interest to the PostgreSQL project. We have had this debate on many > other topics in the past and will no doubt have it again in the > future, but the outcome is always the same. It doesn't seem that there is any semantics incompatible with the SQL standard as follows: - In the first step, "cs" is the database encoding, which is used for char/varchar/text. - In the second (or final) step, where multiple encodings per database is supported, "cs" is the national character encoding which is specified with CREATE DATABASE ... NATIONAL CHARACTER ENCODING cs. If NATIONAL CHARACTER ENCODING clause is omitted, "cs" is the database encoding as step 1. Let me repeat myself: I think the biggest and immediate issue is that PostgreSQL does not support national character types at least officially. "Officially" means the description in the manual. So I don't have strong objection against the current (hidden) implementation of nchar types in PostgreSQL which are just synonyms, as long as the official support is documented. Serious users don't want to depend on hidden features. However, doesn't the current synonym approach have any problems? Wouldn't it produce any trouble in the future? If we treat nchar as char, we lose the fact that the user requested nchar. Can we lose the fact so easily and produce irreversible result as below? -------------------------------------------------- Maybe so. I guess the distinct type for NCHAR is for future extension and user friendliness. As one user, I expect to get "national character" instead of "char character set xxx" as output of psql \d and pg_dump when I specified "national character" in DDL. In addition, that makes it easy to use the pg_dump output for importing data to other DBMSs for some reason. -------------------------------------------------- Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Albe Laurenz
Date:
MauMau wrote: > Let me repeat myself: I think the biggest and immediate issue is that > PostgreSQL does not support national character types at least officially. > "Officially" means the description in the manual. So I don't have strong > objection against the current (hidden) implementation of nchar types in > PostgreSQL which are just synonyms, as long as the official support is > documented. Serious users don't want to depend on hidden features. I agree with you there. Actually it is somewhat documented in http://www.postgresql.org/docs/9.3/static/features-sql-standard.html as "F421", but that requires that you read the SQL standard. > However, doesn't the current synonym approach have any problems? Wouldn't > it produce any trouble in the future? If we treat nchar as char, we lose > the fact that the user requested nchar. Can we lose the fact so easily and > produce irreversible result as below? I don't think that it is a problem. According to the SQL standard, the user requested a CHAR or VARCHAR with an encoding of the choice of the DBMS. PostgreSQL chooses the database encoding. In a way, it is similar to using the "data type" serial. The column will be displayed as "integer", and the information that it was a serial can only be inferred from the DEFAULT value. It seems that this is working fine and does not cause many problems, so I don't see why things should be different here. Again, for serial the behaviour is well documented, so that seconds your request for more documentation. Would you like to write a patch for that? Yours, Laurenz Albe
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Albe Laurenz" <laurenz.albe@wien.gv.at> In a way, it is similar to using the "data type" serial. The column will be displayed as "integer", and the information that it was a serial can only be inferred from the DEFAULT value. It seems that this is working fine and does not cause many problems, so I don't see why things should be different here. Yes, I agree with you in that serial being a synonym is almost no problem. But that's because serial is not an SQL-standard data type but a type unique to PostgreSQL. On the other hand, nchar is an established data type in the SQL standard. I think most people will expect to get "nchar" as output from psql \d and pg_dump as they specified in DDL. If they get "char" as output for "nchar" columns from pg_dump, wouldn't they get in trouble if they want to import schema/data from PostgreSQL to other database products? The documentation for pg_dump says that pg_dump pays attention to easing migrating to other DBMSs. I like this idea and want to respect this. http://www.postgresql.org/docs/current/static/app-pgdump.html -------------------------------------------------- Script files can be used to reconstruct the database even on other machines and other architectures; with some modifications, even on other SQL database products. ... --use-set-session-authorization Output SQL-standard SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to determine object ownership. This makes the dump more standards-compatible, ... -------------------------------------------------- Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
"MauMau"
Date:
From: "Albe Laurenz" <laurenz.albe@wien.gv.at> In a way, it is similar to using the "data type" serial. The column will be displayed as "integer", and the information that it was a serial can only be inferred from the DEFAULT value. It seems that this is working fine and does not cause many problems, so I don't see why things should be different here. Yes, I agree with you in that serial being a synonym is almost no problem. But that's because serial is not an SQL-standard data type but a type unique to PostgreSQL. On the other hand, nchar is an established data type in the SQL standard. I think most people will expect to get "nchar" as output from psql \d and pg_dump as they specified in DDL. If they get "char" as output for "nchar" columns from pg_dump, wouldn't they get in trouble if they want to import schema/data from PostgreSQL to other database products? The documentation for pg_dump says that pg_dump pays attention to easing migrating to other DBMSs. I like this idea and want to respect this. http://www.postgresql.org/docs/current/static/app-pgdump.html -------------------------------------------------- Script files can be used to reconstruct the database even on other machines and other architectures; with some modifications, even on other SQL database products. ... --use-set-session-authorization Output SQL-standard SET SESSION AUTHORIZATION commands instead of ALTER OWNER commands to determine object ownership. This makes the dump more standards-compatible, ... -------------------------------------------------- Regards MauMau
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tom Lane
Date:
"MauMau" <maumau307@gmail.com> writes: > On the other hand, nchar is an established data type in the SQL standard. I > think most people will expect to get "nchar" as output from psql \d and > pg_dump as they specified in DDL. This argument seems awfully weak. You've been able to say create table nt (nf national character varying(22)); in Postgres since around 1997, but I don't recall one single bug report about how that is displayed as just "character varying(22)". The other big problem with this line of argument is that you're trying to claim better spec compliance for what is at best a rather narrow interpretation with really minimal added functionality. (In fact, until you have a solution for the problem that incoming and outgoing data must be in the database's primary encoding, you don't actually have *any* added functionality, just syntactic sugar that does nothing useful.) Unless you can demonstrate by lawyerly reading of the spec that the spec requires exactly the behavior this patch implements, you do not have a leg to stand on here. But you can't demonstrate that, because it doesn't. I'd be much more impressed by seeing a road map for how we get to a useful amount of added functionality --- which, to my mind, would be the ability to support N different encodings in one database, for N>2. But even if you think N=2 is sufficient, we haven't got a road map, and commandeering spec-mandated syntax for an inadequate feature doesn't seem like a good first step. It'll just make our backwards-compatibility problems even worse when somebody does come up with a real solution. regards, tom lane
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tatsuo Ishii
Date:
> I'd be much more impressed by seeing a road map for how we get to a > useful amount of added functionality --- which, to my mind, would be > the ability to support N different encodings in one database, for N>2. > But even if you think N=2 is sufficient, we haven't got a road map, and > commandeering spec-mandated syntax for an inadequate feature doesn't seem > like a good first step. It'll just make our backwards-compatibility > problems even worse when somebody does come up with a real solution. I have been thinking about this for years and I think the key idea for this is, implementing "universal encoding". The universal encoding should have following characteristics to implement N>2 encoding in a database. 1) no loss of round trip encoding conversion 2) no mapping table is necessary to convert from/to existing encodings Once we implement the universal encoding, other problem such as "pg_database with multiple encoding problem" can be solved easily. Currently there's no such an universal encoding in the universe, I think the only way is, inventing it by ourselves. At this point the design of the encoding I have in mind is, 1) 1 byte encoding identifier + 7 bytes body (totaly 8 bytes). The encoding identifier's value is between 0x80 and 0xffand is assigned to exiting encoding such as UTF-8, ascii, EUC-JP and so on. The encodings should be limited to "databasesafe" encodings. The encoding body is raw characters represented by existing encodings. This form is called "word". 2) We also have "mutibyte" representation of the universal encoding. The first byte represents the lenght of the multibyte character (similar to the first byte of UTF-8). The second byte is the encoding identifier explained in above.The rest of the character is same as above. #1 and #2 are logically same and converted to each other, and we can use one of them whenever we like. The form #1 is easy to handle because each word has fixed length (8 bytes). So probably used in temporary data in memory. The second form can save space and will be used in the data itself. If we want to have a table encoded in an encoding different from the database encoding, the table is encoded in the universal encoding. pg_class should remember the fact to avoid the confusion about what encoding a table is using. I think majority of tables in a database uses the same encoding as the database encoding. Only a few tables want to have different encoding. The design pushes the penalty to such minorities. If we need to join two tables which have different encoding, we need to convert them into the same encoding (this should succeed if the encodings are "compatible"). If fails, the join will fail too. We could expand the technique above for the design which allow each column has different encoding. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Peter Eisentraut
Date:
On 11/12/13, 1:57 AM, Tatsuo Ishii wrote: > Currently there's no such an universal encoding in the universe, I > think the only way is, inventing it by ourselves. I think ISO 2022 is something in that direction, but it's not ASCII-safe, AFAICT.
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Martijn van Oosterhout
Date:
On Tue, Nov 12, 2013 at 03:57:52PM +0900, Tatsuo Ishii wrote: > I have been thinking about this for years and I think the key idea for > this is, implementing "universal encoding". The universal encoding > should have following characteristics to implement N>2 encoding in a > database. > > 1) no loss of round trip encoding conversion > > 2) no mapping table is necessary to convert from/to existing encodings > > Once we implement the universal encoding, other problem such as > "pg_database with multiple encoding problem" can be solved easily. Isn't this essentially what the MULE internal encoding is? > Currently there's no such an universal encoding in the universe, I > think the only way is, inventing it by ourselves. This sounds like a terrible idea. In the future people are only going to want more advanced text functions, regular expressions, indexing and making encodings that don't exist anywhere else seems like a way to make a lot of work for little benefit. A better idea seems to me is to (if postgres is configured properly) embed the non-round-trippable characters in the custom character part of the unicode character set. In other words, adjust the mappings tables on demand and voila. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes: > On Tue, Nov 12, 2013 at 03:57:52PM +0900, Tatsuo Ishii wrote: >> Once we implement the universal encoding, other problem such as >> "pg_database with multiple encoding problem" can be solved easily. > Isn't this essentially what the MULE internal encoding is? MULE is completely evil. It has N different encodings for the same character, not to mention no support code available. >> Currently there's no such an universal encoding in the universe, I >> think the only way is, inventing it by ourselves. > This sounds like a terrible idea. In the future people are only going > to want more advanced text functions, regular expressions, indexing and > making encodings that don't exist anywhere else seems like a way to > make a lot of work for little benefit. Agreed. > A better idea seems to me is to (if postgres is configured properly) > embed the non-round-trippable characters in the custom character part > of the unicode character set. In other words, adjust the mappings > tables on demand and voila. From the standpoint of what will happen with existing library code (like strcoll), I'm not sure it's all that easy. regards, tom lane
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tatsuo Ishii
Date:
> Isn't this essentially what the MULE internal encoding is? No. MULE is not powerfull enough and overly complicated to deal with different encodings (character sets). >> Currently there's no such an universal encoding in the universe, I >> think the only way is, inventing it by ourselves. > > This sounds like a terrible idea. In the future people are only going > to want more advanced text functions, regular expressions, indexing and > making encodings that don't exist anywhere else seems like a way to > make a lot of work for little benefit. That is probably a misunderstanding. We don't need to modify existing text handling modules such as text functions, regular expressions, indexing etc. We just convert from the "universal" encoding X to the original encoding before calling them. The process is pretty easy and fast because it just requires skipping "encoding identifier" and "encoding length" part. Basically the encoding X should be used for lower layer modules of PostgreSQL and higher layer module such as living in src/backend/utils/adt should not aware it. > A better idea seems to me is to (if postgres is configured properly) > embed the non-round-trippable characters in the custom character part > of the unicode character set. In other words, adjust the mappings > tables on demand and voila. Using Unicode requires overhead for encoding conversion because it needs to look up mapping tables. That will be a huge handicap for large data and that I want to avoid in the first place. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tatsuo Ishii
Date:
> MULE is completely evil. > It has N different encodings for the same > character, What's wrong with that? It aims that in the first place. > not to mention no support code available. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes: >> MULE is completely evil. >> It has N different encodings for the same character, > What's wrong with that? It aims that in the first place. It greatly complicates comparisons --- at least, if you'd like to preserve the principle that strings that appear the same are equal. regards, tom lane
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@postgresql.org> writes: >>> MULE is completely evil. >>> It has N different encodings for the same character, > >> What's wrong with that? It aims that in the first place. > > It greatly complicates comparisons --- at least, if you'd like to preserve > the principle that strings that appear the same are equal. You don't need to consider it because there's no place in PostgreSQL where a MULE encoded text consists of multiple encodings as far as I know. BTW, same characters are assigned different code points are pretty common in many character sets (Unicode, for example). -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes: > BTW, same characters are assigned different code points are pretty > common in many character sets (Unicode, for example). This is widely considered a security bug; read section 10 in RFC 3629 (the definition of UTF8), and search the CVE database a bit if you still doubt it's a threat. I'm going to push back very hard on any suggestion that Postgres should build itself around a text representation with that kind of weakness designed in. regards, tom lane [1] http://tools.ietf.org/html/rfc3629#section-10
Re: UTF8 national character data type support WIP patch and list of open issues.
From
Chapman Flack
Date:
Hi, Although this is a ten-year-old message, it was the one I found quickly when looking to see what the current state of play on this might be. On 2013-09-20 14:22, Robert Haas wrote: > Hmm. So under that design, a database could support up to a total of > two character sets, the one that you get when you say 'foo' and the > other one that you get when you say n'foo'. > > I guess we could do that, but it seems a bit limited. If we're going > to go to the trouble of supporting multiple character sets, why not > support an arbitrary number instead of just two? Because that old thread came to an end without mentioning how the standard approaches that, it seemed worth adding, just to complete the record. In the draft of the standard I'm looking at (which is also around a decade old), n'foo' is nothing but a handy shorthand for _csname'foo' (which is a syntax we do not accept) for some particular csname that was chosen when setting up the db. So really, the standard contemplates letting you have columns of arbitrary different charsets (CHAR(x) CHARACTER SET csname), and literals of arbitrary charsets _csname'foo'. Then, as a bit of sugar, you get to pick which two of those charsets you'd like to have easy shorter ways of writing, 'foo' or n'foo', CHAR or NCHAR. The grammar for csname is kind of funky. It can be nothing but <SQL language identifier>, which has the nice restricted form /[A-Za-z][A-Za-z0-9_]*/. But it can also be schema-qualified, with the schema of course being a full-fledged <identifier>. So yeah, to fully meet this part of the standard, the parser'd have to know that _U&"I am a schema nameZ0021" UESCAPE 'Z'/*hi!*/.LATIN1'foo' is a string literal, expressing foo, in a character set named LATIN1, in some cutely-named schema. Never a dull moment. Regards, -Chap