Re: Proposal - Support for National Characters functionality - Mailing list pgsql-hackers
From | Boguk, Maksym |
---|---|
Subject | Re: Proposal - Support for National Characters functionality |
Date | |
Msg-id | A756FAD7EDC2E24F8CAB7E2F3B5375E918B12BC0@FALEX03.au.fjanz.com Whole thread Raw |
In response to | Re: Proposal - Support for National Characters functionality (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Proposal - Support for National Characters
functionality
|
List | pgsql-hackers |
Hi everyone, I will try answer on all questions related to proposed National Characters support. >> 2)Provide support for the new GUC nchar_collation to provide the >> database with information about the default collation that needs to be >> used for the new data types. >A GUC seems like completely the wrong tack to be taking. In the first place, that would mandate just one value (at a time anyway) of collation, which is surely not much of an advance over what's already possible. In the second place, what happens if you change the value? >All your indexes on nchar columns are corrupt, that's what. Actually the data itself would be corrupt, if you intend that this setting determines the encoding and not just the collation. If you really are speaking only of collation, it's not clear to me exactly what this proposal offers that can't be achieved today (with greater security, >functionality and spec compliance) by using COLLATE clauses on plain text columns. >Actually, you really haven't answered at all what it is you want to do that COLLATE can't do. I think I give a wrong description there... it will be not GUC but GUC-type value which will be initialized during CREATE DATABASE and will be read only after, very similar to the lc_collate. So I think name national_lc_collate will be better. Function of this value - provide information about the default collation for the NATIONAL CHARACTERS inside the database. That's not limits user ability of use an alternative collation for NATIONAL CHARACTERS during create table via COLLATE keyword. E.g. if we have second encoding inside the database - we should have information about used collation somewhere. >> 4)Because all symbols from non-UTF8 encodings could be represented as >> UTF8 (but the reverse is not true) comparison between N* types and the >> regular string types inside database will be performed in UTF8 form. >I believe that in some Far Eastern character sets there are some characters that map to the same Unicode glyph, but that some people would prefer to keep separate. So transcoding to UTF8 isn't necessarily lossless. This is one of the reasons why we've resisted adopting ICU or standardizing on UTF8 as the One True Database Encoding. >Now this may or may not matter for comparison to strings that were in some other encoding to start with --- but as soon as you base your design on the premise that UTF8 is a universal encoding, you are sliding down a slippery slope to a design that will meet resistance. Will the conversion of both sides to the pg_wchar before comparison fix this problem? Anyway, if the database going to use more than one encoding, a some universal encoding should be used to allow comparison between them. After some analyse I think pg_wchar is better candidate to this role than UTF8. >> 6)Client input/output of NATIONAL strings - NATIONAL strings will >> respect the client_encoding setting, and their values will be >> transparently converted to the requested client_encoding before >> sending(receiving) to client (the same mechanics as used for usual >> string types). >> So no mixed encoding in client input/output will be supported/available. >If you have this restriction, then I'm really failing to see what benefit there is over what can be done today with COLLATE. There are two targets for this project: 1. Legacy database with non-utf8 encoding, which should support old non-utf8 applications and new UTF8 applications. In that case the old applications will use the legacy database encoding (and because these applications are legacy they doesn't work with new NATIONAL CHARACTERS data/tables). And the new applications will use client-side UTF8 encoding and they will be able store international texts in NATIONAL CHARACTER columns. Dump/restore of the whole database to change the database encoding to UTF8 not always possible, so there necessity of the some easy to use workaround. 2.Better compatibility with the ANSI SQL standard. Kind Regards, Maksym
pgsql-hackers by date: