Thread: encoding names v2.

encoding names v2.

From
Karel Zak
Date:
 Hi,

 all are almost same as in last version of this patch. Here are new
changes:

 - aliases cyrillic, cp819, ibm819, isoir100x, l1-4 are removed
 - KOI8 is KOI8-R in *all* functions, maps, etc.
 - WIN is window-1251 (WIN1251)   --- // ---
 - ALT is ALT :-)
 - UNICODE is utf-8
 - PG_ prefix is used for all SQL_ASCII and the others
 - fixed bug with --enable-unicode-conversion

 - getdatabaseencoding() is compatible with old versions, but
   in the code is commented as deprecated.

 - getdbencoding() is new function that return correct encoding names

   test2=# select getdatabaseencoding(), getdbencoding();
   getdatabaseencoding | getdbencoding
   ---------------------+---------------
   LATIN2              | ISO-8859-2
   (1 row)

 - pg_encoding_to_char() and other routines return new names! Only
   for getdatabaseencoding() we keep back compatibility - needful for
   JDBC.

 - all encoding names use '-'. I hope we will never see a problem with
   it and some operator. Encoding names must be used as quoted string.

   Only for SQL_ASCII is used '_', because I see that JDBC has hardcoded
   "pg_encoding_to_char(1) = 'SQL_ASCII'" :-(((

 - the ./configure.in:
     * use new encoding names too for --enable-multibyte
     * define MULTIBYTE that handle default encoding id
     * define MULTIBYTE_NAME that handle default encoding name (neeful
       for initdb)

   Note: old code use same names for macros and for encoding names, but
         now it's in Makefile.global:

   MULTIBYTE         = PG_KOI8R        /* id */
   MULTIBYTE_NAME    = "KOI8-R"        /* name */

 - the backend's createdb() function check correct BE encoding (here was
   bug)

 - 'initdb' check if default template encoding is correct for backend DB.

    In the old code it's in initdb very hardcoded. I add to pg_encoding
    option '-b' that check if encoding is correct for backend DB (means
    encoding is not client only). It's better than
    if [ $MULTIBYTEID -gt 31 ]
                          ^^^^^^
    in scripts.

    For example (Big5 is client only encoding):

   $ pg_encoding Big5
   16
   $ pg_encoding -b Big5
   $

 - initdb use MULTIBYTE_NAME and "pg_encoding -b"

 - ODBC works with old and new names for Shift_JIS and Big5

 - the patch doesn't contain docs about encoding names... later :-)


 Note for CVS commit:

   following files are renamed:

src/utils/mb/Unicode/KOI8_to_utf8.map  --> src/utils/mb/Unicode/KOI8R_to_utf8.map
src/utils/mb/Unicode/WIN_to_utf8.map  --> src/utils/mb/Unicode/WIN1251_to_utf8.map
src/utils/mb/Unicode/utf8_to_KOI8.map --> src/utils/mb/Unicode/utf8_to_KOI8R.map
src/utils/mb/Unicode/utf8_to_WIN.map --> src/utils/mb/Unicode/utf8_to_WIN1251.map

   new file:

src/utils/mb/encname.c

    removed file:

src/utils/mb/common.c


  The patch doesn't contain large configure script, but only configure.in.
Please before "cvs commit" do autoconf!


 Thanks for all suggestion.

 New comments?

            Karel

--
 Karel Zak  <zakkr@zf.jcu.cz>
 http://home.zf.jcu.cz/~zakkr/

 C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

Attachment

Re: encoding names v2.

From
Barry Lind
Date:
Karel,

If the only reason you are staying with the underscore in SQL_ACSII is
because of the JDBC driver, don't worry about it.  The code that calls
pg_encoding_to_char() expecting SQL_ASCII is new code in the 7.2 trunk.
  It does not exist in 7.1, thus we are free to change it.  Feel free to
use a dash if you prefer.

thanks,
--Barry


Karel Zak wrote:
>  Hi,
>
>  all are almost same as in last version of this patch. Here are new
> changes:
>
>  - aliases cyrillic, cp819, ibm819, isoir100x, l1-4 are removed
>  - KOI8 is KOI8-R in *all* functions, maps, etc.
>  - WIN is window-1251 (WIN1251)   --- // ---
>  - ALT is ALT :-)
>  - UNICODE is utf-8
>  - PG_ prefix is used for all SQL_ASCII and the others
>  - fixed bug with --enable-unicode-conversion
>
>  - getdatabaseencoding() is compatible with old versions, but
>    in the code is commented as deprecated.
>
>  - getdbencoding() is new function that return correct encoding names
>
>    test2=# select getdatabaseencoding(), getdbencoding();
>    getdatabaseencoding | getdbencoding
>    ---------------------+---------------
>    LATIN2              | ISO-8859-2
>    (1 row)
>
>  - pg_encoding_to_char() and other routines return new names! Only
>    for getdatabaseencoding() we keep back compatibility - needful for
>    JDBC.
>
>  - all encoding names use '-'. I hope we will never see a problem with
>    it and some operator. Encoding names must be used as quoted string.
>
>    Only for SQL_ASCII is used '_', because I see that JDBC has hardcoded
>    "pg_encoding_to_char(1) = 'SQL_ASCII'" :-(((
>
>  - the ./configure.in:
>      * use new encoding names too for --enable-multibyte
>      * define MULTIBYTE that handle default encoding id
>      * define MULTIBYTE_NAME that handle default encoding name (neeful
>        for initdb)
>
>    Note: old code use same names for macros and for encoding names, but
>          now it's in Makefile.global:
>
>    MULTIBYTE         = PG_KOI8R        /* id */
>    MULTIBYTE_NAME    = "KOI8-R"        /* name */
>
>  - the backend's createdb() function check correct BE encoding (here was
>    bug)
>
>  - 'initdb' check if default template encoding is correct for backend DB.
>
>     In the old code it's in initdb very hardcoded. I add to pg_encoding
>     option '-b' that check if encoding is correct for backend DB (means
>     encoding is not client only). It's better than
>     if [ $MULTIBYTEID -gt 31 ]
>                           ^^^^^^
>     in scripts.
>
>     For example (Big5 is client only encoding):
>
>    $ pg_encoding Big5
>    16
>    $ pg_encoding -b Big5
>    $
>
>  - initdb use MULTIBYTE_NAME and "pg_encoding -b"
>
>  - ODBC works with old and new names for Shift_JIS and Big5
>
>  - the patch doesn't contain docs about encoding names... later :-)
>
>
>  Note for CVS commit:
>
>    following files are renamed:
>
> src/utils/mb/Unicode/KOI8_to_utf8.map  --> src/utils/mb/Unicode/KOI8R_to_utf8.map
> src/utils/mb/Unicode/WIN_to_utf8.map  --> src/utils/mb/Unicode/WIN1251_to_utf8.map
> src/utils/mb/Unicode/utf8_to_KOI8.map --> src/utils/mb/Unicode/utf8_to_KOI8R.map
> src/utils/mb/Unicode/utf8_to_WIN.map --> src/utils/mb/Unicode/utf8_to_WIN1251.map
>
>    new file:
>
> src/utils/mb/encname.c
>
>     removed file:
>
> src/utils/mb/common.c
>
>
>   The patch doesn't contain large configure script, but only configure.in.
> Please before "cvs commit" do autoconf!
>
>
>  Thanks for all suggestion.
>
>  New comments?
>
>             Karel
>
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
> Part 1.1
>
> Content-Type:
>
> text/plain
>
>
> ------------------------------------------------------------------------
> mb-08222001.patch.gz
>
> Content-Type:
>
> application/x-gzip
> Content-Encoding:
>
> base64
>
>
> ------------------------------------------------------------------------
> Part 1.3
>
> Content-Type:
>
> text/plain
> Content-Encoding:
>
> binary
>
>



Re: encoding names v2.

From
Peter Eisentraut
Date:
Okay, here is some bad news:  I just looked into the SQL99 standard for
the names of predefined character set names, and here is the list:

SQL_CHARACTER
GRAPHIC_IRV or ASCII_GRAPHIC
LATIN1                <==== !!!
ISO8BIT or ASCII_FULL
UTF16
UTF8
UCS2
SQL_TEXT
SQL_IDENTIFIER

So perhaps we should keep the LATIN1 thing after all?  I don't like it,
but the rules...

Comments?


Karel Zak writes:

>  - getdatabaseencoding() is compatible with old versions, but
>    in the code is commented as deprecated.
>
>  - getdbencoding() is new function that return correct encoding names

See my other message about this.  I don't think this is a good choice of
names.

>  - all encoding names use '-'. I hope we will never see a problem with
>    it and some operator. Encoding names must be used as quoted string.

For SQL compliance we will need to access charset names as identifiers in
the future.  So the name normalization should take effect whereever a
charset name is expected.  I suppose this is what you did.

>    Only for SQL_ASCII is used '_', because I see that JDBC has hardcoded
>    "pg_encoding_to_char(1) = 'SQL_ASCII'" :-(((

This is okay, look at the list above for precedent.

>  - the ./configure.in:
>      * use new encoding names too for --enable-multibyte
>      * define MULTIBYTE that handle default encoding id

Where is this needed?

>      * define MULTIBYTE_NAME that handle default encoding name (neeful
>        for initdb)

Can you rename this to something like DEFAULT_CHARACTER_SET?  There is
really nothing "multibyte" here.

>  - 'initdb' check if default template encoding is correct for backend DB.
>
>     In the old code it's in initdb very hardcoded. I add to pg_encoding
>     option '-b' that check if encoding is correct for backend DB (means
>     encoding is not client only). It's better than
>     if [ $MULTIBYTEID -gt 31 ]
>                           ^^^^^^
>     in scripts.

Good.

> src/utils/mb/Unicode/KOI8_to_utf8.map  --> src/utils/mb/Unicode/KOI8R_to_utf8.map
> src/utils/mb/Unicode/WIN_to_utf8.map  --> src/utils/mb/Unicode/WIN1251_to_utf8.map
> src/utils/mb/Unicode/utf8_to_KOI8.map --> src/utils/mb/Unicode/utf8_to_KOI8R.map
> src/utils/mb/Unicode/utf8_to_WIN.map --> src/utils/mb/Unicode/utf8_to_WIN1251.map

Can you introduce some uniform capitalization (e.g., all lower case)?

>  Thanks for all suggestion.
>
>  New comments?

Don't worry, we'll get there. ;-)

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter


Re: encoding names v2.

From
Tatsuo Ishii
Date:
> Okay, here is some bad news:  I just looked into the SQL99 standard for
> the names of predefined character set names, and here is the list:
>
> SQL_CHARACTER
> GRAPHIC_IRV or ASCII_GRAPHIC
> LATIN1                <==== !!!
> ISO8BIT or ASCII_FULL
> UTF16
> UTF8
> UCS2
> SQL_TEXT
> SQL_IDENTIFIER
>
> So perhaps we should keep the LATIN1 thing after all?  I don't like it,
> but the rules...
>
> Comments?

No way. We always need to follow the standard.

BTW, do you have the SQL99 docs online somewhere? I have a draft, but
it seemss some part of it, especially NCHAR stuffs might be change in
the very last stage...
--
Tatsuo Ishii


Re: encoding names v2.

From
Karel Zak
Date:
On Wed, Aug 22, 2001 at 09:38:03PM +0200, Peter Eisentraut wrote:
> Okay, here is some bad news:  I just looked into the SQL99 standard for
> the names of predefined character set names, and here is the list:
>
> SQL_CHARACTER
> GRAPHIC_IRV or ASCII_GRAPHIC
> LATIN1                <==== !!!
> ISO8BIT or ASCII_FULL
> UTF16
> UTF8
> UCS2
> SQL_TEXT
> SQL_IDENTIFIER
>
> So perhaps we should keep the LATIN1 thing after all?  I don't like it,
> but the rules...
>
> Comments?

 Oh man... what do you want to hear? :-(

 Here is ***no problem*** add arbitrary alias (for example LATIN1 is still
correct name for our code), but a question is how names select as primary
and use it as output for user eyes. I'm really unsure if we must
blindly support SQL99 if this standard *ignore* in some rules other
standards and conventions. We can support SQL99's ignoran names for example
in pg_char_to_encoding(), but we needn't show these names to users (for
example in psql's \l command).


> >  - getdatabaseencoding() is compatible with old versions, but
> >    in the code is commented as deprecated.
> >
> >  - getdbencoding() is new function that return correct encoding names
>
> See my other message about this.  I don't think this is a good choice of
> names.

 OK.

> This is okay, look at the list above for precedent.
>
> >  - the ./configure.in:
> >      * use new encoding names too for --enable-multibyte
> >      * define MULTIBYTE that handle default encoding id
>
> Where is this needed?

 In "mb/mbutils.c" was/is set default database encoding by encoding id
(maybe it's never used, because standard backend init encoding during
start, but old code used it and I keep it).

>
> >      * define MULTIBYTE_NAME that handle default encoding name (neeful
> >        for initdb)
>
> Can you rename this to something like DEFAULT_CHARACTER_SET?  There is
> really nothing "multibyte" here.

 Good point.

> > src/utils/mb/Unicode/KOI8_to_utf8.map  --> src/utils/mb/Unicode/KOI8R_to_utf8.map
> > src/utils/mb/Unicode/WIN_to_utf8.map  --> src/utils/mb/Unicode/WIN1251_to_utf8.map
> > src/utils/mb/Unicode/utf8_to_KOI8.map --> src/utils/mb/Unicode/utf8_to_KOI8R.map
> > src/utils/mb/Unicode/utf8_to_WIN.map --> src/utils/mb/Unicode/utf8_to_WIN1251.map
>
> Can you introduce some uniform capitalization (e.g., all lower case)?

 OK.

> Don't worry, we'll get there. ;-)

 I'm still happy :-)

            Karel

--
 Karel Zak  <zakkr@zf.jcu.cz>
 http://home.zf.jcu.cz/~zakkr/

 C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

Re: encoding names v2.

From
Karel Zak
Date:
On Wed, Aug 22, 2001 at 10:08:02AM -0700, Barry Lind wrote:
> Karel,
>
> If the only reason you are staying with the underscore in SQL_ACSII is
> because of the JDBC driver, don't worry about it.  The code that calls
> pg_encoding_to_char() expecting SQL_ASCII is new code in the 7.2 trunk.
>   It does not exist in 7.1, thus we are free to change it.  Feel free to
> use a dash if you prefer.
>

 It's good news, but I'm unsure how name is more correct. We will fight
with it yet, because Peter too much study SQL standards... :-)

        Karel

Re: encoding names v2.

From
Bruce Momjian
Date:
Tatsuo has applied this. Thanks.

>
>  Hi,
>
>  all are almost same as in last version of this patch. Here are new
> changes:
>
>  - aliases cyrillic, cp819, ibm819, isoir100x, l1-4 are removed
>  - KOI8 is KOI8-R in *all* functions, maps, etc.
>  - WIN is window-1251 (WIN1251)   --- // ---
>  - ALT is ALT :-)
>  - UNICODE is utf-8
>  - PG_ prefix is used for all SQL_ASCII and the others
>  - fixed bug with --enable-unicode-conversion
>
>  - getdatabaseencoding() is compatible with old versions, but
>    in the code is commented as deprecated.
>
>  - getdbencoding() is new function that return correct encoding names
>
>    test2=# select getdatabaseencoding(), getdbencoding();
>    getdatabaseencoding | getdbencoding
>    ---------------------+---------------
>    LATIN2              | ISO-8859-2
>    (1 row)
>
>  - pg_encoding_to_char() and other routines return new names! Only
>    for getdatabaseencoding() we keep back compatibility - needful for
>    JDBC.
>
>  - all encoding names use '-'. I hope we will never see a problem with
>    it and some operator. Encoding names must be used as quoted string.
>
>    Only for SQL_ASCII is used '_', because I see that JDBC has hardcoded
>    "pg_encoding_to_char(1) = 'SQL_ASCII'" :-(((
>
>  - the ./configure.in:
>      * use new encoding names too for --enable-multibyte
>      * define MULTIBYTE that handle default encoding id
>      * define MULTIBYTE_NAME that handle default encoding name (neeful
>        for initdb)
>
>    Note: old code use same names for macros and for encoding names, but
>          now it's in Makefile.global:
>
>    MULTIBYTE         = PG_KOI8R        /* id */
>    MULTIBYTE_NAME    = "KOI8-R"        /* name */
>
>  - the backend's createdb() function check correct BE encoding (here was
>    bug)
>
>  - 'initdb' check if default template encoding is correct for backend DB.
>
>     In the old code it's in initdb very hardcoded. I add to pg_encoding
>     option '-b' that check if encoding is correct for backend DB (means
>     encoding is not client only). It's better than
>     if [ $MULTIBYTEID -gt 31 ]
>                           ^^^^^^
>     in scripts.
>
>     For example (Big5 is client only encoding):
>
>    $ pg_encoding Big5
>    16
>    $ pg_encoding -b Big5
>    $
>
>  - initdb use MULTIBYTE_NAME and "pg_encoding -b"
>
>  - ODBC works with old and new names for Shift_JIS and Big5
>
>  - the patch doesn't contain docs about encoding names... later :-)
>
>
>  Note for CVS commit:
>
>    following files are renamed:
>
> src/utils/mb/Unicode/KOI8_to_utf8.map  --> src/utils/mb/Unicode/KOI8R_to_utf8.map
> src/utils/mb/Unicode/WIN_to_utf8.map  --> src/utils/mb/Unicode/WIN1251_to_utf8.map
> src/utils/mb/Unicode/utf8_to_KOI8.map --> src/utils/mb/Unicode/utf8_to_KOI8R.map
> src/utils/mb/Unicode/utf8_to_WIN.map --> src/utils/mb/Unicode/utf8_to_WIN1251.map
>
>    new file:
>
> src/utils/mb/encname.c
>
>     removed file:
>
> src/utils/mb/common.c
>
>
>   The patch doesn't contain large configure script, but only configure.in.
> Please before "cvs commit" do autoconf!
>
>
>  Thanks for all suggestion.
>
>  New comments?
>
>             Karel
>
> --
>  Karel Zak  <zakkr@zf.jcu.cz>
>  http://home.zf.jcu.cz/~zakkr/
>
>  C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026