Thread: Re: [PATCHES] encoding names

Re: [PATCHES] encoding names

From
Tatsuo Ishii
Date:
>  Hi,
> 
>  attached is patch with:
> 
> - new encoding names stuff with better performance (binary search
>   intead for() and prevent some needless searching)
> 
> - possible is use synonyms for encoding (an example ISO-8859-1, 
>   Latin1, l1)
> 
> - implemented is Peter's idea about "encoding names clearing" 
>   (other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is 
>   same as 'iso8859_1' or iso-8-8-5-9-1 :-)  
> 
> - share routines for this between FE and BE (never more define 
>   encoding names separate in FE and BE)
> 
> - add prefix PG_ to encoding identificator macros, something like 'ALT' 
>   is pretty dirty in source code, rather use PG_ALT.
> 
>  (Note: patch add new file mb/encname.c and remove mb/common.c)
> 
>                 Karel

Thanks for the patches, but...

1) There is a compiler error if --enable-unicode-conversion is not  enabled

2) The patches break createdb. createdb should raise an error if  client-only encodings such as SJIS etc. is
specified.

3) I don't like following ugliness. Why not changing all of SQL_ASCII  occurrences in the sources.
  /*   * A lot of PG stuff use 'SQL_ASCII' without prefix (dirty...)    */    #define SQL_ASCII    PG_SQL_ASCII

4) Encoding "official" names are inconsistent. Here are my suggested  changes (referring
http://www.iana.org/assignments/character-sets, according to Peter's suggestiuon):
 
   ALT -> IBM866   KOI8 -> KOI8_R   UNICODE -> UTF_8 (Peter's suggestion)      Also, I'm wondering why windows-1251,
notwindows_1251? or   ISO_8859_1, not ISO-8859-1? there seems a confusion about the   usage of "_" and "-".
 

pg_enc2name pg_enc2name_tbl[] =
{{ "SQL_ASCII",    PG_SQL_ASCII },{ "EUC_JP",    PG_EUC_JP },{ "EUC_CN",    PG_EUC_CN },{ "EUC_KR",    PG_EUC_KR },{
"EUC_TW",   PG_EUC_TW },{ "UNICODE",    PG_UNICODE },{ "MULE_INTERNAL",PG_MULE_INTERNAL },{ "ISO_8859_1",    PG_LATIN1
},{"ISO_8859_2",    PG_LATIN2 },{ "ISO_8859_3",    PG_LATIN3 },{ "ISO_8859_4",    PG_LATIN4 },{ "ISO_8859_5",
PG_LATIN5},{ "KOI8",    PG_KOI8 },{ "window-1251",PG_WIN1251 },{ "ALT",    PG_ALT },{ "Shift_JIS",    PG_SJIS },{
"Big5",   PG_BIG5 },{ "window-1250",PG_WIN1251 }
 
};



Re: Re: [PATCHES] encoding names

From
"Serguei Mokhov"
Date:
----- Original Message ----- 
From: Tatsuo Ishii <t-ishii@sra.co.jp>
Sent: Saturday, August 18, 2001 10:02 PM


>     ALT -> IBM866

Just a quick comment: ALT is not necessarily IBM866.
It can be any US-ASCII or 26-character-alphabet Latin set, for example
IBM819 or ISO8859-1. Is  actually quite different from IBM866 in its
true meaning, and they shouldn't be aliased together. ALT is used for example,
when none of KOI8-R, Windows-1251, or IBM866 are available to a Russian-speaking
person to read/write any text, messages and stuff, we use simple English letters 
to write words in Russian so that pronunciation sort of holds the same. It's
something like russian_latin (as an equivalent to greek_latin in the
http://www.iana.org/assignments/character-sets spec), and the writing this
way reminds Polish or Serbian-Latin a bit.

Serguei




Re: [PATCHES] encoding names

From
Karel Zak
Date:
On Sun, Aug 19, 2001 at 11:02:57AM +0900, Tatsuo Ishii wrote:

> 4) Encoding "official" names are inconsistent. Here are my suggested
>    changes (referring http://www.iana.org/assignments/character-sets,
>    according to Peter's suggestiuon):
> 
>     ALT -> IBM866
>     KOI8 -> KOI8_R
>     UNICODE -> UTF_8 (Peter's suggestion)
>     
Right.
But we will still need aliases UNICODE, ALT, KOI8 for back compatibility.
Thanks, I try fix all.            Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: Re: [PATCHES] encoding names

From
Tatsuo Ishii
Date:
> >     ALT -> IBM866
> 
> Just a quick comment: ALT is not necessarily IBM866.
> It can be any US-ASCII or 26-character-alphabet Latin set, for example
> IBM819 or ISO8859-1. Is  actually quite different from IBM866 in its
> true meaning, and they shouldn't be aliased together. ALT is used for example,
> when none of KOI8-R, Windows-1251, or IBM866 are available to a Russian-speaking
> person to read/write any text, messages and stuff, we use simple English letters 
> to write words in Russian so that pronunciation sort of holds the same. It's
> something like russian_latin (as an equivalent to greek_latin in the
> http://www.iana.org/assignments/character-sets spec), and the writing this
> way reminds Polish or Serbian-Latin a bit.

Ok. Let's leave ALT as it is.
--
Tatsuo Ishii


Re: [PATCHES] encoding names

From
Tatsuo Ishii
Date:
> > 4) Encoding "official" names are inconsistent. Here are my suggested
> >    changes (referring http://www.iana.org/assignments/character-sets,
> >    according to Peter's suggestiuon):
> > 
> >     ALT -> IBM866
> >     KOI8 -> KOI8_R
> >     UNICODE -> UTF_8 (Peter's suggestion)
> >     
> 
>  Right.
> 
>  But we will still need aliases UNICODE, ALT, KOI8 for back compatibility.

Sure. 

>  Thanks, I try fix all.

Thanks! But we seem to leave ALT as it is (Serguei's suggestion).
--
Tatsuo Ishii



encoding: ODBC, createdb

From
Karel Zak
Date:
I found some other things:

- why database encoding for new DB check 'createdb' script and not CREATE DATABASE statement? (means client only
encodings,like BIG5)?
 
 Bug?


- ODBC -- here is some multibyte stuff too. Why ODBC code don't use pg_wchar.h where is all defined? In
odbc/multibyte.his again defined all encoding identificators. 
 
 IMHO we can use for ODBC same solution as for libpq and compile it with encname.c file too.

    Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: encoding: ODBC, createdb

From
Hiroshi Inoue
Date:
Karel Zak wrote:
> 
>  I found some other things:
> 
> - ODBC -- here is some multibyte stuff too. Why ODBC code don't use
>   pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
>   all encoding identificators.
> 
>   IMHO we can use for ODBC same solution as for libpq and compile it
>   with encname.c file too.

ODBC under Windows needs no source/header files in PostgreSQL
other than in src/interfaces/odbc. It's not preferable for 
psqlodbc driver to be sensitive about other PostgreSQL changes
because the driver has to be able to talk multiple versions of
PostgreSQL servers. In fact the current driver could talk to 
any server whose version >= 6.2(according to a person).
As for pg_wchar.h I'm not sure if it could be an exception
and we could expect for the maintainer to take care of ODBC.
If I were he, I would hate it.

regards,
Hiroshi Inoue


Re: encoding: ODBC, createdb

From
Tatsuo Ishii
Date:
>  I found some other things:
> 
> - why database encoding for new DB check 'createdb' script and not
>   CREATE DATABASE statement? (means client only encodings, like BIG5)?
> 
>   Bug?

Oh, that must be a bug. Do yo want to take care of it by yourself?

> - ODBC -- here is some multibyte stuff too. Why ODBC code don't use
>   pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
>   all encoding identificators. 
> 
>   IMHO we can use for ODBC same solution as for libpq and compile it
>   with encname.c file too.

Don't know about ODBC. Hiroshi?
--
Tatsuo Ishii


Re: encoding: ODBC, createdb

From
Karel Zak
Date:
On Tue, Aug 21, 2001 at 10:00:21AM +0900, Hiroshi Inoue wrote:
> Karel Zak wrote:
> > 
> >  I found some other things:
> > 
> > - ODBC -- here is some multibyte stuff too. Why ODBC code don't use
> >   pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
> >   all encoding identificators.
> > 
> >   IMHO we can use for ODBC same solution as for libpq and compile it
> >   with encname.c file too.
> 
> ODBC under Windows needs no source/header files in PostgreSQL
> other than in src/interfaces/odbc. It's not preferable for 
> psqlodbc driver to be sensitive about other PostgreSQL changes
> because the driver has to be able to talk multiple versions of
> PostgreSQL servers. In fact the current driver could talk to 
> any server whose version >= 6.2(according to a person).
> As for pg_wchar.h I'm not sure if it could be an exception
> and we could expect for the maintainer to take care of ODBC.
> If I were he, I would hate it.
In the odbc/multibyte.h is

if (strstr(str, "%27SJIS%27") || strstr(str, "'SJIS'") ||    strstr(str, "'sjis'"))
..and same line for BIG5 
I add here new names 'Shift_JIS' and 'Big5' only. 
    Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: encoding: ODBC, createdb

From
Karel Zak
Date:
On Tue, Aug 21, 2001 at 10:00:50AM +0900, Tatsuo Ishii wrote:
> >  I found some other things:
> > 
> > - why database encoding for new DB check 'createdb' script and not
> >   CREATE DATABASE statement? (means client only encodings, like BIG5)?
> > 
> >   Bug?
> 
> Oh, that must be a bug. Do yo want to take care of it by yourself?
I check and fix it. The 'createdb' script needn't check somethig, all 
must be in backend.
Karel 

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: encoding: ODBC, createdb

From
Tatsuo Ishii
Date:
> On Tue, Aug 21, 2001 at 10:00:50AM +0900, Tatsuo Ishii wrote:
> > >  I found some other things:
> > > 
> > > - why database encoding for new DB check 'createdb' script and not
> > >   CREATE DATABASE statement? (means client only encodings, like BIG5)?
> > > 
> > >   Bug?
> > 
> > Oh, that must be a bug. Do yo want to take care of it by yourself?
> 
>  I check and fix it. The 'createdb' script needn't check somethig, all 
> must be in backend.

Agreed.
--
Tatsuo Ishii


Re: encoding: ODBC, createdb

From
Bruce Momjian
Date:
Was this completed?

> 
>  I found some other things:
> 
> - why database encoding for new DB check 'createdb' script and not
>   CREATE DATABASE statement? (means client only encodings, like BIG5)?
> 
>   Bug?
> 
> 
> - ODBC -- here is some multibyte stuff too. Why ODBC code don't use
>   pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
>   all encoding identificators. 
> 
>   IMHO we can use for ODBC same solution as for libpq and compile it
>   with encname.c file too.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: encoding: ODBC, createdb

From
Karel Zak
Date:
On Fri, Sep 07, 2001 at 04:11:25PM -0400, Bruce Momjian wrote:
> 
> Was this completed?
> 
> > 
> >  I found some other things:
> > 
> > - why database encoding for new DB check 'createdb' script and not
> >   CREATE DATABASE statement? (means client only encodings, like BIG5)?
It was include in my large multibyte patch and it's complete in
dbcommands.c (It was non-reported bug in previous releases). 

> > - ODBC -- here is some multibyte stuff too. Why ODBC code don't use
> >   pg_wchar.h where is all defined? In odbc/multibyte.h is again defined
> >   all encoding identificators. 
Probably done, it check ODBC maintainer.
Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz