Thread: encoding names

encoding names

From
Karel Zak
Date:
Hi,
before some time I was discuss with Tatsuo and Thomas about support
for synonyms of encoding names (for example allows to use 
"ISO-8859-1" as the encoding name) and use binary search for searching
in encoding names.  I mean that we can during this change a little clean up encoding 
stuff too. Now PG use for same operations with encoding names different
routines on FE and BE. IMHO it's a little strange. Well, here is a
possible solution:
- use 'enum' instead current #define for encoding identificators  (in pg_wchar.h).- create separate table only with
encodingnames for conversion from  encoding name (char) to encoding numerical identificator,  and searching routines
basedon binary search (from Knut -- see   datetime.c).     All these will *shared* between FE and BE.
 

- For BE create table that handle conversion functions (like current  pg_conv_tbl[]). All items in this table will
availableby access to   array, like 'pg_conv_tbl[ LATIN1 ]', instead current search via for()  cycle.
 
May be also define all tables as 'static' and work with it by some 
routines only. PG has like robust code :-)
Comments, better ideas?
            Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: encoding names

From
Tatsuo Ishii
Date:
Thank you for your suggestions. I'm not totally against your
suggestions (for example, I'm not against the idea that changing all
current encoding names to more "standard" ones for 7.2 if it's your
concern). However, I think we should focus on more fundamental issues
than those trivial ones. Recently Thomas gave an idea how to deal with
the internationalization (I18N) of PostgreSQL: create character set
etc.  I don't think I18N PostgreSQL will happen in 7.2, but we should
tackle them in the near future in my opinion.
--
Tatsuo Ishii


Re: encoding names

From
Karel Zak
Date:
On Wed, Aug 15, 2001 at 11:28:35PM +0900, Tatsuo Ishii wrote:
> Thank you for your suggestions. I'm not totally against your
> suggestions (for example, I'm not against the idea that changing all
> current encoding names to more "standard" ones for 7.2 if it's your
> concern). However, I think we should focus on more fundamental issues
> than those trivial ones. Recently Thomas gave an idea how to deal with
> the internationalization (I18N) of PostgreSQL: create character set
> etc.  I don't think I18N PostgreSQL will happen in 7.2, but we should
> tackle them in the near future in my opinion.
I have now some time for implement this my suggestion. Or is better
let down this for 7.2? You are right that it's trivial :-)
Note: My motivate for this is that I have some multi-language DB      with Web interface and for current version of PG
Imust maintain      separate table for transformation "standard" names to PG encoding      names and vice-versa:-) 
 
Well, I try send some patch.
                Karel
-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: encoding names

From
Peter Eisentraut
Date:
Karel Zak writes:

>  before some time I was discuss with Tatsuo and Thomas about support
> for synonyms of encoding names (for example allows to use
> "ISO-8859-1" as the encoding name) and use binary search for searching
> in encoding names.

Funny, I was thinking the same thing last night...

A couple of other things I was thinking about in the encoding area:

If you want to have codeset synonyms, you should also implement the
normalization of codeset names, defined as such:
 1. Remove all characters beside numbers and letters.
 2. Fold letters to lowercase.
 3. If the same only contains digits prepend the string `"iso"'.
[quote glibc]

This allows ISO_8859-1 and iso88591 to be treated the same.

Here's a good resource of official character set names and aliases:

http://www.iana.org/assignments/character-sets

Also, we ought to have support for the ISO_8859-15 character set, or
people will spread the word that PostgreSQL is not ready for the Euro.

Then I figured, if the client is configured with locale, it should
automatically determine the client's encoding.  Not sure if this is
portably possible, but it would be very nice to have.

Finally, as I've mentioned before I'd like to try out the iconv interface.
Might become an option in 7.2 even.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: encoding names

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> However, I think we should focus on more fundamental issues
> than those trivial ones. Recently Thomas gave an idea how to deal with
> the internationalization (I18N) of PostgreSQL: create character set
> etc.

I haven't actually seen any real implementation proposal yet.  We all know
(I suppose) the requirements of this project and the interface is mostly
specified by SQL.  But what I haven't seen yet is how this will work
internally.  If we encode the charset into the header of the text datum
then each and every function will have to be concerned that its output
value has the right character set.  If we use the type system and create a
new text type for each character set then we'll probably have to implement
N^X (where N is the number of character sets, and X is not known yet but
>1) functions, operators, casts, etc. (not even thinking about
user-pluggable character sets) and we'll really uglify all the psql \d and
pg_dump work.  It's not at all clear.  What I'm thinking these days is
that we'd need something completely new and unprecedented -- a separate
charset mix-and-match subsystem, similar to the type system, but
different.  Not a pretty outlook, of course.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: encoding names

From
Karel Zak
Date:
On Wed, Aug 15, 2001 at 05:16:42PM +0200, Peter Eisentraut wrote:
> Karel Zak writes:
> 
> >  before some time I was discuss with Tatsuo and Thomas about support
> > for synonyms of encoding names (for example allows to use
> > "ISO-8859-1" as the encoding name) and use binary search for searching
> > in encoding names.
> 
> Funny, I was thinking the same thing last night...
:-)

> A couple of other things I was thinking about in the encoding area:
> 
> If you want to have codeset synonyms, you should also implement the
> normalization of codeset names, defined as such:
> 
>   1. Remove all characters beside numbers and letters.
> 
>   2. Fold letters to lowercase.
> 
>   3. If the same only contains digits prepend the string `"iso"'.
> [quote glibc]
> 
> This allows ISO_8859-1 and iso88591 to be treated the same.
My idea is (was:-) create table with all "standard" synonyms and 
search in this table case insensitive. Something like:

PGencname pg_encnames[] =
{{ "ISO-8859-1", LATIN1 },{ "LATIN1", LATIN1 }
};
But your idea with encoding name "clearing" (remove irrelavant chars)
is more cool.  

> Here's a good resource of official character set names and aliases:
> 
> http://www.iana.org/assignments/character-sets
Thanks.

> Also, we ought to have support for the ISO_8859-15 character set, or
> people will spread the word that PostgreSQL is not ready for the Euro.

It require prepare some conversion functions and tables (UTF). Tatsuo?
> Finally, as I've mentioned before I'd like to try out the iconv interface.
Do you want integrate iconv stuff to current PG multibyte routines or as
some extension (functions?) only?
BTW, is on psql some \command that print list of all supported      encodings? 
     Maybe allows use something like: SELECT pg_encoding_names();
        Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: encoding names

From
Tatsuo Ishii
Date:
> Finally, as I've mentioned before I'd like to try out the iconv interface.
> Might become an option in 7.2 even.

I'm curious how do you handle conversion between multibyte strings
and wide characters using iconv. This is necessary to implement
multibyte aware like, regex, char_length etc. functions.  I think at
least you need to have a way to determine the letter boundaries in a
multibyte string.
--
Tatsuo Ishii


Re: encoding names

From
Tatsuo Ishii
Date:
>  I have now some time for implement this my suggestion. Or is better
> let down this for 7.2? You are right that it's trivial :-)

I think you should target for 7.2.

>  Note: My motivate for this is that I have some multi-language DB
>        with Web interface and for current version of PG I must maintain
>        separate table for transformation "standard" names to PG encoding
>        names and vice-versa:-) 
> 
>  Well, I try send some patch.

Thanks. BTW, I'm working on for dynamically loading the Unicode
conversion functions to descrease the runtime memory requirement. The
reason why I want to do this is:

o they are huge (--enable-unicode-conversion will increase ~1MB in the load module size)

o nobody will use all of them at once. For example most Japanese users are only interested in EUC/SJIS maps.
--
Tatsuo Ishii


Re: encoding names

From
Karel Zak
Date:
On Thu, Aug 16, 2001 at 03:39:28PM +0900, Tatsuo Ishii wrote:
>
> o they are huge (--enable-unicode-conversion will increase ~1MB in the
>   load module size)
> 
> o nobody will use all of them at once. For example most Japanese users
>   are only interested in EUC/SJIS maps.
> --
Good idea.
I have a question, the PostgreSQL encoding name "KOI8" is KOI8-R or 
KOI8-U or both? I need it for correct alias setting.
        Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: encoding names

From
Tatsuo Ishii
Date:
>  I have a question, the PostgreSQL encoding name "KOI8" is KOI8-R or 
> KOI8-U or both? I need it for correct alias setting.

I think it's KOI8-R. Oleg, am I correct?

P.S.
I use Makefile.shlib to create each shared object. This way is, I
think, handy and portble. However, I need to make lots of subdirs for
each encoding conversion function. Any suggestions?
--
Tatsuo Ishii


Re: encoding names

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Thanks. BTW, I'm working on for dynamically loading the Unicode
> conversion functions to descrease the runtime memory requirement. The
> reason why I want to do this is:
> o they are huge (--enable-unicode-conversion will increase ~1MB in the
>   load module size)
> o nobody will use all of them at once. For example most Japanese users
>   are only interested in EUC/SJIS maps.

But is it really important?  All Unixen that I know of handle process
text segments on a page-by-page basis; pages that aren't actually being
touched won't get swapped in.  Thus, the unused maps will just sit on
disk, whether they are part of the main executable or a separate file.
I doubt there's any real performance gain to be had by making the maps
dynamically loadable.
        regards, tom lane


Re: encoding names

From
Karel Zak
Date:
On Thu, Aug 16, 2001 at 10:22:48PM +0900, Tatsuo Ishii wrote:
> >  I have a question, the PostgreSQL encoding name "KOI8" is KOI8-R or 
> > KOI8-U or both? I need it for correct alias setting.
> 
> I think it's KOI8-R. Oleg, am I correct?
> 
> P.S.
> I use Makefile.shlib to create each shared object. This way is, I
> think, handy and portble. However, I need to make lots of subdirs for
> each encoding conversion function. Any suggestions?

Please make separate directory for encoding translation table 
programs too. The current mb/ is mix of standard files and files
with main()....
        Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: encoding names

From
Karel Zak
Date:
On Thu, Aug 16, 2001 at 03:39:28PM +0900, Tatsuo Ishii wrote:
> >  I have now some time for implement this my suggestion. Or is better
> > let down this for 7.2? You are right that it's trivial :-)
> 
> I think you should target for 7.2.
> 
> >  Note: My motivate for this is that I have some multi-language DB
> >        with Web interface and for current version of PG I must maintain
> >        separate table for transformation "standard" names to PG encoding
> >        names and vice-versa:-) 
> > 
> >  Well, I try send some patch.
> 
> Thanks. BTW, I'm working on for dynamically loading the Unicode
Sorry Tatsuo, but I have again question :-)
Why is here hole between encoding numbers?

#define KOI8   16                               /* KOI8-R/U */
#define WIN    17                               /* windows-1251 */
#define ALT    18                               /* Alternativny Variant */
19..31 ?

#define SJIS 32                                 /* Shift JIS */
#define BIG5 33                                 /* Big5 */
#define WIN1250  34                             /* windows-1250 */
It's trouble create arrays with encoding stuff, like:
pg_encoding_conv_tbl[ ALT ]->to_micpg_encoding_conv_tbl[ SJIS ]->to_mic
Has this hole between 19..31 some effect? 
        Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: encoding names

From
Oleg Bartunov
Date:
On Thu, 16 Aug 2001, Tatsuo Ishii wrote:

> >  I have a question, the PostgreSQL encoding name "KOI8" is KOI8-R or
> > KOI8-U or both? I need it for correct alias setting.
>
> I think it's KOI8-R. Oleg, am I correct?

YES

>
> P.S.
> I use Makefile.shlib to create each shared object. This way is, I
> think, handy and portble. However, I need to make lots of subdirs for
> each encoding conversion function. Any suggestions?
> --
> Tatsuo Ishii
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: encoding names

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> I use Makefile.shlib to create each shared object. This way is, I
> think, handy and portble. However, I need to make lots of subdirs for
> each encoding conversion function. Any suggestions?

Given Tom Lane's comment, I think that this would be a wasted effort.
Shared objects are normally used for extensibility at runtime, not core
memory savings.  (This would most likely take more memory in fact, given
that the code is larger and you need all the shared object handling
infrastructure.)

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: encoding names

From
Tatsuo Ishii
Date:
> But is it really important?  All Unixen that I know of handle process
> text segments on a page-by-page basis; pages that aren't actually being
> touched won't get swapped in.  Thus, the unused maps will just sit on
> disk, whether they are part of the main executable or a separate file.
> I doubt there's any real performance gain to be had by making the maps
> dynamically loadable.

I did some testing on my Linux box (kernel 2.2) and confirmed that no
performance drgration was found with unicode-conversion-enabled
postgres. This proves that your theory is correct at least on Linux.

Ok, I will make the unicode conversion functionality as a default if
--enable-multibyte is on.
--
Tatsuo Ishii