Thread: encoding names
Hi, before some time I was discuss with Tatsuo and Thomas about support for synonyms of encoding names (for example allows to use "ISO-8859-1" as the encoding name) and use binary search for searching in encoding names. I mean that we can during this change a little clean up encoding stuff too. Now PG use for same operations with encoding names different routines on FE and BE. IMHO it's a little strange. Well, here is a possible solution: - use 'enum' instead current #define for encoding identificators (in pg_wchar.h).- create separate table only with encodingnames for conversion from encoding name (char) to encoding numerical identificator, and searching routines basedon binary search (from Knut -- see datetime.c). All these will *shared* between FE and BE. - For BE create table that handle conversion functions (like current pg_conv_tbl[]). All items in this table will availableby access to array, like 'pg_conv_tbl[ LATIN1 ]', instead current search via for() cycle. May be also define all tables as 'static' and work with it by some routines only. PG has like robust code :-) Comments, better ideas? Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
Thank you for your suggestions. I'm not totally against your suggestions (for example, I'm not against the idea that changing all current encoding names to more "standard" ones for 7.2 if it's your concern). However, I think we should focus on more fundamental issues than those trivial ones. Recently Thomas gave an idea how to deal with the internationalization (I18N) of PostgreSQL: create character set etc. I don't think I18N PostgreSQL will happen in 7.2, but we should tackle them in the near future in my opinion. -- Tatsuo Ishii
On Wed, Aug 15, 2001 at 11:28:35PM +0900, Tatsuo Ishii wrote: > Thank you for your suggestions. I'm not totally against your > suggestions (for example, I'm not against the idea that changing all > current encoding names to more "standard" ones for 7.2 if it's your > concern). However, I think we should focus on more fundamental issues > than those trivial ones. Recently Thomas gave an idea how to deal with > the internationalization (I18N) of PostgreSQL: create character set > etc. I don't think I18N PostgreSQL will happen in 7.2, but we should > tackle them in the near future in my opinion. I have now some time for implement this my suggestion. Or is better let down this for 7.2? You are right that it's trivial :-) Note: My motivate for this is that I have some multi-language DB with Web interface and for current version of PG Imust maintain separate table for transformation "standard" names to PG encoding names and vice-versa:-) Well, I try send some patch. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
Karel Zak writes: > before some time I was discuss with Tatsuo and Thomas about support > for synonyms of encoding names (for example allows to use > "ISO-8859-1" as the encoding name) and use binary search for searching > in encoding names. Funny, I was thinking the same thing last night... A couple of other things I was thinking about in the encoding area: If you want to have codeset synonyms, you should also implement the normalization of codeset names, defined as such: 1. Remove all characters beside numbers and letters. 2. Fold letters to lowercase. 3. If the same only contains digits prepend the string `"iso"'. [quote glibc] This allows ISO_8859-1 and iso88591 to be treated the same. Here's a good resource of official character set names and aliases: http://www.iana.org/assignments/character-sets Also, we ought to have support for the ISO_8859-15 character set, or people will spread the word that PostgreSQL is not ready for the Euro. Then I figured, if the client is configured with locale, it should automatically determine the client's encoding. Not sure if this is portably possible, but it would be very nice to have. Finally, as I've mentioned before I'd like to try out the iconv interface. Might become an option in 7.2 even. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Tatsuo Ishii writes: > However, I think we should focus on more fundamental issues > than those trivial ones. Recently Thomas gave an idea how to deal with > the internationalization (I18N) of PostgreSQL: create character set > etc. I haven't actually seen any real implementation proposal yet. We all know (I suppose) the requirements of this project and the interface is mostly specified by SQL. But what I haven't seen yet is how this will work internally. If we encode the charset into the header of the text datum then each and every function will have to be concerned that its output value has the right character set. If we use the type system and create a new text type for each character set then we'll probably have to implement N^X (where N is the number of character sets, and X is not known yet but >1) functions, operators, casts, etc. (not even thinking about user-pluggable character sets) and we'll really uglify all the psql \d and pg_dump work. It's not at all clear. What I'm thinking these days is that we'd need something completely new and unprecedented -- a separate charset mix-and-match subsystem, similar to the type system, but different. Not a pretty outlook, of course. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
On Wed, Aug 15, 2001 at 05:16:42PM +0200, Peter Eisentraut wrote: > Karel Zak writes: > > > before some time I was discuss with Tatsuo and Thomas about support > > for synonyms of encoding names (for example allows to use > > "ISO-8859-1" as the encoding name) and use binary search for searching > > in encoding names. > > Funny, I was thinking the same thing last night... :-) > A couple of other things I was thinking about in the encoding area: > > If you want to have codeset synonyms, you should also implement the > normalization of codeset names, defined as such: > > 1. Remove all characters beside numbers and letters. > > 2. Fold letters to lowercase. > > 3. If the same only contains digits prepend the string `"iso"'. > [quote glibc] > > This allows ISO_8859-1 and iso88591 to be treated the same. My idea is (was:-) create table with all "standard" synonyms and search in this table case insensitive. Something like: PGencname pg_encnames[] = {{ "ISO-8859-1", LATIN1 },{ "LATIN1", LATIN1 } }; But your idea with encoding name "clearing" (remove irrelavant chars) is more cool. > Here's a good resource of official character set names and aliases: > > http://www.iana.org/assignments/character-sets Thanks. > Also, we ought to have support for the ISO_8859-15 character set, or > people will spread the word that PostgreSQL is not ready for the Euro. It require prepare some conversion functions and tables (UTF). Tatsuo? > Finally, as I've mentioned before I'd like to try out the iconv interface. Do you want integrate iconv stuff to current PG multibyte routines or as some extension (functions?) only? BTW, is on psql some \command that print list of all supported encodings? Maybe allows use something like: SELECT pg_encoding_names(); Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
> Finally, as I've mentioned before I'd like to try out the iconv interface. > Might become an option in 7.2 even. I'm curious how do you handle conversion between multibyte strings and wide characters using iconv. This is necessary to implement multibyte aware like, regex, char_length etc. functions. I think at least you need to have a way to determine the letter boundaries in a multibyte string. -- Tatsuo Ishii
> I have now some time for implement this my suggestion. Or is better > let down this for 7.2? You are right that it's trivial :-) I think you should target for 7.2. > Note: My motivate for this is that I have some multi-language DB > with Web interface and for current version of PG I must maintain > separate table for transformation "standard" names to PG encoding > names and vice-versa:-) > > Well, I try send some patch. Thanks. BTW, I'm working on for dynamically loading the Unicode conversion functions to descrease the runtime memory requirement. The reason why I want to do this is: o they are huge (--enable-unicode-conversion will increase ~1MB in the load module size) o nobody will use all of them at once. For example most Japanese users are only interested in EUC/SJIS maps. -- Tatsuo Ishii
On Thu, Aug 16, 2001 at 03:39:28PM +0900, Tatsuo Ishii wrote: > > o they are huge (--enable-unicode-conversion will increase ~1MB in the > load module size) > > o nobody will use all of them at once. For example most Japanese users > are only interested in EUC/SJIS maps. > -- Good idea. I have a question, the PostgreSQL encoding name "KOI8" is KOI8-R or KOI8-U or both? I need it for correct alias setting. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
> I have a question, the PostgreSQL encoding name "KOI8" is KOI8-R or > KOI8-U or both? I need it for correct alias setting. I think it's KOI8-R. Oleg, am I correct? P.S. I use Makefile.shlib to create each shared object. This way is, I think, handy and portble. However, I need to make lots of subdirs for each encoding conversion function. Any suggestions? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > Thanks. BTW, I'm working on for dynamically loading the Unicode > conversion functions to descrease the runtime memory requirement. The > reason why I want to do this is: > o they are huge (--enable-unicode-conversion will increase ~1MB in the > load module size) > o nobody will use all of them at once. For example most Japanese users > are only interested in EUC/SJIS maps. But is it really important? All Unixen that I know of handle process text segments on a page-by-page basis; pages that aren't actually being touched won't get swapped in. Thus, the unused maps will just sit on disk, whether they are part of the main executable or a separate file. I doubt there's any real performance gain to be had by making the maps dynamically loadable. regards, tom lane
On Thu, Aug 16, 2001 at 10:22:48PM +0900, Tatsuo Ishii wrote: > > I have a question, the PostgreSQL encoding name "KOI8" is KOI8-R or > > KOI8-U or both? I need it for correct alias setting. > > I think it's KOI8-R. Oleg, am I correct? > > P.S. > I use Makefile.shlib to create each shared object. This way is, I > think, handy and portble. However, I need to make lots of subdirs for > each encoding conversion function. Any suggestions? Please make separate directory for encoding translation table programs too. The current mb/ is mix of standard files and files with main().... Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
On Thu, Aug 16, 2001 at 03:39:28PM +0900, Tatsuo Ishii wrote: > > I have now some time for implement this my suggestion. Or is better > > let down this for 7.2? You are right that it's trivial :-) > > I think you should target for 7.2. > > > Note: My motivate for this is that I have some multi-language DB > > with Web interface and for current version of PG I must maintain > > separate table for transformation "standard" names to PG encoding > > names and vice-versa:-) > > > > Well, I try send some patch. > > Thanks. BTW, I'm working on for dynamically loading the Unicode Sorry Tatsuo, but I have again question :-) Why is here hole between encoding numbers? #define KOI8 16 /* KOI8-R/U */ #define WIN 17 /* windows-1251 */ #define ALT 18 /* Alternativny Variant */ 19..31 ? #define SJIS 32 /* Shift JIS */ #define BIG5 33 /* Big5 */ #define WIN1250 34 /* windows-1250 */ It's trouble create arrays with encoding stuff, like: pg_encoding_conv_tbl[ ALT ]->to_micpg_encoding_conv_tbl[ SJIS ]->to_mic Has this hole between 19..31 some effect? Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
On Thu, 16 Aug 2001, Tatsuo Ishii wrote: > > I have a question, the PostgreSQL encoding name "KOI8" is KOI8-R or > > KOI8-U or both? I need it for correct alias setting. > > I think it's KOI8-R. Oleg, am I correct? YES > > P.S. > I use Makefile.shlib to create each shared object. This way is, I > think, handy and portble. However, I need to make lots of subdirs for > each encoding conversion function. Any suggestions? > -- > Tatsuo Ishii > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Tatsuo Ishii writes: > I use Makefile.shlib to create each shared object. This way is, I > think, handy and portble. However, I need to make lots of subdirs for > each encoding conversion function. Any suggestions? Given Tom Lane's comment, I think that this would be a wasted effort. Shared objects are normally used for extensibility at runtime, not core memory savings. (This would most likely take more memory in fact, given that the code is larger and you need all the shared object handling infrastructure.) -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> But is it really important? All Unixen that I know of handle process > text segments on a page-by-page basis; pages that aren't actually being > touched won't get swapped in. Thus, the unused maps will just sit on > disk, whether they are part of the main executable or a separate file. > I doubt there's any real performance gain to be had by making the maps > dynamically loadable. I did some testing on my Linux box (kernel 2.2) and confirmed that no performance drgration was found with unicode-conversion-enabled postgres. This proves that your theory is correct at least on Linux. Ok, I will make the unicode conversion functionality as a default if --enable-multibyte is on. -- Tatsuo Ishii