Thread: Proposal: CREATE CONVERSION
Here is my proposal for new CREATE CONVERSION which makes it possible to define new encoding conversion mapping between two encodings on the fly. The background: We are getting having more and more encoding conversion tables. Up to now, they reach to 385352 source lines and over 3MB in compiled forms in total. They are statically linked to the backend. I know this itself is not a problem since modern OSs have smart memory management capabilities to fetch only necessary pages from a disk. However, I'm worried about the infinite growing of these static tables. I think users won't love 50MB PostgreSQL backend load module. Second problem is more serious. The conversion definitions between certain encodings, such as Unicode and others are not well defined. For example, there are several conversion tables for Japanese Shift JIS and Unicode. This is because each vendor has its own "special characters" and they define the table in that the conversion fits for their purpose. The solution: The proposed new CREATE CONVERSION will solve these problems. A particular conversion table is statically linked to a dynamic loaded function and CREATE CONVERSION will tell PostgreSQL that if a conversion from encoding A to encoding B, then function C should be used. In this way, conversion tables are no more statically linked to the backend. Users also could define their own conversion tables easily that would best fit for their purpose. Also needless to say, people could define new conversions which PostgreSQL does not support yet. Syntax proposal: CREATE CONVERSION <conversion name> SOURCE <source encoding name> DESTINATION <destination encoding name> FROM <conversion function name> ; DROP CONVERSION <conversion name>; Example usage: CREATE OR REPLACE FUNCTION euc_jp_to_utf8(TEXT, TEXT, INTEGER) RETURNS INTEGER AS euc_jp_to_utf8.so LANGUAGE 'c'; CREATE CONVERSION euc_jp_to_utf8 SOURCE EUC_JP DESTINATION UNICODE FROM euc_jp_to_utf8; Implementation: Implementation would be quite straightforward. Create a new system table, and CREATE CONVERSION stores info onto it. pg_find_encoding_converters(utils/mb/mbutils.c) and friends needs to be modified so that they recognize dynamically defined conversions. Also psql would need some capabilities to print conversion definition info. Comments? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > Syntax proposal: > CREATE CONVERSION <conversion name> > SOURCE <source encoding name> > DESTINATION <destination encoding name> > FROM <conversion function name> Doesn't a conversion currently require several support functions? How much overhead will you be adding to funnel them all through one function? Basically I'd like to see a spec for the API of the conversion function... Also, is there anything in SQL99 that we ought to try to be compatible with? regards, tom lane
> > CREATE CONVERSION <conversion name> > > SOURCE <source encoding name> > > DESTINATION <destination encoding name> > > FROM <conversion function name> > > Doesn't a conversion currently require several support functions? > How much overhead will you be adding to funnel them all through > one function? No, only one function is sufficient. What else do you think of? > Basically I'd like to see a spec for the API of the conversion > function... That would be very simple (the previous example I gave was unnecessary complex). The function signature would look like: conversion_funcion(TEXT) RETURNS TEXT It receives source text and converts it then returns it. That's all. > Also, is there anything in SQL99 that we ought to try to be > compatible with? As far as I know there's no such an equivalent in SQL99. -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> Doesn't a conversion currently require several support functions? >> How much overhead will you be adding to funnel them all through >> one function? > No, only one function is sufficient. What else do you think of? I see two different functions linked to from each pg_wchar_table entry... although perhaps those are associated with encodings not with conversions. >> Basically I'd like to see a spec for the API of the conversion >> function... > That would be very simple (the previous example I gave was unnecessary > complex). The function signature would look like: > conversion_funcion(TEXT) RETURNS TEXT > It receives source text and converts it then returns it. That's all. IIRC the existing conversion functions deal in C string pointers and lengths. I'm a little worried about the extra overhead implicit in converting to a TEXT object and back again; that probably means at least two more palloc and memcpy operations. I think you'd be better off sticking to a C-level API, because I really don't believe that anyone is going to code conversion functions in (say) plpgsql. regards, tom lane
Tatsuo Ishii wrote: > Here is my proposal for new CREATE CONVERSION which makes it possible > to define new encoding conversion mapping between two encodings on the > fly. > > The background: > > We are getting having more and more encoding conversion tables. Up to > now, they reach to 385352 source lines and over 3MB in compiled forms > in total. They are statically linked to the backend. I know this > itself is not a problem since modern OSs have smart memory management > capabilities to fetch only necessary pages from a disk. However, I'm > worried about the infinite growing of these static tables. I think > users won't love 50MB PostgreSQL backend load module. Yes, those conversion tables are getting huge in the tarball too:$ pwd/pg/backend/utils/mb$ du4 ./CVS7 ./Unicode/CVS9541 ./Unicode15805 . Look at these two file alone: -rw-r--r-- 1 postgres wheel 1427492 Jun 13 04:28 gb18030_to_utf8.map -rw-r--r-- 1 postgres wheel 1427492 Jun 13 04:28 utf8_to_gb18030.map If we can make these loadable, that would be good. What would be really interesting is if we could split these out into a separate directory/project so development on those could take place in an independent way. This would probably stimulate even more encoding options for users. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> I see two different functions linked to from each pg_wchar_table > entry... although perhaps those are associated with encodings > not with conversions. Yes. those are not directly associated with conversions. > IIRC the existing conversion functions deal in C string pointers and > lengths. I'm a little worried about the extra overhead implicit > in converting to a TEXT object and back again; that probably means at > least two more palloc and memcpy operations. I think you'd be better > off sticking to a C-level API, because I really don't believe that > anyone is going to code conversion functions in (say) plpgsql. I am worried about that too. But if we stick a C-level API, how can we define the argument data type suitable for C string? I don't see such data types. Maybe you are suggesting that we should not use CREATE FUNCTION? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > I am worried about that too. But if we stick a C-level API, how can we > define the argument data type suitable for C string? I don't see such > data types. Maybe you are suggesting that we should not use CREATE > FUNCTION? Well, you'd have to use the same cheat that's used for selectivity estimation functions, triggers, I/O functions and everything else that deals in internal datatypes: declare the function as taking and returning OPAQUE. This is moderately annoying but I don't think there's anything really wrong with it in practice. regards, tom lane
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > I am worried about that too. But if we stick a C-level API, how can we > > define the argument data type suitable for C string? I don't see such > > data types. Maybe you are suggesting that we should not use CREATE > > FUNCTION? > > Well, you'd have to use the same cheat that's used for selectivity > estimation functions, triggers, I/O functions and everything else that > deals in internal datatypes: declare the function as taking and > returning OPAQUE. This is moderately annoying but I don't think > there's anything really wrong with it in practice. Oh, I see. -- Tatsuo Ishii
Tatsuo Ishii writes: > > Also, is there anything in SQL99 that we ought to try to be > > compatible with? > > As far as I know there's no such an equivalent in SQL99. Sure: 11.34 <translation definition> Function Define a character translation. Format <translation definition> ::= CREATE TRANSLATION <translation name> FOR <source characterset specification> TO <target character set specification> FROM <translation source> <source character set specification> ::= <character set specification> <target character set specification> ::= <character set specification> <translation source> ::= <existing translation name> | <translation routine> <existing translation name> ::= <translation name> <translation routine> ::= <specific routine designator> That's pretty much exactly what you are descibing. What would be really cool is if we could somehow reuse the conversion modules provided by the C library and/or the iconv library. For example, I have 176 "modules" under /usr/lib/gconv. They should be useful for something. -- Peter Eisentraut peter_e@gmx.net
> Tatsuo Ishii writes: > > > > Also, is there anything in SQL99 that we ought to try to be > > > compatible with? > > > > As far as I know there's no such an equivalent in SQL99. > > Sure: > > 11.34 <translation definition> I guess you mix up SQL99's "trasnlate" and "convert". As far as I know, SQL99's "translation" is exactly a translation. e.g. rr) translation: A method of translating characters in one character repertoire into characters of the sameor a different character repertoire. For example, certain "translation" might take an input of Engish text, and makes an output of Japanese one (I don't know if we could implement such a translation though :-). On the other hand "convert" just changes the "form-of-use" (SQL's term, actually equivalent to "encoding"), keeping the character repertoire. -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > I guess you mix up SQL99's "trasnlate" and "convert". No, I believe Peter has read the spec correctly. Further down they have <character translation> is a function for changing each character of a given string according to some many-to-oneor one-to-one mapping between two not necessarily distinct character sets. So this is intended as a one-character-at-a-time mapping, not a language translation (which would be far beyond what anyone would expect of a database anyway). One thing that's really unclear to me is what's the difference between a <character translation> and a <form-of-use conversion>, other than that they didn't provide a syntax for defining new conversions. regards, tom lane
Tom Lane writes: > One thing that's really unclear to me is what's the difference between > a <character translation> and a <form-of-use conversion>, other than > that they didn't provide a syntax for defining new conversions. The standard has this messed up. In part 1, a form-of-use and an encoding are two distinct things that can be applied to a character repertoire (see clause 4.6.2.1), whereas in part 2 the term encoding is used in the definition of form-of-use (clause 3.1.5 r). When I sort it out, however, I think that what Tatsuo was describing is indeed a form-of-use conversion. Note that in part 2, clause 4.2.2.1, it says about form-of-use conversions, It is intended, though not enforced by this part of ISO/IEC 9075, that S2 be exactly the same sequence of charactersas S1, but encoded according some different form-of-use. A typical use might be to convert a character stringfrom two-octet UCS to one-octet Latin1 or vice versa. This seems to match what we're doing. A character translation does not make this requirement and it explicitly calls out the possibility of "many-to-one or one-to-one mapping between two not necessarily distinct character sets". I imagine that what this is intended to do is to allow the user to create mappings such as ö -> oe (as is common in German to avoid using characters with diacritic marks), or ö -> o (as one might do in French to achieve the same). In fact, it's a glorified sed command. So I withdraw my earlier comment. But perhaps the syntax of the proposed command could be aligned with the CREATE TRANSLATION command. -- Peter Eisentraut peter_e@gmx.net
> So I withdraw my earlier comment. But perhaps the syntax of the proposed > command could be aligned with the CREATE TRANSLATION command. Ok. What about this? CREATE CONVERSION <conversion name> FOR <encoding name> TO <encoding name> FROM <conversion routine name> DROP CONVERSION <conversion name> BTW, I wonder if we should invent new access privilege for conversion. -- Tatsuo Ishii
Here is a proposal for new pg_conversion system table. Comments? /*-------------------------------------------------------------------------** pg_conversion.h* definition of the system"conversion" relation (pg_conversion)* along with the relation's initial contents.*** Portions Copyright (c) 1996-2002,PostgreSQL Global Development Group* Portions Copyright (c) 1994, Regents of the University of California** $Id$**NOTES* the genbki.sh script reads this file and generates .bki* information from the DATA() statements.**-------------------------------------------------------------------------*/ #ifndef PG_CONVERSION_H #define PG_CONVERSION_H /* ----------------* postgres.h contains the system type definitions and the* CATALOG(), BOOTSTRAP and DATA()sugar words so this file* can be read by both genbki.sh and the C compiler.* ----------------*/ /* ----------------------------------------------------------------* pg_conversion definition.** cpp turnsthis into typedef struct FormData_pg_namespace** conname name of the conversion* connamespace name space which the conversion belongs to* conowner ower of the conversion* conforencoding FOR encoding id* contoencoding TO encoding id* conproc OID of the conversion proc* ----------------------------------------------------------------*/ CATALOG(pg_conversion) {NameData conname;Oid connamespace;int4 conowner;int4 conforencoding;int4 contoencoding;Oid conproc; } FormData_pg_conversion; /* ----------------* Form_pg_conversion corresponds to a pointer to a tuple with* the format of pg_conversionrelation.* ----------------*/ typedef FormData_pg_conversion *Form_pg_conversion; /* ----------------* compiler constants for pg_conversion* ----------------*/ #define Natts_pg_conversion 6 #define Anum_pg_conversion_conpname 1 #define Anum_pg_conversion_connamespace 2 #define Anum_pg_conversion_conowner 3 #define Anum_pg_conversion_conforencoding 4 #define Anum_pg_conversion_contoencoding 5 #define Anum_pg_conversion_conproc 6 /* ----------------* initial contents of pg_conversion* ---------------*/ /** prototypes for functions in pg_conversion.c*/ extern Oid ConversionCreate(const char *conname, Oid connamespace, int32 conowner, int4 conforencoding, int4 contoencoding, Oid conproc); #endif /* PG_CONVERSION_H */
On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote: > What would be really cool is if we could somehow reuse the conversion > modules provided by the C library and/or the iconv library. For example, ^^^^^^^ Very good point. Why use own conversion routines/tables if there is commonlibrary for this? The encoding API for PostgreSQL is really cool idea. I unsure with only one argument for encoding function. What if I wantto use one generic function for all encodings (for exampleas API toiconv)? I think better C interface is: encode( TEXT data, NAME from, NAME to ); where from/to are encoding names. The other way is use some structthat handle this information -- like ARGS in trigger functions. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
> On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote: > > What would be really cool is if we could somehow reuse the conversion > > modules provided by the C library and/or the iconv library. For example, > ^^^^^^^ > > Very good point. Why use own conversion routines/tables if there is common > library for this? I'm not still sure about the details of conversion map used by iconv. Japanese users have enough trouble with the conversin between Unicode and othe charsets. This is because there are many variation of conversion maps provided by vendors. For example, the conversion map used for Unicode and SJIS in PostgreSQL has been carefully designed to minimize problems described above. Another issue is the availabilty of iconv among platforms. If we are sure that a particlular iconv conversion routine is available on all platforms and the conversion result is good eough, our conversion routine could be replaced by new one using iconv. > The encoding API for PostgreSQL is really cool idea. > > I unsure with only one argument for encoding function. What if I want > to use one generic function for all encodings (for example as API to > iconv)? Use a simple wrap function. -- Tatsuo Ishii
On Mon, Jul 08, 2002 at 09:59:44PM +0900, Tatsuo Ishii wrote: > > On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote: > > > What would be really cool is if we could somehow reuse the conversion > > > modules provided by the C library and/or the iconv library. For example, > > ^^^^^^^ > > > > Very good point. Why use own conversion routines/tables if there is common > > library for this? > > I'm not still sure about the details of conversion map used by > iconv. Japanese users have enough trouble with the conversin between > Unicode and othe charsets. This is because there are many variation of > conversion maps provided by vendors. For example, the conversion map > used for Unicode and SJIS in PostgreSQL has been carefully designed to > minimize problems described above. Another issue is the availabilty of > iconv among platforms. If we are sure that a particlular iconv > conversion routine is available on all platforms and the conversion > result is good eough, our conversion routine could be replaced by new > one using iconv. This is not problem if we will have some common API. You can use current conversion tables (maps) and for example I can useiconv on my i386/Linux. I don't want to replace current maps if somebody needs it. I wouldlike to API. I see iconv is included into glibc now. > > I unsure with only one argument for encoding function. What if I want > > to use one generic function for all encodings (for example as API to > > iconv)? > > Use a simple wrap function. How knows this function to/from encoding? Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > CATALOG(pg_conversion) > { > NameData conname; > Oid connamespace; > int4 conowner; > int4 conforencoding; > int4 contoencoding; > Oid conproc; > } FormData_pg_conversion; Should use type "regproc" for conproc, I think. regards, tom lane
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > BTW, I wonder if we should invent new access privilege for conversion. I believe the spec just demands USAGE on the underlying function for the TRANSLATE case, and I don't see why it should be different for CONVERT. (In principle, if we didn't use a C-only API, you could just call the underlying function directly; so there's little point in having protection restrictions different from that case.) regards, tom lane
... > So I withdraw my earlier comment. But perhaps the syntax of the proposed > command could be aligned with the CREATE TRANSLATION command. Tatsuo, it seems that we should use SQL99 terminology and commands where appropriate. We do not yet implement the SQL99 forms of character support, and I'm not sure if our current system is modeled to fit the SQL99 framework. Are you suggesting CREATE CONVERSION to avoid infringing on SQL99 syntax to allow us to use that sometime later? - Thomas
Thomas Lockhart writes: > Tatsuo, it seems that we should use SQL99 terminology and commands where > appropriate. We do not yet implement the SQL99 forms of character > support, and I'm not sure if our current system is modeled to fit the > SQL99 framework. Are you suggesting CREATE CONVERSION to avoid > infringing on SQL99 syntax to allow us to use that sometime later? SQL99 says that the method by which conversions are created is implementation-defined. Tatsuo is defining the implementation. -- Peter Eisentraut peter_e@gmx.net
Tatsuo Ishii writes: > Here is a proposal for new pg_conversion system table. Comments? I wonder if the encodings themselves shouldn't be represented in some system table, too. Admittedly, this is nearly orthogonal to the proposed system table, except perhaps the data type of the two encoding fields. -- Peter Eisentraut peter_e@gmx.net
> When you say "We do not yet implement the SQL99 forms of character > support", I think you mean the ability to specify per column (or even > per string) charset. I don't think this would happen for 7.3(or 8.0 > whatever), but sometime later I would like to make it reality. Right. An aside: I was thinking about this some, from the PoV of using our existing type system to handle this (as you might remember, this is an inclination I've had for quite a while). I think that most things line up fairly well to allow this (and having transaction-enabled features may require it), but do notice that the SQL feature of allowing a different character set for every column *name* does not map particularly well to our underlying structures. - Thomas
> I believe the spec just demands USAGE on the underlying function for > the TRANSLATE case, and I don't see why it should be different for > CONVERT. (In principle, if we didn't use a C-only API, you could > just call the underlying function directly; so there's little point > in having protection restrictions different from that case.) Ok, so: (1) a CONVERSION can only be dropped by the superuser or its owner. (2) a grant syntax for CONVERSION is: GRANT USAGE ON CONVERSION <conversion_name> to {<user_name> | GROUP <group_name> | PUBLIC} [, ...] -- Tatsuo Ishii
> Tatsuo, it seems that we should use SQL99 terminology and commands where > appropriate. We do not yet implement the SQL99 forms of character > support, and I'm not sure if our current system is modeled to fit the > SQL99 framework. Are you suggesting CREATE CONVERSION to avoid > infringing on SQL99 syntax to allow us to use that sometime later? I'm not sure I understand your question, but I would say I would like to follow SQL99 as much as possible. When you say "We do not yet implement the SQL99 forms of character support", I think you mean the ability to specify per column (or even per string) charset. I don't think this would happen for 7.3(or 8.0 whatever), but sometime later I would like to make it reality. -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> I believe the spec just demands USAGE on the underlying function for >> the TRANSLATE case, and I don't see why it should be different for >> CONVERT. (In principle, if we didn't use a C-only API, you could >> just call the underlying function directly; so there's little point >> in having protection restrictions different from that case.) > Ok, so: > (1) a CONVERSION can only be dropped by the superuser or its owner. Okay ... > (2) a grant syntax for CONVERSION is: > GRANT USAGE ON CONVERSION <conversion_name> to > {<user_name> | GROUP <group_name> | PUBLIC} [, ...] No, I don't think a conversion has any privileges of its own at all. You either have USAGE on the underlying function, or not. regards, tom lane
> I've been think this for a while too. What about collation? If we add > new chaset A and B, and each has 10 collations then we are going to > have 20 new types? That seems overkill to me. Well, afaict all of the operations we would ask of a type we will be required to provide for character sets and collations. So ordering, conversions, operators, index methods, etc etc are all required. It *does* seem like a lot of work, but the type system is specifically designed to do exactly this. Lifting those capabilities out of the type system only to reimplement them elsewhere seems all trouble with no upside. Perhaps the current concept of "binary compatible types" could help reduce the complexity, if it were made extensible, which it needs anyway. But in most cases the character set/collation pair is a unique combination, with a limited set of possibilities for other character set/collation pairs with equivalent forms of use, which would keep us from being able to reuse pieces anyway. For most installations, we would install just those character sets the installation/database requires, so in practice the database size need not grow much beyond what it already is. And we could have conventions on how functions and operators are named for a character set and/or collation, so we could auto-generate the SQL definitions given an implementation which meets a template standard. Hmm, an aside which might be relevant: I've been looking at the "national character string" syntax (you know, the N'string' convention) and at the binary and hex string syntax (B'101010' and X'AB1D', as examples) and would like to implement them in the lexer and parser by having the string preceded with a type identifier as though they were something like NATIONAL CHARACTER 'string' BIN '101010' HEX 'AB1D' where both BIN and HEX result in the *same* underlying data type once ingested (or at least a reasonable facimile). I won't be allowed to create two data types with the same type OID, but maybe if I assign them to be binary compatible then I won't have to flesh out the hex data type but only provide an input and output function. - Thomas
> If so, what about the "coercibility" property? > The standard defines four distinct coercibility properties. So in > above my example, actually you are going to define 80 new types? > (also a collation could be either "PAD SPACE" or "NO PAD". So you > might have 160 new types). Well, yes I suppose so. The point is that these relationships *must be defined anyway*. Allowed and forbidden conversions must be defined, collation order must be defined, indexing operations must be defined, etc etc etc. In fact, everything typically associated with a type must be defined, including the allowed conversions between other types (character sets/collations). So, how are we going to do this *in a general way* without carrying the infrastructure of a (the) type system along with it? What would we be able to leave out or otherwise get for free if we use another mechanism? And is that mechanism fundamentally simpler than (re)using the type system that we already have? - Thomas
> > Use a simple wrap function. > > How knows this function to/from encoding? For example you want to define a function for LATIN1 to UNICODE conversion function would look like: function_for_LATIN1_to_UTF-8(from_string opaque, to_string opaque, length integer) {::generic_function_using_iconv(from_str, to_str, "ISO-8859-1", "UTF-8",length); } CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer) RETURNS integer; CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM function_for_LATIN1_to_UTF-8;
> An aside: I was thinking about this some, from the PoV of using our > existing type system to handle this (as you might remember, this is an > inclination I've had for quite a while). I think that most things line > up fairly well to allow this (and having transaction-enabled features > may require it), but do notice that the SQL feature of allowing a > different character set for every column *name* does not map > particularly well to our underlying structures. I've been think this for a while too. What about collation? If we add new chaset A and B, and each has 10 collations then we are going to have 20 new types? That seems overkill to me. -- Tatsuo Ishii
> > (1) a CONVERSION can only be dropped by the superuser or its owner. > > Okay ... > > > (2) a grant syntax for CONVERSION is: > > > GRANT USAGE ON CONVERSION <conversion_name> to > > {<user_name> | GROUP <group_name> | PUBLIC} [, ...] > > No, I don't think a conversion has any privileges of its own at all. > You either have USAGE on the underlying function, or not. I see. -- Tatsuo Ishii
> > Here is a proposal for new pg_conversion system table. Comments? > > I wonder if the encodings themselves shouldn't be represented in some > system table, too. Admittedly, this is nearly orthogonal to the proposed > system table, except perhaps the data type of the two encoding fields. That would be ideal, but I think that would happen at the same time when CREATE CHARACTER SET would be implemented. -- Tatsuo Ishii
On Tue, Jul 09, 2002 at 10:07:11AM +0900, Tatsuo Ishii wrote: > > > Use a simple wrap function. > > > > How knows this function to/from encoding? > > For example you want to define a function for LATIN1 to UNICODE conversion > function would look like: > > function_for_LATIN1_to_UTF-8(from_string opaque, to_string opaque, length > integer) > { > : > : > generic_function_using_iconv(from_str, to_str, "ISO-8859-1", "UTF-8", > length); > } > > CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer) > RETURNS integer; > CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM > function_for_LATIN1_to_UTF-8; Hmm, but it require define "function_for_..." for each conversion. For example trigger function I needn't define for eachtable, but I canuse only one PostgreSQL function for arbirary table. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
> > If so, what about the "coercibility" property? > > The standard defines four distinct coercibility properties. So in > > above my example, actually you are going to define 80 new types? > > (also a collation could be either "PAD SPACE" or "NO PAD". So you > > might have 160 new types). > > Well, yes I suppose so. The point is that these relationships *must be > defined anyway*. Allowed and forbidden conversions must be defined, > collation order must be defined, indexing operations must be defined, > etc etc etc. In fact, everything typically associated with a type must > be defined, including the allowed conversions between other types > (character sets/collations). > > So, how are we going to do this *in a general way* without carrying the > infrastructure of a (the) type system along with it? What would we be > able to leave out or otherwise get for free if we use another mechanism? > And is that mechanism fundamentally simpler than (re)using the type > system that we already have? Well, I think charset/collation/coercibility/pad are all string data type specific properties, not common to any other data types. So it seems more appropreate for type systems not to have those certain type specific knowledges. For example, S1 < S2 should raise an error if S1 has "no collating properties" and S2 has "implicit collating properties", while ok if S1 has "no collating properties" and S2 has "explicit collating properties". It would be very hard for the type system to handle this kind of cases since it requires special knowledges about string data type. Alternative? Why don't we have these properties in the string data itself? (probably we do not need to have them on disk storage). Existing text data type has length + data. I suggest to extend it like: length + charset + collation + pad + coercibility + data With this above example could be easily handled by < operator. For index, maybe we could dynamically replace varstr_cmp function according to collation, though I have not actually examined my idea closely. -- Tatsuo Ishii
SQL99 allows on the fly encoding conversion: CONVERT('aaa' USING myconv 'bbb') So there could be more than 1 conversion for a paticlular encodings pair. This lead to an ambiguity for "default" conversion used for the frontend/backend automatic encoding conversion. Can we add a flag indicating that this is the "default" conversion? The new proposed syntax would be: CREATE CONVERSION <conversion name> FOR <source encoding name> TO <destination encoding name> FROM <conversionfunction name> [DEFAULT] Comments? -- Tatsuo Ishii
> > I've been think this for a while too. What about collation? If we add > > new chaset A and B, and each has 10 collations then we are going to > > have 20 new types? That seems overkill to me. > > Well, afaict all of the operations we would ask of a type we will be > required to provide for character sets and collations. So ordering, > conversions, operators, index methods, etc etc are all required. It > *does* seem like a lot of work, but the type system is specifically > designed to do exactly this. Lifting those capabilities out of the type > system only to reimplement them elsewhere seems all trouble with no > upside. If so, what about the "coercibility" property? The standard defines four distinct coercibility properties. So in above my example, actually you are going to define 80 new types? (also a collation could be either "PAD SPACE" or "NO PAD". So you might have 160 new types). -- Tatsuo Ishii
On Tue, 2002-07-09 at 03:47, Tatsuo Ishii wrote: > > An aside: I was thinking about this some, from the PoV of using our > > existing type system to handle this (as you might remember, this is an > > inclination I've had for quite a while). I think that most things line > > up fairly well to allow this (and having transaction-enabled features > > may require it), but do notice that the SQL feature of allowing a > > different character set for every column *name* does not map > > particularly well to our underlying structures. > > I've been think this for a while too. What about collation? If we add > new chaset A and B, and each has 10 collations then we are going to > have 20 new types? That seems overkill to me. Can't we do all collating in unicode and convert charsets A and B to and from it ? I would even reccommend going a step further and storing all 'national' character sets in unicode. -------------- Hannu
Thomas Lockhart writes: > An aside: I was thinking about this some, from the PoV of using our > existing type system to handle this (as you might remember, this is an > inclination I've had for quite a while). I think that most things line > up fairly well to allow this (and having transaction-enabled features > may require it), but do notice that the SQL feature of allowing a > different character set for every column *name* does not map > particularly well to our underlying structures. There more I think about it, the more I come to the conclusion that the SQL framework for "character sets" is both bogus and a red herring. (And it begins with figuring out exactly what a character set is, as opposed to a form-of-use, a.k.a.(?) encoding, but let's ignore that.) The ability to store each column value in a different encoding sounds interesting, because it allows you to create tables such as product_id | product_name_en | product_name_kr | product_name_jp but you might as well create a table such as product_id | lang | product_name with product_name in Unicode, and have a more extensible application that way, too. I think it's fine to have the encoding fixed for the entire database. It sure makes coding easier. If you want to be international, you use Unicode. If not you can "optimize" your database by using a more efficient encoding. In fact, I think we should consider making UTF-8 the default encoding sometime. The real issue is the collation. But the collation is a small subset of the whole locale/character set gobbledigook. Standardized collation rules in standardized forms exist. Finding/creating routines to interpret and apply them should be the focus. SQL's notion to funnel the decision which collation rule to apply through the character sets is bogus. It's impossible to pick a default collation rule for many character sets without applying bias. -- Peter Eisentraut peter_e@gmx.net
Hannu Krosing writes: > Can't we do all collating in unicode and convert charsets A and B to and > >from it ? > > I would even reccommend going a step further and storing all 'national' > character sets in unicode. Sure. However, Tatsuo maintains that the customary Japanese character sets don't map very well with Unicode. Personally, I believe that this is an issue that should be fixed, not avoided, but I don't understand the issues well enough. -- Peter Eisentraut peter_e@gmx.net
On Wed, 10 Jul 2002 08:21, Peter Eisentraut wrote: > Hannu Krosing writes: ... > > I would even reccommend going a step further and storing all 'national' > > character sets in unicode. > > Sure. However, Tatsuo maintains that the customary Japanese character > sets don't map very well with Unicode. Personally, I believe that this is > an issue that should be fixed, not avoided, but I don't understand the > issues well enough. Presumably improving the Unicode support to cover the full UTF32 (or UCS4) range would help with this. Last time I checked, PostgreSQL only supports the UCS2 subset of Unicode, ie 16 bits. From the Unicode propaganda I've read, it seems that one of the main goals of the expansion of the range beyond 16 bits was to answer the complaints of Japanese users. Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/
On Wed, 10 Jul 2002, Peter Eisentraut wrote: > Sure. However, Tatsuo maintains that the customary Japanese character > sets don't map very well with Unicode. Personally, I believe that this is > an issue that should be fixed, not avoided, but I don't understand the > issues well enough. I hear this all the time, but I have yet to have someone show me what, Iin SO-2022-JP, EUC-JP or SJIS cannot be transparently translated into Unicode and back. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
> > For example you want to define a function for LATIN1 to UNICODE conversion > > function would look like: > > > > function_for_LATIN1_to_UTF-8(from_string opaque, to_string opaque, length > > integer) > > { > > : > > : > > generic_function_using_iconv(from_str, to_str, "ISO-8859-1", "UTF-8", > > length); > > } > > > > CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer) > > RETURNS integer; > > CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM > > function_for_LATIN1_to_UTF-8; > > Hmm, but it require define "function_for_..." for each conversion. > For example trigger function I needn't define for each table, but I can > use only one PostgreSQL function for arbirary table. I don't think this is a big problem, IMO. However, thinking more, I came to a conclusion that passing encoding ids would be a good thing. With the encoding id parameters, the function could check if it is called with correct encodings, and this would prevent disaster. New interface proposal: pgconv( INTEGER, -- source encoding id INTEGER, -- destination encoding id OPAQUE, -- source string(null terminated C string) OPAQUE, -- destination string (null terminated C string) INTERGER -- sourcestring length) returns INTEGER; -- dummy. returns nothing, actually. -- Tatsuo Ishii
On Thu, Jul 11, 2002 at 03:37:49PM +0900, Tatsuo Ishii wrote: > > > CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer) > > > RETURNS integer; > > > CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM > > > function_for_LATIN1_to_UTF-8; > > > > Hmm, but it require define "function_for_..." for each conversion. > > For example trigger function I needn't define for each table, but I can > > use only one PostgreSQL function for arbirary table. > > I don't think this is a big problem, IMO. > > However, thinking more, I came to a conclusion that passing encoding > ids would be a good thing. With the encoding id parameters, the > function could check if it is called with correct encodings, and this > would prevent disaster. New interface proposal: OK. > pgconv( > INTEGER, -- source encoding id > INTEGER, -- destination encoding id Where/how is describe conversion between encoding id and encodingname? (I maybe something overlook:-) I expect new encodingsystem will extendable and encodings list not will hardcoded like now.(extendable = add new encoding without PostgreSQLrebuild) BTW, the client site needs routines for work with encoding names too(pg_char_to_encoding()). Hmm.. it can't be extendable,or yes? > OPAQUE, -- source string (null terminated C string) > OPAQUE, -- destination string (null terminated C string) > INTERGER -- source string length > ) returns INTEGER; -- dummy. returns nothing, actually. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
> Where/how is describe conversion between encoding id and encoding > name? (I maybe something overlook:-) I expect new encoding system > will extendable and encodings list not will hardcoded like now. > (extendable = add new encoding without PostgreSQL rebuild) User defined charsets(encodings) is under discussion and I believe it would not happen for 7.3. > BTW, the client site needs routines for work with encoding names too > (pg_char_to_encoding()). Hmm.. it can't be extendable, or yes? pg_char_to_encoding() is already in libpq. Or am I missing something? -- Tatsuo Ishii
On Thu, Jul 11, 2002 at 05:26:01PM +0900, Tatsuo Ishii wrote: > > Where/how is describe conversion between encoding id and encoding > > name? (I maybe something overlook:-) I expect new encoding system > > will extendable and encodings list not will hardcoded like now. > > (extendable = add new encoding without PostgreSQL rebuild) > > User defined charsets(encodings) is under discussion and I believe it > would not happen for 7.3. > > > BTW, the client site needs routines for work with encoding names too > > (pg_char_to_encoding()). Hmm.. it can't be extendable, or yes? > > pg_char_to_encoding() is already in libpq. Or am I missing something? It works with encoding table (pg_enc2name_tbl) and it's compiled into backend and client too. It means number of encodingis not possible change after compilation and you (user) can't add new encoding without pg_enc2name_tbl[] change.I original thought we can add new encodingson-the-fly in 7.3 :-) You're right. IMHO implement "User defined charsets(encodings)" will problem forcurrent libpq design. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
> > pg_char_to_encoding() is already in libpq. Or am I missing something? > > It works with encoding table (pg_enc2name_tbl) and it's compiled > into backend and client too. It means number of encoding is not possible > change after compilation and you (user) can't add new encoding without > pg_enc2name_tbl[] change. I original thought we can add new encodings > on-the-fly in 7.3 :-) You're right. > > IMHO implement "User defined charsets(encodings)" will problem for > current libpq design. No, it's not a libpq problem, but more common "client/server" problem IMO. It's very hard to share dynamically created object (info) effectively between client and server. -- Tatsuo Ishii
On Thu, Jul 11, 2002 at 05:52:18PM +0900, Tatsuo Ishii wrote: > > > pg_char_to_encoding() is already in libpq. Or am I missing something? > > > > It works with encoding table (pg_enc2name_tbl) and it's compiled > > into backend and client too. It means number of encoding is not possible > > change after compilation and you (user) can't add new encoding without > > pg_enc2name_tbl[] change. I original thought we can add new encodings > > on-the-fly in 7.3 :-) You're right. > > > > IMHO implement "User defined charsets(encodings)" will problem for > > current libpq design. > > No, it's not a libpq problem, but more common "client/server" problem > IMO. It's very hard to share dynamically created object (info) > effectively between client and server. IMHO dynamic object will keep server and client must ask for wanted information to server. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
> > No, it's not a libpq problem, but more common "client/server" problem > > IMO. It's very hard to share dynamically created object (info) > > effectively between client and server. > > IMHO dynamic object will keep server and client must ask for wanted > information to server. I agree with you. However real problem is how fast it could be. For example, pg_mblen() is called for each word processed by libpq to know the byte length of the word. If each call to pg_mblen() accesses backend, the performance might be unacceptably slow. -- Tatsuo Ishii
On Thu, Jul 11, 2002 at 06:30:48PM +0900, Tatsuo Ishii wrote: > > > No, it's not a libpq problem, but more common "client/server" problem > > > IMO. It's very hard to share dynamically created object (info) > > > effectively between client and server. > > > > IMHO dynamic object will keep server and client must ask for wanted > > information to server. > > I agree with you. However real problem is how fast it could be. For > example, pg_mblen() is called for each word processed by libpq to know > the byte length of the word. If each call to pg_mblen() accesses > backend, the performance might be unacceptably slow. It must load all relevant information about actual encoding(s) andcache it in libpq. IMHO basic encoding information like name and id are not problem. The PQmblen() is big problem. Strange question: is PQmblen()reallyneedful? I see it's used for result printing, but why backend notmark size of field (word) to result? If backendgood knows size ofdata why not send this information to client togeter with data? Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
> IMHO basic encoding information like name and id are not problem. > The PQmblen() is big problem. Strange question: is PQmblen() really > needful? I see it's used for result printing, but why backend not > mark size of field (word) to result? If backend good knows size of > data why not send this information to client togeter with data? PQmblen() is used by psql in many places. It is used for parsing query texts supplied by user, not only for data sent from backend. -- Tatsuo Ishii