Thread: Proposal: CREATE CONVERSION

Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
Here is my proposal for new CREATE CONVERSION which makes it possible
to define new encoding conversion mapping between two encodings on the
fly.

The background:

We are getting having more and more encoding conversion tables. Up to
now, they reach to 385352 source lines and over 3MB in compiled forms
in total. They are statically linked to the backend. I know this
itself is not a problem since modern OSs have smart memory management
capabilities to fetch only necessary pages from a disk. However, I'm
worried about the infinite growing of these static tables.  I think
users won't love 50MB PostgreSQL backend load module.

Second problem is more serious. The conversion definitions between
certain encodings, such as Unicode and others are not well
defined. For example, there are several conversion tables for Japanese
Shift JIS and Unicode. This is because each vendor has its own
"special characters" and they define the table in that the conversion
fits for their purpose.

The solution:

The proposed new CREATE CONVERSION will solve these problems. A
particular conversion table is statically linked to a dynamic loaded
function and CREATE CONVERSION will tell PostgreSQL that if
a conversion from encoding A to encoding B, then function C should be
used. In this way, conversion tables are no more statically linked to
the backend.

Users also could define their own conversion tables easily that would
best fit for their purpose. Also needless to say, people could define
new conversions which PostgreSQL does not support yet.

Syntax proposal:

CREATE CONVERSION <conversion name>      SOURCE <source encoding name>      DESTINATION <destination encoding name>
FROM <conversion function name>
 
;
DROP CONVERSION <conversion name>;

Example usage:

CREATE OR REPLACE FUNCTION euc_jp_to_utf8(TEXT, TEXT, INTEGER)      RETURNS INTEGER AS euc_jp_to_utf8.so LANGUAGE 'c';
CREATE CONVERSION euc_jp_to_utf8      SOURCE EUC_JP DESTINATION UNICODE      FROM euc_jp_to_utf8;

Implementation:

Implementation would be quite straightforward. Create a new system
table, and CREATE CONVERSION stores info onto
it. pg_find_encoding_converters(utils/mb/mbutils.c) and friends needs
to be modified so that they recognize dynamically defined conversions.
Also psql would need some capabilities to print conversion definition
info.

Comments?
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Syntax proposal:
> CREATE CONVERSION <conversion name>
>        SOURCE <source encoding name>
>        DESTINATION <destination encoding name>
>        FROM <conversion function name>

Doesn't a conversion currently require several support functions?
How much overhead will you be adding to funnel them all through
one function?

Basically I'd like to see a spec for the API of the conversion
function...

Also, is there anything in SQL99 that we ought to try to be
compatible with?
        regards, tom lane




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> > CREATE CONVERSION <conversion name>
> >        SOURCE <source encoding name>
> >        DESTINATION <destination encoding name>
> >        FROM <conversion function name>
> 
> Doesn't a conversion currently require several support functions?
> How much overhead will you be adding to funnel them all through
> one function?

No, only one function is sufficient. What else do you think of?

> Basically I'd like to see a spec for the API of the conversion
> function...

That would be very simple (the previous example I gave was unnecessary
complex). The function signature would look like:

conversion_funcion(TEXT) RETURNS TEXT

It receives source text and converts it then returns it. That's all.

> Also, is there anything in SQL99 that we ought to try to be
> compatible with?

As far as I know there's no such an equivalent in SQL99.
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> Doesn't a conversion currently require several support functions?
>> How much overhead will you be adding to funnel them all through
>> one function?

> No, only one function is sufficient. What else do you think of?

I see two different functions linked to from each pg_wchar_table
entry... although perhaps those are associated with encodings
not with conversions.

>> Basically I'd like to see a spec for the API of the conversion
>> function...

> That would be very simple (the previous example I gave was unnecessary
> complex). The function signature would look like:
> conversion_funcion(TEXT) RETURNS TEXT
> It receives source text and converts it then returns it. That's all.

IIRC the existing conversion functions deal in C string pointers and
lengths.  I'm a little worried about the extra overhead implicit
in converting to a TEXT object and back again; that probably means at
least two more palloc and memcpy operations.  I think you'd be better
off sticking to a C-level API, because I really don't believe that
anyone is going to code conversion functions in (say) plpgsql.
        regards, tom lane




Re: Proposal: CREATE CONVERSION

From
Bruce Momjian
Date:
Tatsuo Ishii wrote:
> Here is my proposal for new CREATE CONVERSION which makes it possible
> to define new encoding conversion mapping between two encodings on the
> fly.
> 
> The background:
> 
> We are getting having more and more encoding conversion tables. Up to
> now, they reach to 385352 source lines and over 3MB in compiled forms
> in total. They are statically linked to the backend. I know this
> itself is not a problem since modern OSs have smart memory management
> capabilities to fetch only necessary pages from a disk. However, I'm
> worried about the infinite growing of these static tables.  I think
> users won't love 50MB PostgreSQL backend load module.

Yes, those conversion tables are getting huge in the tarball too:$ pwd/pg/backend/utils/mb$ du4       ./CVS7
./Unicode/CVS9541   ./Unicode15805   .
 

Look at these two file alone:

-rw-r--r--  1 postgres  wheel  1427492 Jun 13 04:28 gb18030_to_utf8.map
-rw-r--r--  1 postgres  wheel  1427492 Jun 13 04:28 utf8_to_gb18030.map

If we can make these loadable, that would be good.  What would be really
interesting is if we could split these out into a separate
directory/project so development on those could take place in an
independent way.  This would probably stimulate even more encoding
options for users.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> I see two different functions linked to from each pg_wchar_table
> entry... although perhaps those are associated with encodings
> not with conversions.

Yes. those are not directly associated with conversions.

> IIRC the existing conversion functions deal in C string pointers and
> lengths.  I'm a little worried about the extra overhead implicit
> in converting to a TEXT object and back again; that probably means at
> least two more palloc and memcpy operations.  I think you'd be better
> off sticking to a C-level API, because I really don't believe that
> anyone is going to code conversion functions in (say) plpgsql.

I am worried about that too. But if we stick a C-level API, how can we
define the argument data type suitable for C string? I don't see such
data types. Maybe you are suggesting that we should not use CREATE
FUNCTION?
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> I am worried about that too. But if we stick a C-level API, how can we
> define the argument data type suitable for C string? I don't see such
> data types. Maybe you are suggesting that we should not use CREATE
> FUNCTION?

Well, you'd have to use the same cheat that's used for selectivity
estimation functions, triggers, I/O functions and everything else that
deals in internal datatypes: declare the function as taking and
returning OPAQUE.  This is moderately annoying but I don't think
there's anything really wrong with it in practice.
        regards, tom lane




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > I am worried about that too. But if we stick a C-level API, how can we
> > define the argument data type suitable for C string? I don't see such
> > data types. Maybe you are suggesting that we should not use CREATE
> > FUNCTION?
> 
> Well, you'd have to use the same cheat that's used for selectivity
> estimation functions, triggers, I/O functions and everything else that
> deals in internal datatypes: declare the function as taking and
> returning OPAQUE.  This is moderately annoying but I don't think
> there's anything really wrong with it in practice.

Oh, I see.
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> > Also, is there anything in SQL99 that we ought to try to be
> > compatible with?
>
> As far as I know there's no such an equivalent in SQL99.

Sure:
        11.34  <translation definition>
        Function
        Define a character translation.
        Format
        <translation definition> ::=             CREATE TRANSLATION <translation name>               FOR <source
characterset specification>                 TO <target character set specification>               FROM <translation
source>
        <source character set specification> ::= <character set specification>
        <target character set specification> ::= <character set specification>
        <translation source> ::=               <existing translation name>             | <translation routine>
        <existing translation name> ::= <translation name>
        <translation routine> ::= <specific routine designator>


That's pretty much exactly what you are descibing.

What would be really cool is if we could somehow reuse the conversion
modules provided by the C library and/or the iconv library.  For example,
I have 176 "modules" under /usr/lib/gconv.  They should be useful for
something.

-- 
Peter Eisentraut   peter_e@gmx.net





Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> Tatsuo Ishii writes:
> 
> > > Also, is there anything in SQL99 that we ought to try to be
> > > compatible with?
> >
> > As far as I know there's no such an equivalent in SQL99.
> 
> Sure:
> 
>          11.34  <translation definition>

I guess you mix up SQL99's "trasnlate" and "convert".

As far as I know, SQL99's "translation" is exactly a translation. e.g.
       rr) translation: A method of translating characters in one character           repertoire into characters of the
sameor a different character           repertoire.
 

For example, certain "translation" might take an input of Engish text,
and makes an output of Japanese one (I don't know if we could
implement such a translation though :-).

On the other hand "convert" just changes the "form-of-use" (SQL's
term, actually equivalent to "encoding"), keeping the character
repertoire.
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> I guess you mix up SQL99's "trasnlate" and "convert".

No, I believe Peter has read the spec correctly.  Further down they have
        <character translation> is a function for changing each character        of a given string according to some
many-to-oneor one-to-one        mapping between two not necessarily distinct character sets.
 

So this is intended as a one-character-at-a-time mapping, not a language
translation (which would be far beyond what anyone would expect of a
database anyway).

One thing that's really unclear to me is what's the difference between
a <character translation> and a <form-of-use conversion>, other than
that they didn't provide a syntax for defining new conversions.
        regards, tom lane




Re: Proposal: CREATE CONVERSION

From
Peter Eisentraut
Date:
Tom Lane writes:

> One thing that's really unclear to me is what's the difference between
> a <character translation> and a <form-of-use conversion>, other than
> that they didn't provide a syntax for defining new conversions.

The standard has this messed up.  In part 1, a form-of-use and an encoding
are two distinct things that can be applied to a character repertoire (see
clause 4.6.2.1), whereas in part 2 the term encoding is used in the
definition of form-of-use (clause 3.1.5 r).

When I sort it out, however, I think that what Tatsuo was describing is
indeed a form-of-use conversion.  Note that in part 2, clause 4.2.2.1, it
says about form-of-use conversions,
   It is intended,   though not enforced by this part of ISO/IEC 9075, that S2 be   exactly the same sequence of
charactersas S1, but encoded   according some different form-of-use. A typical use might be to   convert a character
stringfrom two-octet UCS to one-octet Latin1   or vice versa.
 

This seems to match what we're doing.

A character translation does not make this requirement and it explicitly
calls out the possibility of "many-to-one or one-to-one mapping between
two not necessarily distinct character sets".  I imagine that what this is
intended to do is to allow the user to create mappings such as ö
-> oe (as is common in German to avoid using characters with diacritic
marks), or ö -> o (as one might do in French to achieve the same).  In
fact, it's a glorified sed command.

So I withdraw my earlier comment.  But perhaps the syntax of the proposed
command could be aligned with the CREATE TRANSLATION command.

-- 
Peter Eisentraut   peter_e@gmx.net






Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> So I withdraw my earlier comment.  But perhaps the syntax of the proposed
> command could be aligned with the CREATE TRANSLATION command.

Ok. What about this?
   CREATE CONVERSION <conversion name>     FOR <encoding name>     TO <encoding name>     FROM <conversion routine
name>
   DROP CONVERSION <conversion name>

BTW, I wonder if we should invent new access privilege for conversion.
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
Here is a proposal for new pg_conversion system table. Comments?

/*-------------------------------------------------------------------------** pg_conversion.h*      definition of the
system"conversion" relation (pg_conversion)*      along with the relation's initial contents.*** Portions Copyright (c)
1996-2002,PostgreSQL Global Development Group* Portions Copyright (c) 1994, Regents of the University of California**
$Id$**NOTES*      the genbki.sh script reads this file and generates .bki*      information from the DATA()
statements.**-------------------------------------------------------------------------*/
#ifndef PG_CONVERSION_H
#define PG_CONVERSION_H

/* ----------------*        postgres.h contains the system type definitions and the*        CATALOG(), BOOTSTRAP and
DATA()sugar words so this file*        can be read by both genbki.sh and the C compiler.* ----------------*/
 

/* ----------------------------------------------------------------*        pg_conversion definition.**        cpp
turnsthis into typedef struct FormData_pg_namespace**    conname                name of the conversion*    connamespace
      name space which the conversion belongs to*    conowner            ower of the conversion*    conforencoding
 FOR encoding id*    contoencoding        TO encoding id*  conproc                OID of the conversion proc*
----------------------------------------------------------------*/
CATALOG(pg_conversion)
{NameData    conname;Oid            connamespace;int4        conowner;int4        conforencoding;int4
contoencoding;Oid           conproc;
 
} FormData_pg_conversion;

/* ----------------*        Form_pg_conversion corresponds to a pointer to a tuple with*        the format of
pg_conversionrelation.* ----------------*/
 
typedef FormData_pg_conversion *Form_pg_conversion;

/* ----------------*        compiler constants for pg_conversion* ----------------*/

#define Natts_pg_conversion                6
#define Anum_pg_conversion_conpname        1
#define Anum_pg_conversion_connamespace    2
#define Anum_pg_conversion_conowner        3
#define Anum_pg_conversion_conforencoding        4
#define Anum_pg_conversion_contoencoding        5
#define Anum_pg_conversion_conproc        6

/* ----------------* initial contents of pg_conversion* ---------------*/

/** prototypes for functions in pg_conversion.c*/
extern Oid    ConversionCreate(const char *conname, Oid connamespace,                         int32 conowner,
             int4 conforencoding, int4 contoencoding,                         Oid conproc);
 

#endif   /* PG_CONVERSION_H */




Re: Proposal: CREATE CONVERSION

From
Karel Zak
Date:
On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote:
> What would be really cool is if we could somehow reuse the conversion
> modules provided by the C library and/or the iconv library.  For example,
^^^^^^^
Very good point. Why use own conversion routines/tables if there is commonlibrary for this?
The encoding API for PostgreSQL is really cool idea.
I unsure with only one argument for encoding function. What if I wantto use one generic function for all encodings (for
exampleas API toiconv)? I think better C interface is:
 
   encode( TEXT data, NAME from, NAME to );
where from/to are encoding names. The other way is use some structthat handle this information -- like ARGS in trigger
functions.
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote:
> > What would be really cool is if we could somehow reuse the conversion
> > modules provided by the C library and/or the iconv library.  For example,
>                                               ^^^^^^^
> 
>  Very good point. Why use own conversion routines/tables if there is common
>  library for this?

I'm not still sure about the details of conversion map used by
iconv. Japanese users have enough trouble with the conversin between
Unicode and othe charsets. This is because there are many variation of
conversion maps provided by vendors. For example, the conversion map
used for Unicode and SJIS in PostgreSQL has been carefully designed to
minimize problems described above. Another issue is the availabilty of
iconv among platforms. If we are sure that a particlular iconv
conversion routine is available on all platforms and the conversion
result is good eough, our conversion routine could be replaced by new
one using iconv.

>  The encoding API for PostgreSQL is really cool idea.
> 
>  I unsure with only one argument for encoding function. What if I want
>  to use one generic function for all encodings (for example as API to
>  iconv)?

Use a simple wrap function.
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Karel Zak
Date:
On Mon, Jul 08, 2002 at 09:59:44PM +0900, Tatsuo Ishii wrote:
> > On Sun, Jul 07, 2002 at 12:58:07PM +0200, Peter Eisentraut wrote:
> > > What would be really cool is if we could somehow reuse the conversion
> > > modules provided by the C library and/or the iconv library.  For example,
> >                                               ^^^^^^^
> > 
> >  Very good point. Why use own conversion routines/tables if there is common
> >  library for this?
> 
> I'm not still sure about the details of conversion map used by
> iconv. Japanese users have enough trouble with the conversin between
> Unicode and othe charsets. This is because there are many variation of
> conversion maps provided by vendors. For example, the conversion map
> used for Unicode and SJIS in PostgreSQL has been carefully designed to
> minimize problems described above. Another issue is the availabilty of
> iconv among platforms. If we are sure that a particlular iconv
> conversion routine is available on all platforms and the conversion
> result is good eough, our conversion routine could be replaced by new
> one using iconv.
This is not problem if we will have some common API. You can use current conversion tables (maps) and for example I can
useiconv on my i386/Linux.
 
I don't want to replace current maps if somebody needs it. I wouldlike to API.
I see iconv is included into glibc now.

> >  I unsure with only one argument for encoding function. What if I want
> >  to use one generic function for all encodings (for example as API to
> >  iconv)?
> 
> Use a simple wrap function.
How knows this function to/from encoding? 
  Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz




Re: Proposal: CREATE CONVERSION

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> CATALOG(pg_conversion)
> {
>     NameData    conname;
>     Oid            connamespace;
>     int4        conowner;
>     int4        conforencoding;
>     int4        contoencoding;
>     Oid            conproc;
> } FormData_pg_conversion;

Should use type "regproc" for conproc, I think.
        regards, tom lane




Re: Proposal: CREATE CONVERSION

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> BTW, I wonder if we should invent new access privilege for conversion.

I believe the spec just demands USAGE on the underlying function for
the TRANSLATE case, and I don't see why it should be different for
CONVERT.  (In principle, if we didn't use a C-only API, you could
just call the underlying function directly; so there's little point
in having protection restrictions different from that case.)
        regards, tom lane




Re: Proposal: CREATE CONVERSION

From
Thomas Lockhart
Date:
...
> So I withdraw my earlier comment.  But perhaps the syntax of the proposed
> command could be aligned with the CREATE TRANSLATION command.

Tatsuo, it seems that we should use SQL99 terminology and commands where
appropriate. We do not yet implement the SQL99 forms of character
support, and I'm not sure if our current system is modeled to fit the
SQL99 framework. Are you suggesting CREATE CONVERSION to avoid
infringing on SQL99 syntax to allow us to use that sometime later?
                 - Thomas




Re: Proposal: CREATE CONVERSION

From
Peter Eisentraut
Date:
Thomas Lockhart writes:

> Tatsuo, it seems that we should use SQL99 terminology and commands where
> appropriate. We do not yet implement the SQL99 forms of character
> support, and I'm not sure if our current system is modeled to fit the
> SQL99 framework. Are you suggesting CREATE CONVERSION to avoid
> infringing on SQL99 syntax to allow us to use that sometime later?

SQL99 says that the method by which conversions are created is
implementation-defined.  Tatsuo is defining the implementation.

-- 
Peter Eisentraut   peter_e@gmx.net





Re: Proposal: CREATE CONVERSION

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> Here is a proposal for new pg_conversion system table. Comments?

I wonder if the encodings themselves shouldn't be represented in some
system table, too.  Admittedly, this is nearly orthogonal to the proposed
system table, except perhaps the data type of the two encoding fields.

-- 
Peter Eisentraut   peter_e@gmx.net





Re: Proposal: CREATE CONVERSION

From
Thomas Lockhart
Date:
> When you say "We do not yet implement the SQL99 forms of character
> support", I think you mean the ability to specify per column (or even
> per string) charset. I don't think this would happen for 7.3(or 8.0
> whatever), but sometime later I would like to make it reality.

Right.

An aside: I was thinking about this some, from the PoV of using our
existing type system to handle this (as you might remember, this is an
inclination I've had for quite a while). I think that most things line
up fairly well to allow this (and having transaction-enabled features
may require it), but do notice that the SQL feature of allowing a
different character set for every column *name* does not map
particularly well to our underlying structures.
                    - Thomas




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> I believe the spec just demands USAGE on the underlying function for
> the TRANSLATE case, and I don't see why it should be different for
> CONVERT.  (In principle, if we didn't use a C-only API, you could
> just call the underlying function directly; so there's little point
> in having protection restrictions different from that case.)

Ok, so:

(1) a CONVERSION can only be dropped by the superuser or its owner.
(2) a grant syntax for CONVERSION is:
   GRANT USAGE ON CONVERSION <conversion_name> to   {<user_name> | GROUP <group_name> | PUBLIC} [, ...]
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> Tatsuo, it seems that we should use SQL99 terminology and commands where
> appropriate. We do not yet implement the SQL99 forms of character
> support, and I'm not sure if our current system is modeled to fit the
> SQL99 framework. Are you suggesting CREATE CONVERSION to avoid
> infringing on SQL99 syntax to allow us to use that sometime later?

I'm not sure I understand your question, but I would say I would like
to follow SQL99 as much as possible.

When you say "We do not yet implement the SQL99 forms of character
support", I think you mean the ability to specify per column (or even
per string) charset. I don't think this would happen for 7.3(or 8.0
whatever), but sometime later I would like to make it reality.
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> I believe the spec just demands USAGE on the underlying function for
>> the TRANSLATE case, and I don't see why it should be different for
>> CONVERT.  (In principle, if we didn't use a C-only API, you could
>> just call the underlying function directly; so there's little point
>> in having protection restrictions different from that case.)

> Ok, so:

> (1) a CONVERSION can only be dropped by the superuser or its owner.

Okay ...

> (2) a grant syntax for CONVERSION is:

>     GRANT USAGE ON CONVERSION <conversion_name> to
>     {<user_name> | GROUP <group_name> | PUBLIC} [, ...]

No, I don't think a conversion has any privileges of its own at all.
You either have USAGE on the underlying function, or not.
        regards, tom lane




Re: Proposal: CREATE CONVERSION

From
Thomas Lockhart
Date:
> I've been think this for a while too. What about collation? If we add
> new chaset A and B, and each has 10 collations then we are going to
> have 20 new types? That seems overkill to me.

Well, afaict all of the operations we would ask of a type we will be
required to provide for character sets and collations. So ordering,
conversions, operators, index methods, etc etc are all required. It
*does* seem like a lot of work, but the type system is specifically
designed to do exactly this. Lifting those capabilities out of the type
system only to reimplement them elsewhere seems all trouble with no
upside.

Perhaps the current concept of "binary compatible types" could help
reduce the complexity, if it were made extensible, which it needs
anyway. But in most cases the character set/collation pair is a unique
combination, with a limited set of possibilities for other character
set/collation pairs with equivalent forms of use, which would keep us
from being able to reuse pieces anyway.

For most installations, we would install just those character sets the
installation/database requires, so in practice the database size need
not grow much beyond what it already is. And we could have conventions
on how functions and operators are named for a character set and/or
collation, so we could auto-generate the SQL definitions given an
implementation which meets a template standard.

Hmm, an aside which might be relevant: I've been looking at the
"national character string" syntax (you know, the N'string' convention)
and at the binary and hex string syntax (B'101010' and X'AB1D', as
examples) and would like to implement them in the lexer and parser by
having the string preceded with a type identifier as though they were
something like

NATIONAL CHARACTER 'string'
BIN '101010'
HEX 'AB1D'

where both BIN and HEX result in the *same* underlying data type once
ingested (or at least a reasonable facimile). I won't be allowed to
create two data types with the same type OID, but maybe if I assign them
to be binary compatible then I won't have to flesh out the hex data type
but only provide an input and output function.
                  - Thomas




Re: Proposal: CREATE CONVERSION

From
Thomas Lockhart
Date:
> If so, what about the "coercibility" property?
> The standard defines four distinct coercibility properties. So in
> above my example, actually you are going to define 80 new types?
> (also a collation could be either "PAD SPACE" or "NO PAD". So you
> might have 160 new types).

Well, yes I suppose so. The point is that these relationships *must be
defined anyway*. Allowed and forbidden conversions must be defined,
collation order must be defined, indexing operations must be defined,
etc etc etc. In fact, everything typically associated with a type must
be defined, including the allowed conversions between other types
(character sets/collations).

So, how are we going to do this *in a general way* without carrying the
infrastructure of a (the) type system along with it? What would we be
able to leave out or otherwise get for free if we use another mechanism?
And is that mechanism fundamentally simpler than (re)using the type
system that we already have?
                 - Thomas




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> > Use a simple wrap function.
> 
>  How knows this function to/from encoding? 

For example you want to define a function for LATIN1 to UNICODE conversion
function would look like:

function_for_LATIN1_to_UTF-8(from_string opaque, to_string opaque, length
integer)
{::generic_function_using_iconv(from_str, to_str, "ISO-8859-1", "UTF-8",length);
}

CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer)
RETURNS integer;
CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM
function_for_LATIN1_to_UTF-8;





Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> An aside: I was thinking about this some, from the PoV of using our
> existing type system to handle this (as you might remember, this is an
> inclination I've had for quite a while). I think that most things line
> up fairly well to allow this (and having transaction-enabled features
> may require it), but do notice that the SQL feature of allowing a
> different character set for every column *name* does not map
> particularly well to our underlying structures.

I've been think this for a while too. What about collation? If we add
new chaset A and B, and each has 10 collations then we are going to
have 20 new types? That seems overkill to me.
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> > (1) a CONVERSION can only be dropped by the superuser or its owner.
> 
> Okay ...
> 
> > (2) a grant syntax for CONVERSION is:
> 
> >     GRANT USAGE ON CONVERSION <conversion_name> to
> >     {<user_name> | GROUP <group_name> | PUBLIC} [, ...]
> 
> No, I don't think a conversion has any privileges of its own at all.
> You either have USAGE on the underlying function, or not.

I see.
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> > Here is a proposal for new pg_conversion system table. Comments?
> 
> I wonder if the encodings themselves shouldn't be represented in some
> system table, too.  Admittedly, this is nearly orthogonal to the proposed
> system table, except perhaps the data type of the two encoding fields.

That would be ideal, but I think that would happen at the same time
when CREATE CHARACTER SET would be implemented.
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Karel Zak
Date:
On Tue, Jul 09, 2002 at 10:07:11AM +0900, Tatsuo Ishii wrote:
> > > Use a simple wrap function.
> > 
> >  How knows this function to/from encoding? 
> 
> For example you want to define a function for LATIN1 to UNICODE conversion
> function would look like:
> 
> function_for_LATIN1_to_UTF-8(from_string opaque, to_string opaque, length
> integer)
> {
>     :
>     :
>     generic_function_using_iconv(from_str, to_str, "ISO-8859-1", "UTF-8",
>     length);
> }
> 
> CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer)
> RETURNS integer;
> CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM
> function_for_LATIN1_to_UTF-8;
Hmm, but it require define "function_for_..." for each conversion. For example trigger function I needn't define for
eachtable, but I canuse only one PostgreSQL function for arbirary table.
 
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> > If so, what about the "coercibility" property?
> > The standard defines four distinct coercibility properties. So in
> > above my example, actually you are going to define 80 new types?
> > (also a collation could be either "PAD SPACE" or "NO PAD". So you
> > might have 160 new types).
> 
> Well, yes I suppose so. The point is that these relationships *must be
> defined anyway*. Allowed and forbidden conversions must be defined,
> collation order must be defined, indexing operations must be defined,
> etc etc etc. In fact, everything typically associated with a type must
> be defined, including the allowed conversions between other types
> (character sets/collations).
> 
> So, how are we going to do this *in a general way* without carrying the
> infrastructure of a (the) type system along with it? What would we be
> able to leave out or otherwise get for free if we use another mechanism?
> And is that mechanism fundamentally simpler than (re)using the type
> system that we already have?

Well, I think charset/collation/coercibility/pad are all string data
type specific properties, not common to any other data types. So it
seems more appropreate for type systems not to have those certain type
specific knowledges. For example,

S1 < S2

should raise an error if S1 has "no collating properties" and S2 has
"implicit collating properties", while ok if S1 has "no collating
properties" and S2 has "explicit collating properties". It would be
very hard for the type system to handle this kind of cases since it
requires special knowledges about string data type.

Alternative?

Why don't we have these properties in the string data itself?
(probably we do not need to have them on disk storage).  Existing text
data type has length + data. I suggest to extend it like:

length + charset + collation + pad + coercibility + data

With this above example could be easily handled by < operator.

For index, maybe we could dynamically replace varstr_cmp function
according to collation, though I have not actually examined my idea
closely.
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
SQL99 allows on the fly encoding conversion:

CONVERT('aaa' USING myconv 'bbb')

So there could be more than 1 conversion for a paticlular encodings
pair. This lead to an ambiguity for "default" conversion used for the
frontend/backend automatic encoding conversion. Can we add a flag
indicating that this is the "default" conversion? The new proposed
syntax would be:

CREATE CONVERSION <conversion name>      FOR <source encoding name>      TO <destination encoding name>      FROM
<conversionfunction name>      [DEFAULT]
 

Comments?
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> > I've been think this for a while too. What about collation? If we add
> > new chaset A and B, and each has 10 collations then we are going to
> > have 20 new types? That seems overkill to me.
> 
> Well, afaict all of the operations we would ask of a type we will be
> required to provide for character sets and collations. So ordering,
> conversions, operators, index methods, etc etc are all required. It
> *does* seem like a lot of work, but the type system is specifically
> designed to do exactly this. Lifting those capabilities out of the type
> system only to reimplement them elsewhere seems all trouble with no
> upside.

If so, what about the "coercibility" property?
The standard defines four distinct coercibility properties. So in
above my example, actually you are going to define 80 new types?
(also a collation could be either "PAD SPACE" or "NO PAD". So you
might have 160 new types).
--
Tatsuo Ishii




Re: Proposal: CREATE CONVERSION

From
Hannu Krosing
Date:
On Tue, 2002-07-09 at 03:47, Tatsuo Ishii wrote:
> > An aside: I was thinking about this some, from the PoV of using our
> > existing type system to handle this (as you might remember, this is an
> > inclination I've had for quite a while). I think that most things line
> > up fairly well to allow this (and having transaction-enabled features
> > may require it), but do notice that the SQL feature of allowing a
> > different character set for every column *name* does not map
> > particularly well to our underlying structures.
> 
> I've been think this for a while too. What about collation? If we add
> new chaset A and B, and each has 10 collations then we are going to
> have 20 new types? That seems overkill to me.

Can't we do all collating in unicode and convert charsets A and B to and
from it ?

I would even reccommend going a step further and storing all 'national'
character sets in unicode.

--------------
Hannu






Re: Proposal: CREATE CONVERSION

From
Peter Eisentraut
Date:
Thomas Lockhart writes:

> An aside: I was thinking about this some, from the PoV of using our
> existing type system to handle this (as you might remember, this is an
> inclination I've had for quite a while). I think that most things line
> up fairly well to allow this (and having transaction-enabled features
> may require it), but do notice that the SQL feature of allowing a
> different character set for every column *name* does not map
> particularly well to our underlying structures.

There more I think about it, the more I come to the conclusion that the
SQL framework for "character sets" is both bogus and a red herring.  (And
it begins with figuring out exactly what a character set is, as opposed
to a form-of-use, a.k.a.(?) encoding, but let's ignore that.)

The ability to store each column value in a different encoding sounds
interesting, because it allows you to create tables such as
   product_id | product_name_en | product_name_kr | product_name_jp

but you might as well create a table such as
   product_id | lang | product_name

with product_name in Unicode, and have a more extensible application that
way, too.

I think it's fine to have the encoding fixed for the entire database.  It
sure makes coding easier.  If you want to be international, you use
Unicode.  If not you can "optimize" your database by using a more
efficient encoding.  In fact, I think we should consider making UTF-8 the
default encoding sometime.

The real issue is the collation.  But the collation is a small subset of
the whole locale/character set gobbledigook.  Standardized collation rules
in standardized forms exist.  Finding/creating routines to interpret and
apply them should be the focus.  SQL's notion to funnel the decision which
collation rule to apply through the character sets is bogus.  It's
impossible to pick a default collation rule for many character sets
without applying bias.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: Proposal: CREATE CONVERSION

From
Peter Eisentraut
Date:
Hannu Krosing writes:

> Can't we do all collating in unicode and convert charsets A and B to and
> >from it ?
>
> I would even reccommend going a step further and storing all 'national'
> character sets in unicode.

Sure.  However, Tatsuo maintains that the customary Japanese character
sets don't map very well with Unicode.  Personally, I believe that this is
an issue that should be fixed, not avoided, but I don't understand the
issues well enough.

-- 
Peter Eisentraut   peter_e@gmx.net



Re: Proposal: CREATE CONVERSION

From
Tim Allen
Date:
On Wed, 10 Jul 2002 08:21, Peter Eisentraut wrote:
> Hannu Krosing writes:
...
> > I would even reccommend going a step further and storing all 'national'
> > character sets in unicode.
>
> Sure.  However, Tatsuo maintains that the customary Japanese character
> sets don't map very well with Unicode.  Personally, I believe that this is
> an issue that should be fixed, not avoided, but I don't understand the
> issues well enough.

Presumably improving the Unicode support to cover the full UTF32 (or UCS4) 
range would help with this. Last time I checked, PostgreSQL only supports the 
UCS2 subset of Unicode, ie 16 bits. From the Unicode propaganda I've read, it 
seems that one of the main goals of the expansion of the range beyond 16 bits 
was to answer the complaints of Japanese users.

Tim

-- 
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/



Re: Proposal: CREATE CONVERSION

From
Curt Sampson
Date:
On Wed, 10 Jul 2002, Peter Eisentraut wrote:

> Sure.  However, Tatsuo maintains that the customary Japanese character
> sets don't map very well with Unicode.  Personally, I believe that this is
> an issue that should be fixed, not avoided, but I don't understand the
> issues well enough.

I hear this all the time, but I have yet to have someone show me what,
Iin SO-2022-JP, EUC-JP or SJIS cannot be transparently translated into
Unicode and back.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> > For example you want to define a function for LATIN1 to UNICODE conversion
> > function would look like:
> > 
> > function_for_LATIN1_to_UTF-8(from_string opaque, to_string opaque, length
> > integer)
> > {
> >     :
> >     :
> >     generic_function_using_iconv(from_str, to_str, "ISO-8859-1", "UTF-8",
> >     length);
> > }
> > 
> > CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer)
> > RETURNS integer;
> > CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM
> > function_for_LATIN1_to_UTF-8;
> 
>  Hmm, but it require define "function_for_..." for each conversion. 
>  For example trigger function I needn't define for each table, but I can
>  use only one PostgreSQL function for arbirary table.

I don't think this is a big problem, IMO.

However, thinking more, I came to a conclusion that passing encoding
ids would be a good thing. With the encoding id parameters, the
function could check if it is called with correct encodings, and this
would prevent disaster. New interface proposal:
pgconv(    INTEGER,    -- source encoding id    INTEGER,    -- destination encoding id    OPAQUE,        -- source
string(null terminated C string)    OPAQUE,        -- destination string (null terminated C string)    INTERGER    --
sourcestring length) returns INTEGER;    -- dummy. returns nothing, actually.
 

--
Tatsuo Ishii


Re: Proposal: CREATE CONVERSION

From
Karel Zak
Date:
On Thu, Jul 11, 2002 at 03:37:49PM +0900, Tatsuo Ishii wrote:
> > > CREATE FUNCTION function_for_LATIN1_to_UTF-8(opaque, opaque, integer)
> > > RETURNS integer;
> > > CREAE CONVERSION myconversion FOR 'LATIN1' TO 'UNICODE' FROM
> > > function_for_LATIN1_to_UTF-8;
> > 
> >  Hmm, but it require define "function_for_..." for each conversion. 
> >  For example trigger function I needn't define for each table, but I can
> >  use only one PostgreSQL function for arbirary table.
> 
> I don't think this is a big problem, IMO.
> 
> However, thinking more, I came to a conclusion that passing encoding
> ids would be a good thing. With the encoding id parameters, the
> function could check if it is called with correct encodings, and this
> would prevent disaster. New interface proposal:
OK.

>  pgconv(
>         INTEGER,    -- source encoding id
>         INTEGER,    -- destination encoding id
Where/how is describe conversion between encoding id and encodingname? (I maybe something overlook:-) I expect new
encodingsystem will extendable and encodings list not will hardcoded like now.(extendable = add new encoding without
PostgreSQLrebuild)
 
BTW, the client site needs routines for work with encoding names too(pg_char_to_encoding()). Hmm.. it can't be
extendable,or yes?
 

>         OPAQUE,        -- source string (null terminated C string)
>         OPAQUE,        -- destination string (null terminated C string)
>         INTERGER    -- source string length
>  ) returns INTEGER;    -- dummy. returns nothing, actually.
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
>  Where/how is describe conversion between encoding id and encoding
>  name? (I maybe something overlook:-) I expect new encoding system 
>  will extendable and encodings list not will hardcoded like now.
>  (extendable = add new encoding without PostgreSQL rebuild)

User defined charsets(encodings) is under discussion and I believe it
would not happen for 7.3.

>  BTW, the client site needs routines for work with encoding names too
>  (pg_char_to_encoding()). Hmm.. it can't be extendable, or yes?

pg_char_to_encoding() is already in libpq. Or am I missing something?
--
Tatsuo Ishii


Re: Proposal: CREATE CONVERSION

From
Karel Zak
Date:
On Thu, Jul 11, 2002 at 05:26:01PM +0900, Tatsuo Ishii wrote:
> >  Where/how is describe conversion between encoding id and encoding
> >  name? (I maybe something overlook:-) I expect new encoding system 
> >  will extendable and encodings list not will hardcoded like now.
> >  (extendable = add new encoding without PostgreSQL rebuild)
> 
> User defined charsets(encodings) is under discussion and I believe it
> would not happen for 7.3.
> 
> >  BTW, the client site needs routines for work with encoding names too
> >  (pg_char_to_encoding()). Hmm.. it can't be extendable, or yes?
> 
> pg_char_to_encoding() is already in libpq. Or am I missing something?
It works with encoding table (pg_enc2name_tbl) and it's compiled into backend and client too. It means number of
encodingis not possible change after compilation and you (user) can't add new encoding without pg_enc2name_tbl[]
change.I original thought we can add new encodingson-the-fly in 7.3 :-) You're right.
 
IMHO implement "User defined charsets(encodings)" will problem forcurrent libpq design.
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> > pg_char_to_encoding() is already in libpq. Or am I missing something?
> 
>  It works with encoding table (pg_enc2name_tbl) and it's compiled 
>  into backend and client too. It means number of encoding is not possible 
>  change after compilation and you (user) can't add new encoding without 
>  pg_enc2name_tbl[] change. I original thought we can add new encodings
>  on-the-fly in 7.3 :-) You're right.
> 
>  IMHO implement "User defined charsets(encodings)" will problem for
>  current libpq design.

No, it's not a libpq problem, but more common "client/server" problem
IMO. It's very hard to share dynamically created object (info)
effectively between client and server.
--
Tatsuo Ishii


Re: Proposal: CREATE CONVERSION

From
Karel Zak
Date:
On Thu, Jul 11, 2002 at 05:52:18PM +0900, Tatsuo Ishii wrote:
> > > pg_char_to_encoding() is already in libpq. Or am I missing something?
> > 
> >  It works with encoding table (pg_enc2name_tbl) and it's compiled 
> >  into backend and client too. It means number of encoding is not possible 
> >  change after compilation and you (user) can't add new encoding without 
> >  pg_enc2name_tbl[] change. I original thought we can add new encodings
> >  on-the-fly in 7.3 :-) You're right.
> > 
> >  IMHO implement "User defined charsets(encodings)" will problem for
> >  current libpq design.
> 
> No, it's not a libpq problem, but more common "client/server" problem
> IMO. It's very hard to share dynamically created object (info)
> effectively between client and server.
IMHO dynamic object will keep server and client must ask for wanted information to server.
   Karel
-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
> > No, it's not a libpq problem, but more common "client/server" problem
> > IMO. It's very hard to share dynamically created object (info)
> > effectively between client and server.
> 
>  IMHO dynamic object will keep server and client must ask for wanted 
>  information to server.

I agree with you. However real problem is how fast it could be. For
example, pg_mblen() is called for each word processed by libpq to know
the byte length of the word. If each call to pg_mblen() accesses
backend, the performance might be unacceptably slow.
--
Tatsuo Ishii


Re: Proposal: CREATE CONVERSION

From
Karel Zak
Date:
On Thu, Jul 11, 2002 at 06:30:48PM +0900, Tatsuo Ishii wrote:
> > > No, it's not a libpq problem, but more common "client/server" problem
> > > IMO. It's very hard to share dynamically created object (info)
> > > effectively between client and server.
> > 
> >  IMHO dynamic object will keep server and client must ask for wanted 
> >  information to server.
> 
> I agree with you. However real problem is how fast it could be. For
> example, pg_mblen() is called for each word processed by libpq to know
> the byte length of the word. If each call to pg_mblen() accesses
> backend, the performance might be unacceptably slow.
It must load all relevant information about actual encoding(s) andcache it in libpq.
IMHO basic encoding information like name and id are not problem. The PQmblen() is big problem. Strange question: is
PQmblen()reallyneedful? I see it's used for result printing, but why backend notmark size of field (word) to result? If
backendgood knows size ofdata why not send this information to client togeter with data?   Karel
 

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: Proposal: CREATE CONVERSION

From
Tatsuo Ishii
Date:
>  IMHO basic encoding information like name and id are not problem. 
>  The PQmblen() is big problem. Strange question: is PQmblen() really
>  needful? I see it's used for result printing, but why backend not
>  mark size of field (word) to result? If backend good knows size of
>  data why not send this information to client togeter with data?

PQmblen() is used by psql in many places. It is used for parsing query
texts supplied by user, not only for data sent from backend.
--
Tatsuo Ishii