Thread: Re: [PATCHES] Postgres-6.3.2 locale patch

Re: [PATCHES] Postgres-6.3.2 locale patch

From
"Thomas G. Lockhart"
Date:
Hi. I'm looking for non-English-using Postgres hackers to participate in
implementing NCHAR() and alternate character sets in Postgres. I think
I've worked out how to do the implementation (not the details, just a
strategy) so that multiple character sets will be allowed in a single
database, additional character sets can be loaded at run-time, and so
that everything will behave transparently.

I would propose to do this for v6.4 as user-defined packages (with
compile-time parser support) on top of the existing USE_LOCALE and MB
patches so that the existing compile-time options are not changed or
damaged.

So, the initial questions:

1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for
non-English applications? Do other databases use this SQL92 convention,
or does it have difficulties?

2) Would anyone be interested in helping to define the character sets
and helping to test? I don't know the correct collation sequences and
don't think they would display properly on my screen...

3) I'd like to implement the existing Cyrillic and EUC-jp character
sets, and also some European languages (French and ??) which use the
Latin-1 alphabet but might have different collation sequences. Any
suggestions for candidates??

                       - Tom

Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch

From
Patrice Hédé
Date:
Hi Tom,

> I would propose to do this for v6.4 as user-defined packages (with
> compile-time parser support) on top of the existing USE_LOCALE and MB
> patches so that the existing compile-time options are not changed or
> damaged.

Be careful that system locales may not be here, though you may need the
locale information in Postgres. They may also be broken (which is in fact
often the case), so don't depend on them.

> So, the initial questions:
>
> 1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for
> non-English applications? Do other databases use this SQL92 convention,
> or does it have difficulties?

Don't know (yet).
>
> 2) Would anyone be interested in helping to define the character sets
> and helping to test? I don't know the correct collation sequences and
> don't think they would display properly on my screen...

I can help for french, icelandic, and german and norwegian (though for the
two last ones, I guess there are more appropriate persons on this list :).

> 3) I'd like to implement the existing Cyrillic and EUC-jp character
> sets, and also some European languages (French and ??) which use the
> Latin-1 alphabet but might have different collation sequences. Any
> suggestions for candidates??

They all have, as soon as we take care of accents, which are all put at
the end with an english system. And of course, they are different for each
language :)

Patrice

PS : I'm sorry, Tom, I haven't been able to work on the faq for the past
month :(( because I've been busy in my free time learning norwegian ! I
will submit something very soon, I promise !

--
Patrice HÉDÉ --------------------------------- patrice@idf.net -----
                     ... Looking for a job in Iceland or in Norway !
Ingénieur informaticien   -   Computer engineer   -   Tölvufræðingur
----- http://www.idf.net/patrice/ ----------------------------------


Re: [PATCHES] Postgres-6.3.2 locale patch

From
t-ishii@sra.co.jp
Date:
>Hi. I'm looking for non-English-using Postgres hackers to participate in
>implementing NCHAR() and alternate character sets in Postgres. I think
>I've worked out how to do the implementation (not the details, just a
>strategy) so that multiple character sets will be allowed in a single
>database, additional character sets can be loaded at run-time, and so
>that everything will behave transparently.

Sounds interesting idea... But before going into discussion, Let me
make clarify what "character sets" means. A character sets consists of
some characters. One of the most famous character set is ISO646
(almost same as ASCII). In western Europe, ISO 8859 series character
sets are widely used. For example, ISO 8859-1 includes English,
French, German etc. and ISO 8859-2 includes Albanian, Romanian
etc. These are "single byte" and there is one to many correspondacne
between the character set and Languages.

Example1:
ISO 8859-1 <------> English, French, German

On the other hand, some asian languages such as Japanese, Chinese, and
Korean do not correspond to a chacter set, rather correspond to
multiple character sets.

Example2:
ASCII, JIS X0208, JIS X0201, JIS X0212 <-------> Japanese
(ASCII, JIS X0208, JIS X0201, JIS X0212 are individual character sets)

An "encoding" is a way to represent set of charactser sets in
computers. The above set of characters sets are encoded in the EUC_JP
encdoing.

I think SQL92 uses a term "character set" as encoding.

>So, the initial questions:
>
>1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for
>non-English applications? Do other databases use this SQL92 convention,
>or does it have difficulties?

As far as I know, there is no commercial RDBMS that supports
NCHAR/NVARCHAR/CHARACTER SET syntax. Oracle supports multiple
encodings. An encoding for a database is defined while creating the
database and cannot be changed at runtime. Clients can use different
encoding as long as it is a "subset" of the database's encoding. For
example, a oracle client can use ASCII if the database encoding is
EUC_JP.

I think the idea that the "default" encoding for a database being
defined at the database creation time is nice.

create database with encoding EUC_JP;

If NCHAR/NVARCHAR/CHARACTER SET syntax would be supported, a user
could use a different encoding other than EUC_JP. Sound very nice too.

>2) Would anyone be interested in helping to define the character sets
>and helping to test? I don't know the correct collation sequences and
>don't think they would display properly on my screen...

I would be able to help you in the Japanese part. For Chinese and
Korean, I'm going to find volunteers in the local PostgreSQL mailing
list I'm running if necessary.

>3) I'd like to implement the existing Cyrillic and EUC-jp character
>sets, and also some European languages (French and ??) which use the
>Latin-1 alphabet but might have different collation sequences. Any
>suggestions for candidates??

Collation sequences for EUC_JP? How nice it would be! One of a problem
for collation sequences for multi-byte encodings is the sequence might
become huge. Seems you have a solution for that. Please let me know
more details.
--
Tatsuo Ishii
t-ishii@sra.co.jp

Re: [PATCHES] Postgres-6.3.2 locale patch

From
Oleg Broytmann
Date:
Hello!

On Wed, 3 Jun 1998, Thomas G. Lockhart wrote:
> Hi. I'm looking for non-English-using Postgres hackers to participate in
> implementing NCHAR() and alternate character sets in Postgres. I think
> I've worked out how to do the implementation (not the details, just a
> strategy) so that multiple character sets will be allowed in a single
> database, additional character sets can be loaded at run-time, and so
> that everything will behave transparently.

   All this sounds nice, but I am afraid the job is not for me. Actually I
am very new to Postgres and SQL world. I started to learn SQL 3 months ago;
I started to play with Postgres 2 months ago. I started to hack Potsgres
sources (about locale) a little more than a month ago.

> 2) Would anyone be interested in helping to define the character sets
> and helping to test? I don't know the correct collation sequences and
> don't think they would display properly on my screen...

   It would be nice to test it, providing that it wouldn't break existing
code. Our site is running hundreds CGIs that rely on current locale support
in Postgres...

Oleg.
----
  Oleg Broytmann     http://members.tripod.com/~phd2/     phd2@earthling.net
           Programmers don't die, they just GOSUB without RETURN.


Re: [PATCHES] Postgres-6.3.2 locale patch

From
"Jose' Soares Da Silva"
Date:
On Thu, 4 Jun 1998 t-ishii@sra.co.jp wrote:

> >Hi. I'm looking for non-English-using Postgres hackers to participate in
> >implementing NCHAR() and alternate character sets in Postgres. I think
> >I've worked out how to do the implementation (not the details, just a
> >strategy) so that multiple character sets will be allowed in a single
> >database, additional character sets can be loaded at run-time, and so
> >that everything will behave transparently.
>
> Sounds interesting idea... But before going into discussion, Let me
> make clarify what "character sets" means. A character sets consists of
> some characters. One of the most famous character set is ISO646
> (almost same as ASCII). In western Europe, ISO 8859 series character
> sets are widely used. For example, ISO 8859-1 includes English,
> French, German etc. and ISO 8859-2 includes Albanian, Romanian
> etc. These are "single byte" and there is one to many correspondacne
> between the character set and Languages.
>
> Example1:
> ISO 8859-1 <------> English, French, German
>
> On the other hand, some asian languages such as Japanese, Chinese, and
> Korean do not correspond to a chacter set, rather correspond to
> multiple character sets.
>
> Example2:
> ASCII, JIS X0208, JIS X0201, JIS X0212 <-------> Japanese
> (ASCII, JIS X0208, JIS X0201, JIS X0212 are individual character sets)
>
> An "encoding" is a way to represent set of charactser sets in
> computers. The above set of characters sets are encoded in the EUC_JP
> encdoing.
>
> I think SQL92 uses a term "character set" as encoding.
>
> >So, the initial questions:
> >
> >1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for
> >non-English applications? Do other databases use this SQL92 convention,
> >or does it have difficulties?
>
> As far as I know, there is no commercial RDBMS that supports
> NCHAR/NVARCHAR/CHARACTER SET syntax. Oracle supports multiple
> encodings. An encoding for a database is defined while creating the
> database and cannot be changed at runtime. Clients can use different
> encoding as long as it is a "subset" of the database's encoding. For
> example, a oracle client can use ASCII if the database encoding is
> EUC_JP.

I try the following databases on Linux and no  one has this feature:
. MySql
. Solid
. Empress
. Kubl
. ADABAS D

I found only one under M$-Windows that implement this feature:
. OCELOT
I'm playing with it, but so far I don't understand its behavior.
There's an interesting documentation about it on OCELOT manual,
if you want I can send it to you.

>
> I think the idea that the "default" encoding for a database being
> defined at the database creation time is nice.
>
> create database with encoding EUC_JP;
>
> If NCHAR/NVARCHAR/CHARACTER SET syntax would be supported, a user
> could use a different encoding other than EUC_JP. Sound very nice too.
>
> >2) Would anyone be interested in helping to define the character sets
> >and helping to test? I don't know the correct collation sequences and
> >don't think they would display properly on my screen...
>
> I would be able to help you in the Japanese part. For Chinese and
> Korean, I'm going to find volunteers in the local PostgreSQL mailing
> list I'm running if necessary.

I may help with Italian, Spanish and Portuguese.

>
> >3) I'd like to implement the existing Cyrillic and EUC-jp character
> >sets, and also some European languages (French and ??) which use the
> >Latin-1 alphabet but might have different collation sequences. Any
> >suggestions for candidates??
>
> Collation sequences for EUC_JP? How nice it would be! One of a problem
> for collation sequences for multi-byte encodings is the sequence might
> become huge. Seems you have a solution for that. Please let me know
> more details.
> --
> Tatsuo Ishii
> t-ishii@sra.co.jp
                                                            Ciao, Jose'


Re: [PATCHES] Postgres-6.3.2 locale patch

From
"Thomas G. Lockhart"
Date:
> > Sounds interesting idea... But before going into discussion, Let me
> > make clarify what "character sets" means.
> > An "encoding" is a way to represent set of charactser sets in
> > computers.
> > I think SQL92 uses a term "character set" as encoding.

I have found the SQL92 terminology confusing, because they do not seem
to make the nice clear distinction between encoding and collation
sequence which you have pointed out. I suppose that there can be an
issue of visual appearance of an alphabet for different locales also.

afaik, SQL92 uses the term "character set" to mean an encoding with an
implicit collation sequence. SQL92 allows alternate collation sequences
to be specified for a "character set" when it can be made meaningful.

I would propose to implement
  VARCHAR(length) WITH CHARACTER SET setname

as a type with a type name of, for example, "VARSETNAME". This type
would have the comparison functions and operators which implement
collation sequences.

I would propose to implement
  VARCHAR(length) WITH CHARACTER SET setname COLLATION collname

as a type with a name of, for example, "VARCOLLNAME". For the EUC-jp
encoding, "collname" could be "Korean" or "Japanese" so the type name
would become "varkorean" or "varjapanese". Don't know for sure yet
whether this is adequate, but other possibilities can be used if
necessary.

When a database is created, it can be specified with a default character
set/collation sequence for the database; this would correspond to the
NCHAR/NVARCHAR/NTEXT types. We could implement a
  SET NATIONAL CHARACTER SET = 'language';

command to determine the default character set for the session when
NCHAR is used.

The SQL92 technique for specifying an encoding/collation sequence in a
literal string is
  _language 'string'

so for example to specify a string in the French language (implying an
encoding, collation, and representation?) you would use
  _FRENCH 'string'

> > I would be able to help you in the Japanese part. For Chinese and
> > Korean, I'm going to find volunteers in the local PostgreSQL mailing
> > list I'm running if necessary.
>
> I may help with Italian, Spanish and Portuguese.

Great, and perhaps Oleg could help test with Cyrillic (I assume I can
steal code from the existing "CYR_LOCALE" blocks in the Postgres
backend).

> > Collation sequences for EUC_JP? How nice it would be! One of a
> > problem for collation sequences for multi-byte encodings is the
> > sequence might become huge. Seems you have a solution for that.
> > Please let me know more details.

Um, no, I just assume we can find a solution :/ I'd like to implement
the infrastructure in the Postgres parser to allow multiple
encodings/collations, and then see where we are. As I mentioned, this
would be done for v6.4 as a transparent add-on, so that existing
capabilities are not touched or damaged. Implementing everything for
some European languages (with the 1-byte Latin-1 encoding?) may be
easiest, but the Asian languages might be more fun :)

                       - Tom

Re: [PATCHES] Postgres-6.3.2 locale patch

From
Oleg Broytmann
Date:
Hi!

On Thu, 4 Jun 1998, Thomas G. Lockhart wrote:
> Great, and perhaps Oleg could help test with Cyrillic (I assume I can
> steal code from the existing "CYR_LOCALE" blocks in the Postgres
> backend).

   Before sending my patch to pgsql-patches I gave it out to few testers
here. It wouldn't be too hard to find testers for Cyrillic support, sure.

Oleg.
----
  Oleg Broytmann     http://members.tripod.com/~phd2/     phd2@earthling.net
           Programmers don't die, they just GOSUB without RETURN.


Re: [PATCHES] Postgres-6.3.2 locale patch

From
t-ishii@sra.co.jp
Date:
>When a database is created, it can be specified with a default character
>set/collation sequence for the database; this would correspond to the
>NCHAR/NVARCHAR/NTEXT types. We could implement a
>  SET NATIONAL CHARACTER SET = 'language';

In the current implementation of MB, the encoding used by BE is
determined at the compile time. This time I would like to add more
flexibility in that the encoding can be specified when creating a
database. I would like to add a new option to the CREATE DATABASE
statement:

CREATE DATABASE WITH ENCODING 'encoding';

I'm not sure if this kind of thing is defined in the
standard. Suggestion?
--
Tatsuo Ishii
t-ishii@sra.co.jp