Thread: encoding names

encoding names

From
Karel Zak
Date:
 Hi,

 attached is patch with:

- new encoding names stuff with better performance (binary search
  intead for() and prevent some needless searching)

- possible is use synonyms for encoding (an example ISO-8859-1,
  Latin1, l1)

- implemented is Peter's idea about "encoding names clearing"
  (other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
  same as 'iso8859_1' or iso-8-8-5-9-1 :-)

- share routines for this between FE and BE (never more define
  encoding names separate in FE and BE)

- add prefix PG_ to encoding identificator macros, something like 'ALT'
  is pretty dirty in source code, rather use PG_ALT.

 (Note: patch add new file mb/encname.c and remove mb/common.c)

                Karel

--
 Karel Zak  <zakkr@zf.jcu.cz>
 http://home.zf.jcu.cz/~zakkr/

 C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

Attachment

Re: encoding names

From
Peter Eisentraut
Date:
Karel Zak writes:

> - possible is use synonyms for encoding (an example ISO-8859-1,
>   Latin1, l1)

On the choice of synonyms:  Do we really want to add that many synonyms
that are not the standard name?  I think the following are not necessary:

cyrillic, cp819, ibm819, isoir100x, l1-4

ISO 8859 is a pretty well-know term these days.

KOI8 needs to be aliased as koi8r.  Unicode is not a valid encoding name,
actually.  Do you know what encoding is stands for and could you add that
as an alias?

On the code:

#ifdef WIN32
   #include "win32.h"
#else
   #include <unistd.h>
#endif

needs to be written as

#ifdef WIN32
#   include "win32.h"
#else
#   include <unistd.h>
#endif

for portability.

For extra credit:  A patch to configure and the documentation.

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter


Re: encoding names

From
Barry Lind
Date:
This patch will break the JDBC driver.  The jdbc driver relies on the
value returned by getdatabaseencoding() to determine the server encoding
so that it can convert to unicode.  This patch changes the return values
for getdatabaseencoding() such that the driver will no longer work.  For
example "LATIN1" which used to be returned will now come back as
"iso88591".  This change in behaviour impacts the JDBC driver and any
other application that is depending on the output of the
getdatabaseencoding() function.

I would recommend that getdatabaseencoding() return the old names for
backword compatibility and then deprecate this function to be removed in
the future.  Then create a new function that returns the new encoding
names that can be used going forward.

thanks,
--Barry

Karel Zak wrote:
>  Hi,
>
>  attached is patch with:
>
> - new encoding names stuff with better performance (binary search
>   intead for() and prevent some needless searching)
>
> - possible is use synonyms for encoding (an example ISO-8859-1,
>   Latin1, l1)
>
> - implemented is Peter's idea about "encoding names clearing"
>   (other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is
>   same as 'iso8859_1' or iso-8-8-5-9-1 :-)
>
> - share routines for this between FE and BE (never more define
>   encoding names separate in FE and BE)
>
> - add prefix PG_ to encoding identificator macros, something like 'ALT'
>   is pretty dirty in source code, rather use PG_ALT.
>
>  (Note: patch add new file mb/encname.c and remove mb/common.c)
>
>                 Karel
>
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
> Part 1.1
>
> Content-Type:
>
> text/plain
>
>
> ------------------------------------------------------------------------
> mb-08172001.patch.gz
>
> Content-Type:
>
> application/x-gzip
> Content-Encoding:
>
> base64
>
>
> ------------------------------------------------------------------------
> Part 1.3
>
> Content-Type:
>
> text/plain
> Content-Encoding:
>
> binary
>
>



Re: encoding names

From
Peter Eisentraut
Date:
Barry Lind writes:

> This patch will break the JDBC driver.  The jdbc driver relies on the
> value returned by getdatabaseencoding() to determine the server encoding
> so that it can convert to unicode.

Then the driver needs to be changed to accept the new encoding names as
well.  (Or couldn't we convert it to Unicode in the server?)

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter


Re: encoding names

From
"Serguei Mokhov"
Date:
----- Original Message -----
From: Peter Eisentraut <peter_e@gmx.net>
Sent: Friday, August 17, 2001 12:11 PM


> Karel Zak writes:
>
> > - possible is use synonyms for encoding (an example ISO-8859-1,
> >   Latin1, l1)
>
> On the choice of synonyms:  Do we really want to add that many synonyms
> that are not the standard name?  I think the following are not necessary:
>
> cyrillic, cp819, ibm819, isoir100x, l1-4

I'm not sure about others, but 'cyrillic' is a quite ambigous alias,
because it can denote many slavic languages: Russian, Ukranian,
Bulgarian, Serbian are few examples, so I believe it should be excluded
from the list of synomyms.

> KOI8 needs to be aliased as koi8r.

... and Karel can you change these so they are consistent with
others:

> KOI8_to_utf(unsigned char *iso, unsigned char *utf, int len)
> {
>  local_to_utf(iso, utf, LUmapKOI8, sizeof(LUmapKOI8) / sizeof(pg_local_to_utf), PG_KOI8, len);
> }

to

koi8r_to_utf(unsigned char *iso, unsigned char *utf, int len)
^^^^^
{
 local_to_utf(iso, utf, LUmapKOI8R, sizeof(LUmapKOI8R) / sizeof(pg_local_to_utf), PG_KOI8R, len);
}                            ^^^^^              ^^^^^                                ^^^^^

> WIN_to_utf(unsigned char *iso, unsigned char *utf, int len)
> {
>   local_to_utf(iso, utf, LUmapWIN, sizeof(LUmapWIN) / sizeof(pg_local_to_utf), PG_WIN1251, len);
> }

to

win1251_to_utf(unsigned char *iso, unsigned char *utf, int len)
^^^^^^^
{
  local_to_utf(iso, utf, LUmapWIN1251, sizeof(LUmapWIN1251) / sizeof(pg_local_to_utf), PG_WIN1251, len);
                              ^^^^^^^              ^^^^^^^                                ^^^^^^^
}

S.



Re: encoding names

From
Tatsuo Ishii
Date:
> Barry Lind writes:
>
> > This patch will break the JDBC driver.  The jdbc driver relies on the
> > value returned by getdatabaseencoding() to determine the server encoding
> > so that it can convert to unicode.
>
> Then the driver needs to be changed to accept the new encoding names as
> well.  (Or couldn't we convert it to Unicode in the server?)

This will break the backward compatibility. I agree with Barry's opinion.
--
Tatsuo Ishii

Re: encoding names

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> > Barry Lind writes:
> >
> > > This patch will break the JDBC driver.  The jdbc driver relies on the
> > > value returned by getdatabaseencoding() to determine the server encoding
> > > so that it can convert to unicode.
> >
> > Then the driver needs to be changed to accept the new encoding names as
> > well.  (Or couldn't we convert it to Unicode in the server?)
>
> This will break the backward compatibility.

How so?

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter


Re: encoding names

From
Tatsuo Ishii
Date:
> > > > This patch will break the JDBC driver.  The jdbc driver relies on the
> > > > value returned by getdatabaseencoding() to determine the server encoding
> > > > so that it can convert to unicode.
> > >
> > > Then the driver needs to be changed to accept the new encoding names as
> > > well.  (Or couldn't we convert it to Unicode in the server?)
> >
> > This will break the backward compatibility.
>
> How so?

Apparently 7.1 JDBC driver does not understand the value 7.2
getdatabaseencoding() returns.
--
Tatsuo Ishii

Re: encoding names

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> Apparently 7.1 JDBC driver does not understand the value 7.2
> getdatabaseencoding() returns.

Then the server needs to look at the protocol number to decide what to
send back.  But we need to be able to move forward with the encoding names
sooner or later anyway.

However, the 7.1 JDBC driver is going to be incompatible with a 7.2 server
in a number of other areas as well, so I'm not completely sure whether
it'd be worth the effort.

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter


Re: encoding names

From
Tatsuo Ishii
Date:
> Then the server needs to look at the protocol number to decide what to
> send back.  But we need to be able to move forward with the encoding names
> sooner or later anyway.

I'm not sure if we are going to raise the FE/BE protocol number for
7.2.
--
Tatsuo Ishii

Re: encoding names

From
Bruce Momjian
Date:
> > Then the server needs to look at the protocol number to decide what to
> > send back.  But we need to be able to move forward with the encoding names
> > sooner or later anyway.
>
> I'm not sure if we are going to raise the FE/BE protocol number for
> 7.2.

We are not, as far as I know.  I have made my changes without doing
that.

However, this brings up the issue of how a backend will fail if the
client provides a newer protocol version.  I think we should get it to
send back its current protocol version and see if the client responds
with a protocol version we can accept.  I know we don't need it now, but
when we do need to up the protocol version number, we are stuck because
of the older releases that can't cope with this.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: encoding names

From
Tatsuo Ishii
Date:
> > I'm not sure if we are going to raise the FE/BE protocol number for
> > 7.2.
>
> We are not, as far as I know.  I have made my changes without doing
> that.

Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and
add a new function which returns official encoding names.
--
Tatsuo Ishii

Re: encoding names

From
Karel Zak
Date:
On Fri, Aug 17, 2001 at 06:11:00PM +0200, Peter Eisentraut wrote:
> Karel Zak writes:
>
> > - possible is use synonyms for encoding (an example ISO-8859-1,
> >   Latin1, l1)
>
> On the choice of synonyms:  Do we really want to add that many synonyms
> that are not the standard name?  I think the following are not necessary:
>
> cyrillic, cp819, ibm819, isoir100x, l1-4

 IMHO is not problem if PG will understand to more aliases, or is here some
relevant problem with it? :-)

> ISO 8859 is a pretty well-know term these days.
>
> KOI8 needs to be aliased as koi8r.  Unicode is not a valid encoding name,

 Agree.

> actually.  Do you know what encoding is stands for and could you add that
> as an alias?
>
> On the code:
>
> #ifdef WIN32
>    #include "win32.h"
> #else
>    #include <unistd.h>
> #endif
>
> needs to be written as
>
> #ifdef WIN32
> #   include "win32.h"
> #else
> #   include <unistd.h>
> #endif
>
> for portability.

 OK, but sounds curious (how compiler has problem with it?)

> For extra credit:  A patch to configure and the documentation.

 :-) needs time... but yes, I add it to next patch version.


 Thanks for suggestions.

        Karel

--
 Karel Zak  <zakkr@zf.jcu.cz>
 http://home.zf.jcu.cz/~zakkr/

 C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

Re: encoding names

From
Karel Zak
Date:
On Fri, Aug 17, 2001 at 10:37:18AM -0700, Barry Lind wrote:
> This patch will break the JDBC driver.  The jdbc driver relies on the
> value returned by getdatabaseencoding() to determine the server encoding
> so that it can convert to unicode.  This patch changes the return values
> for getdatabaseencoding() such that the driver will no longer work.  For
> example "LATIN1" which used to be returned will now come back as
> "iso88591".  This change in behaviour impacts the JDBC driver and any
> other application that is depending on the output of the
> getdatabaseencoding() function.

 Hmm.. but I agree with Peter that correct solution is rewrite it to
standard names.

> I would recommend that getdatabaseencoding() return the old names for
> backword compatibility and then deprecate this function to be removed in
  ^^^^^^^^^^^^^^^^^^^^^
 We can finish as great Microsoft systems... nice face but terrible old stuff
in kernel.

--
 Karel Zak  <zakkr@zf.jcu.cz>
 http://home.zf.jcu.cz/~zakkr/

 C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

Re: encoding names

From
Karel Zak
Date:
On Sun, Aug 19, 2001 at 11:02:49AM +0900, Tatsuo Ishii wrote:
> > > I'm not sure if we are going to raise the FE/BE protocol number for
> > > 7.2.
> >
> > We are not, as far as I know.  I have made my changes without doing
> > that.
>
> Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and
> add a new function which returns official encoding names.

 Ok, Is here some suggestion for name of this function?

                Karel

--
 Karel Zak  <zakkr@zf.jcu.cz>
 http://home.zf.jcu.cz/~zakkr/

 C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

Re: encoding names

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> However, this brings up the issue of how a backend will fail if the
> client provides a newer protocol version.  I think we should get it to
> send back its current protocol version and see if the client responds
> with a protocol version we can accept.

Why?  A client that wants to do this can retry with a lower version
number upon seeing the "unsupported protocol version" failure.  There's
no need to change the postmaster code --- indeed, doing so would negate
the main value of such a feature, namely being able to talk to *old*
postmasters.

            regards, tom lane

Re: encoding names

From
Tatsuo Ishii
Date:
> On Sun, Aug 19, 2001 at 11:02:49AM +0900, Tatsuo Ishii wrote:
> > > > I'm not sure if we are going to raise the FE/BE protocol number for
> > > > 7.2.
> > >
> > > We are not, as far as I know.  I have made my changes without doing
> > > that.
> >
> > Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and
> > add a new function which returns official encoding names.
>
>  Ok, Is here some suggestion for name of this function?

The new function returns "canonical database encoding names". So

"get_canonical_database_encoding" or shorter name looks appropriate
to me.
--
Tatsuo Ishii

Re: encoding names

From
Karel Zak
Date:
On Wed, Aug 22, 2001 at 05:09:50PM +0900, Tatsuo Ishii wrote:
>
> The new function returns "canonical database encoding names". So
>
> "get_canonical_database_encoding" or shorter name looks appropriate
> to me.

 Oops, I overlook this mail in my inbox. Hmm .. I use getdbencoding(),
but we can change it later (before 7.2 release of course). It's
cosmetic change.

        Karel

--
 Karel Zak  <zakkr@zf.jcu.cz>
 http://home.zf.jcu.cz/~zakkr/

 C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

Re: encoding names

From
Tatsuo Ishii
Date:
> On Wed, Aug 22, 2001 at 05:09:50PM +0900, Tatsuo Ishii wrote:
> >
> > The new function returns "canonical database encoding names". So
> >
> > "get_canonical_database_encoding" or shorter name looks appropriate
> > to me.
>
>  Oops, I overlook this mail in my inbox. Hmm .. I use getdbencoding(),
> but we can change it later (before 7.2 release of course). It's
> cosmetic change.

I don't think you need to change the function name "getdbencoding".
"get_canonical_database_encoding" is too long anyway.
--
Tatsuo Ishii

Re: encoding names

From
Peter Eisentraut
Date:
Tatsuo Ishii writes:

> I don't think you need to change the function name "getdbencoding".
> "get_canonical_database_encoding" is too long anyway.

But getdbencoding isn't semantically different from the old
getdatabaseencoding.  "encoding" isn't the right term anyway, methinks, it
should be "character set".  So maybe database_character_set()?  (No "get"
please.)

--
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: encoding names

From
Tatsuo Ishii
Date:
> Tatsuo Ishii writes:
>
> > I don't think you need to change the function name "getdbencoding".
> > "get_canonical_database_encoding" is too long anyway.
>
> But getdbencoding isn't semantically different from the old
> getdatabaseencoding.  "encoding" isn't the right term anyway, methinks, it
> should be "character set".  So maybe database_character_set()?  (No "get"
> please.)

I'm not a native English speaker, so please feel free to choose more
appropriate name.

BTW, what's wrong with "encoding"? I don't think, for example EUC-JP
or utf-8, are character set names.
--
Tatsuo Ishii