Thread: encoding names
Hi, attached is patch with: - new encoding names stuff with better performance (binary search intead for() and prevent some needless searching) - possible is use synonyms for encoding (an example ISO-8859-1, Latin1, l1) - implemented is Peter's idea about "encoding names clearing" (other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is same as 'iso8859_1' or iso-8-8-5-9-1 :-) - share routines for this between FE and BE (never more define encoding names separate in FE and BE) - add prefix PG_ to encoding identificator macros, something like 'ALT' is pretty dirty in source code, rather use PG_ALT. (Note: patch add new file mb/encname.c and remove mb/common.c) Karel -- Karel Zak <zakkr@zf.jcu.cz> http://home.zf.jcu.cz/~zakkr/ C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
Attachment
Karel Zak writes: > - possible is use synonyms for encoding (an example ISO-8859-1, > Latin1, l1) On the choice of synonyms: Do we really want to add that many synonyms that are not the standard name? I think the following are not necessary: cyrillic, cp819, ibm819, isoir100x, l1-4 ISO 8859 is a pretty well-know term these days. KOI8 needs to be aliased as koi8r. Unicode is not a valid encoding name, actually. Do you know what encoding is stands for and could you add that as an alias? On the code: #ifdef WIN32 #include "win32.h" #else #include <unistd.h> #endif needs to be written as #ifdef WIN32 # include "win32.h" #else # include <unistd.h> #endif for portability. For extra credit: A patch to configure and the documentation. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
This patch will break the JDBC driver. The jdbc driver relies on the value returned by getdatabaseencoding() to determine the server encoding so that it can convert to unicode. This patch changes the return values for getdatabaseencoding() such that the driver will no longer work. For example "LATIN1" which used to be returned will now come back as "iso88591". This change in behaviour impacts the JDBC driver and any other application that is depending on the output of the getdatabaseencoding() function. I would recommend that getdatabaseencoding() return the old names for backword compatibility and then deprecate this function to be removed in the future. Then create a new function that returns the new encoding names that can be used going forward. thanks, --Barry Karel Zak wrote: > Hi, > > attached is patch with: > > - new encoding names stuff with better performance (binary search > intead for() and prevent some needless searching) > > - possible is use synonyms for encoding (an example ISO-8859-1, > Latin1, l1) > > - implemented is Peter's idea about "encoding names clearing" > (other chars than [A-Za-z0-9] are irrelevan -- 'ISO-8859-1' is > same as 'iso8859_1' or iso-8-8-5-9-1 :-) > > - share routines for this between FE and BE (never more define > encoding names separate in FE and BE) > > - add prefix PG_ to encoding identificator macros, something like 'ALT' > is pretty dirty in source code, rather use PG_ALT. > > (Note: patch add new file mb/encname.c and remove mb/common.c) > > Karel > > > > ------------------------------------------------------------------------ > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > Part 1.1 > > Content-Type: > > text/plain > > > ------------------------------------------------------------------------ > mb-08172001.patch.gz > > Content-Type: > > application/x-gzip > Content-Encoding: > > base64 > > > ------------------------------------------------------------------------ > Part 1.3 > > Content-Type: > > text/plain > Content-Encoding: > > binary > >
Barry Lind writes: > This patch will break the JDBC driver. The jdbc driver relies on the > value returned by getdatabaseencoding() to determine the server encoding > so that it can convert to unicode. Then the driver needs to be changed to accept the new encoding names as well. (Or couldn't we convert it to Unicode in the server?) -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
----- Original Message ----- From: Peter Eisentraut <peter_e@gmx.net> Sent: Friday, August 17, 2001 12:11 PM > Karel Zak writes: > > > - possible is use synonyms for encoding (an example ISO-8859-1, > > Latin1, l1) > > On the choice of synonyms: Do we really want to add that many synonyms > that are not the standard name? I think the following are not necessary: > > cyrillic, cp819, ibm819, isoir100x, l1-4 I'm not sure about others, but 'cyrillic' is a quite ambigous alias, because it can denote many slavic languages: Russian, Ukranian, Bulgarian, Serbian are few examples, so I believe it should be excluded from the list of synomyms. > KOI8 needs to be aliased as koi8r. ... and Karel can you change these so they are consistent with others: > KOI8_to_utf(unsigned char *iso, unsigned char *utf, int len) > { > local_to_utf(iso, utf, LUmapKOI8, sizeof(LUmapKOI8) / sizeof(pg_local_to_utf), PG_KOI8, len); > } to koi8r_to_utf(unsigned char *iso, unsigned char *utf, int len) ^^^^^ { local_to_utf(iso, utf, LUmapKOI8R, sizeof(LUmapKOI8R) / sizeof(pg_local_to_utf), PG_KOI8R, len); } ^^^^^ ^^^^^ ^^^^^ > WIN_to_utf(unsigned char *iso, unsigned char *utf, int len) > { > local_to_utf(iso, utf, LUmapWIN, sizeof(LUmapWIN) / sizeof(pg_local_to_utf), PG_WIN1251, len); > } to win1251_to_utf(unsigned char *iso, unsigned char *utf, int len) ^^^^^^^ { local_to_utf(iso, utf, LUmapWIN1251, sizeof(LUmapWIN1251) / sizeof(pg_local_to_utf), PG_WIN1251, len); ^^^^^^^ ^^^^^^^ ^^^^^^^ } S.
> Barry Lind writes: > > > This patch will break the JDBC driver. The jdbc driver relies on the > > value returned by getdatabaseencoding() to determine the server encoding > > so that it can convert to unicode. > > Then the driver needs to be changed to accept the new encoding names as > well. (Or couldn't we convert it to Unicode in the server?) This will break the backward compatibility. I agree with Barry's opinion. -- Tatsuo Ishii
Tatsuo Ishii writes: > > Barry Lind writes: > > > > > This patch will break the JDBC driver. The jdbc driver relies on the > > > value returned by getdatabaseencoding() to determine the server encoding > > > so that it can convert to unicode. > > > > Then the driver needs to be changed to accept the new encoding names as > > well. (Or couldn't we convert it to Unicode in the server?) > > This will break the backward compatibility. How so? -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> > > > This patch will break the JDBC driver. The jdbc driver relies on the > > > > value returned by getdatabaseencoding() to determine the server encoding > > > > so that it can convert to unicode. > > > > > > Then the driver needs to be changed to accept the new encoding names as > > > well. (Or couldn't we convert it to Unicode in the server?) > > > > This will break the backward compatibility. > > How so? Apparently 7.1 JDBC driver does not understand the value 7.2 getdatabaseencoding() returns. -- Tatsuo Ishii
Tatsuo Ishii writes: > Apparently 7.1 JDBC driver does not understand the value 7.2 > getdatabaseencoding() returns. Then the server needs to look at the protocol number to decide what to send back. But we need to be able to move forward with the encoding names sooner or later anyway. However, the 7.1 JDBC driver is going to be incompatible with a 7.2 server in a number of other areas as well, so I'm not completely sure whether it'd be worth the effort. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> Then the server needs to look at the protocol number to decide what to > send back. But we need to be able to move forward with the encoding names > sooner or later anyway. I'm not sure if we are going to raise the FE/BE protocol number for 7.2. -- Tatsuo Ishii
> > Then the server needs to look at the protocol number to decide what to > > send back. But we need to be able to move forward with the encoding names > > sooner or later anyway. > > I'm not sure if we are going to raise the FE/BE protocol number for > 7.2. We are not, as far as I know. I have made my changes without doing that. However, this brings up the issue of how a backend will fail if the client provides a newer protocol version. I think we should get it to send back its current protocol version and see if the client responds with a protocol version we can accept. I know we don't need it now, but when we do need to up the protocol version number, we are stuck because of the older releases that can't cope with this. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> > I'm not sure if we are going to raise the FE/BE protocol number for > > 7.2. > > We are not, as far as I know. I have made my changes without doing > that. Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and add a new function which returns official encoding names. -- Tatsuo Ishii
On Fri, Aug 17, 2001 at 06:11:00PM +0200, Peter Eisentraut wrote: > Karel Zak writes: > > > - possible is use synonyms for encoding (an example ISO-8859-1, > > Latin1, l1) > > On the choice of synonyms: Do we really want to add that many synonyms > that are not the standard name? I think the following are not necessary: > > cyrillic, cp819, ibm819, isoir100x, l1-4 IMHO is not problem if PG will understand to more aliases, or is here some relevant problem with it? :-) > ISO 8859 is a pretty well-know term these days. > > KOI8 needs to be aliased as koi8r. Unicode is not a valid encoding name, Agree. > actually. Do you know what encoding is stands for and could you add that > as an alias? > > On the code: > > #ifdef WIN32 > #include "win32.h" > #else > #include <unistd.h> > #endif > > needs to be written as > > #ifdef WIN32 > # include "win32.h" > #else > # include <unistd.h> > #endif > > for portability. OK, but sounds curious (how compiler has problem with it?) > For extra credit: A patch to configure and the documentation. :-) needs time... but yes, I add it to next patch version. Thanks for suggestions. Karel -- Karel Zak <zakkr@zf.jcu.cz> http://home.zf.jcu.cz/~zakkr/ C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
On Fri, Aug 17, 2001 at 10:37:18AM -0700, Barry Lind wrote: > This patch will break the JDBC driver. The jdbc driver relies on the > value returned by getdatabaseencoding() to determine the server encoding > so that it can convert to unicode. This patch changes the return values > for getdatabaseencoding() such that the driver will no longer work. For > example "LATIN1" which used to be returned will now come back as > "iso88591". This change in behaviour impacts the JDBC driver and any > other application that is depending on the output of the > getdatabaseencoding() function. Hmm.. but I agree with Peter that correct solution is rewrite it to standard names. > I would recommend that getdatabaseencoding() return the old names for > backword compatibility and then deprecate this function to be removed in ^^^^^^^^^^^^^^^^^^^^^ We can finish as great Microsoft systems... nice face but terrible old stuff in kernel. -- Karel Zak <zakkr@zf.jcu.cz> http://home.zf.jcu.cz/~zakkr/ C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
On Sun, Aug 19, 2001 at 11:02:49AM +0900, Tatsuo Ishii wrote: > > > I'm not sure if we are going to raise the FE/BE protocol number for > > > 7.2. > > > > We are not, as far as I know. I have made my changes without doing > > that. > > Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and > add a new function which returns official encoding names. Ok, Is here some suggestion for name of this function? Karel -- Karel Zak <zakkr@zf.jcu.cz> http://home.zf.jcu.cz/~zakkr/ C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
Bruce Momjian <pgman@candle.pha.pa.us> writes: > However, this brings up the issue of how a backend will fail if the > client provides a newer protocol version. I think we should get it to > send back its current protocol version and see if the client responds > with a protocol version we can accept. Why? A client that wants to do this can retry with a lower version number upon seeing the "unsupported protocol version" failure. There's no need to change the postmaster code --- indeed, doing so would negate the main value of such a feature, namely being able to talk to *old* postmasters. regards, tom lane
> On Sun, Aug 19, 2001 at 11:02:49AM +0900, Tatsuo Ishii wrote: > > > > I'm not sure if we are going to raise the FE/BE protocol number for > > > > 7.2. > > > > > > We are not, as far as I know. I have made my changes without doing > > > that. > > > > Ok. I think we should keep getdatabaseencoding() behaves as 7.1, and > > add a new function which returns official encoding names. > > Ok, Is here some suggestion for name of this function? The new function returns "canonical database encoding names". So "get_canonical_database_encoding" or shorter name looks appropriate to me. -- Tatsuo Ishii
On Wed, Aug 22, 2001 at 05:09:50PM +0900, Tatsuo Ishii wrote: > > The new function returns "canonical database encoding names". So > > "get_canonical_database_encoding" or shorter name looks appropriate > to me. Oops, I overlook this mail in my inbox. Hmm .. I use getdbencoding(), but we can change it later (before 7.2 release of course). It's cosmetic change. Karel -- Karel Zak <zakkr@zf.jcu.cz> http://home.zf.jcu.cz/~zakkr/ C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
> On Wed, Aug 22, 2001 at 05:09:50PM +0900, Tatsuo Ishii wrote: > > > > The new function returns "canonical database encoding names". So > > > > "get_canonical_database_encoding" or shorter name looks appropriate > > to me. > > Oops, I overlook this mail in my inbox. Hmm .. I use getdbencoding(), > but we can change it later (before 7.2 release of course). It's > cosmetic change. I don't think you need to change the function name "getdbencoding". "get_canonical_database_encoding" is too long anyway. -- Tatsuo Ishii
Tatsuo Ishii writes: > I don't think you need to change the function name "getdbencoding". > "get_canonical_database_encoding" is too long anyway. But getdbencoding isn't semantically different from the old getdatabaseencoding. "encoding" isn't the right term anyway, methinks, it should be "character set". So maybe database_character_set()? (No "get" please.) -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> Tatsuo Ishii writes: > > > I don't think you need to change the function name "getdbencoding". > > "get_canonical_database_encoding" is too long anyway. > > But getdbencoding isn't semantically different from the old > getdatabaseencoding. "encoding" isn't the right term anyway, methinks, it > should be "character set". So maybe database_character_set()? (No "get" > please.) I'm not a native English speaker, so please feel free to choose more appropriate name. BTW, what's wrong with "encoding"? I don't think, for example EUC-JP or utf-8, are character set names. -- Tatsuo Ishii