Thread: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
"Marcus Raphelt"
Date:
The following bug has been logged online:

Bug reference:      1775
Logged by:          Marcus Raphelt
Email address:      marcus@raphelt.de
PostgreSQL version: latest
Operating system:   WinXP (dev), Win2000 (server)
Description:        UTF8 to ISO does not convert the german "sharp s"
(ß)
Details:

Hi,

I currently have a iso_8859_1 app and included the opengeodb tables to have
a set of german cities and zipcodes.

http://opengeodb.sourceforge.net/index.php?FrontPage_en
I downloaded this dump:

http://prdownloads.sourceforge.net/opengeodb/opengeodb-0.2.3c-UTF8-postgres.
zip?download

They offer a Postgresql dump that can be imported without hassle, but is in
UTF8. I did several attempts using convert() to convert the city names to
iso. It works, and it also works fine on german umlauts (äöü), but it
seems like convert() does NOT convert the german "ß" / sharp s /
ß...

cheers,
marcus

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
"Magnus Hagander"
Date:
PiBCdWcgcmVmZXJlbmNlOiAgICAgIDE3NzUNCj4gTG9nZ2VkIGJ5OiAgICAg
ICAgICBNYXJjdXMgUmFwaGVsdA0KPiBFbWFpbCBhZGRyZXNzOiAgICAgIG1h
cmN1c0ByYXBoZWx0LmRlDQo+IFBvc3RncmVTUUwgdmVyc2lvbjogbGF0ZXN0
DQo+IE9wZXJhdGluZyBzeXN0ZW06ICAgV2luWFAgKGRldiksIFdpbjIwMDAg
KHNlcnZlcikNCj4gRGVzY3JpcHRpb246ICAgICAgICBVVEY4IHRvIElTTyBk
b2VzIG5vdCBjb252ZXJ0IHRoZSBnZXJtYW4gInNoYXJwIHMiDQo+ICgmc3ps
aWc7KQ0KPiBEZXRhaWxzOiANCj4gDQo+IEhpLA0KPiANCj4gSSBjdXJyZW50
bHkgaGF2ZSBhIGlzb184ODU5XzEgYXBwIGFuZCBpbmNsdWRlZCB0aGUgb3Bl
bmdlb2RiIA0KPiB0YWJsZXMgdG8gaGF2ZSBhIHNldCBvZiBnZXJtYW4gY2l0
aWVzIGFuZCB6aXBjb2Rlcy4NCj4gDQo+IGh0dHA6Ly9vcGVuZ2VvZGIuc291
cmNlZm9yZ2UubmV0L2luZGV4LnBocD9Gcm9udFBhZ2VfZW4NCj4gSSBkb3du
bG9hZGVkIHRoaXMgZHVtcDoNCj4gDQo+IGh0dHA6Ly9wcmRvd25sb2Fkcy5z
b3VyY2Vmb3JnZS5uZXQvb3Blbmdlb2RiL29wZW5nZW9kYi0wLjIuM2MtDQo+
IFVURjgtcG9zdGdyZXMuDQo+IHppcD9kb3dubG9hZA0KPiANCj4gVGhleSBv
ZmZlciBhIFBvc3RncmVzcWwgZHVtcCB0aGF0IGNhbiBiZSBpbXBvcnRlZCB3
aXRob3V0IA0KPiBoYXNzbGUsIGJ1dCBpcyBpbiBVVEY4LiBJIGRpZCBzZXZl
cmFsIGF0dGVtcHRzIHVzaW5nIA0KPiBjb252ZXJ0KCkgdG8gY29udmVydCB0
aGUgY2l0eSBuYW1lcyB0byBpc28uIEl0IHdvcmtzLCBhbmQgaXQgDQo+IGFs
c28gd29ya3MgZmluZSBvbiBnZXJtYW4gdW1sYXV0cyAow4PCpMODwrbDg8K8
KSwgYnV0IGl0IHNlZW1zIGxpa2UgDQo+IGNvbnZlcnQoKSBkb2VzIE5PVCBj
b252ZXJ0IHRoZSBnZXJtYW4gIsODxbgiIC8gc2hhcnAgcyAvICZzemxpZzsu
Li4NCg0KUG9zdGdyZVNRTCA4LjAgZG9lcyBub3Qgc3VwcG9ydCBVVEY4IG9u
IFdpbjMyLiBZb3Ugd2lsbCBuZWVkIHRvIGNvbnZlcnQgdGhlIGZpbGUgYmVm
b3JlIHlvdSBsb2FkIGl0IGludG8gdGhlIGRhdGFiYXNlIC0gbG9vayBhdCBm
b3IgZXhhbXBsZSB0aGUgR05VIGljb252IHByb2dyYW0gdG8gZG8gdGhpcy4N
Cg0KLy9NYWdudXMNCg==

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
Peter Eisentraut
Date:
Magnus Hagander wrote:
> PostgreSQL 8.0 does not support UTF8 on Win32. You will need to
> convert the file before you load it into the database - look at for
> example the GNU iconv program to do this.

PostgreSQL does not support UTF8 locales on Windows, but the character
conversion should work the same on all platforms, shouldn't it?

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
"Magnus Hagander"
Date:
> > PostgreSQL 8.0 does not support UTF8 on Win32. You will need to=20
> > convert the file before you load it into the database - look at for=20
> > example the GNU iconv program to do this.
>=20
> PostgreSQL does not support UTF8 locales on Windows, but the=20
> character conversion should work the same on all platforms,=20
> shouldn't it?

I thought that was also handled in the locale code. But now that you say
it, it's probably not - it should be the same.
I'll withdraw my comment then, clearly wrong. Thanks for clearing that
up.


//Magnus

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
"John Hansen"
Date:
Magnus Hagander Wrote:
> > > PostgreSQL 8.0 does not support UTF8 on Win32. You will need to=20
> > > convert the file before you load it into the database -=20
> look at for=20
> > > example the GNU iconv program to do this.
> >=20
> > PostgreSQL does not support UTF8 locales on Windows, but=20
> the character=20
> > conversion should work the same on all platforms, shouldn't it?
>=20
> I thought that was also handled in the locale code. But now=20
> that you say it, it's probably not - it should be the same.
> I'll withdraw my comment then, clearly wrong. Thanks for=20
> clearing that up.

Currently, it is very much so handled by the systems locale support,
hence it is handled differently depending on platform.

ICU support should make it to 8.1 I hear, which will make the bahaviour
uniform.

... John

>=20
>=20
> //Magnus
>=20
> ---------------------------(end of=20
> broadcast)---------------------------
> TIP 6: explain analyze is your friend
>=20
>=20

Re: BUG #1775: UTF8 to ISO does not convert the german

From
Tatsuo Ishii
Date:
> Magnus Hagander wrote:
> > PostgreSQL 8.0 does not support UTF8 on Win32. You will need to
> > convert the file before you load it into the database - look at for
> > example the GNU iconv program to do this.
>
> PostgreSQL does not support UTF8 locales on Windows, but the character
> conversion should work the same on all platforms, shouldn't it?

Right.

BTW the original bug report sounds interesting to me. The conversion
between UTF-8 and ISO 8859-1 is handled by a program logic, not by
conversion tables. So if german umlauts are converted fine, there's no
reason the conversion for german sharp s does not work.

Marcus,

Can you give me the exact error message from PostgreSQL when the
conversio failed?
--
Tatsuo Ishii

Re: BUG #1775: UTF8 to ISO does not convert the german

From
Tatsuo Ishii
Date:
> Hi Tatsuo,
>
> Am Mittwoch, 20. Juli 2005 um 01:00 schrieben Sie:
>
> TI> conversion tables. So if german umlauts are converted fine, there's no
> TI> reason the conversion for german sharp s does not work.
>
> TI> Marcus,
>
> TI> Can you give me the exact error message from PostgreSQL when the
> TI> conversio failed?
>
> Well - actually, there IS no error message, convert() just returns an
> empty sting. In detail, here's what I did:
>
> -Set up PostgreSQL (Ascii)
>
> -Imported the opengeodb
> http://sourceforge.net/project/showfiles.php?group_id=132421
> As they offer a PostgreSQL dump, I chose this one.
>
> -Imported the dump using pgAdminIII
>
> -Created a view that returns all german cities with ZIP and
> opengeodb-locationID:
>
> CREATE OR REPLACE VIEW orte_de AS
> SELECT code.text_val AS plz, code.loc_id, town.text_val AS ort
> FROM geodb_hierarchies hi, geodb_textdata state, geodb_textdata town, geodb_textdata code
> WHERE hi.id_lvl2 = state.loc_id AND state.text_val = 'DE'::text
> AND state.text_type = 500100001 AND town.loc_id = hi.loc_id
> AND town.text_type = 500100000 AND code.loc_id = town.loc_id
> AND code.text_type = 500300000;
>
> So now I've got three columns: "plz" (zip), "ort" (city) and loc_id.
> Assuming I want to retrieve cites in the Hamburg / Hannover area...
> (perfect for this task, as they have pretty strange city names there
> :)
>
> SELECT plz, loc_id, ort, convert(ort using utf_8_to_iso_8859_1) as
> ort_conv
> from orte_de
> where plz between 20000 and 30000
> order by ort_conv
>
> This query returns empty values for "ort_conv" if "ort" contains a
> sharp s.
> Btw, it seems like it is the same for "Ä" (Ä), have a look at
> loc_id 25182.

I see no problem with encoding conversion itself:

utf8=# \encoding latin1
utf8=# select * from t1;
 t
----
 $Bd_(B
(1 row)

> SELECT plz, loc_id, ort, convert(ort using utf_8_to_iso_8859_1) as
> ort_conv
> from orte_de
> where plz between 20000 and 30000
> order by ort_conv

Problem here is the result of convert(ort using utf_8_to_iso_8859_1)
is ISO-8859-1 but your database encoding is UTF-8, so the terminal
treats ISO-8859-1 chars as UTF-8 which will result in unexpected
characters displayed.

I guess what you want to do is:

SELECT plz, loc_id, ort from orte_de
where plz between 20000 and 30000
order by convert(ort using utf_8_to_iso_8859_1)
--
Tatsuo Ishii

Re: BUG #1775: UTF8 to ISO does not convert the german

From
Tatsuo Ishii
Date:
> TI> Problem here is the result of convert(ort using utf_8_to_iso_8859_1)
> TI> is ISO-8859-1 but your database encoding is UTF-8,
>
> Not exactly, my database is in iso_8859_1, only the opengeodb-dump is
> solely available in utf-8.
> The whole thing is no problem so far, as I can convert the values in
> the application that uses the data - I just wanted to let you know.

That's surprising. Since you try to convert from UTF-8 to ISO 8859-1,
I thought you store data as UTF-8.

> TI> I guess what you want to do is:
> TI> SELECT plz, loc_id, ort from orte_de
> TI> where plz between 20000 and 30000
> TI> order by convert(ort using utf_8_to_iso_8859_1)
>
> Well, that was a fantasy-query :) just to show some "odd" values, I do
> not need that much data in the application. Currently, I cannot
> retrieve ready-converted iso data from the orte_de view, as converted
> names that contain "ß" are not shown...

Let me make sure. You store UTF-8 data in ISO 8859-1 DB?
--
Tatsuo Ishii

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
Marcus Raphelt
Date:
Hi,

Am Mittwoch, 20. Juli 2005 um 10:27 schrieben Sie:

TI> Problem here is the result of convert(ort using utf_8_to_iso_8859_1)
TI> is ISO-8859-1 but your database encoding is UTF-8,

Not exactly, my database is in iso_8859_1, only the opengeodb-dump is
solely available in utf-8.
The whole thing is no problem so far, as I can convert the values in
the application that uses the data - I just wanted to let you know.

TI> I guess what you want to do is:
TI> SELECT plz, loc_id, ort from orte_de
TI> where plz between 20000 and 30000
TI> order by convert(ort using utf_8_to_iso_8859_1)

Well, that was a fantasy-query :) just to show some "odd" values, I do
not need that much data in the application. Currently, I cannot
retrieve ready-converted iso data from the orte_de view, as converted
names that contain "ß" are not shown...


--
Mit freundlichen Grüssen
Marcus Raphelt                            mailto:marcus@raphelt.de

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
Marcus Raphelt
Date:
Hi Tatsuo,

Am Mittwoch, 20. Juli 2005 um 01:00 schrieben Sie:

TI> conversion tables. So if german umlauts are converted fine, there's no
TI> reason the conversion for german sharp s does not work.

TI> Marcus,

TI> Can you give me the exact error message from PostgreSQL when the
TI> conversio failed?

Well - actually, there IS no error message, convert() just returns an
empty sting. In detail, here's what I did:

-Set up PostgreSQL (Ascii)

-Imported the opengeodb
http://sourceforge.net/project/showfiles.php?group_id=132421
As they offer a PostgreSQL dump, I chose this one.

-Imported the dump using pgAdminIII

-Created a view that returns all german cities with ZIP and
opengeodb-locationID:

CREATE OR REPLACE VIEW orte_de AS
SELECT code.text_val AS plz, code.loc_id, town.text_val AS ort
FROM geodb_hierarchies hi, geodb_textdata state, geodb_textdata town, geodb_textdata code
WHERE hi.id_lvl2 = state.loc_id AND state.text_val = 'DE'::text
AND state.text_type = 500100001 AND town.loc_id = hi.loc_id
AND town.text_type = 500100000 AND code.loc_id = town.loc_id
AND code.text_type = 500300000;

So now I've got three columns: "plz" (zip), "ort" (city) and loc_id.
Assuming I want to retrieve cites in the Hamburg / Hannover area...
(perfect for this task, as they have pretty strange city names there
:)

SELECT plz, loc_id, ort, convert(ort using utf_8_to_iso_8859_1) as
ort_conv
from orte_de
where plz between 20000 and 30000
order by ort_conv

This query returns empty values for "ort_conv" if "ort" contains a
sharp s.
Btw, it seems like it is the same for "Ä" (Ä), have a look at
loc_id 25182.

--
Mit freundlichen Grüssen
Marcus Raphelt                            mailto:marcus@raphelt.de

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
Marcus Raphelt
Date:
Hi,

PE> PostgreSQL does not support UTF8 locales on Windows, but the character
PE> conversion should work the same on all platforms, shouldn't it?

It works pretty fine even under win32, all umlauts are coverted as
expexted, except for the "ß" (ß).
For example, "Köln" converts to "Köln", but ""Groß Kreutz" results
in an empty string.

--
Mit freundlichen Grüssen
Marcus Raphelt                            mailto:marcus@raphelt.de

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
Bruno Wolff III
Date:
On Wed, Jul 20, 2005 at 08:35:33 +1000,
  John Hansen <john@geeknet.com.au> wrote:
>
> ICU support should make it to 8.1 I hear, which will make the bahaviour
> uniform.

There was some discussion about that, but my impression was that that
wasn't going to happen.

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
"Magnus Hagander"
Date:
> > ICU support should make it to 8.1 I hear, which will make the=20
> > bahaviour uniform.
>=20
> There was some discussion about that, but my impression was=20
> that that wasn't going to happen.

AFAIK the result was that it would go in as an alternative, but not to
*replace* what's there now.

//Magnus

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
Tom Lane
Date:
"Magnus Hagander" <mha@sollentuna.net> writes:
>>> ICU support should make it to 8.1 I hear, which will make the
>>> bahaviour uniform.
>>
>> There was some discussion about that, but my impression was
>> that that wasn't going to happen.

> AFAIK the result was that it would go in as an alternative, but not to
> *replace* what's there now.

The whole issue is still TBD as far as I know --- there were arguments
on both sides but not consensus, and if any committer has reviewed the
ICU patch in any detail, I didn't hear about it.

It'd probably be easier to get consensus for a "make it a compile-time
option" proposal than for a "wholesale replacement" proposal.

            regards, tom lane

Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß)

From
"Magnus Hagander"
Date:
> >>> ICU support should make it to 8.1 I hear, which will make the=20
> >>> bahaviour uniform.
> >>=20
> >> There was some discussion about that, but my impression=20
> was that that=20
> >> wasn't going to happen.
>=20
> > AFAIK the result was that it would go in as an alternative,=20
> but not to
> > *replace* what's there now.
>=20
> The whole issue is still TBD as far as I know --- there were=20
> arguments on both sides but not consensus, and if any=20
> committer has reviewed the ICU patch in any detail, I didn't=20
> hear about it.

Naturally it has to pass the standard requirements for an acceptable
patch. But IIRC the early version was checked over (I think by Bruce).
Not in details, but as an overview. I fully realise that doesn't mean
it's approved, but it's a step ;-)


> It'd probably be easier to get consensus for a "make it a=20
> compile-time option" proposal than for a "wholesale=20
> replacement" proposal.

My point exactly. Considering it's vital for e.g. win32 to get unicode,
and apparantly pretty good for freebsd people if I read Palle right,
whereas other platforms have what they need already...


//Magnus