Thread: proposal: UTF8 to_ascii function
Hello, combination functions to_ascii and convert_to is broken now. Problem is in convert_to function. It doesn't support 8bit output encoding. Current workaround: CREATE FUNCTION to_ascii(bytea, name) RETURNS text AS 'to_ascii_encname' LANGUAGE internal; SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2'); I don't expect column collate for 8.4, so we need to have workable to_ascii function. I propose function to_ascii(text, name) that internally convert text from utf8 encoding when it's necessary. Regards Pavel Stehule
Pavel Stehule wrote: > Hello, > > combination functions to_ascii and convert_to is broken now. Problem > is in convert_to function. It doesn't support 8bit output encoding. > > Current workaround: > > CREATE FUNCTION to_ascii(bytea, name) > RETURNS text AS 'to_ascii_encname' LANGUAGE internal; > > SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2'); > > I don't expect column collate for 8.4, so we need to have workable > to_ascii function. > > I propose function to_ascii(text, name) that internally convert text > from utf8 encoding when it's necessary.cheers > > > convert_to is not broken. It returns a bytea, and it is up to you to de-escape it if you get the text representation. We are surely not going to go back to a situation where we have functions returning text in any encoding other than the database encoding. That becomes a vehicle for storing wrongly encoded data in the database, and we have just gone through the exercise of plugging those holes. I privately predicted when we did this work that it might motivate people who had been abusing convert_to to get proper support for multiple encodings done. That is the right way to go, not re-opening holes we have just very deliberately plugged. cheers andrew
Hello 2008/8/11 Andrew Dunstan <andrew@dunslane.net>: > > > Pavel Stehule wrote: >> >> Hello, >> >> combination functions to_ascii and convert_to is broken now. Problem >> is in convert_to function. It doesn't support 8bit output encoding. >> >> Current workaround: >> >> CREATE FUNCTION to_ascii(bytea, name) >> RETURNS text AS 'to_ascii_encname' LANGUAGE internal; >> >> SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2'); >> >> I don't expect column collate for 8.4, so we need to have workable >> to_ascii function. >> >> I propose function to_ascii(text, name) that internally convert text >> from utf8 encoding when it's necessary.cheers >> >> >> > > convert_to is not broken. It returns a bytea, and it is up to you to > de-escape it if you get the text representation. > ok, I talked about combination convert_to and to_ascii. to_ascii doesn't support bytea, what is probably correct. We cannot use descape, because it remove 8bit. This issue was noticed more times - http://archives.postgresql.org/pgsql-general/2008-06/msg00495.php > We are surely not going to go back to a situation where we have functions > returning text in any encoding other than the database encoding. That > becomes a vehicle for storing wrongly encoded data in the database, and we > have just gone through the exercise of plugging those holes. I privately > predicted when we did this work that it might motivate people who had been > abusing convert_to to get proper support for multiple encodings done. That > is the right way to go, not re-opening holes we have just very deliberately > plugged. > to_ascii isn't related to multiple encodings. And actually there is only one man who works on it. We will be happy for database collation in 8.4. So without any change this feature will be broken more than two years. Regards Pavel > > > cheers > > andrew
2008/8/11 Andrew Dunstan <andrew@dunslane.net>: > > > Pavel Stehule wrote: >> >> Hello, >> >> combination functions to_ascii and convert_to is broken now. Problem >> is in convert_to function. It doesn't support 8bit output encoding. >> >> Current workaround: >> >> CREATE FUNCTION to_ascii(bytea, name) >> RETURNS text AS 'to_ascii_encname' LANGUAGE internal; >> >> SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2'); >> >> I don't expect column collate for 8.4, so we need to have workable >> to_ascii function. >> >> I propose function to_ascii(text, name) that internally convert text >> from utf8 encoding when it's necessary.cheers >> >> >> > > convert_to is not broken. It returns a bytea, and it is up to you to > de-escape it if you get the text representation. One note - convert_to is correct. But we have to use to_ascii without decode functions. It has same behave - convert from bytea to text. Text in "incorrect" encoding is dafacto bytea. So correct to_ascii function prototypes are: to_ascii(text) to_ascii(bytea, integer); to_ascii(bytea, name); Regards Pavel Stehule > > We are surely not going to go back to a situation where we have functions > returning text in any encoding other than the database encoding. That > becomes a vehicle for storing wrongly encoded data in the database, and we > have just gone through the exercise of plugging those holes. I privately > predicted when we did this work that it might motivate people who had been > abusing convert_to to get proper support for multiple encodings done. That > is the right way to go, not re-opening holes we have just very deliberately > plugged. > > > > cheers > > andrew >
Pavel Stehule wrote: > > > One note - convert_to is correct. But we have to use to_ascii without > decode functions. It has same behave - convert from bytea to text. > Text in "incorrect" encoding is dafacto bytea. So correct to_ascii > function prototypes are: > > to_ascii(text) > to_ascii(bytea, integer); > to_ascii(bytea, name); > > >> What you have not said is how you propose to convert UTF8 to ASCII. Currently to_ascii() converts a small number of single byte charsets to ASCII by folding the chars with high bits set, so what we get is a pure ASCII result which is safe in any server encoding, as they are all ASCII supersets. But what conversion rule will you use for the gazillions of Unicode characters? I honestly do not understand the use case for this at all. cheers andrew
Andrew Dunstan wrote: > > > Pavel Stehule wrote: >> >> >> One note - convert_to is correct. But we have to use to_ascii without >> decode functions. It has same behave - convert from bytea to text. >> Text in "incorrect" encoding is dafacto bytea. So correct to_ascii >> function prototypes are: >> >> to_ascii(text) >> to_ascii(bytea, integer); >> to_ascii(bytea, name); >> >> >>> > > What you have not said is how you propose to convert UTF8 to ASCII. > > Currently to_ascii() converts a small number of single byte charsets to > ASCII by folding the chars with high bits set, so what we get is a pure > ASCII result which is safe in any server encoding, as they are all ASCII > supersets. > > But what conversion rule will you use for the gazillions of Unicode > characters? > > I honestly do not understand the use case for this at all. I do. Often clients want their searches to be accented-or-language-specific letters insensitive. So searching for 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of such facility made me consider not upgrading particular client to 8.3...). Or maybe there's a better way to do it? Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański wrote: > Andrew Dunstan wrote: >> >> >> Pavel Stehule wrote: >>> >>> >>> One note - convert_to is correct. But we have to use to_ascii without >>> decode functions. It has same behave - convert from bytea to text. >>> Text in "incorrect" encoding is dafacto bytea. So correct to_ascii >>> function prototypes are: >>> >>> to_ascii(text) >>> to_ascii(bytea, integer); >>> to_ascii(bytea, name); >>> >>> >>>> >> >> What you have not said is how you propose to convert UTF8 to ASCII. >> >> Currently to_ascii() converts a small number of single byte charsets >> to ASCII by folding the chars with high bits set, so what we get is a >> pure ASCII result which is safe in any server encoding, as they are >> all ASCII supersets. >> >> But what conversion rule will you use for the gazillions of Unicode >> characters? >> >> I honestly do not understand the use case for this at all. > > I do. Often clients want their searches to be > accented-or-language-specific letters insensitive. So searching for > 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of > such facility made me consider not upgrading particular client to > 8.3...). > Or maybe there's a better way to do it? Well, my first question would be "Why aren't you using a database encoding that supports to_ascii()?" However, I suppose that your use case would support this signature: to_ascii(bytea, name) where it would just error out if the encoding name were something other than LATIN1, LATIN2, LATIN9, or WIN1250. But what would be the meaning of this?: to_ascii(bytea, integer) cheers andrew
Andrew Dunstan wrote: > > > Jan Urbański wrote: >> Andrew Dunstan wrote: >>> >>> >>> Pavel Stehule wrote: >>> What you have not said is how you propose to convert UTF8 to ASCII. >>> >>> Currently to_ascii() converts a small number of single byte charsets >>> to ASCII by folding the chars with high bits set, so what we get is a >>> pure ASCII result which is safe in any server encoding, as they are >>> all ASCII supersets. >>> >>> But what conversion rule will you use for the gazillions of Unicode >>> characters? >>> >>> I honestly do not understand the use case for this at all. >> >> I do. Often clients want their searches to be >> accented-or-language-specific letters insensitive. So searching for >> 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of >> such facility made me consider not upgrading particular client to >> 8.3...). >> Or maybe there's a better way to do it? > > Well, my first question would be "Why aren't you using a database > encoding that supports to_ascii()?" Because I want UTF-8 in it ;) It's mostly LATIN2, but clients sometimes input Cyrillic, Greek or Hebrew letters, and sometimes use Unicode characters like (U+2026) HORIZONTAL ELLIPSIS. I'd like to have to_ascii(text, [error_handling]) returns text So no bytea, to_ascii would accept text that's legal in my current database encoding and return text in that encoding. And error_handling would be something like: - 'error' (the default, throw an error if a character is untranslable to ASCII) - 'ignore' (omit untranslable characters) - 'transliterate' (do your best to transliterate the character, or leave it as it is if impossible). Examples would include (assuming UTF-8 database) to_ascii('łódź') -> 'lodz' to_ascii('china is written 中國') -> ERROR to_ascii('china is written 中國', 'ignore') -> 'china is written ' to_ascii('china is written 中國', 'transliterate') -> 'china is written zhong guo' (in an ideal world) to_ascii('china is written 中國', 'transliterate') -> 'china is written 中國' (in reality)\ These would have the property, that: to_ascii(X, 'ignore') is always pure ASCII data and never throws an error to_ascii(X, 'transliterate') is sometimes non-ASCII data and never throws an error to_ascii(X) is sometimes non-ASCII data and sometimes throws an error It's something like PHP's iconv that can have //TRANSLIT or somesuch (forgive me for giving PHP as an example...). Now I'd love to hear people punch holes in my daydreaming design ;) Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
2008/8/11 Andrew Dunstan <andrew@dunslane.net>: > > > Pavel Stehule wrote: >> >> >> One note - convert_to is correct. But we have to use to_ascii without >> decode functions. It has same behave - convert from bytea to text. >> Text in "incorrect" encoding is dafacto bytea. So correct to_ascii >> function prototypes are: >> >> to_ascii(text) >> to_ascii(bytea, integer); >> to_ascii(bytea, name); >> >> >>> >>> > > What you have not said is how you propose to convert UTF8 to ASCII. > > Currently to_ascii() converts a small number of single byte charsets to > ASCII by folding the chars with high bits set, so what we get is a pure > ASCII result which is safe in any server encoding, as they are all ASCII > supersets. > > But what conversion rule will you use for the gazillions of Unicode > characters? > > I honestly do not understand the use case for this at all. > It's typical case in czech language, where some searchings are accents insensitive - Stěhule, Stehule, Novotný, Novotny. > cheers > > andrew >
2008/8/11 Andrew Dunstan <andrew@dunslane.net>: > > > Jan Urbański wrote: >> >> Andrew Dunstan wrote: >>> >>> >>> Pavel Stehule wrote: >>>> >>>> >>>> One note - convert_to is correct. But we have to use to_ascii without >>>> decode functions. It has same behave - convert from bytea to text. >>>> Text in "incorrect" encoding is dafacto bytea. So correct to_ascii >>>> function prototypes are: >>>> >>>> to_ascii(text) >>>> to_ascii(bytea, integer); >>>> to_ascii(bytea, name); >>>> >>>> >>>>> >>>>> >>> >>> What you have not said is how you propose to convert UTF8 to ASCII. >>> >>> Currently to_ascii() converts a small number of single byte charsets to >>> ASCII by folding the chars with high bits set, so what we get is a pure >>> ASCII result which is safe in any server encoding, as they are all ASCII >>> supersets. >>> >>> But what conversion rule will you use for the gazillions of Unicode >>> characters? >>> >>> I honestly do not understand the use case for this at all. >> >> I do. Often clients want their searches to be >> accented-or-language-specific letters insensitive. So searching for 'łódź' >> returns 'lodz'. So the use case is there (in fact, the lack of such facility >> made me consider not upgrading particular client to 8.3...). >> Or maybe there's a better way to do it? > > Well, my first question would be "Why aren't you using a database encoding > that supports to_ascii()?" > > However, I suppose that your use case would support this signature: > > to_ascii(bytea, name) > > where it would just error out if the encoding name were something other than > LATIN1, LATIN2, LATIN9, or WIN1250. > > But what would be the meaning of this?: > > to_ascii(bytea, integer) > it's symmetric. Nothing more. > > cheers > > andrew > >
Pavel Stehule wrote: >> >> But what would be the meaning of this?: >> >> to_ascii(bytea, integer) >> >> > > it's symmetric. Nothing more. > > Symmetric to what? What is the second argument supposed to be? cheers andrew
2008/8/11 Andrew Dunstan <andrew@dunslane.net>: > > > Pavel Stehule wrote: >>> >>> But what would be the meaning of this?: >>> >>> to_ascii(bytea, integer) >>> >>> >> >> it's symmetric. Nothing more. >> >> > > Symmetric to what? What is the second argument supposed to be? > postgres=# \df to_ascii List of functions Schema | Name | Result data type | Argument data types ------------+----------+------------------+---------------------pg_catalog | to_ascii | text | textpg_catalog| to_ascii | text | text, integerpg_catalog | to_ascii | text | text, name postgres=# select to_ascii('pavel',8);to_ascii ----------pavel (1 row) > cheers > > andrew >
Pavel Stehule wrote: > 2008/8/11 Andrew Dunstan <andrew@dunslane.net>: > >> Pavel Stehule wrote: >> >>>> But what would be the meaning of this?: >>>> >>>> to_ascii(bytea, integer) >>>> >>>> >>>> >>> it's symmetric. Nothing more. >>> >>> >>> >> Symmetric to what? What is the second argument supposed to be? >> >> > > postgres=# \df to_ascii > List of functions > Schema | Name | Result data type | Argument data types > ------------+----------+------------------+--------------------- > pg_catalog | to_ascii | text | text > pg_catalog | to_ascii | text | text, integer > pg_catalog | to_ascii | text | text, name > > postgres=# select to_ascii('pavel',8); > to_ascii > ---------- > pavel > (1 row) > > > > Hmm. That's not documented, and I suspect shouldn't be there. Everywhere else pretty much that I can think of we pass the encoding as a name, and I think we should be consistent about it. cheers andrew
2008/8/11 Andrew Dunstan <andrew@dunslane.net>: > > > Pavel Stehule wrote: >> >> 2008/8/11 Andrew Dunstan <andrew@dunslane.net>: >> >>> >>> Pavel Stehule wrote: >>> >>>>> >>>>> But what would be the meaning of this?: >>>>> >>>>> to_ascii(bytea, integer) >>>>> >>>>> >>>>> >>>> >>>> it's symmetric. Nothing more. >>>> >>>> >>>> >>> >>> Symmetric to what? What is the second argument supposed to be? >>> >>> >> >> postgres=# \df to_ascii >> List of functions >> Schema | Name | Result data type | Argument data types >> ------------+----------+------------------+--------------------- >> pg_catalog | to_ascii | text | text >> pg_catalog | to_ascii | text | text, integer >> pg_catalog | to_ascii | text | text, name >> >> postgres=# select to_ascii('pavel',8); >> to_ascii >> ---------- >> pavel >> (1 row) >> >> >> > > Hmm. That's not documented, and I suspect shouldn't be there. Everywhere > else pretty much that I can think of we pass the encoding as a name, and I > think we should be consistent about it. > I don't need it regards Pavel > cheers > > andrew >
On Monday 11 August 2008 16:23:29 Jan Urbański wrote: > Often clients want their searches to be > accented-or-language-specific letters insensitive. So searching for > 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of > such facility made me consider not upgrading particular client to 8.3...). These are valid ideas, but then please design a new function that addresses your use case in a well-defined way, and don't overload questionable old interfaces for new purposes. In the Unicode standard you can find well-defined methods to decompose characters into diacritic marks, and then you could strip them off. But this has nothing to do with ASCII or UTF8 or encodings. Cyrillic characters can have diacritic marks as well, for example.
2008/8/11 Peter Eisentraut <peter_e@gmx.net>: > On Monday 11 August 2008 16:23:29 Jan Urbański wrote: >> Often clients want their searches to be >> accented-or-language-specific letters insensitive. So searching for >> 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of >> such facility made me consider not upgrading particular client to 8.3...). > > These are valid ideas, but then please design a new function that addresses > your use case in a well-defined way, and don't overload questionable old > interfaces for new purposes. > > In the Unicode standard you can find well-defined methods to decompose > characters into diacritic marks, and then you could strip them off. But this > has nothing to do with ASCII or UTF8 or encodings. Cyrillic characters can > have diacritic marks as well, for example. > Hi Peter, changes to_ascii from text to bytea is more bugfix than new feature and should be done immediately. Correct conversions are related to colum collation - is not necessary repeat same work and same code from some unicode libs. Regards Pavel
Peter Eisentraut wrote: > On Monday 11 August 2008 16:23:29 Jan Urbański wrote: >> Often clients want their searches to be >> accented-or-language-specific letters insensitive. So searching for >> 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of >> such facility made me consider not upgrading particular client to 8.3...). > > These are valid ideas, but then please design a new function that addresses > your use case in a well-defined way, and don't overload questionable old > interfaces for new purposes. > > In the Unicode standard you can find well-defined methods to decompose > characters into diacritic marks, and then you could strip them off. But this > has nothing to do with ASCII or UTF8 or encodings. Cyrillic characters can > have diacritic marks as well, for example. OK, I was envisioning something like that: http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm but now that I think of it, I can always just write a plperlu function that uses that module. The only inconvenience is having to have plperlu in the db, but I can live with that. Postgres extensibility rocks and I rest my case. Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin