Thread: proposal: UTF8 to_ascii function

proposal: UTF8 to_ascii function

From
"Pavel Stehule"
Date:
Hello,

combination functions to_ascii and convert_to is broken now. Problem
is in convert_to function. It doesn't support 8bit output encoding.

Current workaround:

CREATE FUNCTION to_ascii(bytea, name)
RETURNS text AS 'to_ascii_encname' LANGUAGE internal;

SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2');

I don't expect column collate for 8.4, so we need to have workable
to_ascii function.

I propose function to_ascii(text, name) that internally convert text
from utf8 encoding when it's necessary.

Regards
Pavel Stehule

Re: proposal: UTF8 to_ascii function

From
Andrew Dunstan
Date:

Pavel Stehule wrote:
> Hello,
>
> combination functions to_ascii and convert_to is broken now. Problem
> is in convert_to function. It doesn't support 8bit output encoding.
>
> Current workaround:
>
> CREATE FUNCTION to_ascii(bytea, name)
> RETURNS text AS 'to_ascii_encname' LANGUAGE internal;
>
> SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2');
>
> I don't expect column collate for 8.4, so we need to have workable
> to_ascii function.
>
> I propose function to_ascii(text, name) that internally convert text
> from utf8 encoding when it's necessary.cheers
>
>
>   

convert_to is not broken. It returns a bytea, and it is up to you to 
de-escape it if you get the text representation.

We are surely not going to go back to a situation where we have 
functions returning text in any encoding other than the database 
encoding. That becomes a vehicle for storing wrongly encoded data in the 
database, and we have just gone through the exercise of plugging those 
holes. I privately predicted when we did this work that it might 
motivate people who had been abusing convert_to to get proper support 
for multiple encodings done. That is the right way to go, not re-opening 
holes we have just very deliberately plugged.



cheers

andrew


Re: proposal: UTF8 to_ascii function

From
"Pavel Stehule"
Date:
Hello

2008/8/11 Andrew Dunstan <andrew@dunslane.net>:
>
>
> Pavel Stehule wrote:
>>
>> Hello,
>>
>> combination functions to_ascii and convert_to is broken now. Problem
>> is in convert_to function. It doesn't support 8bit output encoding.
>>
>> Current workaround:
>>
>> CREATE FUNCTION to_ascii(bytea, name)
>> RETURNS text AS 'to_ascii_encname' LANGUAGE internal;
>>
>> SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2');
>>
>> I don't expect column collate for 8.4, so we need to have workable
>> to_ascii function.
>>
>> I propose function to_ascii(text, name) that internally convert text
>> from utf8 encoding when it's necessary.cheers
>>
>>
>>
>
> convert_to is not broken. It returns a bytea, and it is up to you to
> de-escape it if you get the text representation.
>

ok, I talked about combination convert_to and to_ascii. to_ascii
doesn't support bytea, what is probably correct. We cannot use
descape, because it remove 8bit. This issue was noticed more times -
http://archives.postgresql.org/pgsql-general/2008-06/msg00495.php


> We are surely not going to go back to a situation where we have functions
> returning text in any encoding other than the database encoding. That
> becomes a vehicle for storing wrongly encoded data in the database, and we
> have just gone through the exercise of plugging those holes. I privately
> predicted when we did this work that it might motivate people who had been
> abusing convert_to to get proper support for multiple encodings done. That
> is the right way to go, not re-opening holes we have just very deliberately
> plugged.
>

to_ascii isn't related to multiple encodings. And actually there is
only one man who works on it. We will be happy for database collation
in 8.4. So without any change this feature will be broken more than
two years.

Regards
Pavel

>
>
> cheers
>
> andrew

Re: proposal: UTF8 to_ascii function

From
"Pavel Stehule"
Date:
2008/8/11 Andrew Dunstan <andrew@dunslane.net>:
>
>
> Pavel Stehule wrote:
>>
>> Hello,
>>
>> combination functions to_ascii and convert_to is broken now. Problem
>> is in convert_to function. It doesn't support 8bit output encoding.
>>
>> Current workaround:
>>
>> CREATE FUNCTION to_ascii(bytea, name)
>> RETURNS text AS 'to_ascii_encname' LANGUAGE internal;
>>
>> SELECT to_ascii(convert_to('Příliš žlutý kůň', 'latin2'),'latin2');
>>
>> I don't expect column collate for 8.4, so we need to have workable
>> to_ascii function.
>>
>> I propose function to_ascii(text, name) that internally convert text
>> from utf8 encoding when it's necessary.cheers
>>
>>
>>
>
> convert_to is not broken. It returns a bytea, and it is up to you to
> de-escape it if you get the text representation.

One note - convert_to is correct. But we have to use to_ascii without
decode functions. It has same behave - convert from bytea to text.
Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
function prototypes are:

to_ascii(text)
to_ascii(bytea, integer);
to_ascii(bytea, name);

Regards
Pavel Stehule
>
> We are surely not going to go back to a situation where we have functions
> returning text in any encoding other than the database encoding. That
> becomes a vehicle for storing wrongly encoded data in the database, and we
> have just gone through the exercise of plugging those holes. I privately
> predicted when we did this work that it might motivate people who had been
> abusing convert_to to get proper support for multiple encodings done. That
> is the right way to go, not re-opening holes we have just very deliberately
> plugged.
>
>
>
> cheers
>
> andrew
>

Re: proposal: UTF8 to_ascii function

From
Andrew Dunstan
Date:

Pavel Stehule wrote:
>
>
> One note - convert_to is correct. But we have to use to_ascii without
> decode functions. It has same behave - convert from bytea to text.
> Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
> function prototypes are:
>
> to_ascii(text)
> to_ascii(bytea, integer);
> to_ascii(bytea, name);
>
>   
>>     

What you have not said is how you propose to convert UTF8 to ASCII.

Currently to_ascii() converts a small number of single byte charsets to 
ASCII by folding the chars with high bits set, so what we get is a pure 
ASCII result which is safe in any server encoding, as they are all ASCII 
supersets.

But what conversion rule will you use for the gazillions of Unicode 
characters?

I honestly do not understand the use case for this at all.

cheers

andrew


Re: proposal: UTF8 to_ascii function

From
Jan Urbański
Date:
Andrew Dunstan wrote:
>
>
> Pavel Stehule wrote:
>>
>>
>> One note - convert_to is correct. But we have to use to_ascii without
>> decode functions. It has same behave - convert from bytea to text.
>> Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
>> function prototypes are:
>>
>> to_ascii(text)
>> to_ascii(bytea, integer);
>> to_ascii(bytea, name);
>>
>>
>>>
>
> What you have not said is how you propose to convert UTF8 to ASCII.
>
> Currently to_ascii() converts a small number of single byte charsets to
> ASCII by folding the chars with high bits set, so what we get is a pure
> ASCII result which is safe in any server encoding, as they are all ASCII
> supersets.
>
> But what conversion rule will you use for the gazillions of Unicode
> characters?
>
> I honestly do not understand the use case for this at all.

I do. Often clients want their searches to be
accented-or-language-specific letters insensitive. So searching for
'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
such facility made me consider not upgrading particular client to 8.3...).
Or maybe there's a better way to do it?

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin



Re: proposal: UTF8 to_ascii function

From
Andrew Dunstan
Date:

Jan Urbański wrote:
> Andrew Dunstan wrote:
>>
>>
>> Pavel Stehule wrote:
>>>
>>>
>>> One note - convert_to is correct. But we have to use to_ascii without
>>> decode functions. It has same behave - convert from bytea to text.
>>> Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
>>> function prototypes are:
>>>
>>> to_ascii(text)
>>> to_ascii(bytea, integer);
>>> to_ascii(bytea, name);
>>>
>>>  
>>>>     
>>
>> What you have not said is how you propose to convert UTF8 to ASCII.
>>
>> Currently to_ascii() converts a small number of single byte charsets 
>> to ASCII by folding the chars with high bits set, so what we get is a 
>> pure ASCII result which is safe in any server encoding, as they are 
>> all ASCII supersets.
>>
>> But what conversion rule will you use for the gazillions of Unicode 
>> characters?
>>
>> I honestly do not understand the use case for this at all.
>
> I do. Often clients want their searches to be 
> accented-or-language-specific letters insensitive. So searching for 
> 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of 
> such facility made me consider not upgrading particular client to 
> 8.3...).
> Or maybe there's a better way to do it?

Well, my first question would be "Why aren't you using a database 
encoding that supports to_ascii()?"

However, I suppose that your use case would support this signature:
   to_ascii(bytea, name)

where it would just error out if the encoding name were something other 
than LATIN1, LATIN2, LATIN9, or WIN1250.

But what would be the meaning of this?:
   to_ascii(bytea, integer)


cheers

andrew



Re: proposal: UTF8 to_ascii function

From
Jan Urbański
Date:
Andrew Dunstan wrote:
>
>
> Jan Urbański wrote:
>> Andrew Dunstan wrote:
>>>
>>>
>>> Pavel Stehule wrote:
>>> What you have not said is how you propose to convert UTF8 to ASCII.
>>>
>>> Currently to_ascii() converts a small number of single byte charsets
>>> to ASCII by folding the chars with high bits set, so what we get is a
>>> pure ASCII result which is safe in any server encoding, as they are
>>> all ASCII supersets.
>>>
>>> But what conversion rule will you use for the gazillions of Unicode
>>> characters?
>>>
>>> I honestly do not understand the use case for this at all.
>>
>> I do. Often clients want their searches to be
>> accented-or-language-specific letters insensitive. So searching for
>> 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
>> such facility made me consider not upgrading particular client to
>> 8.3...).
>> Or maybe there's a better way to do it?
>
> Well, my first question would be "Why aren't you using a database
> encoding that supports to_ascii()?"

Because I want UTF-8 in it ;) It's mostly LATIN2, but clients sometimes
input Cyrillic, Greek or Hebrew letters, and sometimes use Unicode
characters like (U+2026)  HORIZONTAL ELLIPSIS.

I'd like to have
to_ascii(text, [error_handling]) returns text

So no bytea, to_ascii would accept text that's legal in my current
database encoding and return text in that encoding. And error_handling
would be something like:
- 'error' (the default, throw an error if a character is untranslable to
ASCII)
- 'ignore' (omit untranslable characters)
- 'transliterate' (do your best to transliterate the character, or leave
it as it is if impossible).

Examples would include (assuming UTF-8 database)
to_ascii('łódź') -> 'lodz'
to_ascii('china is written 中國') -> ERROR
to_ascii('china is written 中國', 'ignore') -> 'china is written '
to_ascii('china is written 中國', 'transliterate') -> 'china is written
zhong guo' (in an ideal world)
to_ascii('china is written 中國', 'transliterate') -> 'china is written
中國' (in reality)\

These would have the property, that:
to_ascii(X, 'ignore') is always pure ASCII data and never throws an error
to_ascii(X, 'transliterate') is sometimes non-ASCII data and never
throws an error
to_ascii(X) is sometimes non-ASCII data and sometimes throws an error

It's something like PHP's iconv that can have //TRANSLIT or somesuch
(forgive me for giving PHP as an example...). Now I'd love to hear
people punch holes in my daydreaming design ;)

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin



Re: proposal: UTF8 to_ascii function

From
"Pavel Stehule"
Date:
2008/8/11 Andrew Dunstan <andrew@dunslane.net>:
>
>
> Pavel Stehule wrote:
>>
>>
>> One note - convert_to is correct. But we have to use to_ascii without
>> decode functions. It has same behave - convert from bytea to text.
>> Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
>> function prototypes are:
>>
>> to_ascii(text)
>> to_ascii(bytea, integer);
>> to_ascii(bytea, name);
>>
>>
>>>
>>>
>
> What you have not said is how you propose to convert UTF8 to ASCII.
>
> Currently to_ascii() converts a small number of single byte charsets to
> ASCII by folding the chars with high bits set, so what we get is a pure
> ASCII result which is safe in any server encoding, as they are all ASCII
> supersets.
>
> But what conversion rule will you use for the gazillions of Unicode
> characters?
>
> I honestly do not understand the use case for this at all.
>

It's typical case in czech language, where some searchings are accents
insensitive - Stěhule, Stehule, Novotný, Novotny.

> cheers
>
> andrew
>

Re: proposal: UTF8 to_ascii function

From
"Pavel Stehule"
Date:
2008/8/11 Andrew Dunstan <andrew@dunslane.net>:
>
>
> Jan Urbański wrote:
>>
>> Andrew Dunstan wrote:
>>>
>>>
>>> Pavel Stehule wrote:
>>>>
>>>>
>>>> One note - convert_to is correct. But we have to use to_ascii without
>>>> decode functions. It has same behave - convert from bytea to text.
>>>> Text in "incorrect" encoding is dafacto bytea. So correct to_ascii
>>>> function prototypes are:
>>>>
>>>> to_ascii(text)
>>>> to_ascii(bytea, integer);
>>>> to_ascii(bytea, name);
>>>>
>>>>
>>>>>
>>>>>
>>>
>>> What you have not said is how you propose to convert UTF8 to ASCII.
>>>
>>> Currently to_ascii() converts a small number of single byte charsets to
>>> ASCII by folding the chars with high bits set, so what we get is a pure
>>> ASCII result which is safe in any server encoding, as they are all ASCII
>>> supersets.
>>>
>>> But what conversion rule will you use for the gazillions of Unicode
>>> characters?
>>>
>>> I honestly do not understand the use case for this at all.
>>
>> I do. Often clients want their searches to be
>> accented-or-language-specific letters insensitive. So searching for 'łódź'
>> returns 'lodz'. So the use case is there (in fact, the lack of such facility
>> made me consider not upgrading particular client to 8.3...).
>> Or maybe there's a better way to do it?
>
> Well, my first question would be "Why aren't you using a database encoding
> that supports to_ascii()?"
>
> However, I suppose that your use case would support this signature:
>
>   to_ascii(bytea, name)
>
> where it would just error out if the encoding name were something other than
> LATIN1, LATIN2, LATIN9, or WIN1250.
>
> But what would be the meaning of this?:
>
>   to_ascii(bytea, integer)
>

it's symmetric. Nothing more.

>
> cheers
>
> andrew
>
>

Re: proposal: UTF8 to_ascii function

From
Andrew Dunstan
Date:

Pavel Stehule wrote:
>>
>> But what would be the meaning of this?:
>>
>>   to_ascii(bytea, integer)
>>
>>     
>
> it's symmetric. Nothing more.
>
>   

Symmetric to what? What is the second argument supposed to be?

cheers

andrew


Re: proposal: UTF8 to_ascii function

From
"Pavel Stehule"
Date:
2008/8/11 Andrew Dunstan <andrew@dunslane.net>:
>
>
> Pavel Stehule wrote:
>>>
>>> But what would be the meaning of this?:
>>>
>>>  to_ascii(bytea, integer)
>>>
>>>
>>
>> it's symmetric. Nothing more.
>>
>>
>
> Symmetric to what? What is the second argument supposed to be?
>

postgres=# \df to_ascii                      List of functions  Schema   |   Name   | Result data type | Argument data
types
------------+----------+------------------+---------------------pg_catalog | to_ascii | text             |
textpg_catalog| to_ascii | text             | text, integerpg_catalog | to_ascii | text             | text, name
 

postgres=# select to_ascii('pavel',8);to_ascii
----------pavel
(1 row)


> cheers
>
> andrew
>


Re: proposal: UTF8 to_ascii function

From
Andrew Dunstan
Date:

Pavel Stehule wrote:
> 2008/8/11 Andrew Dunstan <andrew@dunslane.net>:
>   
>> Pavel Stehule wrote:
>>     
>>>> But what would be the meaning of this?:
>>>>
>>>>  to_ascii(bytea, integer)
>>>>
>>>>
>>>>         
>>> it's symmetric. Nothing more.
>>>
>>>
>>>       
>> Symmetric to what? What is the second argument supposed to be?
>>
>>     
>
> postgres=# \df to_ascii
>                        List of functions
>    Schema   |   Name   | Result data type | Argument data types
> ------------+----------+------------------+---------------------
>  pg_catalog | to_ascii | text             | text
>  pg_catalog | to_ascii | text             | text, integer
>  pg_catalog | to_ascii | text             | text, name
>
> postgres=# select to_ascii('pavel',8);
>  to_ascii
> ----------
>  pavel
> (1 row)
>
>
>   
>

Hmm. That's not documented, and I suspect shouldn't be there. Everywhere 
else pretty much that I can think of we pass the encoding as a name, and 
I think we should be consistent about it.

cheers

andrew


Re: proposal: UTF8 to_ascii function

From
"Pavel Stehule"
Date:
2008/8/11 Andrew Dunstan <andrew@dunslane.net>:
>
>
> Pavel Stehule wrote:
>>
>> 2008/8/11 Andrew Dunstan <andrew@dunslane.net>:
>>
>>>
>>> Pavel Stehule wrote:
>>>
>>>>>
>>>>> But what would be the meaning of this?:
>>>>>
>>>>>  to_ascii(bytea, integer)
>>>>>
>>>>>
>>>>>
>>>>
>>>> it's symmetric. Nothing more.
>>>>
>>>>
>>>>
>>>
>>> Symmetric to what? What is the second argument supposed to be?
>>>
>>>
>>
>> postgres=# \df to_ascii
>>                       List of functions
>>   Schema   |   Name   | Result data type | Argument data types
>> ------------+----------+------------------+---------------------
>>  pg_catalog | to_ascii | text             | text
>>  pg_catalog | to_ascii | text             | text, integer
>>  pg_catalog | to_ascii | text             | text, name
>>
>> postgres=# select to_ascii('pavel',8);
>>  to_ascii
>> ----------
>>  pavel
>> (1 row)
>>
>>
>>
>
> Hmm. That's not documented, and I suspect shouldn't be there. Everywhere
> else pretty much that I can think of we pass the encoding as a name, and I
> think we should be consistent about it.
>

I don't need it

regards
Pavel

> cheers
>
> andrew
>


Re: proposal: UTF8 to_ascii function

From
Peter Eisentraut
Date:
On Monday 11 August 2008 16:23:29 Jan Urbański wrote:
> Often clients want their searches to be
> accented-or-language-specific letters insensitive. So searching for
> 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
> such facility made me consider not upgrading particular client to 8.3...).

These are valid ideas, but then please design a new function that addresses
your use case in a well-defined way, and don't overload questionable old
interfaces for new purposes.

In the Unicode standard you can find well-defined methods to decompose
characters into diacritic marks, and then you could strip them off.  But this
has nothing to do with ASCII or UTF8 or encodings.  Cyrillic characters can
have diacritic marks as well, for example.


Re: proposal: UTF8 to_ascii function

From
"Pavel Stehule"
Date:
2008/8/11 Peter Eisentraut <peter_e@gmx.net>:
> On Monday 11 August 2008 16:23:29 Jan Urbański wrote:
>> Often clients want their searches to be
>> accented-or-language-specific letters insensitive. So searching for
>> 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
>> such facility made me consider not upgrading particular client to 8.3...).
>
> These are valid ideas, but then please design a new function that addresses
> your use case in a well-defined way, and don't overload questionable old
> interfaces for new purposes.
>
> In the Unicode standard you can find well-defined methods to decompose
> characters into diacritic marks, and then you could strip them off.  But this
> has nothing to do with ASCII or UTF8 or encodings.  Cyrillic characters can
> have diacritic marks as well, for example.
>

Hi Peter,

changes to_ascii from text to bytea is more bugfix than new feature
and should be done immediately. Correct conversions are related to
colum collation - is not necessary repeat same work and same code from
some unicode libs.

Regards
Pavel

Re: proposal: UTF8 to_ascii function

From
Jan Urbański
Date:
Peter Eisentraut wrote:
> On Monday 11 August 2008 16:23:29 Jan Urbański wrote:
>> Often clients want their searches to be
>> accented-or-language-specific letters insensitive. So searching for
>> 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
>> such facility made me consider not upgrading particular client to 8.3...).
>
> These are valid ideas, but then please design a new function that addresses
> your use case in a well-defined way, and don't overload questionable old
> interfaces for new purposes.
>
> In the Unicode standard you can find well-defined methods to decompose
> characters into diacritic marks, and then you could strip them off.  But this
> has nothing to do with ASCII or UTF8 or encodings.  Cyrillic characters can
> have diacritic marks as well, for example.

OK, I was envisioning something like that:
http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm
but now that I think of it, I can always just write a plperlu function
that uses that module. The only inconvenience is having to have plperlu
in the db, but I can live with that.
Postgres extensibility rocks and I rest my case.

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin