No,
I need a solution which is as generic as possible. I use UTF-8 encoded
unicode strings on all levels. This is what I have done so far:
1) Writing a separate Python command line script for testing - works as
expected:
#!/usr/bin/python
import sys
import unicodedata
str = sys.argv[1].decode('UTF-8')
str = unicodedata.normalize('NFKD', str)
str = ''.join(c for c in str if unicodedata.combining(c) == 0)
print str
2) Transfering this to PL/Python:
CREATE OR REPLACE FUNCTION test (str text)
RETURNS text
AS $$
import unicodedata
return unicodedata.normalize('NFKD', str.decode('UTF-8'))
$$ LANGUAGE plpythonu;
Problem: plpython throws an error, where my commandline script did it
correctly:
# select test('aÄÖÜ');
ERROR: plpython: function "test" could not create return value
DETAIL: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't
encode character u'\u0308' in position 2: ordinal not in range(128)
I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like in
a normal python environment?
In the end it should look like this:
CREATE TABLE t (
...
ts ts_vector NOT NULL
);
INSERT INTO t (ts) VALUES(to_tsvector(normalize(?)));
Andi
David Fetter schrieb:
> On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote:
>
>> Has somebody integrated Unicode normalization into Postgres? if not, I
>> would have to implement my own function by using this CPAN module:
>> http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .
>>
>> I need a function which removes all diacritics (1) and transforms some
>> characters to a more compatible form (2) to get a better index on
>> strings.
>>
>> Best,
>>
>> Andi
>>
>>
>> 1) à,ä, ... => a
>> 2) ø => o, ƒ => f, ª => a
>>
>
> You mean something like this?
>
> http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase
>
> Cheers,
> David.
>