Re: Unicode normalization - Mailing list pgsql-general

From Andreas Kalsch
Subject Re: Unicode normalization
Date
Msg-id 4AB13DE6.3040800@gmx.de
Whole thread Raw
In response to Re: Unicode normalization  (David Fetter <david@fetter.org>)
Responses Re: Unicode normalization
Re: Unicode normalization
Re: Unicode normalization
List pgsql-general
No,

I need a solution which is as generic as possible. I use UTF-8 encoded
unicode strings on all levels. This is what I have done so far:


1) Writing a separate Python command line script for testing - works as
expected:

#!/usr/bin/python

import sys
import unicodedata

str = sys.argv[1].decode('UTF-8')
str = unicodedata.normalize('NFKD', str)
str = ''.join(c for c in str if unicodedata.combining(c) == 0)
print str


2) Transfering this to PL/Python:

CREATE OR REPLACE FUNCTION test (str text)
  RETURNS text
AS $$
    import unicodedata
    return unicodedata.normalize('NFKD', str.decode('UTF-8'))
$$ LANGUAGE plpythonu;

Problem: plpython throws an error, where my commandline script did it
correctly:

# select test('aÄÖÜ');

ERROR:  plpython: function "test" could not create return value
DETAIL:  <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't
encode character u'\u0308' in position 2: ordinal not in range(128)



I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like in
a normal python environment?


In the end it should look like this:

CREATE TABLE t (
...
ts ts_vector NOT NULL
);

INSERT INTO t (ts) VALUES(to_tsvector(normalize(?)));

Andi


David Fetter schrieb:
> On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote:
>
>> Has somebody integrated Unicode normalization into Postgres? if not, I
>> would have to implement my own function by using this CPAN module:
>> http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .
>>
>> I need a function which removes all diacritics (1) and transforms some
>> characters to a more compatible form (2) to get a better index on
>> strings.
>>
>> Best,
>>
>> Andi
>>
>>
>> 1) à,ä, ... => a
>> 2) ø => o, ƒ => f, ª => a
>>
>
> You mean something like this?
>
> http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase
>
> Cheers,
> David.
>


pgsql-general by date:

Previous
From: David Fetter
Date:
Subject: Re: Unicode normalization
Next
From: Christine Penner
Date:
Subject: 8.4 installer