Thread: bug in Google translate snippet

bug in Google translate snippet

From
Alvaro Herrera
Date:
Hi,

I was having a look at this snippet:
http://wiki.postgresql.org/wiki/Google_Translate
and it turns out that it doesn't work if the result contains non-ASCII
chars.  Does anybody know how to fix it?

alvherre=# select gtranslate('en', 'es', 'he');
ERROR:  plpython: function "gtranslate" could not create return value
DETALLE:  <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal
notin range(128)
 

By adding a plpy.log() call you can see that the answer is "él":
LOG:  (u'\xe9l',)

I guess it needs some treatment similar to the one in this function:
http://wiki.postgresql.org/wiki/Strip_accents_from_strings


For completeness, here is the code:

CREATE OR REPLACE FUNCTION gtranslate(src text, target text, phrase text) RETURNS text
LANGUAGE plpythonu
AS $$
import re
import urllib
import simplejson as json
class UrlOpener(urllib.FancyURLopener):       version = "py-gtranslate/1.0"
base_uri = "http://ajax.googleapis.com/ajax/services/language/translate"
default_params = {'v': '1.0'}
def translate(src, to, phrase):       args = default_params.copy()       args.update({               'langpair':
'%s%%7C%s'% (src, to),               'q': urllib.quote_plus(phrase),       })       argstring = '%s' %
('&'.join(['%s=%s'% (k,v) for (k,v) in args.iteritems()]))       resp = json.load(UrlOpener().open('%s?%s' % (base_uri,
argstring)))      try:               return resp['responseData']['translatedText']       except:               # should
probablywarn about failed translation               return phrase
 
return translate(src, target, phrase)
$$;

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: bug in Google translate snippet

From
Andrew Dunstan
Date:

Alvaro Herrera wrote:
> Hi,
>
> I was having a look at this snippet:
> http://wiki.postgresql.org/wiki/Google_Translate
> and it turns out that it doesn't work if the result contains non-ASCII
> chars.  Does anybody know how to fix it?
>
> alvherre=# select gtranslate('en', 'es', 'he');
> ERROR:  plpython: function "gtranslate" could not create return value
> DETALLE:  <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal
notin range(128)
 
>   

This looks like a python issue rather than a Postgres issue. The problem 
is probably in python-simplejson.

cheers

andrew




Re: bug in Google translate snippet

From
Alvaro Herrera
Date:
Andrew Dunstan wrote:
>
>
> Alvaro Herrera wrote:
>> Hi,
>>
>> I was having a look at this snippet:
>> http://wiki.postgresql.org/wiki/Google_Translate
>> and it turns out that it doesn't work if the result contains non-ASCII
>> chars.  Does anybody know how to fix it?
>>
>> alvherre=# select gtranslate('en', 'es', 'he');
>> ERROR:  plpython: function "gtranslate" could not create return value
>> DETALLE:  <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0:
ordinalnot in range(128)
 
>
> This looks like a python issue rather than a Postgres issue. The problem  
> is probably in python-simplejson.

I think the problem happens when the PL tries to create the output
value.  Otherwise I wouldn't be able to see the value in plpy.log.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: bug in Google translate snippet

From
Jan Urbański
Date:
Alvaro Herrera wrote:
> Andrew Dunstan wrote:
>>
>> Alvaro Herrera wrote:
>>> Hi,
>>>
>>> I was having a look at this snippet:
>>> http://wiki.postgresql.org/wiki/Google_Translate
>>> and it turns out that it doesn't work if the result contains non-ASCII
>>> chars.  Does anybody know how to fix it?
>>>
>>> alvherre=# select gtranslate('en', 'es', 'he');
>>> ERROR:  plpython: function "gtranslate" could not create return value
>>> DETALLE:  <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0:
ordinalnot in range(128)
 
>> This looks like a python issue rather than a Postgres issue. The problem  
>> is probably in python-simplejson.
> 
> I think the problem happens when the PL tries to create the output
> value.  Otherwise I wouldn't be able to see the value in plpy.log.

The problem is that the thing you are trying to return
(resp['responseData']['translatedText']) is a Unicode object, so you
can't just print it. The error comes from Python complaining that you
are trying to output an 8-bit character using the 'ascii' codec, that
cannot encode that.

One solution is to explicitly encode the Unicode string with some codec,
that is: ask Python to convert the Unicode object into a blob using some
serialization method, UTF-8 being a good method here. For instance return
resp['responseData']['translatedText'].encode('utf-8')
worked for me.

See also http://docs.python.org/tutorial/introduction.html#unicode-strings

Cheers,
Jan


Re: bug in Google translate snippet

From
Jan Urbański
Date:
Alvaro Herrera wrote:
> Andrew Dunstan wrote:
>>
>> Alvaro Herrera wrote:
>>> Hi,
>>>
>>> I was having a look at this snippet:
>>> http://wiki.postgresql.org/wiki/Google_Translate
>>> and it turns out that it doesn't work if the result contains non-ASCII
>>> chars.  Does anybody know how to fix it?
>>>
>>> alvherre=# select gtranslate('en', 'es', 'he');
>>> ERROR:  plpython: function "gtranslate" could not create return value
>>> DETALLE:  <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0:
ordinalnot in range(128)
 
>> This looks like a python issue rather than a Postgres issue. The problem  
>> is probably in python-simplejson.
> 
> I think the problem happens when the PL tries to create the output
> value.  Otherwise I wouldn't be able to see the value in plpy.log.

The problem is that the thing you are trying to return
(resp['responseData']['translatedText']) is a Unicode object, so you
can't just print it. The error comes from Python complaining that you
are trying to output an 8-bit character using the 'ascii' codec, that
cannot encode that.

One solution is to explicitly encode the Unicode string with some codec,
that is: ask Python to convert the Unicode object into a blob using some
serialization method, UTF-8 being a good method here. For instance return
resp['responseData']['translatedText'].encode('utf-8')
worked for me.

See also http://docs.python.org/tutorial/introduction.html#unicode-strings

Cheers,
Jan