Thread: bug in Google translate snippet
Hi, I was having a look at this snippet: http://wiki.postgresql.org/wiki/Google_Translate and it turns out that it doesn't work if the result contains non-ASCII chars. Does anybody know how to fix it? alvherre=# select gtranslate('en', 'es', 'he'); ERROR: plpython: function "gtranslate" could not create return value DETALLE: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal notin range(128) By adding a plpy.log() call you can see that the answer is "él": LOG: (u'\xe9l',) I guess it needs some treatment similar to the one in this function: http://wiki.postgresql.org/wiki/Strip_accents_from_strings For completeness, here is the code: CREATE OR REPLACE FUNCTION gtranslate(src text, target text, phrase text) RETURNS text LANGUAGE plpythonu AS $$ import re import urllib import simplejson as json class UrlOpener(urllib.FancyURLopener): version = "py-gtranslate/1.0" base_uri = "http://ajax.googleapis.com/ajax/services/language/translate" default_params = {'v': '1.0'} def translate(src, to, phrase): args = default_params.copy() args.update({ 'langpair': '%s%%7C%s'% (src, to), 'q': urllib.quote_plus(phrase), }) argstring = '%s' % ('&'.join(['%s=%s'% (k,v) for (k,v) in args.iteritems()])) resp = json.load(UrlOpener().open('%s?%s' % (base_uri, argstring))) try: return resp['responseData']['translatedText'] except: # should probablywarn about failed translation return phrase return translate(src, target, phrase) $$; -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Hi, > > I was having a look at this snippet: > http://wiki.postgresql.org/wiki/Google_Translate > and it turns out that it doesn't work if the result contains non-ASCII > chars. Does anybody know how to fix it? > > alvherre=# select gtranslate('en', 'es', 'he'); > ERROR: plpython: function "gtranslate" could not create return value > DETALLE: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal notin range(128) > This looks like a python issue rather than a Postgres issue. The problem is probably in python-simplejson. cheers andrew
Andrew Dunstan wrote: > > > Alvaro Herrera wrote: >> Hi, >> >> I was having a look at this snippet: >> http://wiki.postgresql.org/wiki/Google_Translate >> and it turns out that it doesn't work if the result contains non-ASCII >> chars. Does anybody know how to fix it? >> >> alvherre=# select gtranslate('en', 'es', 'he'); >> ERROR: plpython: function "gtranslate" could not create return value >> DETALLE: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinalnot in range(128) > > This looks like a python issue rather than a Postgres issue. The problem > is probably in python-simplejson. I think the problem happens when the PL tries to create the output value. Otherwise I wouldn't be able to see the value in plpy.log. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Andrew Dunstan wrote: >> >> Alvaro Herrera wrote: >>> Hi, >>> >>> I was having a look at this snippet: >>> http://wiki.postgresql.org/wiki/Google_Translate >>> and it turns out that it doesn't work if the result contains non-ASCII >>> chars. Does anybody know how to fix it? >>> >>> alvherre=# select gtranslate('en', 'es', 'he'); >>> ERROR: plpython: function "gtranslate" could not create return value >>> DETALLE: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinalnot in range(128) >> This looks like a python issue rather than a Postgres issue. The problem >> is probably in python-simplejson. > > I think the problem happens when the PL tries to create the output > value. Otherwise I wouldn't be able to see the value in plpy.log. The problem is that the thing you are trying to return (resp['responseData']['translatedText']) is a Unicode object, so you can't just print it. The error comes from Python complaining that you are trying to output an 8-bit character using the 'ascii' codec, that cannot encode that. One solution is to explicitly encode the Unicode string with some codec, that is: ask Python to convert the Unicode object into a blob using some serialization method, UTF-8 being a good method here. For instance return resp['responseData']['translatedText'].encode('utf-8') worked for me. See also http://docs.python.org/tutorial/introduction.html#unicode-strings Cheers, Jan
Alvaro Herrera wrote: > Andrew Dunstan wrote: >> >> Alvaro Herrera wrote: >>> Hi, >>> >>> I was having a look at this snippet: >>> http://wiki.postgresql.org/wiki/Google_Translate >>> and it turns out that it doesn't work if the result contains non-ASCII >>> chars. Does anybody know how to fix it? >>> >>> alvherre=# select gtranslate('en', 'es', 'he'); >>> ERROR: plpython: function "gtranslate" could not create return value >>> DETALLE: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe9' in position 0: ordinalnot in range(128) >> This looks like a python issue rather than a Postgres issue. The problem >> is probably in python-simplejson. > > I think the problem happens when the PL tries to create the output > value. Otherwise I wouldn't be able to see the value in plpy.log. The problem is that the thing you are trying to return (resp['responseData']['translatedText']) is a Unicode object, so you can't just print it. The error comes from Python complaining that you are trying to output an 8-bit character using the 'ascii' codec, that cannot encode that. One solution is to explicitly encode the Unicode string with some codec, that is: ask Python to convert the Unicode object into a blob using some serialization method, UTF-8 being a good method here. For instance return resp['responseData']['translatedText'].encode('utf-8') worked for me. See also http://docs.python.org/tutorial/introduction.html#unicode-strings Cheers, Jan