Thread: Unicode and unaccent()
I am trying to write an unaccent function because I need to do some queries comparing data that has accents and data that does not.
The encoding on my DB is Unicode, so far I have found an unaccent() function by looking in the mail archives it looks like the following:
CREATE FUNCTION unaccent(text) RETURNS text AS $$
BEGIN
RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou')
; END; $$ LANGUAGE plpgsql IMMUTABLE STRICT;
My problem is that the values like \342 are for LATIN1 type encoding. I have tried and failed to get this working using the what I think is the Unicode escaping method \u0032 for example.
Even if someone could help me with the Unicode escaping method that would be useful. For example if I wanted to find a Unicode character 0x00E2 with a select statement how would I?
Something like select * from table where field like ‘%\u00e2%’;
Doesn’t seem to work.
Does anyone have a good method for unaccenting Unicode dbs/characters?
I am using PG7.4 on FC2
Thank you
Mark Borins wrote: > My problem is that the values like \342 are for LATIN1 type encoding. > I have tried and failed to get this working using the what I think > is the Unicode escaping method \u0032 for example. There is no Unicode escaping method. You need to encode the characters into UTF-8 yourself and write out the individual bytes using the octal escape sequences. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Mark Borins wrote: > The encoding on my DB is Unicode, so far I have found an unaccent() function > by looking in the mail archives it looks like the following: > > > CREATE FUNCTION unaccent(text) RETURNS text AS $$ > BEGIN > RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou') > ; END; $$ LANGUAGE plpgsql IMMUTABLE STRICT; > > My problem is that the values like \342 are for LATIN1 type encoding. I Why wouldn't this: RETURN translate($1, 'éçàêè...', 'ecaee...') ; work just fine? It's even portable across encodings. -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
I am not sure how I could encode the characters into UTF-8. For example, I went to Unicode.org and looked up in the specs for lets say an â is 00E2. If I wanted to do search for all names with an â in them how would I do that? 00E2 into Octal is: 342 So would I do: Select * from table where name like '%\342%' This leads to a greater question. I am trying to convert a Unicode DB to Latin1 because I realized we have absolutely no reason to be using Unicode. When I try to restore the back of a Unicode database into Latin1 I am getting some conversion errors as there are characters in Unicode that cannot be converted automatically into Latin1. These are erroneous characters and I would like to find them, I am give the hex value of the offending character. For example, 0x00E2, how would I search for this character? Thanks in advance for any help. -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Peter Eisentraut Sent: May 6, 2005 2:12 AM To: Mark Borins Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Unicode and unaccent() Mark Borins wrote: > My problem is that the values like \342 are for LATIN1 type encoding. > I have tried and failed to get this working using the what I think > is the Unicode escaping method \u0032 for example. There is no Unicode escaping method. You need to encode the characters into UTF-8 yourself and write out the individual bytes using the octal escape sequences. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend