At 01:25 02/10/2011, Reuven M. Lerner wrote:
>Hi, everyone.=C2 I'm working on a project on=20
>PostgreSQL 9.0 (soon to be upgraded to 9.1,=20
>given that we haven't yet launched).=C2 The=20
>project will involve numerous text fields=20
>containing English, Spanish, and=20
>Portuguese.=C2 Some of those text fields will be=20
>searchable by the user.=C2 That's easy enough to=20
>do; for our purposes, I was planning to use some=20
>combination of LIKE searches; the database is=20
>small enough that this doesn't take very much=20
>time, and we don't expect the number of=20
>searchable records (or columns within those records) to be all that large.
>
>The thing is, the people running the site want=20
>searches to work on what I'm calling (for lack=20
>of a better term) "bare" letters.=C2 That is, if=20
>the user searches for "n", then the search=20
>should also match Spanish words containing=20
>"=C3=B1".=C2 I'm told by Spanish-speaking members of=20
>the team that this is how they would expect=20
>searches to work.=C2 However, when I just did a=20
>quick test using a UTF-8 encoded 9.0 database, I=20
>found that PostgreSQL didn't=C2 see the two=20
>characters as identical.=C2 (I must say, this is=20
>the behavior that I would have expected, had the=20
>Spanish-speaking team member not said anything on the subject.)
>
>So my question is whether I can somehow wrangle=20
>PostgreSQL into thinking that "n" and "=C3=B1" are=20
>the same character for search purposes, or if I=20
>need to do something else -- use regexps, keep a=20
>"naked," searchable version of each column=20
>alongside the native one, or something else entirely -- to get this to wor=
k.
>
>Any ideas?
You can use perceptual hashing for that. There=20
are multiple algorithms, some of them can be tuned for specific languages.
See this documentation:
http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description,
http://en.wikipedia.org/wiki/Soundex is the first one developed, very old,
http://en.wikipedia.org/wiki/Metaphone is a=20
family of several modern algorithms.
Remember that they are hashing algorithms, some=20
words can collide because they have the same pronunciation but write differ=
ent.
I remember that datapark search engine uses them=20
with dictionaries. You can check it too.
http://www.dataparksearch.org/
>Thanks,
>
>Reuven
HTH=20