Thread: Searching for "bare" letters

Searching for "bare" letters

From
"Reuven M. Lerner"
Date:
<p>Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet
launched). The project will involve numerous text fields containing English, Spanish, and Portuguese.  Some of those
textfields will be searchable by the user.  That's easy enough to do; for our purposes, I was planning to use some
combinationof LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect
thenumber of searchable records (or columns within those records) to be all that large.<p>The thing is, the people
runningthe site want searches to work on what I'm calling (for lack of a better term) "bare" letters.  That is, if the
usersearches for "n", then the search should also match Spanish words containing "ñ".  I'm told by Spanish-speaking
membersof the team that this is how they would expect searches to work.  However, when I just did a quick test using a
UTF-8encoded 9.0 database, I found that PostgreSQL didn't  see the two characters as identical.  (I must say, this is
thebehavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.)<p>So my
questionis whether I can somehow wrangle PostgreSQL into thinking that "n" and "ñ" are the same character for search
purposes,or if I need to do something else -- use regexps, keep a "naked," searchable version of each column alongside
thenative one, or something else entirely -- to get this to work.<p>Any ideas?<p>Thanks,<p>Reuven<br /><p><pre
class="moz-signature"cols="72">-- 
 
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner
</pre>

Re: Searching for "bare" letters

From
planas
Date:
On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote:
Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched).  The project will involve numerous text fields containing English, Spanish, and Portuguese.  Some of those text fields will be searchable by the user.  That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large.

The thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) "bare" letters.  That is, if the user searches for "n", then the search should also match Spanish words containing "ñ".  I'm told by Spanish-speaking members of the team that this is how they would expect searches to work.  However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't  see the two characters as identical.  (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.)

So my question is whether I can somehow wrangle PostgreSQL into thinking that "n" and "ñ" are the same character for search purposes, or if I need to do something else -- use regexps, keep a "naked," searchable version of each column alongside the native one, or something else entirely -- to get this to work.

Could you parse the search string for the non-English characters and convert them to the appropriate English character? My skills are not that good or I would offer more details.
Any ideas?

Thanks,

Reuven


-- 
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner


--
Jay Lozier
jslozier@gmail.com

Re: Searching for "bare" letters

From
Uwe Schroeder
Date:

> Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon to be
> upgraded to 9.1, given that we haven't yet launched).  The project will
> involve numerous text fields containing English, Spanish, and Portuguese.
> Some of those text fields will be searchable by the user.  That's easy
> enough to do; for our purposes, I was planning to use some combination of
> LIKE searches; the database is small enough that this doesn't take very
> much time, and we don't expect the number of searchable records (or
> columns within those records) to be all that large. The thing is, the
> people running the site want searches to work on what I'm calling (for
> lack of a better term) "bare" letters.  That is, if the user searches for
> "n", then the search should also match Spanish words containing "ñ".  I'm
> told by Spanish-speaking members of the team that this is how they would
> expect searches to work.  However, when I just did a quick test using a
> UTF-8 encoded 9.0 database, I found that PostgreSQL didn't  see the two
> characters as identical.  (I must say, this is the behavior that I would
> have expected, had the Spanish-speaking team member not said anything on
> the subject.) So my question is whether I can somehow wrangle PostgreSQL
> into thinking that "n" and "ñ" are the same character for search purposes,
> or if I need to do something else -- use regexps, keep a "naked,"
> searchable version of each column alongside the native one, or something
> else entirely -- to get this to work. Any ideas?
> Thanks,
> Reuven



What kind of "client" are the users using?  I assume you will have some kind
of user interface. For me this is a typical job for a user interface. The
number of letters with "equivalents" in different languages are extremely
limited, so a simple matching routine in the user interface should give you a
way to issue the proper query.

Uwe

Re: Searching for "bare" letters

From
Cody Caughlan
Date:
One approach would be to "normalize" all the text and search against that.

That is, basically convert all non-ASCII characters to their equivalents. 

I've had to do this in Solr for searching for the exact reasons you've outlined: treat "ñ" as "n". Ditto for "ü" -> "u", "é" => "e", etc.

This is easily done in Solr via the included ASCIIFoldingFilterFactory:


You could look at the code to see how they do the conversion and implement it.

/Cody

On Oct 1, 2011, at 7:09 PM, planas wrote:

On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote:
Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched).  The project will involve numerous text fields containing English, Spanish, and Portuguese.  Some of those text fields will be searchable by the user.  That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large.

The thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) "bare" letters.  That is, if the user searches for "n", then the search should also match Spanish words containing "ñ".  I'm told by Spanish-speaking members of the team that this is how they would expect searches to work.  However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't  see the two characters as identical.  (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.)

So my question is whether I can somehow wrangle PostgreSQL into thinking that "n" and "ñ" are the same character for search purposes, or if I need to do something else -- use regexps, keep a "naked," searchable version of each column alongside the native one, or something else entirely -- to get this to work.

Could you parse the search string for the non-English characters and convert them to the appropriate English character? My skills are not that good or I would offer more details.
Any ideas?

Thanks,

Reuven


-- 
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner


--
Jay Lozier
jslozier@gmail.com

Re: Searching for "bare" letters

From
"Reuven M. Lerner"
Date:
Hi, everyone.  Uwe wrote:

> What kind of "client" are the users using?  I assume you will have some kind
> of user interface. For me this is a typical job for a user interface. The
> number of letters with "equivalents" in different languages are extremely
> limited, so a simple matching routine in the user interface should give you a
> way to issue the proper query.

The user interface will be via a Web application.  But we need to store
the data with the European characters, such as ñ, so that we can display
them appropriately.  So much as I like your suggestion, we need to do
the opposite of what you're saying -- namely, take a bare letter, and
then search for letters with accents and such on them.

I am beginning to think that storing two versions of each name, one bare
and the other not, might be the easiest way to go.   But hey, I'm open
to more suggestions.

Reuven

--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

Re: Searching for "bare" letters

From
hamann.w@t-online.de
Date:
Reuven M. Lerner wrote:

>>     <p>Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon
>>       to be upgraded to 9.1, given that we haven't yet launched).  The
>>       project will involve numerous text fields containing English,
>>       Spanish, and Portuguese.  Some of those text fields will be
>>       searchable by the user.  That's easy enough to do; for our
>>       purposes, I was planning to use some combination of LIKE searches;
>>       the database is small enough that this doesn't take very much
>>       time, and we don't expect the number of searchable records (or
>>       columns within those records) to be all that large.</p>
>>     <p>The thing is, the people running the site want searches to work
>>       on what I'm calling (for lack of a better term) "bare" letters. 
>>       That is, if the user searches for "n", then the search should also
>>       match Spanish words containing "ñ".  I'm told by Spanish-speaking
>>       members of the team that this is how they would expect searches to
>>       work.  However, when I just did a quick test using a UTF-8 encoded
>>       9.0 database, I found that PostgreSQL didn't  see the two
>>       characters as identical.  (I must say, this is the behavior that I
>>       would have expected, had the Spanish-speaking team member not said
>>       anything on the subject.)</p>
>>     <p>So my question is whether I can somehow wrangle PostgreSQL into
>>       thinking that "n" and "ñ" are the same character for search
>>       purposes, or if I need to do something else -- use regexps, keep a
>>       "naked," searchable version of each column alongside the native
>>       one, or something else entirely -- to get this to work.</p>
>>     <p>Any ideas?</p>
>>     <p>Thanks,</p>
>>     <p>Reuven<br>

I had the same problem with german (there is ä ö ü)
I ended up with a normalized version of the database (for many purposes, this could
be just an extra column) plus preprocessing the input.
There is one difficulty with german searches: these letters are commonly transliterated into
ue etc, like in "Muenchen". So depending on culture, some people would expect a "u" search
term to match, and others the "ue". So preprocessing query means replacing bare u
(not followed by e) with a ue? regex

BTW: if your search form does not explicitly tell the browser to use utf8 to encode the search field,
you might expect a small proportion of iso-latin1 requests

Regards
Wolfgang




Re: Searching for "bare" letters

From
Uwe Schroeder
Date:
> Hi, everyone.  Uwe wrote:
> > What kind of "client" are the users using?  I assume you will have some
> > kind of user interface. For me this is a typical job for a user
> > interface. The number of letters with "equivalents" in different
> > languages are extremely limited, so a simple matching routine in the
> > user interface should give you a way to issue the proper query.
>
> The user interface will be via a Web application.  But we need to store
> the data with the European characters, such as ñ, so that we can display
> them appropriately.  So much as I like your suggestion, we need to do
> the opposite of what you're saying -- namely, take a bare letter, and
> then search for letters with accents and such on them.
>
> I am beginning to think that storing two versions of each name, one bare
> and the other not, might be the easiest way to go.   But hey, I'm open
> to more suggestions.
>
> Reuven


That still doesn't hinder you from using a matching algorithm. Here a simple
example (to my understanding of the problem)
You have texts stored in the db both containing a n and a ñ. Now a client
enters "n" on the website. What you want to do is look for both variations, so
"n" translates into "n" or "ñ".
There you have it. In the routine that receives the request you have a
matching method that matches on "n" (or any of the few other characters with
equivalents) and the routine will issue a query with a "xx like "%n%" or xx
like "%ñ%" (personally I would use ilike, since that eliminates the case
problem).

Since you're referring to a "name", I sure don't know the specifics of the
problem or data layout, but by what I know I think you can tackle this with a
rather primitive "match -> translate to" kind of algorithm.

One thing I'd not do: store duplicate versions. There's always a way to deal
with data the way it is. In my opinion storing different versions of the same
data just bloats a database in favor of a smarter way to deal with the initial
data.

Uwe



Re: Searching for "bare" letters

From
Oleg Bartunov
Date:
I don't see the problem - you can have a dictionary, which does all work on
recognizing bare letters and output several versions. Have you seen unaccent
dictionary ?

Oleg
On Sun, 2 Oct 2011, Uwe Schroeder wrote:

>> Hi, everyone.  Uwe wrote:
>>> What kind of "client" are the users using?  I assume you will have some
>>> kind of user interface. For me this is a typical job for a user
>>> interface. The number of letters with "equivalents" in different
>>> languages are extremely limited, so a simple matching routine in the
>>> user interface should give you a way to issue the proper query.
>>
>> The user interface will be via a Web application.  But we need to store
>> the data with the European characters, such as ?, so that we can display
>> them appropriately.  So much as I like your suggestion, we need to do
>> the opposite of what you're saying -- namely, take a bare letter, and
>> then search for letters with accents and such on them.
>>
>> I am beginning to think that storing two versions of each name, one bare
>> and the other not, might be the easiest way to go.   But hey, I'm open
>> to more suggestions.
>>
>> Reuven
>
>
> That still doesn't hinder you from using a matching algorithm. Here a simple
> example (to my understanding of the problem)
> You have texts stored in the db both containing a n and a ?. Now a client
> enters "n" on the website. What you want to do is look for both variations, so
> "n" translates into "n" or "?".
> There you have it. In the routine that receives the request you have a
> matching method that matches on "n" (or any of the few other characters with
> equivalents) and the routine will issue a query with a "xx like "%n%" or xx
> like "%?%" (personally I would use ilike, since that eliminates the case
> problem).
>
> Since you're referring to a "name", I sure don't know the specifics of the
> problem or data layout, but by what I know I think you can tackle this with a
> rather primitive "match -> translate to" kind of algorithm.
>
> One thing I'd not do: store duplicate versions. There's always a way to deal
> with data the way it is. In my opinion storing different versions of the same
> data just bloats a database in favor of a smarter way to deal with the initial
> data.
>
> Uwe
>
>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Searching for "bare" letters

From
"Reuven M. Lerner"
Date:
Hi, Oleg.  You wrote:

> I don't see the problem - you can have a dictionary, which does all work
> on recognizing bare letters and output several versions. Have you seen
> unaccent
> dictionary ?

This seems to be the direction that everyone is suggesting, and I'm
quite grateful for that.  (I really hadn't ever needed to deal with such
issues in the past, having worked mostly with English and Hebrew, which
don't have such accent marks.)

As for the unaccent dictionary, I hadn't heard of it before, but just
saw it now in contrib, and it looks like it might fit perfectly.  I'll
take a look; thanks for the suggestion.

Reuven

--
Reuven M. Lerner -- Web development, consulting, and training
Mobile: +972-54-496-8405 * US phone: 847-230-9795
Skype/AIM: reuvenlerner

Re: Searching for "bare" letters

From
Mike Christensen
Date:
>> I don't see the problem - you can have a dictionary, which does all work
>> on recognizing bare letters and output several versions. Have you seen
>> unaccent
>> dictionary ?
>
> This seems to be the direction that everyone is suggesting, and I'm quite
> grateful for that.  (I really hadn't ever needed to deal with such issues in
> the past, having worked mostly with English and Hebrew, which don't have
> such accent marks.)
>
> As for the unaccent dictionary, I hadn't heard of it before, but just saw it
> now in contrib, and it looks like it might fit perfectly.  I'll take a look;
> thanks for the suggestion.

I wrote this code for something similar I was doing, feel free to rip
it off or copy the regular expressions:


input = Regex.Replace(input, @"[\xC0-\xC5\xE0-\xE5]", "a");  //Replace with "a"
input = Regex.Replace(input, @"[\xC8-\xCB\xE8-\xEB]", "e");  //Replace with "e"
input = Regex.Replace(input, @"[\xCC-\xCF\xEC-\xEF]", "i");  //Replace with "i"
input = Regex.Replace(input, @"[\xD1\xF1]", "n");            //Replace with "n"
input = Regex.Replace(input, @"[\xD2-\xD6\xF2-\xF6]", "o");  //Replace with "o"
input = Regex.Replace(input, @"[\xD9-\xDC\xF9-\xFC]", "u");  //Replace with "u"
input = Regex.Replace(input, @"[\xDD\xDF\xFF]", "y");        //Replace with "y"

Re: Searching for "bare" letters

From
Eduardo Morras
Date:
At 01:25 02/10/2011, Reuven M. Lerner wrote:

>Hi, everyone.=C2  I'm working on a project on=20
>PostgreSQL 9.0 (soon to be upgraded to 9.1,=20
>given that we haven't yet launched).=C2  The=20
>project will involve numerous text fields=20
>containing English, Spanish, and=20
>Portuguese.=C2  Some of those text fields will be=20
>searchable by the user.=C2  That's easy enough to=20
>do; for our purposes, I was planning to use some=20
>combination of LIKE searches; the database is=20
>small enough that this doesn't take very much=20
>time, and we don't expect the number of=20
>searchable records (or columns within those records) to be all that large.
>
>The thing is, the people running the site want=20
>searches to work on what I'm calling (for lack=20
>of a better term) "bare" letters.=C2  That is, if=20
>the user searches for "n", then the search=20
>should also match Spanish words containing=20
>"=C3=B1".=C2  I'm told by Spanish-speaking members of=20
>the team that this is how they would expect=20
>searches to work.=C2  However, when I just did a=20
>quick test using a UTF-8 encoded 9.0 database, I=20
>found that PostgreSQL didn't=C2  see the two=20
>characters as identical.=C2  (I must say, this is=20
>the behavior that I would have expected, had the=20
>Spanish-speaking team member not said anything on the subject.)
>
>So my question is whether I can somehow wrangle=20
>PostgreSQL into thinking that "n" and "=C3=B1" are=20
>the same character for search purposes, or if I=20
>need to do something else -- use regexps, keep a=20
>"naked," searchable version of each column=20
>alongside the native one, or something else entirely -- to get this to wor=
k.
>
>Any ideas?

You can use perceptual hashing for that. There=20
are multiple algorithms, some of them can be tuned for specific languages.

See this documentation:

http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description,

http://en.wikipedia.org/wiki/Soundex is the first one developed, very old,

http://en.wikipedia.org/wiki/Metaphone is a=20
family of several modern algorithms.

Remember that they are hashing algorithms, some=20
words can collide because they have the same pronunciation but write differ=
ent.

I remember that datapark search engine uses them=20
with dictionaries. You can check it too.

http://www.dataparksearch.org/

>Thanks,
>
>Reuven

HTH=20

Re: Searching for "bare" letters

From
Eduardo Morras
Date:
At 01:25 02/10/2011, Reuven M. Lerner wrote:

>Hi, everyone.  I'm working on a project on
>PostgreSQL 9.0 (soon to be upgraded to 9.1,
>given that we haven't yet launched).  The
>project will involve numerous text fields
>containing English, Spanish, and
>Portuguese.  Some of those text fields will be
>searchable by the user.  That's easy enough to
>do; for our purposes, I was planning to use some
>combination of LIKE searches; the database is
>small enough that this doesn't take very much
>time, and we don't expect the number of
>searchable records (or columns within those records) to be all that large.
>
>The thing is, the people running the site want
>searches to work on what I'm calling (for lack
>of a better term) "bare" letters.  That is, if
>the user searches for "n", then the search
>should also match Spanish words containing
>"ñ".  I'm told by Spanish-speaking members of
>the team that this is how they would expect
>searches to work.  However, when I just did a
>quick test using a UTF-8 encoded 9.0 database, I
>found that PostgreSQL didn't  see the two
>characters as identical.  (I must say, this is
>the behavior that I would have expected, had the
>Spanish-speaking team member not said anything on the subject.)
>
>So my question is whether I can somehow wrangle
>PostgreSQL into thinking that "n" and "ñ" are
>the same character for search purposes, or if I
>need to do something else -- use regexps, keep a
>"naked," searchable version of each column
>alongside the native one, or something else entirely -- to get this to work.
>
>Any ideas?

You can use perceptual hashing for that. There
are multiple algorithms, some of them can be tuned for specific languages.

See this documentation:

http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description,

http://en.wikipedia.org/wiki/Soundex is the first one developed, very old,

http://en.wikipedia.org/wiki/Metaphone is a
family of several modern algorithms.

Remember that they are hashing algorithms, some
words can collide because they have the same pronunciation but write different.

I remember that datapark search engine uses them
with dictionaries. You can check it too.

http://www.dataparksearch.org/

>Thanks,
>
>Reuven

HTH



Re: Searching for "bare" letters

From
Andrew Sullivan
Date:
On Sun, Oct 02, 2011 at 05:45:48PM +0200, Reuven M. Lerner wrote:
> quite grateful for that.  (I really hadn't ever needed to deal with
> such issues in the past, having worked mostly with English and
> Hebrew, which don't have such accent marks.)

That isn't quite true about English.  We have words like coöperate and
naïve.  The former is sometimes fixed with a hyphen instead, but the
latter can't be.

I think what happened is that English speakers, because we're already
used to being sloppy (you can't tell what's a subjunctive in English,
either, just by looking) were willing to adapt our spelling to reflect
the limitations of typewriters.  Also, English never really had an
official standard spelling -- by the time the English were attempting
to standardize seriously, there was already an American branch with
its own Bossypants Official Reformer of Spelling ("BORS", which in
that case was Noah Webster.  See G.B. Shaw for a British example).  So
we mostly lost the accents in standard spelling.  We also lost various
standard digraphs, like that in encyclopædia (which, depending on
which branch of nonsense you subscribe to, can be spelled instead
"encyclopedia" or "encyclopaedia"; both would have been called "wrong"
once upon a time).

> As for the unaccent dictionary, I hadn't heard of it before, but
> just saw it now in contrib, and it looks like it might fit
> perfectly.  I'll take a look; thanks for the suggestion.

The big problem there is what someone else pointed to up-thread: in
some languages, the natural thing to do is to transliterate using
multiple characters.  The usual example is that in German is it common
to use "e" after a vowel to approximate the umlaut.  So, "ö" becomes
"oe".  Unfortunately, in Swedish this is clearly a mistake, and if you
can't use the diaeresis, then you just use the "undecorated" character
instead.  The famous Swedish ship called the Götheborg cannot be
transliterated as Goetheborg.  Even in German, the rule is
complicated, because it's not two-way: you can't spell the famous
writer's name Göthe (even though Google seems to think you can).

As far as I can tell, the unaccent dictionary doesn't handle the
two-character case, though it sure looks like it could be extended to
do it.  But it doesn't seem to have a facility for differentiating
based on the language of the string.  I don't know whether that could
be added.

The upshot is that, if you need to store multilingual input and do
special handling on the strings afterwards, you are wise to store the
string with a language tag so that you can apply the right rules later
on.  See RFC 5646 (http://www.rfc-editor.org/rfc/rfc5646.txt) for some
pointers.  If just "stripping accents" is good enough for you, then
the unaccent dictionary will probably be good enough.

A

--
Andrew Sullivan
ajs@crankycanuck.ca