Thread: Searching for "bare" letters
<p>Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched). The project will involve numerous text fields containing English, Spanish, and Portuguese. Some of those textfields will be searchable by the user. That's easy enough to do; for our purposes, I was planning to use some combinationof LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect thenumber of searchable records (or columns within those records) to be all that large.<p>The thing is, the people runningthe site want searches to work on what I'm calling (for lack of a better term) "bare" letters. That is, if the usersearches for "n", then the search should also match Spanish words containing "ñ". I'm told by Spanish-speaking membersof the team that this is how they would expect searches to work. However, when I just did a quick test using a UTF-8encoded 9.0 database, I found that PostgreSQL didn't see the two characters as identical. (I must say, this is thebehavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.)<p>So my questionis whether I can somehow wrangle PostgreSQL into thinking that "n" and "ñ" are the same character for search purposes,or if I need to do something else -- use regexps, keep a "naked," searchable version of each column alongside thenative one, or something else entirely -- to get this to work.<p>Any ideas?<p>Thanks,<p>Reuven<br /><p><pre class="moz-signature"cols="72">-- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner </pre>
On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote:
Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched). The project will involve numerous text fields containing English, Spanish, and Portuguese. Some of those text fields will be searchable by the user. That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large.Could you parse the search string for the non-English characters and convert them to the appropriate English character? My skills are not that good or I would offer more details.
The thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) "bare" letters. That is, if the user searches for "n", then the search should also match Spanish words containing "ñ". I'm told by Spanish-speaking members of the team that this is how they would expect searches to work. However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two characters as identical. (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.)
So my question is whether I can somehow wrangle PostgreSQL into thinking that "n" and "ñ" are the same character for search purposes, or if I need to do something else -- use regexps, keep a "naked," searchable version of each column alongside the native one, or something else entirely -- to get this to work.
Any ideas?
Thanks,
Reuven-- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner
-- Jay Lozier jslozier@gmail.com |
> Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be > upgraded to 9.1, given that we haven't yet launched). The project will > involve numerous text fields containing English, Spanish, and Portuguese. > Some of those text fields will be searchable by the user. That's easy > enough to do; for our purposes, I was planning to use some combination of > LIKE searches; the database is small enough that this doesn't take very > much time, and we don't expect the number of searchable records (or > columns within those records) to be all that large. The thing is, the > people running the site want searches to work on what I'm calling (for > lack of a better term) "bare" letters. That is, if the user searches for > "n", then the search should also match Spanish words containing "ñ". I'm > told by Spanish-speaking members of the team that this is how they would > expect searches to work. However, when I just did a quick test using a > UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two > characters as identical. (I must say, this is the behavior that I would > have expected, had the Spanish-speaking team member not said anything on > the subject.) So my question is whether I can somehow wrangle PostgreSQL > into thinking that "n" and "ñ" are the same character for search purposes, > or if I need to do something else -- use regexps, keep a "naked," > searchable version of each column alongside the native one, or something > else entirely -- to get this to work. Any ideas? > Thanks, > Reuven What kind of "client" are the users using? I assume you will have some kind of user interface. For me this is a typical job for a user interface. The number of letters with "equivalents" in different languages are extremely limited, so a simple matching routine in the user interface should give you a way to issue the proper query. Uwe
One approach would be to "normalize" all the text and search against that.
That is, basically convert all non-ASCII characters to their equivalents.
I've had to do this in Solr for searching for the exact reasons you've outlined: treat "ñ" as "n". Ditto for "ü" -> "u", "é" => "e", etc.
This is easily done in Solr via the included ASCIIFoldingFilterFactory:
You could look at the code to see how they do the conversion and implement it.
/Cody
On Oct 1, 2011, at 7:09 PM, planas wrote:
On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote:Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon to be upgraded to 9.1, given that we haven't yet launched). The project will involve numerous text fields containing English, Spanish, and Portuguese. Some of those text fields will be searchable by the user. That's easy enough to do; for our purposes, I was planning to use some combination of LIKE searches; the database is small enough that this doesn't take very much time, and we don't expect the number of searchable records (or columns within those records) to be all that large.Could you parse the search string for the non-English characters and convert them to the appropriate English character? My skills are not that good or I would offer more details.
The thing is, the people running the site want searches to work on what I'm calling (for lack of a better term) "bare" letters. That is, if the user searches for "n", then the search should also match Spanish words containing "ñ". I'm told by Spanish-speaking members of the team that this is how they would expect searches to work. However, when I just did a quick test using a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't see the two characters as identical. (I must say, this is the behavior that I would have expected, had the Spanish-speaking team member not said anything on the subject.)
So my question is whether I can somehow wrangle PostgreSQL into thinking that "n" and "ñ" are the same character for search purposes, or if I need to do something else -- use regexps, keep a "naked," searchable version of each column alongside the native one, or something else entirely -- to get this to work.Any ideas?
Thanks,
Reuven-- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner
--
Jay Lozier
jslozier@gmail.com
Hi, everyone. Uwe wrote: > What kind of "client" are the users using? I assume you will have some kind > of user interface. For me this is a typical job for a user interface. The > number of letters with "equivalents" in different languages are extremely > limited, so a simple matching routine in the user interface should give you a > way to issue the proper query. The user interface will be via a Web application. But we need to store the data with the European characters, such as ñ, so that we can display them appropriately. So much as I like your suggestion, we need to do the opposite of what you're saying -- namely, take a bare letter, and then search for letters with accents and such on them. I am beginning to think that storing two versions of each name, one bare and the other not, might be the easiest way to go. But hey, I'm open to more suggestions. Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner
Reuven M. Lerner wrote: >> <p>Hi, everyone. I'm working on a project on PostgreSQL 9.0 (soon >> to be upgraded to 9.1, given that we haven't yet launched). The >> project will involve numerous text fields containing English, >> Spanish, and Portuguese. Some of those text fields will be >> searchable by the user. That's easy enough to do; for our >> purposes, I was planning to use some combination of LIKE searches; >> the database is small enough that this doesn't take very much >> time, and we don't expect the number of searchable records (or >> columns within those records) to be all that large.</p> >> <p>The thing is, the people running the site want searches to work >> on what I'm calling (for lack of a better term) "bare" letters. >> That is, if the user searches for "n", then the search should also >> match Spanish words containing "ñ". I'm told by Spanish-speaking >> members of the team that this is how they would expect searches to >> work. However, when I just did a quick test using a UTF-8 encoded >> 9.0 database, I found that PostgreSQL didn't see the two >> characters as identical. (I must say, this is the behavior that I >> would have expected, had the Spanish-speaking team member not said >> anything on the subject.)</p> >> <p>So my question is whether I can somehow wrangle PostgreSQL into >> thinking that "n" and "ñ" are the same character for search >> purposes, or if I need to do something else -- use regexps, keep a >> "naked," searchable version of each column alongside the native >> one, or something else entirely -- to get this to work.</p> >> <p>Any ideas?</p> >> <p>Thanks,</p> >> <p>Reuven<br> I had the same problem with german (there is ä ö ü) I ended up with a normalized version of the database (for many purposes, this could be just an extra column) plus preprocessing the input. There is one difficulty with german searches: these letters are commonly transliterated into ue etc, like in "Muenchen". So depending on culture, some people would expect a "u" search term to match, and others the "ue". So preprocessing query means replacing bare u (not followed by e) with a ue? regex BTW: if your search form does not explicitly tell the browser to use utf8 to encode the search field, you might expect a small proportion of iso-latin1 requests Regards Wolfgang
> Hi, everyone. Uwe wrote: > > What kind of "client" are the users using? I assume you will have some > > kind of user interface. For me this is a typical job for a user > > interface. The number of letters with "equivalents" in different > > languages are extremely limited, so a simple matching routine in the > > user interface should give you a way to issue the proper query. > > The user interface will be via a Web application. But we need to store > the data with the European characters, such as ñ, so that we can display > them appropriately. So much as I like your suggestion, we need to do > the opposite of what you're saying -- namely, take a bare letter, and > then search for letters with accents and such on them. > > I am beginning to think that storing two versions of each name, one bare > and the other not, might be the easiest way to go. But hey, I'm open > to more suggestions. > > Reuven That still doesn't hinder you from using a matching algorithm. Here a simple example (to my understanding of the problem) You have texts stored in the db both containing a n and a ñ. Now a client enters "n" on the website. What you want to do is look for both variations, so "n" translates into "n" or "ñ". There you have it. In the routine that receives the request you have a matching method that matches on "n" (or any of the few other characters with equivalents) and the routine will issue a query with a "xx like "%n%" or xx like "%ñ%" (personally I would use ilike, since that eliminates the case problem). Since you're referring to a "name", I sure don't know the specifics of the problem or data layout, but by what I know I think you can tackle this with a rather primitive "match -> translate to" kind of algorithm. One thing I'd not do: store duplicate versions. There's always a way to deal with data the way it is. In my opinion storing different versions of the same data just bloats a database in favor of a smarter way to deal with the initial data. Uwe
I don't see the problem - you can have a dictionary, which does all work on recognizing bare letters and output several versions. Have you seen unaccent dictionary ? Oleg On Sun, 2 Oct 2011, Uwe Schroeder wrote: >> Hi, everyone. Uwe wrote: >>> What kind of "client" are the users using? I assume you will have some >>> kind of user interface. For me this is a typical job for a user >>> interface. The number of letters with "equivalents" in different >>> languages are extremely limited, so a simple matching routine in the >>> user interface should give you a way to issue the proper query. >> >> The user interface will be via a Web application. But we need to store >> the data with the European characters, such as ?, so that we can display >> them appropriately. So much as I like your suggestion, we need to do >> the opposite of what you're saying -- namely, take a bare letter, and >> then search for letters with accents and such on them. >> >> I am beginning to think that storing two versions of each name, one bare >> and the other not, might be the easiest way to go. But hey, I'm open >> to more suggestions. >> >> Reuven > > > That still doesn't hinder you from using a matching algorithm. Here a simple > example (to my understanding of the problem) > You have texts stored in the db both containing a n and a ?. Now a client > enters "n" on the website. What you want to do is look for both variations, so > "n" translates into "n" or "?". > There you have it. In the routine that receives the request you have a > matching method that matches on "n" (or any of the few other characters with > equivalents) and the routine will issue a query with a "xx like "%n%" or xx > like "%?%" (personally I would use ilike, since that eliminates the case > problem). > > Since you're referring to a "name", I sure don't know the specifics of the > problem or data layout, but by what I know I think you can tackle this with a > rather primitive "match -> translate to" kind of algorithm. > > One thing I'd not do: store duplicate versions. There's always a way to deal > with data the way it is. In my opinion storing different versions of the same > data just bloats a database in favor of a smarter way to deal with the initial > data. > > Uwe > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Hi, Oleg. You wrote: > I don't see the problem - you can have a dictionary, which does all work > on recognizing bare letters and output several versions. Have you seen > unaccent > dictionary ? This seems to be the direction that everyone is suggesting, and I'm quite grateful for that. (I really hadn't ever needed to deal with such issues in the past, having worked mostly with English and Hebrew, which don't have such accent marks.) As for the unaccent dictionary, I hadn't heard of it before, but just saw it now in contrib, and it looks like it might fit perfectly. I'll take a look; thanks for the suggestion. Reuven -- Reuven M. Lerner -- Web development, consulting, and training Mobile: +972-54-496-8405 * US phone: 847-230-9795 Skype/AIM: reuvenlerner
>> I don't see the problem - you can have a dictionary, which does all work >> on recognizing bare letters and output several versions. Have you seen >> unaccent >> dictionary ? > > This seems to be the direction that everyone is suggesting, and I'm quite > grateful for that. (I really hadn't ever needed to deal with such issues in > the past, having worked mostly with English and Hebrew, which don't have > such accent marks.) > > As for the unaccent dictionary, I hadn't heard of it before, but just saw it > now in contrib, and it looks like it might fit perfectly. I'll take a look; > thanks for the suggestion. I wrote this code for something similar I was doing, feel free to rip it off or copy the regular expressions: input = Regex.Replace(input, @"[\xC0-\xC5\xE0-\xE5]", "a"); //Replace with "a" input = Regex.Replace(input, @"[\xC8-\xCB\xE8-\xEB]", "e"); //Replace with "e" input = Regex.Replace(input, @"[\xCC-\xCF\xEC-\xEF]", "i"); //Replace with "i" input = Regex.Replace(input, @"[\xD1\xF1]", "n"); //Replace with "n" input = Regex.Replace(input, @"[\xD2-\xD6\xF2-\xF6]", "o"); //Replace with "o" input = Regex.Replace(input, @"[\xD9-\xDC\xF9-\xFC]", "u"); //Replace with "u" input = Regex.Replace(input, @"[\xDD\xDF\xFF]", "y"); //Replace with "y"
At 01:25 02/10/2011, Reuven M. Lerner wrote: >Hi, everyone.=C2 I'm working on a project on=20 >PostgreSQL 9.0 (soon to be upgraded to 9.1,=20 >given that we haven't yet launched).=C2 The=20 >project will involve numerous text fields=20 >containing English, Spanish, and=20 >Portuguese.=C2 Some of those text fields will be=20 >searchable by the user.=C2 That's easy enough to=20 >do; for our purposes, I was planning to use some=20 >combination of LIKE searches; the database is=20 >small enough that this doesn't take very much=20 >time, and we don't expect the number of=20 >searchable records (or columns within those records) to be all that large. > >The thing is, the people running the site want=20 >searches to work on what I'm calling (for lack=20 >of a better term) "bare" letters.=C2 That is, if=20 >the user searches for "n", then the search=20 >should also match Spanish words containing=20 >"=C3=B1".=C2 I'm told by Spanish-speaking members of=20 >the team that this is how they would expect=20 >searches to work.=C2 However, when I just did a=20 >quick test using a UTF-8 encoded 9.0 database, I=20 >found that PostgreSQL didn't=C2 see the two=20 >characters as identical.=C2 (I must say, this is=20 >the behavior that I would have expected, had the=20 >Spanish-speaking team member not said anything on the subject.) > >So my question is whether I can somehow wrangle=20 >PostgreSQL into thinking that "n" and "=C3=B1" are=20 >the same character for search purposes, or if I=20 >need to do something else -- use regexps, keep a=20 >"naked," searchable version of each column=20 >alongside the native one, or something else entirely -- to get this to wor= k. > >Any ideas? You can use perceptual hashing for that. There=20 are multiple algorithms, some of them can be tuned for specific languages. See this documentation: http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description, http://en.wikipedia.org/wiki/Soundex is the first one developed, very old, http://en.wikipedia.org/wiki/Metaphone is a=20 family of several modern algorithms. Remember that they are hashing algorithms, some=20 words can collide because they have the same pronunciation but write differ= ent. I remember that datapark search engine uses them=20 with dictionaries. You can check it too. http://www.dataparksearch.org/ >Thanks, > >Reuven HTH=20
At 01:25 02/10/2011, Reuven M. Lerner wrote: >Hi, everyone. I'm working on a project on >PostgreSQL 9.0 (soon to be upgraded to 9.1, >given that we haven't yet launched). The >project will involve numerous text fields >containing English, Spanish, and >Portuguese. Some of those text fields will be >searchable by the user. That's easy enough to >do; for our purposes, I was planning to use some >combination of LIKE searches; the database is >small enough that this doesn't take very much >time, and we don't expect the number of >searchable records (or columns within those records) to be all that large. > >The thing is, the people running the site want >searches to work on what I'm calling (for lack >of a better term) "bare" letters. That is, if >the user searches for "n", then the search >should also match Spanish words containing >"ñ". I'm told by Spanish-speaking members of >the team that this is how they would expect >searches to work. However, when I just did a >quick test using a UTF-8 encoded 9.0 database, I >found that PostgreSQL didn't see the two >characters as identical. (I must say, this is >the behavior that I would have expected, had the >Spanish-speaking team member not said anything on the subject.) > >So my question is whether I can somehow wrangle >PostgreSQL into thinking that "n" and "ñ" are >the same character for search purposes, or if I >need to do something else -- use regexps, keep a >"naked," searchable version of each column >alongside the native one, or something else entirely -- to get this to work. > >Any ideas? You can use perceptual hashing for that. There are multiple algorithms, some of them can be tuned for specific languages. See this documentation: http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description, http://en.wikipedia.org/wiki/Soundex is the first one developed, very old, http://en.wikipedia.org/wiki/Metaphone is a family of several modern algorithms. Remember that they are hashing algorithms, some words can collide because they have the same pronunciation but write different. I remember that datapark search engine uses them with dictionaries. You can check it too. http://www.dataparksearch.org/ >Thanks, > >Reuven HTH
On Sun, Oct 02, 2011 at 05:45:48PM +0200, Reuven M. Lerner wrote: > quite grateful for that. (I really hadn't ever needed to deal with > such issues in the past, having worked mostly with English and > Hebrew, which don't have such accent marks.) That isn't quite true about English. We have words like coöperate and naïve. The former is sometimes fixed with a hyphen instead, but the latter can't be. I think what happened is that English speakers, because we're already used to being sloppy (you can't tell what's a subjunctive in English, either, just by looking) were willing to adapt our spelling to reflect the limitations of typewriters. Also, English never really had an official standard spelling -- by the time the English were attempting to standardize seriously, there was already an American branch with its own Bossypants Official Reformer of Spelling ("BORS", which in that case was Noah Webster. See G.B. Shaw for a British example). So we mostly lost the accents in standard spelling. We also lost various standard digraphs, like that in encyclopædia (which, depending on which branch of nonsense you subscribe to, can be spelled instead "encyclopedia" or "encyclopaedia"; both would have been called "wrong" once upon a time). > As for the unaccent dictionary, I hadn't heard of it before, but > just saw it now in contrib, and it looks like it might fit > perfectly. I'll take a look; thanks for the suggestion. The big problem there is what someone else pointed to up-thread: in some languages, the natural thing to do is to transliterate using multiple characters. The usual example is that in German is it common to use "e" after a vowel to approximate the umlaut. So, "ö" becomes "oe". Unfortunately, in Swedish this is clearly a mistake, and if you can't use the diaeresis, then you just use the "undecorated" character instead. The famous Swedish ship called the Götheborg cannot be transliterated as Goetheborg. Even in German, the rule is complicated, because it's not two-way: you can't spell the famous writer's name Göthe (even though Google seems to think you can). As far as I can tell, the unaccent dictionary doesn't handle the two-character case, though it sure looks like it could be extended to do it. But it doesn't seem to have a facility for differentiating based on the language of the string. I don't know whether that could be added. The upshot is that, if you need to store multilingual input and do special handling on the strings afterwards, you are wise to store the string with a language tag so that you can apply the right rules later on. See RFC 5646 (http://www.rfc-editor.org/rfc/rfc5646.txt) for some pointers. If just "stripping accents" is good enough for you, then the unaccent dictionary will probably be good enough. A -- Andrew Sullivan ajs@crankycanuck.ca