Enhancing phonetic search support for more languages - GSoC 2010 - Mailing list pgsql-hackers

From Dhiraj Lohiya
Subject Enhancing phonetic search support for more languages - GSoC 2010
Date
Msg-id h2rb268c9e91004071324r2ea2471p3135f5d4b485ad30@mail.gmail.com
Whole thread Raw
Responses Re: Enhancing phonetic search support for more languages - GSoC 2010  (Josh Berkus <josh@agliodbs.com>)
Re: Enhancing phonetic search support for more languages - GSoC 2010  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hello

I am Dhiraj Lohiya, Computer Science undergraduate from BITS Pilani. I wanted to propose idea to improvise upon the phonetic search support, initially for some Indian languages like Hindi and Marathi with a framework for extending it to other languages easily by contributing the rules in a simple format. I am looking to take it forward as a GSoC projectCheck out if you find this interesting enough:

I plan to customize the soundex algorithm for all languages where each language could have a different phonetic equivalent class of rules (Generally around 20 rules for most Indian languages I have worked with).  I would keep the approach layered so that support for multiple language rules could be easily contributed and more languages could be added by others.

Moreover, since it is important that once a base set of rules are defined by someone, the rules could themselves be added/evolve based on the user input and usage.
For instance, if many users(above a threshold set by us) insert some search string for which no wanted search result is retrieved, we could track what he finally selects and then accordingly append/modify our set of phonetic rules based on the phonetic mismatch amongst the  query inserted and result wanted according to our set of rules. Using this, the rule sets it could evolve itself when we collect usage statistics from users based on their experience. This feature would add a new dimension to the searchfunctionality and would surely stand out.

Initially I plan to code this for few Indian languages like Hindi, Marathi etc. and define a simple way (probably a gui on concept based on GoogleImageLabeler, wherein two words which sound similar will be mapped for improving upon the rules set) in which rules for different languages can be directly added and then people knowing those languages could contribute.


Samples:
  • Some case of Hindi songs, 
  • if I search for a song which has word "naiyya" but I spell the word as  ''nayya", presently no result would be returned since this is not in the playlist.
  • Moreover, if "pyar" is searched, the results vary than when "pyaar" is searched but it is easy to realize that both are the same and hence should give the same results.
Some background on this:
I have already worked out a basic customized version of soundex algorithm as a part of my intern project at PennyWiseSolutions and implemented it in java (which had features of self improving upon its rule set based on the 2 input phonetically similar words as well). Right now, the rule sets are designed only for Hindi and Marathi. The results are narrowed down pretty well with much less false positives and this works well with Marath and Hindi. Now since the algorithm part remains same (almost equivalent to soundex) and only the rule set of other languages is to be contributed which would be used by the algorithm to process, I guess this could do. Some specific customization that was done included not to take care of silent letters like in soundex since when spelling a Hindi word in English, users don't really use silent letters.

I would be glad to have more input on this.

--
Regards
Dhiraj Lohiya

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Win32 timezone matching
Next
From: Tom Lane
Date:
Subject: FM suffix in to_char Y/YY/YYY still screwy