Re: Enhancing phonetic search support for more languages - GSoC 2010 - Mailing list pgsql-hackers

From Dhiraj Lohiya
Subject Re: Enhancing phonetic search support for more languages - GSoC 2010
Date
Msg-id h2ib268c9e91004072035g2eae0879sbaa147605e68478@mail.gmail.com
Whole thread Raw
In response to Re: Enhancing phonetic search support for more languages - GSoC 2010  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Enhancing phonetic search support for more languages - GSoC 2010  (Dhiraj Lohiya <lohiya.dhiraj@gmail.com>)
List pgsql-hackers


I'm also curious why you chose to focus on the extremely imprecise
soundex instead of the more discriminating metaphone.


The main reason to choose soundex over metaphone/double metaphone is for Indian languages, soundex itself with some customizations works pretty well. Use of Double Metaphone only increases upon the processing overhead  alongwith the need to store 2 hashes but the performance would remain the same since the way the words are pronounced in Indian languages is based on the Phonology of Devnagri script in which we don't have silent letters and other accent related inclusions (which was a major consideration that went in the design of Double Metaphone). One more customization required with reference to Indian languages is that the characters in the words aren't taken one by one but are broken as substrings of continuous vowels and consonants and accordingly are mapped to the equivalent class. Also, one rule from metaphone needs to be incorporated wherein in soundex the first letter of the word is not considered but  we would encode it also for the corresponding equivalent class.

Now with this approach of Soundex (without consideration for silent letters and breaking the word into substrings not on a character by character basis) delivers with almost same performance and much less overhead compared to Double metaphone with considerations for silent letters, accents etc. which don't have much impact on Indian languages and hence this would be more efficient.

For western languages, double metaphone is known to perform with great results. Hence, it could be used.

My previous  mail was concentrated on soundex since I had also considered how it would proceed to self improve its rule set of equivalent classes, which is a little trickier in double metaphone whereas in soundex, we can map the rules based on the  corresponding mapping that are present. But this could be looked upon later whether we want to include this functionality as well.

So for the SoC project, as proposed, probably I could concentrate on the algorithmic part for multi-lingual support. Once the framework is set ready with tutorials and wiki as to how to add rules for a new language, this could be contributed upon for other users for more languages by the community and after testing for a particular quality threshold, this could be incorporated.

Thanks for the inputs. More suggestions/reviews please!

--
Regards
Dhiraj Lohiya

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: [COMMITTERS] pgsql: Forbid using pg_xlogfile_name() and pg_xlogfile_name_offset()
Next
From: Brendan Jurd
Date:
Subject: Re: FM suffix in to_char Y/YY/YYY still screwy