Home > mailing lists

Re: Enhancing phonetic search support for more languages - GSoC 2010 - Mailing list pgsql-hackers

From	Dhiraj Lohiya
Subject	Re: Enhancing phonetic search support for more languages - GSoC 2010
Date	April 8, 2010 03:35:43
Msg-id	h2ib268c9e91004072035g2eae0879sbaa147605e68478@mail.gmail.com Whole thread Raw
In response to	Re: Enhancing phonetic search support for more languages - GSoC 2010 (Josh Berkus <josh@agliodbs.com>)
Responses	Re: Enhancing phonetic search support for more languages - GSoC 2010
List	pgsql-hackers

Tree view

I'm also curious why you chose to focus on the extremely imprecise
soundex instead of the more discriminating metaphone.

The main reason to choose soundex over metaphone/double metaphone is for Indian languages, soundex itself with some customizations works pretty well. Use of Double Metaphone only increases upon the processing overhead alongwith the need to store 2 hashes but the performance would remain the same since the way the words are pronounced in Indian languages is based on the Phonology of Devnagri script in which we don't have silent letters and other accent related inclusions (which was a major consideration that went in the design of Double Metaphone). One more customization required with reference to Indian languages is that the characters in the words aren't taken one by one but are broken as substrings of continuous vowels and consonants and accordingly are mapped to the equivalent class. Also, one rule from metaphone needs to be incorporated wherein in soundex the first letter of the word is not considered but we would encode it also for the corresponding equivalent class.

Now with this approach of Soundex (without consideration for silent letters and breaking the word into substrings not on a character by character basis) delivers with almost same performance and much less overhead compared to Double metaphone with considerations for silent letters, accents etc. which don't have much impact on Indian languages and hence this would be more efficient.

For western languages, double metaphone is known to perform with great results. Hence, it could be used.

My previous mail was concentrated on soundex since I had also considered how it would proceed to self improve its rule set of equivalent classes, which is a little trickier in double metaphone whereas in soundex, we can map the rules based on the corresponding mapping that are present. But this could be looked upon later whether we want to include this functionality as well.

So for the SoC project, as proposed, probably I could concentrate on the algorithmic part for multi-lingual support. Once the framework is set ready with tutorials and wiki as to how to add rules for a new language, this could be contributed upon for other users for more languages by the community and after testing for a particular quality threshold, this could be incorporated.

Thanks for the inputs. More suggestions/reviews please!

--
Regards
Dhiraj Lohiya

pgsql-hackers by date:

From: Fujii Masao
Date: 08 April 2010, 02:15:37
Subject: Re: [COMMITTERS] pgsql: Forbid using pg_xlogfile_name() and pg_xlogfile_name_offset()

From: Brendan Jurd
Date: 08 April 2010, 03:56:46
Subject: Re: FM suffix in to_char Y/YY/YYY still screwy

Re: Enhancing phonetic search support for more languages - GSoC 2010 - Mailing list pgsql-hackers

Previous

Next