Home > mailing lists

fuzzystrmatch module buggy? observations - Mailing list pgsql-general

From	r d
Subject	fuzzystrmatch module buggy? observations
Date	October 30, 2012 13:29:18
Msg-id	CALtFtELVR8pL72vC83Qq2Tdax1-28xCGfmW8imNgQjhYxjP7=A@mail.gmail.com Whole thread
Responses	Re: fuzzystrmatch module buggy? observations
List	pgsql-general

Tree view

The fuzzystrmatch module (http://www.postgresql.org/docs/9.2/static/fuzzystrmatch.html) is currently, as of 9.2.1, documented with the caution "At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8)".

While the venerable algorithms contained in the module seem to generally work for Latin strings from European languages which all have accented/diacritic characters such as äöüñáéíóúàèìòù, for languages with non-Latin characters such as Kyrillic, Hebrew, Arabic, Chinese, these venerable algorithms return NULL (empty) or plain weirdness.

Some examples:

dmetaphone ('Новости') = 'NN'

soundex ('Новости') = NULL

dmetaphone ('לפחות') = NULL

soundex ('לפחות') = NULL

soundex ('相关搜索') = NULL

dmetaphone ('相关搜索') = NULL

metaphone() crashes with SQL state: 42883 for all these strings (it tells me I should cast the 'unknown' input).

The string 'äöüñáéíóúàèìòù' causes metaphone(), dmetaphone(), dmetaphone_alt, soundex() to fail.

Only levenshtein() appears to function correctly with all above inputs, even when I let it compare Hebrew against Chinese strings.

Summarizing my experience:

* for english (ASCII equivalent), the module works,

* for the rest of the Latin charsets (equivalent to ISO 8859-x) the module works unreliably,

* for non-latin chars (UTF8 with 2-4 bytes per char) the module does not work

Note: My DB and the OS are set up for UTF-8.

This would appear to be less a problem of Postgresql and the fuzzystrmach module itself but because there

appear to exist no replacement algorithms adequate for a multilingual world - at least that is my impression

after looking at the IPA and http://www.lt-world.org websites and branching out from there.

Given all this I have no idea of this is a bug at all or the state-of-the-art around this topic is inadequate.

Questions (to the developers):

- Is there anything in work or planned for the fuzzystrmatch module?

- Does anybody know about adequate replacements or upgrades of the soundex, metaphone etc. algorithms from academia?

pgsql-general by date:

From: Rodrigo Pereira da Silva
Date: 30 October 2012, 12:05:12
Subject: Re: Too much clients connected to the PostgreSQL Database

From: telenieko@gmail.com
Date: 30 October 2012, 14:55:49
Subject: Average Balance "life"

fuzzystrmatch module buggy? observations - Mailing list pgsql-general

Previous

Next