Thread: Re: Fuzzy matching
> > Our usual practice with stuff of uncertain usefulness has been to > > stick > > it in contrib for awhile and see if anyone uses it. If there's > > sufficient interest, we'll promote it to mainstream in a future > > release. > > Makes sense to me. Go, Joe! > Per this discussion, here's a patch to implement both levenshtein() and metaphone() in a contrib. There seem to be a fair number of different approaches to both of these algorithms. I used the simplest case for levenshtein which has a cost of 1 for any character insertion, deletion, or substitution. For metaphone, I adapted the same code from CPAN that the PHP folks did. A couple of questions: 1. Does it make sense to fold the soundex contrib together with this one? 2. I was debating trying to add multibyte support to levenshtein (it would make no sense at all for metaphone), but a quick search through the contrib directory found no hits on the word MULTIBYTE. Should worry about adding multibyte support to levenshtein()? Thanks, Joe
Attachment
"Joe Conway" <joseph.conway@home.com> writes: > 1. Does it make sense to fold the soundex contrib together with this one? > 2. I was debating trying to add multibyte support to levenshtein (it would > make no sense at all for metaphone), but a quick search through the contrib > directory found no hits on the word MULTIBYTE. Should worry about adding > multibyte support to levenshtein()? Both of these seem like reasonable things to do, if you have the energy. regards, tom lane
Your patch has been added to the PostgreSQL unapplied patches list at: http://candle.pha.pa.us/cgi-bin/pgpatches I will try to apply it within the next 48 hours. > > > Our usual practice with stuff of uncertain usefulness has been to > > > stick > > > it in contrib for awhile and see if anyone uses it. If there's > > > sufficient interest, we'll promote it to mainstream in a future > > > release. > > > > Makes sense to me. Go, Joe! > > > > Per this discussion, here's a patch to implement both levenshtein() and > metaphone() in a contrib. There seem to be a fair number of different > approaches to both of these algorithms. I used the simplest case for > levenshtein which has a cost of 1 for any character insertion, deletion, or > substitution. For metaphone, I adapted the same code from CPAN that the PHP > folks did. > > A couple of questions: > 1. Does it make sense to fold the soundex contrib together with this one? > > 2. I was debating trying to add multibyte support to levenshtein (it would > make no sense at all for metaphone), but a quick search through the contrib > directory found no hits on the word MULTIBYTE. Should worry about adding > multibyte support to levenshtein()? > > Thanks, > > Joe > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
I have added this into /contrib under fuzzystrmatch. Not sure if we want to merge soundex into that but it can be done later. > > > Our usual practice with stuff of uncertain usefulness has been to > > > stick > > > it in contrib for awhile and see if anyone uses it. If there's > > > sufficient interest, we'll promote it to mainstream in a future > > > release. > > > > Makes sense to me. Go, Joe! > > > > Per this discussion, here's a patch to implement both levenshtein() and > metaphone() in a contrib. There seem to be a fair number of different > approaches to both of these algorithms. I used the simplest case for > levenshtein which has a cost of 1 for any character insertion, deletion, or > substitution. For metaphone, I adapted the same code from CPAN that the PHP > folks did. > > A couple of questions: > 1. Does it make sense to fold the soundex contrib together with this one? > > 2. I was debating trying to add multibyte support to levenshtein (it would > make no sense at all for metaphone), but a quick search through the contrib > directory found no hits on the word MULTIBYTE. Should worry about adding > multibyte support to levenshtein()? > > Thanks, > > Joe > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> > I have added this into /contrib under fuzzystrmatch. Not sure if we > want to merge soundex into that but it can be done later. > > Sorry - I should have gotten to this sooner. Here's a patch which you should be able to apply against what you just committed. It rolls soundex into fuzzystrmatch. I'm don't think I will multibyte-enable levenshtein right now. If there is some interest in it, I'll do that later. Joe
Attachment
> > > > I have added this into /contrib under fuzzystrmatch. Not sure if we > > want to merge soundex into that but it can be done later. > > > > > > Sorry - I should have gotten to this sooner. Here's a patch which you should > be able to apply against what you just committed. It rolls soundex into > fuzzystrmatch. > > I'm don't think I will multibyte-enable levenshtein right now. If there is > some interest in it, I'll do that later. OK, I have removed /contrib/soundex and /contrib/metaphone and added your patch so they are all in fuzzystrmatch. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026