Thread: Re: Fuzzy matching

Re: Fuzzy matching

From
"Joe Conway"
Date:
> > Our usual practice with stuff of uncertain usefulness has been to
> > stick
> > it in contrib for awhile and see if anyone uses it.  If there's
> > sufficient interest, we'll promote it to mainstream in a future
> > release.
>
> Makes sense to me.  Go, Joe!
>

Per this discussion, here's a patch to implement both levenshtein() and
metaphone() in a contrib. There seem to be a fair number of different
approaches to both of these algorithms. I used the simplest case for
levenshtein which has a cost  of 1 for any character insertion, deletion, or
substitution. For metaphone, I adapted the same code from CPAN that the PHP
folks did.

A couple of questions:
1. Does it make sense to fold the soundex contrib together with this one?

2. I was debating trying to add multibyte support to levenshtein (it would
make no sense at all for metaphone), but a quick search through the contrib
directory found no hits on the word MULTIBYTE. Should worry about adding
multibyte support to levenshtein()?

Thanks,

Joe


Attachment

Re: Re: Fuzzy matching

From
Tom Lane
Date:
"Joe Conway" <joseph.conway@home.com> writes:
> 1. Does it make sense to fold the soundex contrib together with this one?

> 2. I was debating trying to add multibyte support to levenshtein (it would
> make no sense at all for metaphone), but a quick search through the contrib
> directory found no hits on the word MULTIBYTE. Should worry about adding
> multibyte support to levenshtein()?

Both of these seem like reasonable things to do, if you have the energy.

            regards, tom lane

Re: Re: Fuzzy matching

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://candle.pha.pa.us/cgi-bin/pgpatches

I will try to apply it within the next 48 hours.

> > > Our usual practice with stuff of uncertain usefulness has been to
> > > stick
> > > it in contrib for awhile and see if anyone uses it.  If there's
> > > sufficient interest, we'll promote it to mainstream in a future
> > > release.
> >
> > Makes sense to me.  Go, Joe!
> >
>
> Per this discussion, here's a patch to implement both levenshtein() and
> metaphone() in a contrib. There seem to be a fair number of different
> approaches to both of these algorithms. I used the simplest case for
> levenshtein which has a cost  of 1 for any character insertion, deletion, or
> substitution. For metaphone, I adapted the same code from CPAN that the PHP
> folks did.
>
> A couple of questions:
> 1. Does it make sense to fold the soundex contrib together with this one?
>
> 2. I was debating trying to add multibyte support to levenshtein (it would
> make no sense at all for metaphone), but a quick search through the contrib
> directory found no hits on the word MULTIBYTE. Should worry about adding
> multibyte support to levenshtein()?
>
> Thanks,
>
> Joe
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Re: Fuzzy matching

From
Bruce Momjian
Date:
I have added this into /contrib under fuzzystrmatch.  Not sure if we
want to merge soundex into that but it can be done later.


> > > Our usual practice with stuff of uncertain usefulness has been to
> > > stick
> > > it in contrib for awhile and see if anyone uses it.  If there's
> > > sufficient interest, we'll promote it to mainstream in a future
> > > release.
> >
> > Makes sense to me.  Go, Joe!
> >
>
> Per this discussion, here's a patch to implement both levenshtein() and
> metaphone() in a contrib. There seem to be a fair number of different
> approaches to both of these algorithms. I used the simplest case for
> levenshtein which has a cost  of 1 for any character insertion, deletion, or
> substitution. For metaphone, I adapted the same code from CPAN that the PHP
> folks did.
>
> A couple of questions:
> 1. Does it make sense to fold the soundex contrib together with this one?
>
> 2. I was debating trying to add multibyte support to levenshtein (it would
> make no sense at all for metaphone), but a quick search through the contrib
> directory found no hits on the word MULTIBYTE. Should worry about adding
> multibyte support to levenshtein()?
>
> Thanks,
>
> Joe
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Re: Fuzzy matching

From
"Joe Conway"
Date:
>
> I have added this into /contrib under fuzzystrmatch.  Not sure if we
> want to merge soundex into that but it can be done later.
>
>

Sorry - I should have gotten to this sooner. Here's a patch which you should
be able to apply against what you just committed. It rolls soundex into
fuzzystrmatch.

I'm don't think I will multibyte-enable levenshtein right now. If there is
some interest in it, I'll do that later.

Joe


Attachment

Re: Re: Fuzzy matching

From
Bruce Momjian
Date:
> >
> > I have added this into /contrib under fuzzystrmatch.  Not sure if we
> > want to merge soundex into that but it can be done later.
> >
> >
>
> Sorry - I should have gotten to this sooner. Here's a patch which you should
> be able to apply against what you just committed. It rolls soundex into
> fuzzystrmatch.
>
> I'm don't think I will multibyte-enable levenshtein right now. If there is
> some interest in it, I'll do that later.

OK, I have removed /contrib/soundex and /contrib/metaphone and added
your patch so they are all in fuzzystrmatch.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026