Thread: Bug in metaphone (contrib/fuzzystrmatch)

Bug in metaphone (contrib/fuzzystrmatch)

From
"Jim C. Nasby"
Date:
Second argument to metaphone is suposed to set the limit on the number
of characters to return, but it breaks on some phrases:

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from (select 'Hello world'::varchar AS a) a;
 HLW       | HLWR      | HLWRLT

usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from (select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
 AKM       | AKMKS     | AKMKSMMRL

In every case I've found that does this, the 4th and 5th letters are
always 'KS'.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Bug in metaphone (contrib/fuzzystrmatch)

From
"Jim C. Nasby"
Date:
Great, I'll try it right away. I was also wondering why you have the
function bomb if it's fed an empty string or a NULL? It seems it would
be much nicer to have it return and empty string/NULL, respectively.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Bug in metaphone (contrib/fuzzystrmatch)

From
Joe Conway
Date:
Jim C. Nasby wrote:
> Second argument to metaphone is suposed to set the limit on the
> number of characters to return, but it breaks on some phrases:
>
> usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
> (select 'Hello world'::varchar AS a) a;
> HLW       | HLWR      | HLWRLT
>
> usps=# select metaphone(a,3),metaphone(a,4),metaphone(a,20) from
> (select 'A A COMEAUX MEMORIAL'::varchar AS a) a;
 > AKM       | AKMKS     | AKMKSMMRL
>
> In every case I've found that does this, the 4th and 5th letters are
> always 'KS'.

Nice catch.

There was a bug in the original metaphone algorithm from CPAN. Patch
attached (while I was at it I updated my email address, changed the
copyright to PGDG, and removed an unnecessary palloc). Here's how it
looks now:

regression=# select metaphone(a,4) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
  metaphone
-----------
  AKMK
(1 row)

regression=# select metaphone(a,5) from (select 'A A COMEAUX
MEMORIAL'::varchar AS a) a;
  metaphone
-----------
  AKMKS
(1 row)

Please apply.

Thanks,

Joe
Index: contrib/fuzzystrmatch/README.fuzzystrmatch
===================================================================
RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/README.fuzzystrmatch,v
retrieving revision 1.2
diff -c -r1.2 README.fuzzystrmatch
*** contrib/fuzzystrmatch/README.fuzzystrmatch    7 Aug 2001 18:16:01 -0000    1.2
--- contrib/fuzzystrmatch/README.fuzzystrmatch    6 Jun 2003 16:37:54 -0000
***************
*** 3,9 ****
   *
   * Functions for "fuzzy" comparison of strings
   *
!  * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
   *
   * levenshtein()
   * -------------
--- 3,12 ----
   *
   * Functions for "fuzzy" comparison of strings
   *
!  * Joe Conway <mail@joeconway.com>
!  *
!  * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group
!  * ALL RIGHTS RESERVED;
   *
   * levenshtein()
   * -------------
Index: contrib/fuzzystrmatch/fuzzystrmatch.c
===================================================================
RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/fuzzystrmatch.c,v
retrieving revision 1.7
diff -c -r1.7 fuzzystrmatch.c
*** contrib/fuzzystrmatch/fuzzystrmatch.c    10 Mar 2003 22:28:17 -0000    1.7
--- contrib/fuzzystrmatch/fuzzystrmatch.c    6 Jun 2003 16:38:06 -0000
***************
*** 3,9 ****
   *
   * Functions for "fuzzy" comparison of strings
   *
!  * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
   *
   * levenshtein()
   * -------------
--- 3,12 ----
   *
   * Functions for "fuzzy" comparison of strings
   *
!  * Joe Conway <mail@joeconway.com>
!  *
!  * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group
!  * ALL RIGHTS RESERVED;
   *
   * levenshtein()
   * -------------
***************
*** 221,229 ****
      if (!(reqlen > 0))
          elog(ERROR, "metaphone: Requested Metaphone output length must be > 0");

-     metaph = palloc(reqlen);
-     memset(metaph, '\0', reqlen);
-
      retval = _metaphone(str_i, reqlen, &metaph);
      if (retval == META_SUCCESS)
      {
--- 224,229 ----
***************
*** 629,635 ****
                  /* KS */
              case 'X':
                  Phonize('K');
!                 Phonize('S');
                  break;
                  /* Y if followed by a vowel */
              case 'Y':
--- 629,636 ----
                  /* KS */
              case 'X':
                  Phonize('K');
!                 if (max_phonemes == 0 || Phone_Len < max_phonemes)
!                     Phonize('S');
                  break;
                  /* Y if followed by a vowel */
              case 'Y':
Index: contrib/fuzzystrmatch/fuzzystrmatch.h
===================================================================
RCS file: /opt/src/cvs/pgsql-server/contrib/fuzzystrmatch/fuzzystrmatch.h,v
retrieving revision 1.6
diff -c -r1.6 fuzzystrmatch.h
*** contrib/fuzzystrmatch/fuzzystrmatch.h    5 Sep 2002 00:43:06 -0000    1.6
--- contrib/fuzzystrmatch/fuzzystrmatch.h    6 Jun 2003 16:38:13 -0000
***************
*** 3,9 ****
   *
   * Functions for "fuzzy" comparison of strings
   *
!  * Copyright (c) Joseph Conway <joseph.conway@home.com>, 2001;
   *
   * levenshtein()
   * -------------
--- 3,12 ----
   *
   * Functions for "fuzzy" comparison of strings
   *
!  * Joe Conway <mail@joeconway.com>
!  *
!  * Copyright (c) 2001, 2002, 2003 by PostgreSQL Global Development Group
!  * ALL RIGHTS RESERVED;
   *
   * levenshtein()
   * -------------