Thread: Fuzzy matching?
Folks, For many of my programs, it would be extremely useful to have some form of "fuzzy matching" for VARCHAR fields. There are two kinds of fuzzy matching for words that I know of: 1. Phonetic matching, which would be nice but will have to wait for someone's $100,000 project; 2. Textual mathcing, which I will outline below. The way textual fuzzy matching should work is as follows: The developer supplies two VARCHARs to match and a number/percent of character mis-match that is acceptable: Fuzzy_match('Thornton','Tornton',1) And the fuzzy_match should return True if the two phrases are no more than that number of characters different. Thus, we should get: fuzzy_match('Thornton','Tornton',1) = TRUE fuzzy_match('Thornton','Torntin',1) = FALSE fuzzy_match('Thornton','Torntin',2) = TRUE Unfortunately, I cannot think of a way to make this happen in a function without cycling through all the possible permutations of characters for both words or doing some character-by-character comparison with elaborate logic for placement. Either of these approaches would be very slow, and completely unsuitable for column comparisons on large tables. Can anyone suggest some shortcuts here? Perhaps using pl/perl or something similar? Grazie! -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Attachment
> And the fuzzy_match should return True if the two phrases are no more > than that number of characters different. Thus, we should get: > > fuzzy_match('Thornton','Tornton',1) = TRUE > fuzzy_match('Thornton','Torntin',1) = FALSE > fuzzy_match('Thornton','Torntin',2) = TRUE > > Unfortunately, I cannot think of a way to make this happen in a function > without cycling through all the possible permutations of characters for > both words or doing some character-by-character comparison with > elaborate logic for placement. Either of these approaches would be very > slow, and completely unsuitable for column comparisons on large tables. > > Can anyone suggest some shortcuts here? Perhaps using pl/perl or > something similar? Sounds like you want something along the lines of soundex or metaphone? I don't see either function in PostgreSQL, but take a look at the PHP manual to see examples: http://www.php.net/manual/en/function.soundex.php , http://www.php.net/manual/en/function.metaphone.php I looked at the soundex function in the PHP source, and it looks like it would be fairly easy to port to a Postgres C function. The algorithm itself comes from Donald Knuth in "The Art Of Computer Programming, vol. 3: Sorting And Searching", Addison-Wesley (1973), pp. 391-392. HTH, -- Joe
Here's an off the cuff reply: It sounds like fuzzy_match(str1,str2,num) is really just a tokenizer-type operation. The number is exactly one less than the potential number of string segments that you are interested in. For example: fuzzy_match('Thornton','Tornton',1) = TRUE Because the two segements are 'T' and 'ornton' And also: fuzzy_match('Thornton','Torntin',2) = TRUE Becuse the three segments are 'T', "ornt', and 'n' So, it seems like you could try to build the tokens, which would be probably more efficient than just trying all permutations. HTH -Robby -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org]On Behalf Of Josh Berkus Sent: Tuesday, July 31, 2001 11:05 AM To: pgsql-sql@postgresql.org Subject: [SQL] Fuzzy matching? Folks, For many of my programs, it would be extremely useful to have some form of "fuzzy matching" for VARCHAR fields. There are two kinds of fuzzy matching for words that I know of: 1. Phonetic matching, which would be nice but will have to wait for someone's $100,000 project; 2. Textual mathcing, which I will outline below. The way textual fuzzy matching should work is as follows: The developer supplies two VARCHARs to match and a number/percent of character mis-match that is acceptable: Fuzzy_match('Thornton','Tornton',1) And the fuzzy_match should return True if the two phrases are no more than that number of characters different. Thus, we should get: fuzzy_match('Thornton','Tornton',1) = TRUE fuzzy_match('Thornton','Torntin',1) = FALSE fuzzy_match('Thornton','Torntin',2) = TRUE Unfortunately, I cannot think of a way to make this happen in a function without cycling through all the possible permutations of characters for both words or doing some character-by-character comparison with elaborate logic for placement. Either of these approaches would be very slow, and completely unsuitable for column comparisons on large tables. Can anyone suggest some shortcuts here? Perhaps using pl/perl or something similar? Grazie! -Josh Berkus ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
"Josh Berkus" <josh@agliodbs.com> writes: > For many of my programs, it would be extremely useful to have some form > of "fuzzy matching" for VARCHAR fields. There are two kinds of fuzzy > matching for words that I know of: > 1. Phonetic matching, which would be nice but will have to wait for > someone's $100,000 project; Uh, have you looked at contrib/soundex? The Soundex code is kinda specialized but might be just what you want... regards, tom lane
> > Sounds like you want something along the lines of soundex or metaphone? I > > don't see either function in PostgreSQL, but take a look at the PHP manual > > to see examples: http://www.php.net/manual/en/function.soundex.php , > > http://www.php.net/manual/en/function.metaphone.php > > > > See /contrib/soundex. Sorry, missed that -- I only looked in the Documentation :( I guess it's not there because it is a contrib. FWIW, both Oracle and MSSQL have a built-in soundex function. In any case, metaphone is reportedly more accurate (at least for English words) than soundex, and levenshtein offers an entirely different and interesting approach. Any interest in having all three of these in the backend? -- Joe
> > See /contrib/soundex. > > Sorry, missed that -- I only looked in the Documentation :( > I guess it's not there because it is a contrib. FWIW, both Oracle and MSSQL > have a built-in soundex function. > > In any case, metaphone is reportedly more accurate (at least for English > words) than soundex, and levenshtein offers an entirely different and > interesting approach. Any interest in having all three of these in the > backend? Sure. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > > > Actually, this may even be closer to what you want: > > http://www.php.net/manual/en/function.levenshtein.php > > Hey, that's terrific! I didn't know that those programs existed > outside fo expensive proprietary software. > > Now, who can I talk into porting them (metaphone, levenstein) to > Postgres? Hey, GreatBridge folks? (this would be a significant value > enhancement for Postgres) > > -Josh I wouldn't mind doing it if the core team agrees. It will probably be a couple of weeks before I can get to it though -- not sure if that's soon enough to make it into 7.2. Should it be a contrib, or in the backend? -- Joe
> Sounds like you want something along the lines of soundex or metaphone? I > don't see either function in PostgreSQL, but take a look at the PHP manual > to see examples: http://www.php.net/manual/en/function.soundex.php , > http://www.php.net/manual/en/function.metaphone.php > See /contrib/soundex. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Joe, > In any case, metaphone is reportedly more accurate (at least for > English > words) than soundex, and levenshtein offers an entirely different and > interesting approach. Any interest in having all three of these in > the > backend? I'm quite interested, myself. How difficult is it for somebody that doesn't program C to attach a function from the Contrib directory? If it's not very difficult, then I'd recommend putting metaphone in /contrib, and levenstein in the backend. My reasoning is that levenstein is useful for all roman alphabets, but metaphone is not so useful for non-english versions of postgres. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Joe, > > Sounds like you want something along the lines of soundex or > metaphone? I > > don't see either function in PostgreSQL, but take a look at the PHP > manual > > to see examples: http://www.php.net/manual/en/function.soundex.php > , > > http://www.php.net/manual/en/function.metaphone.php > > > > I looked at the soundex function in the PHP source, and it looks > like it > > would be fairly easy to port to a Postgres C function. The > algorithm > itself > > comes from Donald Knuth in "The Art Of Computer Programming, vol. > 3: > Sorting > > And Searching", Addison-Wesley (1973), pp. 391-392. > > > > Actually, this may even be closer to what you want: > http://www.php.net/manual/en/function.levenshtein.php Hey, that's terrific! I didn't know that those programs existed outside fo expensive proprietary software. Now, who can I talk into porting them (metaphone, levenstein) to Postgres? Hey, GreatBridge folks? (this would be a significant value enhancement for Postgres) -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
"Joe Conway" <joseph.conway@home.com> writes: > In any case, metaphone is reportedly more accurate (at least for English > words) than soundex, and levenshtein offers an entirely different and > interesting approach. Any interest in having all three of these in the > backend? I'd certainly accept such functions as /contrib material, but probably would want to wait to see if they get used much before putting them in the standard backend ... regards, tom lane
With version 7.2 we will have pl/perlu (untrusted), which will allow use of the various Perl modules which do this sort of thing. > -----Original Message----- > From: Josh Berkus [SMTP:josh@agliodbs.com] > Sent: Tuesday, July 31, 2001 1:16 PM > To: Joe Conway; Bruce Momjian > Cc: Josh Berkus; pgsql-sql@postgresql.org > Subject: Re: Fuzzy matching? > > Joe, > > > In any case, metaphone is reportedly more accurate (at least for > > English > > words) than soundex, and levenshtein offers an entirely different and > > interesting approach. Any interest in having all three of these in > > the > > backend? > > I'm quite interested, myself. How difficult is it for somebody that > doesn't program C to attach a function from the Contrib directory? If > it's not very difficult, then I'd recommend putting metaphone in > /contrib, and levenstein in the backend. My reasoning is that > levenstein is useful for all roman alphabets, but metaphone is not so > useful for non-english versions of postgres. > > -Josh > > ______AGLIO DATABASE SOLUTIONS___________________________ > Josh Berkus > Complete information technology josh@agliodbs.com > and data management solutions (415) 565-7293 > for law firms, small businesses fax 621-2533 > and non-profit organizations. San Francisco > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html
Tom, Joe, > Our usual practice with stuff of uncertain usefulness has been to > stick > it in contrib for awhile and see if anyone uses it. If there's > sufficient interest, we'll promote it to mainstream in a future > release. Makes sense to me. Go, Joe! Since I can't help with the porting of metaphone or levenstein (but will benefit immensely), I pledge to write a name-alike data checking PL/pgSQL function which I will post to Roberto's library for public consumption. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
"Josh Berkus" <josh@agliodbs.com> writes: > I'm quite interested, myself. How difficult is it for somebody that > doesn't program C to attach a function from the Contrib directory? Run the install script. > If it's not very difficult, then I'd recommend putting metaphone in > /contrib, and levenstein in the backend. My reasoning is that > levenstein is useful for all roman alphabets, but metaphone is not so > useful for non-english versions of postgres. Our usual practice with stuff of uncertain usefulness has been to stick it in contrib for awhile and see if anyone uses it. If there's sufficient interest, we'll promote it to mainstream in a future release. regards, tom lane
On Wed, 01 Aug 2001 05:44, Tom Lane wrote: > "Josh Berkus" <josh@agliodbs.com> writes: > > For many of my programs, it would be extremely useful to have some form > > of "fuzzy matching" for VARCHAR fields. There are two kinds of fuzzy > > matching for words that I know of: > > > > 1. Phonetic matching, which would be nice but will have to wait for > > someone's $100,000 project; > > Uh, have you looked at contrib/soundex? The Soundex code is kinda > specialized but might be just what you want... Be aware that Soundex simply does not work at all well on names of Pacific origin. Samoan in particular is a total no no.
I posted this many moons ago to pgsql-hackers. 'Guess nobody noticed. Tim Josh Berkus wrote: >Folks, > >For many of my programs, it would be extremely useful to have some form >of "fuzzy matching" for VARCHAR fields. There are two kinds of fuzzy >matching for words that I know of: > >1. Phonetic matching, which would be nice but will have to wait for >someone's $100,000 project; > >2. Textual mathcing, which I will outline below. > >The way textual fuzzy matching should work is as follows: >The developer supplies two VARCHARs to match and a number/percent of >character mis-match that is acceptable: > >Fuzzy_match('Thornton','Tornton',1) > >And the fuzzy_match should return True if the two phrases are no more >than that number of characters different. Thus, we should get: > >fuzzy_match('Thornton','Tornton',1) = TRUE >fuzzy_match('Thornton','Torntin',1) = FALSE >fuzzy_match('Thornton','Torntin',2) = TRUE > >Unfortunately, I cannot think of a way to make this happen in a function >without cycling through all the possible permutations of characters for >both words or doing some character-by-character comparison with >elaborate logic for placement. Either of these approaches would be very >slow, and completely unsuitable for column comparisons on large tables. > >Can anyone suggest some shortcuts here? Perhaps using pl/perl or >something similar? > >Grazie! > >-Josh Berkus > >______AGLIO DATABASE SOLUTIONS___________________________ > Josh Berkus > Complete information technology josh@agliodbs.com > and data management solutions (415) 565-7293 > for law firms, small businesses fax 621-2533 > and non-profit organizations. San Francisco > > >------------------------------------------------------------------------ > > > >------------------------------------------------------------------------ > > > >------------------------------------------------------------------------ > > > >------------------------------------------------------------------------ > > >---------------------------(end of broadcast)--------------------------- >TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > Part 1.2 > > Content-Type: > > text/plain > Content-Encoding: > > base64 > > > ------------------------------------------------------------------------ > Part 1.3 > > Content-Type: > > text/plain > Content-Encoding: > > base64 > > > ------------------------------------------------------------------------ > Part 1.4 > > Content-Type: > > text/plain > Content-Encoding: > > base64 > > > ------------------------------------------------------------------------ > Part 1.5 > > Content-Type: > > text/plain > Content-Encoding: > > binary > > -- Timothy H. Keitt Department of Ecology and Evolution State University of New York at Stony Brook Stony Brook, New York 11794 USA Phone: 631-632-1101, FAX: 631-632-7626 http://life.bio.sunysb.edu/ee/keitt/ /* Copyright (c) 2000 Timothy H. Keitt */ /* Licence: GPL version 2 or higher (see http://www.gnu.org/) */ #include "postgres.h" #define STATIC_SIZE 32 /* This must be changed if STATIC_SIZE is changed */ static int4 static_array[STATIC_SIZE][STATIC_SIZE] = { {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}, {1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {26, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {27, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {28, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, {31, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}; /* Fast version for strings up to STATIC_SIZE characters */ int4 levenshtein_distance(text *s1, text *s2) { register int i, j, l, m, n, add, rows, columns; columns = VARSIZE(s1) - VARHDRSZ + 1; rows = VARSIZE(s2) - VARHDRSZ + 1; /* Use slower dynamically allocated version for larger strings */ if (columns > STATIC_SIZE || rows > STATIC_SIZE) return levenshtein_distance_dynamic(s1, s2); for (j=1; j<rows; ++j) { for (i=1; i<columns; ++i) { if (VARDATA(s1)[i-1] == VARDATA(s2)[j-1]) add=0; else add=1; m = 1 + static_array[j-1][i]; l = 1 + static_array[j][i-1]; n = add + static_array[j-1][i-1]; static_array[j][i] = (m < l ? (m < n ? m : n): (l < n ? l : n)); } /* next column (i) */ } /* next row (j) */ return static_array[rows-1][columns-1]; } /* end levenshtein_distance(text *s1, text *s2) */ int4 levenshtein_distance_dynamic(text *s1, text *s2) { register int i, j, l, m, n, add, rows, columns, out; int4 *dynamic_array; columns = VARSIZE(s1) - VARHDRSZ + 1; rows = VARSIZE(s2) - VARHDRSZ + 1; dynamic_array = (int4 *) palloc(rows * columns * sizeof(int4)); if (dynamic_array == NULL) return -1; for (i=0; i<columns; ++i) dynamic_array[i] = i; for (j=0; j<rows; ++j) dynamic_array[j*columns] = j; for (j=1; j<rows; ++j) { for (i=1; i<columns; ++i) { if (VARDATA(s1)[i-1] == VARDATA(s2)[j-1]) add=0; else add=1; m = 1 + dynamic_array[(j-1)*columns+i]; l = 1 + dynamic_array[j*columns+i-1]; n = add + dynamic_array[(j-1)*columns+i-1]; dynamic_array[j*columns+i] = (m < l ? (m < n ? m : n): (l < n ? l : n)); } /* next column (i) */ } /* next row (j) */ out = dynamic_array[(rows-1)*columns+columns-1]; pfree(dynamic_array); return out; } /* end levenshtein_distance_dynamic(text *s1, text *s2) */
Oh and despite the copyright notice, I'm happy to put it in the public domain, so feel free to incorporate into postgresql. Tim Timothy H. Keitt wrote: > I posted this many moons ago to pgsql-hackers. 'Guess nobody noticed. > > Tim > > Josh Berkus wrote: > >> Folks, >> >> For many of my programs, it would be extremely useful to have some form >> of "fuzzy matching" for VARCHAR fields. There are two kinds of fuzzy >> matching for words that I know of: >> >> 1. Phonetic matching, which would be nice but will have to wait for >> someone's $100,000 project; >> >> 2. Textual mathcing, which I will outline below. >> >> The way textual fuzzy matching should work is as follows: >> The developer supplies two VARCHARs to match and a number/percent of >> character mis-match that is acceptable: >> >> Fuzzy_match('Thornton','Tornton',1) >> >> And the fuzzy_match should return True if the two phrases are no more >> than that number of characters different. Thus, we should get: >> >> fuzzy_match('Thornton','Tornton',1) = TRUE >> fuzzy_match('Thornton','Torntin',1) = FALSE >> fuzzy_match('Thornton','Torntin',2) = TRUE >> >> Unfortunately, I cannot think of a way to make this happen in a function >> without cycling through all the possible permutations of characters for >> both words or doing some character-by-character comparison with >> elaborate logic for placement. Either of these approaches would be very >> slow, and completely unsuitable for column comparisons on large tables. >> >> Can anyone suggest some shortcuts here? Perhaps using pl/perl or >> something similar? >> >> Grazie! >> >> -Josh Berkus >> >> ______AGLIO DATABASE SOLUTIONS___________________________ >> Josh Berkus >> Complete information technology josh@agliodbs.com >> and data management solutions (415) 565-7293 >> for law firms, small businesses fax 621-2533 >> and non-profit organizations. San Francisco >> >> >> ------------------------------------------------------------------------ >> >> >> >> ------------------------------------------------------------------------ >> >> >> >> ------------------------------------------------------------------------ >> >> >> >> ------------------------------------------------------------------------ >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 2: you can get off all lists at once with the unregister command >> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >> >> Part 1.2 >> >> Content-Type: >> >> text/plain >> Content-Encoding: >> >> base64 >> >> >> ------------------------------------------------------------------------ >> Part 1.3 >> >> Content-Type: >> >> text/plain >> Content-Encoding: >> >> base64 >> >> >> ------------------------------------------------------------------------ >> Part 1.4 >> >> Content-Type: >> >> text/plain >> Content-Encoding: >> >> base64 >> >> >> ------------------------------------------------------------------------ >> Part 1.5 >> >> Content-Type: >> >> text/plain >> Content-Encoding: >> >> binary >> >> > > >------------------------------------------------------------------------ > >/* Copyright (c) 2000 Timothy H. Keitt */ >/* Licence: GPL version 2 or higher (see http://www.gnu.org/) */ > >#include "postgres.h" > >#define STATIC_SIZE 32 > >/* This must be changed if STATIC_SIZE is changed */ >static int4 static_array[STATIC_SIZE][STATIC_SIZE] = { > {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}, > {1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {17, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {26, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {27, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {28, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, > {31, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}; > >/* Fast version for strings up to STATIC_SIZE characters */ >int4 >levenshtein_distance(text *s1, text *s2) { > register int i, j, l, m, n, add, rows, columns; > > columns = VARSIZE(s1) - VARHDRSZ + 1; > rows = VARSIZE(s2) - VARHDRSZ + 1; > > /* Use slower dynamically allocated version for larger strings */ > if (columns > STATIC_SIZE || rows > STATIC_SIZE) > return levenshtein_distance_dynamic(s1, s2); > > for (j=1; j<rows; ++j) { > for (i=1; i<columns; ++i) { > > if (VARDATA(s1)[i-1] == VARDATA(s2)[j-1]) add=0; else add=1; > > m = 1 + static_array[j-1][i]; > l = 1 + static_array[j][i-1]; > n = add + static_array[j-1][i-1]; > > static_array[j][i] = (m < l ? (m < n ? m : n): (l < n ? l : n)); > > } /* next column (i) */ > } /* next row (j) */ > > return static_array[rows-1][columns-1]; > >} /* end levenshtein_distance(text *s1, text *s2) */ > >int4 >levenshtein_distance_dynamic(text *s1, text *s2) { > register int i, j, l, m, n, add, rows, columns, out; > int4 *dynamic_array; > > columns = VARSIZE(s1) - VARHDRSZ + 1; > rows = VARSIZE(s2) - VARHDRSZ + 1; > > dynamic_array = (int4 *) palloc(rows * columns * sizeof(int4)); > if (dynamic_array == NULL) return -1; > > for (i=0; i<columns; ++i) dynamic_array[i] = i; > for (j=0; j<rows; ++j) dynamic_array[j*columns] = j; > > for (j=1; j<rows; ++j) { > for (i=1; i<columns; ++i) { > > if (VARDATA(s1)[i-1] == VARDATA(s2)[j-1]) add=0; else add=1; > > m = 1 + dynamic_array[(j-1)*columns+i]; > l = 1 + dynamic_array[j*columns+i-1]; > n = add + dynamic_array[(j-1)*columns+i-1]; > > dynamic_array[j*columns+i] = > (m < l ? (m < n ? m : n): (l < n ? l : n)); > > } /* next column (i) */ > } /* next row (j) */ > > out = dynamic_array[(rows-1)*columns+columns-1]; > > pfree(dynamic_array); > > return out; > >} /* end levenshtein_distance_dynamic(text *s1, text *s2) */ > > > >------------------------------------------------------------------------ > > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster > > levenshtein_distance.c > > Content-Type: > > text/plain > Content-Encoding: > > 7bit > > > ------------------------------------------------------------------------ > Part 1.3 > > Content-Type: > > text/plain > Content-Encoding: > > binary > > -- Timothy H. Keitt Department of Ecology and Evolution State University of New York at Stony Brook Stony Brook, New York 11794 USA Phone: 631-632-1101, FAX: 631-632-7626 http://life.bio.sunysb.edu/ee/keitt/
I have C-codes which implement Levenstein Distance Algorithms Here is a header for soe info: // Levenstein Distance Algorithms Module // // Contains functions to do "Inexact Alphanumeric Comparisons". The original // concept for this code came from "The C Users Journal", May 1991, starting // on page 127. The author was Hans Z. Zwakenberg. // // These routines have been modified so they do not need to use double // subscripted arrays in the CUJ code. An additional function was created to // "look up" a word from a list, returning the most likely matches. // // Written initailly by R. Bruce Roberts, MCI Systems Engineering, 1/7/93 // Compiled with Borland C++, V3.1 // It's compiled and works well. Don't know about license but it's published and it's possible to ask author directly zen:~/app/algo/dist/ldist$ ./test simple sample Returns 'Levenstein Distance' of two strings on command line. Determining Distance for simple and sample. Length is limited to 40 characters. Words are alike, L Distance is 1, Threshold was 3. PLease let me know if you need it. I thought about implementing the same feature as Google has into our OpenFTS search Regards, Oleg On Tue, 31 Jul 2001, Tom Lane wrote: > "Josh Berkus" <josh@agliodbs.com> writes: > > I'm quite interested, myself. How difficult is it for somebody that > > doesn't program C to attach a function from the Contrib directory? > > Run the install script. > > > If it's not very difficult, then I'd recommend putting metaphone in > > /contrib, and levenstein in the backend. My reasoning is that > > levenstein is useful for all roman alphabets, but metaphone is not so > > useful for non-english versions of postgres. > > Our usual practice with stuff of uncertain usefulness has been to stick > it in contrib for awhile and see if anyone uses it. If there's > sufficient interest, we'll promote it to mainstream in a future release. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://www.postgresql.org/search.mpl > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Josh Berkus writes: > For many of my programs, it would be extremely useful to have some form > of "fuzzy matching" for VARCHAR fields. For lexical similarity, check out the agrep algorithm. Last I checked the source code wasn't quite Free(tm), but the algorithm was published in an academic work. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
The Levenshtein-distance code I posted uses the agrep algorithm. Tim Peter Eisentraut wrote: >Josh Berkus writes: > >>For many of my programs, it would be extremely useful to have some form >>of "fuzzy matching" for VARCHAR fields. >> > >For lexical similarity, check out the agrep algorithm. Last I checked the >source code wasn't quite Free(tm), but the algorithm was published in an >academic work. > -- Timothy H. Keitt Department of Ecology and Evolution State University of New York at Stony Brook Stony Brook, New York 11794 USA Phone: 631-632-1101, FAX: 631-632-7626 http://life.bio.sunysb.edu/ee/keitt/