Thread: Metaphone function attachment
-- Joel Burton <jburton@scw.org> Director of Information Systems, Support Center of Washington
Attachment
For those curious about what this actually does, README attached. It will appear in 7.2. Seems similar to Soundex. > > > -- > Joel Burton <jburton@scw.org> > Director of Information Systems, Support Center of Washington Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 This directory contains a module that implements the "Metaphone" code as a PostgreSQL user-defined function. The Metaphone system is a method of matching similar sounding names (or any words) to the same code. Metaphone was invented by Lawrence Philips as an improvement to the popular name-hashing routine, Soundex. This metaphone code is from Michael Kuhn, and is detailed at http://aspell.sourceforge.net/metaphone/metaphone-kuhn.txt Code for this (including this help file!) was liberally borrowed from the soundex() module for PostgreSQL. There are two functions: metaphone(text) : returns hash of a name metaphone(text,int) : returns hash (maximum length of int) of name --- To install it, first configure the main source tree, then run make; make install in this directory. Finally, load the function definition with psql: psql -f PREFIX/share/contrib/metaphone.sql The following are some usage examples: SELECT text_metaphone('hello world!'); SELECT text_metaphone('hello world!', 4); CREATE TABLE s (nm text)\g insert into s values ('john')\g insert into s values ('joan')\g insert into s values ('wobbly')\g select * from s where text_metaphone(nm) = text_metaphone('john')\g select nm from s a, s b where text_metaphone(a.nm) = text_metaphone(b.nm) and a.oid <> b.oid\g CREATE FUNCTION text_mp_eq(text, text) RETURNS bool AS 'select text_metaphone($1) = text_metaphone($2)' LANGUAGE 'sql'\g CREATE FUNCTION text_mp_lt(text,text) RETURNS bool AS 'select text_metaphone($1) < text_metaphone($2)' LANGUAGE 'sql'\g CREATE FUNCTION text_mp_gt(text,text) RETURNS bool AS 'select text_metaphone($1) > text_metaphone($2)' LANGUAGE 'sql'; CREATE FUNCTION text_mp_le(text,text) RETURNS bool AS 'select text_metaphone($1) <= text_metaphone($2)' LANGUAGE 'sql'; CREATE FUNCTION text_mp_ge(text,text) RETURNS bool AS 'select text_metaphone($1) >= text_metaphone($2)' LANGUAGE 'sql'; CREATE FUNCTION text_mp_ne(text,text) RETURNS bool AS 'select text_metaphone($1) <> text_metaphone($2)' LANGUAGE 'sql'; DROP OPERATOR #= (text,text)\g CREATE OPERATOR #= (leftarg=text, rightarg=text, procedure=text_mp_eq, commutator=text_mp_eq)\g SELECT * FROM s WHERE text_mp_eq(nm,'pillsbury')\g SELECT * from s where s.nm #= 'pillsbury';
Why not start a new project at greatbridge.org? I'd be happy to see metaphone() move in there, soundex() would make sense. I have a hashing algorithm that grabs the first letter off of words, except for user-definable 'stop words', which we use to look for likely organization name matches. These could all fall under a project of PG string functions. I think, as little things in contrib/, it's easy for people to miss these. With a project page, some discussion, etc. (& a place in contrib/), more people would be able to use these. PG functions are one of the things that separates PG from MySQL (which has only C UDFs, and IIRC, not on some platforms) and InterBase (which has plsql-like procedures, but functions can only be written in C). I think our functions are one of our strongest cases, and the more we can show people examples of how to use them, and the larger our useful library, the more we win. P.S. What exactly does contains() do?
Joel Burton wrote: > > -- > Joel Burton <jburton@scw.org> > Director of Information Systems, Support Center of Washington > > ------------------------------------------------------------------------------- > Name: contrib-metaphone.tgz > contrib-metaphone.tgz Type: unspecified type (APPLICATION/octet-stream) > Encoding: BASE64 > > ------------------------------------------------------------------------------- > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster I have written a similar function, and it is in a library I wrote "pgcontains" which started out as a simple implementation of "contains(...)" but has grown WAY beyond it's original scope. Anyway, Metaphone is great for doing some cool searches, but now that we are on the subject of cool search functions, I have a few that may be useful and would be glad to contribute. My only question is how? (and I'd have to write them for 7.1, because they are still in 7.0.x format) contains(...) A simple implementation of contains. Forces a table scan, but does some cool things including phrase detection. decode(...) Similar to "case" but works as decode for oracle queries. In 7.1 and higher it should be easy to make one function take a variable number of parameters. Right now I have a stub for the most common numbers. strip(...) Strips out all but alphanumeric characters and returns a lowercase string. "Oops I did it again" comes back as "oopsididitagain." This is cool for lightly fuzzy searches. striprev(...) Like strip, but reverses the string. Allows you to use an index for records which end in something. For instance: "select * from table where field like 'abc%'" can use an index, where as "select * from table where field like '%xyx'" will not. However, "select * from table where striprev(field) like striprev('xyz') || '%'" can. Example: cdinfo=# select title, striprev(title) from ztitles where striprev(title) like striprev('wall') || '%' limit 3; title | striprev --------------------------------------------+------------------------------------ A Giggle Can Wiggle Its Way Through A Wall | llawahguorhtyawstielggiwnacelggiga Shadows On A Wall * | llawanoswodahs The Wall | llaweht (3 rows) cdinfo=# explain select title, striprev(title) from ztitles where striprev(title) like striprev('wall') || '%' limit 3; NOTICE: QUERY PLAN: Limit (cost=0.00..10.21 rows=3 width=12) -> Index Scan using f1 on ztitles (cost=0.00..7579.94 rows=2227 width=12) EXPLAIN int64_t *strtonumu(text, int4 base) Converts a string to a number with an arbitrary base. (Is there a function to do this already?) -- I'm not offering myself as an example; every life evolves by its own laws. ------------------------ http://www.mohawksoft.com
select * from table where contains(field, 'text string', 10) > 1 order by score(10); Contains returns a number based on an evaluation of the 'text string' against the field. If the field has the words contained in 'text string' in it, it returns a number based on some default assumptions. The assumptions are things like points for the first occurrence of a word, and next occurrence of the word. Points for words in the right order as specified in 'text string', etc. Who do I contact at greatbridge? Joel Burton wrote: > Why not start a new project at greatbridge.org? > > I'd be happy to see metaphone() move in there, soundex() would make > sense. I have a hashing algorithm that grabs the first letter off of > words, except for user-definable 'stop words', which we use to look for > likely organization name matches. > > These could all fall under a project of PG string functions. > > I think, as little things in contrib/, it's easy for people to miss > these. With a project page, some discussion, etc. (& a place in contrib/), > more people would be able to use these. > > PG functions are one of the things that separates PG from MySQL (which has > only C UDFs, and IIRC, not on some platforms) and InterBase (which has > plsql-like procedures, but functions can only be written in C). I think > our functions are one of our strongest cases, and the more we can show > people examples of how to use them, and the larger our useful library, the > more we win. > > P.S. What exactly does contains() do? > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html
Sure, send them over and we can put them in /contrib. Send them to the patches list, I think. > Joel Burton wrote: > > > > -- > > Joel Burton <jburton@scw.org> > > Director of Information Systems, Support Center of Washington > > > > ------------------------------------------------------------------------------- > > Name: contrib-metaphone.tgz > > contrib-metaphone.tgz Type: unspecified type (APPLICATION/octet-stream) > > Encoding: BASE64 > > > > ------------------------------------------------------------------------------- > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 4: Don't 'kill -9' the postmaster > > I have written a similar function, and it is in a library I wrote "pgcontains" > which started out as a simple implementation of "contains(...)" but has grown > WAY beyond it's original scope. > > Anyway, Metaphone is great for doing some cool searches, but now that we are on > the subject of cool search functions, I have a few that may be useful and would > be glad to contribute. My only question is how? (and I'd have to write them for > 7.1, because they are still in 7.0.x format) > > > contains(...) > A simple implementation of contains. Forces a table scan, but does some cool > things including phrase detection. > > decode(...) > Similar to "case" but works as decode for oracle queries. In 7.1 and higher it > should be easy to make one function take a variable number of parameters. Right > now I have a stub for the most common numbers. > > strip(...) > Strips out all but alphanumeric characters and returns a lowercase string. > "Oops I did it again" comes back as "oopsididitagain." This is cool for lightly > fuzzy searches. > > striprev(...) > Like strip, but reverses the string. Allows you to use an index for records > which end in something. For instance: "select * from table where field like > 'abc%'" can use an index, where as "select * from table where field like > '%xyx'" will not. However, "select * from table where striprev(field) like > striprev('xyz') || '%'" can. > > Example: > cdinfo=# select title, striprev(title) from ztitles where striprev(title) like > striprev('wall') || '%' limit 3; > title | striprev > --------------------------------------------+------------------------------------ > A Giggle Can Wiggle Its Way Through A Wall | > llawahguorhtyawstielggiwnacelggiga > Shadows On A Wall * | llawanoswodahs > The Wall | llaweht > (3 rows) > > cdinfo=# explain select title, striprev(title) from ztitles where > striprev(title) like striprev('wall') || '%' limit > 3; > NOTICE: QUERY PLAN: > > Limit (cost=0.00..10.21 rows=3 width=12) > -> Index Scan using f1 on ztitles (cost=0.00..7579.94 rows=2227 width=12) > > EXPLAIN > > int64_t *strtonumu(text, int4 base) > Converts a string to a number with an arbitrary base. (Is there a function to > do this already?) > > > -- > I'm not offering myself as an example; every life evolves by its own laws. > ------------------------ > http://www.mohawksoft.com > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Added to our /contrib. This is a loadable module, so it clearly belongs in /contrib. > > > -- > Joel Burton <jburton@scw.org> > Director of Information Systems, Support Center of Washington Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026