Thread: Metaphone function attachment

Metaphone function attachment

From
Joel Burton
Date:

--
Joel Burton   <jburton@scw.org>
Director of Information Systems, Support Center of Washington

Attachment

Re: [HACKERS] Metaphone function attachment

From
Bruce Momjian
Date:
For those curious about what this actually does, README attached.  It
will appear in 7.2.  Seems similar to Soundex.

>
>
> --
> Joel Burton   <jburton@scw.org>
> Director of Information Systems, Support Center of Washington

Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
This directory contains a module that implements the "Metaphone" code as
a PostgreSQL user-defined function.  The Metaphone system is a method of
matching similar sounding names (or any words) to the same code.

Metaphone was invented by Lawrence Philips as an improvement to the popular
name-hashing routine, Soundex.

This metaphone code is from Michael Kuhn, and is detailed at
   http://aspell.sourceforge.net/metaphone/metaphone-kuhn.txt

Code for this (including this help file!) was liberally borrowed from
the soundex() module for PostgreSQL.

There are two functions:
  metaphone(text)       : returns hash of a name
  metaphone(text,int)   : returns hash (maximum length of int) of name

---

To install it, first configure the main source tree, then run make;
make install in this directory.  Finally, load the function definition
with psql:

    psql -f PREFIX/share/contrib/metaphone.sql

The following are some usage examples:

SELECT text_metaphone('hello world!');
SELECT text_metaphone('hello world!', 4);

CREATE TABLE s (nm text)\g

insert into s values ('john')\g
insert into s values ('joan')\g
insert into s values ('wobbly')\g

select * from s
where text_metaphone(nm) = text_metaphone('john')\g

select nm from s a, s b
where text_metaphone(a.nm) = text_metaphone(b.nm)
and a.oid <> b.oid\g

CREATE FUNCTION text_mp_eq(text, text) RETURNS bool AS
'select text_metaphone($1) = text_metaphone($2)'
LANGUAGE 'sql'\g

CREATE FUNCTION text_mp_lt(text,text) RETURNS bool AS
'select text_metaphone($1) < text_metaphone($2)'
LANGUAGE 'sql'\g

CREATE FUNCTION text_mp_gt(text,text) RETURNS bool AS
'select text_metaphone($1) > text_metaphone($2)'
LANGUAGE 'sql';

CREATE FUNCTION text_mp_le(text,text) RETURNS bool AS
'select text_metaphone($1) <= text_metaphone($2)'
LANGUAGE 'sql';

CREATE FUNCTION text_mp_ge(text,text) RETURNS bool AS
'select text_metaphone($1) >= text_metaphone($2)'
LANGUAGE 'sql';

CREATE FUNCTION text_mp_ne(text,text) RETURNS bool AS
'select text_metaphone($1) <> text_metaphone($2)'
LANGUAGE 'sql';

DROP OPERATOR #= (text,text)\g

CREATE OPERATOR #= (leftarg=text, rightarg=text, procedure=text_mp_eq,
commutator=text_mp_eq)\g

SELECT *
FROM s
WHERE text_mp_eq(nm,'pillsbury')\g

SELECT *
from s
where s.nm #= 'pillsbury';

Re: Metaphone function attachment

From
Joel Burton
Date:
Why not start a new project at greatbridge.org?

I'd be happy to see metaphone() move in there, soundex() would make
sense. I have a hashing algorithm that grabs the first letter off of
words, except for user-definable 'stop words', which we use to look for
likely organization name matches.

These could all fall under a project of PG string functions.

I think, as little things in contrib/, it's easy for people to miss
these. With a project page, some discussion, etc. (& a place in contrib/),
more people would be able to use these.

PG functions are one of the things that separates PG from MySQL (which has
only C UDFs, and IIRC, not on some platforms) and InterBase (which has
plsql-like procedures, but functions can only be written in C). I think
our functions are one of our strongest cases, and the more we can show
people examples of how to use them, and the larger our useful library, the
more we win.

P.S. What exactly does contains() do?


Re: Metaphone function attachment

From
mlw
Date:
Joel Burton wrote:
>
> --
> Joel Burton   <jburton@scw.org>
> Director of Information Systems, Support Center of Washington
>
>   -------------------------------------------------------------------------------
>                             Name: contrib-metaphone.tgz
>    contrib-metaphone.tgz    Type: unspecified type (APPLICATION/octet-stream)
>                         Encoding: BASE64
>
>   -------------------------------------------------------------------------------
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

I have written a similar function, and it is in a library I wrote "pgcontains"
which started out as a simple implementation of "contains(...)" but has grown
WAY beyond it's original scope.

Anyway, Metaphone is great for doing some cool searches, but now that we are on
the subject of cool search functions, I have a few that may be useful and would
be glad to contribute. My only question is how? (and I'd have to write them for
7.1, because they are still in 7.0.x format)


contains(...)
A simple implementation of contains. Forces a table scan, but does some cool
things including phrase detection.

decode(...)
Similar to "case" but works as decode for oracle queries. In 7.1 and higher it
should be easy to make one function take a variable number of parameters. Right
now I have a stub for the most common numbers.

strip(...)
Strips out all but alphanumeric characters and returns a lowercase string.
"Oops I did it again" comes back as "oopsididitagain." This is cool for lightly
fuzzy searches.

striprev(...)
Like strip, but reverses the string. Allows you to use an index for records
which end in something. For instance: "select * from table where field like
'abc%'" can use an index, where as "select * from table where field like
'%xyx'" will not. However, "select * from table where striprev(field) like
striprev('xyz') || '%'" can.

Example:
cdinfo=# select title, striprev(title) from ztitles where striprev(title) like
striprev('wall') || '%' limit 3;
                   title                    |              striprev
--------------------------------------------+------------------------------------
 A Giggle Can Wiggle Its Way Through A Wall |
llawahguorhtyawstielggiwnacelggiga
 Shadows On A Wall *                        | llawanoswodahs
 The Wall                                   | llaweht
(3 rows)

cdinfo=# explain select title, striprev(title) from ztitles where
striprev(title) like striprev('wall') || '%' limit
3;
NOTICE:  QUERY PLAN:

Limit  (cost=0.00..10.21 rows=3 width=12)
  ->  Index Scan using f1 on ztitles  (cost=0.00..7579.94 rows=2227 width=12)

EXPLAIN

int64_t *strtonumu(text, int4 base)
Converts a string to a number with an arbitrary base. (Is there a function to
do this already?)


--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

Re: Metaphone function attachment

From
mlw
Date:
select * from table where contains(field, 'text string', 10) > 1  order by
score(10);

Contains returns a number based on an evaluation of the 'text string' against
the field. If the field has the words contained in 'text string' in it, it
returns a number based on some default assumptions. The assumptions are
things like points for the first occurrence of a word, and next occurrence of
the word. Points for words in the right order as specified in 'text string',
etc.

Who do I contact at greatbridge?


Joel Burton wrote:

> Why not start a new project at greatbridge.org?
>
> I'd be happy to see metaphone() move in there, soundex() would make
> sense. I have a hashing algorithm that grabs the first letter off of
> words, except for user-definable 'stop words', which we use to look for
> likely organization name matches.
>
> These could all fall under a project of PG string functions.
>
> I think, as little things in contrib/, it's easy for people to miss
> these. With a project page, some discussion, etc. (& a place in contrib/),
> more people would be able to use these.
>
> PG functions are one of the things that separates PG from MySQL (which has
> only C UDFs, and IIRC, not on some platforms) and InterBase (which has
> plsql-like procedures, but functions can only be written in C). I think
> our functions are one of our strongest cases, and the more we can show
> people examples of how to use them, and the larger our useful library, the
> more we win.
>
> P.S. What exactly does contains() do?
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html


Re: Re: Metaphone function attachment

From
Bruce Momjian
Date:
Sure, send them over and we can put them in /contrib.  Send them to the
patches list, I think.


> Joel Burton wrote:
> >
> > --
> > Joel Burton   <jburton@scw.org>
> > Director of Information Systems, Support Center of Washington
> >
> >   -------------------------------------------------------------------------------
> >                             Name: contrib-metaphone.tgz
> >    contrib-metaphone.tgz    Type: unspecified type (APPLICATION/octet-stream)
> >                         Encoding: BASE64
> >
> >   -------------------------------------------------------------------------------
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Don't 'kill -9' the postmaster
>
> I have written a similar function, and it is in a library I wrote "pgcontains"
> which started out as a simple implementation of "contains(...)" but has grown
> WAY beyond it's original scope.
>
> Anyway, Metaphone is great for doing some cool searches, but now that we are on
> the subject of cool search functions, I have a few that may be useful and would
> be glad to contribute. My only question is how? (and I'd have to write them for
> 7.1, because they are still in 7.0.x format)
>
>
> contains(...)
> A simple implementation of contains. Forces a table scan, but does some cool
> things including phrase detection.
>
> decode(...)
> Similar to "case" but works as decode for oracle queries. In 7.1 and higher it
> should be easy to make one function take a variable number of parameters. Right
> now I have a stub for the most common numbers.
>
> strip(...)
> Strips out all but alphanumeric characters and returns a lowercase string.
> "Oops I did it again" comes back as "oopsididitagain." This is cool for lightly
> fuzzy searches.
>
> striprev(...)
> Like strip, but reverses the string. Allows you to use an index for records
> which end in something. For instance: "select * from table where field like
> 'abc%'" can use an index, where as "select * from table where field like
> '%xyx'" will not. However, "select * from table where striprev(field) like
> striprev('xyz') || '%'" can.
>
> Example:
> cdinfo=# select title, striprev(title) from ztitles where striprev(title) like
> striprev('wall') || '%' limit 3;
>                    title                    |              striprev
> --------------------------------------------+------------------------------------
>  A Giggle Can Wiggle Its Way Through A Wall |
> llawahguorhtyawstielggiwnacelggiga
>  Shadows On A Wall *                        | llawanoswodahs
>  The Wall                                   | llaweht
> (3 rows)
>
> cdinfo=# explain select title, striprev(title) from ztitles where
> striprev(title) like striprev('wall') || '%' limit
> 3;
> NOTICE:  QUERY PLAN:
>
> Limit  (cost=0.00..10.21 rows=3 width=12)
>   ->  Index Scan using f1 on ztitles  (cost=0.00..7579.94 rows=2227 width=12)
>
> EXPLAIN
>
> int64_t *strtonumu(text, int4 base)
> Converts a string to a number with an arbitrary base. (Is there a function to
> do this already?)
>
>
> --
> I'm not offering myself as an example; every life evolves by its own laws.
> ------------------------
> http://www.mohawksoft.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: [HACKERS] Metaphone function attachment

From
Bruce Momjian
Date:
Added to our /contrib.  This is a loadable module, so it clearly belongs
in /contrib.


>
>
> --
> Joel Burton   <jburton@scw.org>
> Director of Information Systems, Support Center of Washington

Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026