Thread: Metaphone function attachment

Metaphone function attachment

From
Joel Burton
Date:

--
Joel Burton   <jburton@scw.org>
Director of Information Systems, Support Center of Washington

Attachment

Re: Metaphone function attachment

From
Bruce Momjian
Date:
For those curious about what this actually does, README attached.  It
will appear in 7.2.  Seems similar to Soundex.

>
>
> --
> Joel Burton   <jburton@scw.org>
> Director of Information Systems, Support Center of Washington

Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
This directory contains a module that implements the "Metaphone" code as
a PostgreSQL user-defined function.  The Metaphone system is a method of
matching similar sounding names (or any words) to the same code.

Metaphone was invented by Lawrence Philips as an improvement to the popular
name-hashing routine, Soundex.

This metaphone code is from Michael Kuhn, and is detailed at
   http://aspell.sourceforge.net/metaphone/metaphone-kuhn.txt

Code for this (including this help file!) was liberally borrowed from
the soundex() module for PostgreSQL.

There are two functions:
  metaphone(text)       : returns hash of a name
  metaphone(text,int)   : returns hash (maximum length of int) of name

---

To install it, first configure the main source tree, then run make;
make install in this directory.  Finally, load the function definition
with psql:

    psql -f PREFIX/share/contrib/metaphone.sql

The following are some usage examples:

SELECT text_metaphone('hello world!');
SELECT text_metaphone('hello world!', 4);

CREATE TABLE s (nm text)\g

insert into s values ('john')\g
insert into s values ('joan')\g
insert into s values ('wobbly')\g

select * from s
where text_metaphone(nm) = text_metaphone('john')\g

select nm from s a, s b
where text_metaphone(a.nm) = text_metaphone(b.nm)
and a.oid <> b.oid\g

CREATE FUNCTION text_mp_eq(text, text) RETURNS bool AS
'select text_metaphone($1) = text_metaphone($2)'
LANGUAGE 'sql'\g

CREATE FUNCTION text_mp_lt(text,text) RETURNS bool AS
'select text_metaphone($1) < text_metaphone($2)'
LANGUAGE 'sql'\g

CREATE FUNCTION text_mp_gt(text,text) RETURNS bool AS
'select text_metaphone($1) > text_metaphone($2)'
LANGUAGE 'sql';

CREATE FUNCTION text_mp_le(text,text) RETURNS bool AS
'select text_metaphone($1) <= text_metaphone($2)'
LANGUAGE 'sql';

CREATE FUNCTION text_mp_ge(text,text) RETURNS bool AS
'select text_metaphone($1) >= text_metaphone($2)'
LANGUAGE 'sql';

CREATE FUNCTION text_mp_ne(text,text) RETURNS bool AS
'select text_metaphone($1) <> text_metaphone($2)'
LANGUAGE 'sql';

DROP OPERATOR #= (text,text)\g

CREATE OPERATOR #= (leftarg=text, rightarg=text, procedure=text_mp_eq,
commutator=text_mp_eq)\g

SELECT *
FROM s
WHERE text_mp_eq(nm,'pillsbury')\g

SELECT *
from s
where s.nm #= 'pillsbury';

Re: Metaphone function attachment

From
mlw
Date:
Joel Burton wrote:
>
> --
> Joel Burton   <jburton@scw.org>
> Director of Information Systems, Support Center of Washington
>
>   -------------------------------------------------------------------------------
>                             Name: contrib-metaphone.tgz
>    contrib-metaphone.tgz    Type: unspecified type (APPLICATION/octet-stream)
>                         Encoding: BASE64
>
>   -------------------------------------------------------------------------------
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

I have written a similar function, and it is in a library I wrote "pgcontains"
which started out as a simple implementation of "contains(...)" but has grown
WAY beyond it's original scope.

Anyway, Metaphone is great for doing some cool searches, but now that we are on
the subject of cool search functions, I have a few that may be useful and would
be glad to contribute. My only question is how? (and I'd have to write them for
7.1, because they are still in 7.0.x format)


contains(...)
A simple implementation of contains. Forces a table scan, but does some cool
things including phrase detection.

decode(...)
Similar to "case" but works as decode for oracle queries. In 7.1 and higher it
should be easy to make one function take a variable number of parameters. Right
now I have a stub for the most common numbers.

strip(...)
Strips out all but alphanumeric characters and returns a lowercase string.
"Oops I did it again" comes back as "oopsididitagain." This is cool for lightly
fuzzy searches.

striprev(...)
Like strip, but reverses the string. Allows you to use an index for records
which end in something. For instance: "select * from table where field like
'abc%'" can use an index, where as "select * from table where field like
'%xyx'" will not. However, "select * from table where striprev(field) like
striprev('xyz') || '%'" can.

Example:
cdinfo=# select title, striprev(title) from ztitles where striprev(title) like
striprev('wall') || '%' limit 3;
                   title                    |              striprev
--------------------------------------------+------------------------------------
 A Giggle Can Wiggle Its Way Through A Wall |
llawahguorhtyawstielggiwnacelggiga
 Shadows On A Wall *                        | llawanoswodahs
 The Wall                                   | llaweht
(3 rows)

cdinfo=# explain select title, striprev(title) from ztitles where
striprev(title) like striprev('wall') || '%' limit
3;
NOTICE:  QUERY PLAN:

Limit  (cost=0.00..10.21 rows=3 width=12)
  ->  Index Scan using f1 on ztitles  (cost=0.00..7579.94 rows=2227 width=12)

EXPLAIN

int64_t *strtonumu(text, int4 base)
Converts a string to a number with an arbitrary base. (Is there a function to
do this already?)


--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

Re: Metaphone function attachment

From
Joel Burton
Date:
Why not start a new project at greatbridge.org?

I'd be happy to see metaphone() move in there, soundex() would make
sense. I have a hashing algorithm that grabs the first letter off of
words, except for user-definable 'stop words', which we use to look for
likely organization name matches.

These could all fall under a project of PG string functions.

I think, as little things in contrib/, it's easy for people to miss
these. With a project page, some discussion, etc. (& a place in contrib/),
more people would be able to use these.

PG functions are one of the things that separates PG from MySQL (which has
only C UDFs, and IIRC, not on some platforms) and InterBase (which has
plsql-like procedures, but functions can only be written in C). I think
our functions are one of our strongest cases, and the more we can show
people examples of how to use them, and the larger our useful library, the
more we win.

P.S. What exactly does contains() do?


Re: Re: Metaphone function attachment

From
Peter Eisentraut
Date:
Joel Burton writes:

> I think, as little things in contrib/, it's easy for people to miss
> these. With a project page, some discussion, etc. (& a place in contrib/),
> more people would be able to use these.

Most of the extension functions and types in contrib should, in my mind,
eventually be moved into the core.  contrib is a nice place for things
that we don't really know how/whether they work, but once we're confident
about the quality we might as well offer it by default.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Re: Metaphone function attachment

From
Joel Burton
Date:
On Fri, 4 May 2001, Peter Eisentraut wrote:

> Joel Burton writes:
> 
> > I think, as little things in contrib/, it's easy for people to miss
> > these. With a project page, some discussion, etc. (& a place in contrib/),
> > more people would be able to use these.
> 
> Most of the extension functions and types in contrib should, in my mind,
> eventually be moved into the core.  contrib is a nice place for things
> that we don't really know how/whether they work, but once we're confident
> about the quality we might as well offer it by default.

Yeah, but things do seem to languish there for quite a while. (soundex(),
for instance, was in contrib when I first looked at PG).

Also, some things are in contrib/ that seem a bit out of date (I think
there was still some early RI stuff in there last time I went through it)

I understand the need not to stuff PG full of *everything* -- and perhaps
stuff like soundex(), metaphone(), etc., shouldn't go into the core *. But
I think if we leave them in contrib/, after a while, it feels like there's
an implied comment on the quality/soundness of the code.

Would it work to have a different mechanism for distributing proven yet
out-of-the-mainstream stuff, like soundex(), etc.


* - soundex(), in particular, should go into the core, though. 
Many other DBs have it built in, so users could reasonably have the
expectation that we should have it.

-- 
Joel Burton   <jburton@scw.org>
Director of Information Systems, Support Center of Washington



Re: Metaphone function attachment

From
mlw
Date:
select * from table where contains(field, 'text string', 10) > 1  order by
score(10);

Contains returns a number based on an evaluation of the 'text string' against
the field. If the field has the words contained in 'text string' in it, it
returns a number based on some default assumptions. The assumptions are
things like points for the first occurrence of a word, and next occurrence of
the word. Points for words in the right order as specified in 'text string',
etc.

Who do I contact at greatbridge?


Joel Burton wrote:

> Why not start a new project at greatbridge.org?
>
> I'd be happy to see metaphone() move in there, soundex() would make
> sense. I have a hashing algorithm that grabs the first letter off of
> words, except for user-definable 'stop words', which we use to look for
> likely organization name matches.
>
> These could all fall under a project of PG string functions.
>
> I think, as little things in contrib/, it's easy for people to miss
> these. With a project page, some discussion, etc. (& a place in contrib/),
> more people would be able to use these.
>
> PG functions are one of the things that separates PG from MySQL (which has
> only C UDFs, and IIRC, not on some platforms) and InterBase (which has
> plsql-like procedures, but functions can only be written in C). I think
> our functions are one of our strongest cases, and the more we can show
> people examples of how to use them, and the larger our useful library, the
> more we win.
>
> P.S. What exactly does contains() do?
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html


Re: Re: Metaphone function attachmenty

From
Bruce Momjian
Date:
> Yeah, but things do seem to languish there for quite a while. (soundex(),
> for instance, was in contrib when I first looked at PG).
> 
> Also, some things are in contrib/ that seem a bit out of date (I think
> there was still some early RI stuff in there last time I went through it)
> 
> I understand the need not to stuff PG full of *everything* -- and perhaps
> stuff like soundex(), metaphone(), etc., shouldn't go into the core *. But
> I think if we leave them in contrib/, after a while, it feels like there's
> an implied comment on the quality/soundness of the code.
> 
> Would it work to have a different mechanism for distributing proven yet
> out-of-the-mainstream stuff, like soundex(), etc.
> 
> 
> * - soundex(), in particular, should go into the core, though. 
> Many other DBs have it built in, so users could reasonably have the
> expectation that we should have it.

Added to TODO:
* Move some things from /contrib into main tree, like soundex    


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [GENERAL] Re: Metaphone function attachment

From
Bruce Momjian
Date:
Sure, send them over and we can put them in /contrib.  Send them to the
patches list, I think.


> Joel Burton wrote:
> >
> > --
> > Joel Burton   <jburton@scw.org>
> > Director of Information Systems, Support Center of Washington
> >
> >   -------------------------------------------------------------------------------
> >                             Name: contrib-metaphone.tgz
> >    contrib-metaphone.tgz    Type: unspecified type (APPLICATION/octet-stream)
> >                         Encoding: BASE64
> >
> >   -------------------------------------------------------------------------------
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Don't 'kill -9' the postmaster
>
> I have written a similar function, and it is in a library I wrote "pgcontains"
> which started out as a simple implementation of "contains(...)" but has grown
> WAY beyond it's original scope.
>
> Anyway, Metaphone is great for doing some cool searches, but now that we are on
> the subject of cool search functions, I have a few that may be useful and would
> be glad to contribute. My only question is how? (and I'd have to write them for
> 7.1, because they are still in 7.0.x format)
>
>
> contains(...)
> A simple implementation of contains. Forces a table scan, but does some cool
> things including phrase detection.
>
> decode(...)
> Similar to "case" but works as decode for oracle queries. In 7.1 and higher it
> should be easy to make one function take a variable number of parameters. Right
> now I have a stub for the most common numbers.
>
> strip(...)
> Strips out all but alphanumeric characters and returns a lowercase string.
> "Oops I did it again" comes back as "oopsididitagain." This is cool for lightly
> fuzzy searches.
>
> striprev(...)
> Like strip, but reverses the string. Allows you to use an index for records
> which end in something. For instance: "select * from table where field like
> 'abc%'" can use an index, where as "select * from table where field like
> '%xyx'" will not. However, "select * from table where striprev(field) like
> striprev('xyz') || '%'" can.
>
> Example:
> cdinfo=# select title, striprev(title) from ztitles where striprev(title) like
> striprev('wall') || '%' limit 3;
>                    title                    |              striprev
> --------------------------------------------+------------------------------------
>  A Giggle Can Wiggle Its Way Through A Wall |
> llawahguorhtyawstielggiwnacelggiga
>  Shadows On A Wall *                        | llawanoswodahs
>  The Wall                                   | llaweht
> (3 rows)
>
> cdinfo=# explain select title, striprev(title) from ztitles where
> striprev(title) like striprev('wall') || '%' limit
> 3;
> NOTICE:  QUERY PLAN:
>
> Limit  (cost=0.00..10.21 rows=3 width=12)
>   ->  Index Scan using f1 on ztitles  (cost=0.00..7579.94 rows=2227 width=12)
>
> EXPLAIN
>
> int64_t *strtonumu(text, int4 base)
> Converts a string to a number with an arbitrary base. (Is there a function to
> do this already?)
>
>
> --
> I'm not offering myself as an example; every life evolves by its own laws.
> ------------------------
> http://www.mohawksoft.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Metaphone function attachment

From
Bruce Momjian
Date:
Added to our /contrib.  This is a loadable module, so it clearly belongs
in /contrib.


>
>
> --
> Joel Burton   <jburton@scw.org>
> Director of Information Systems, Support Center of Washington

Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026