Thread: Generalized edit function?

Generalized edit function?

From
fork
Date:
Hi hackers,

I am interested in extending Postgres with a "generalized edit function" like
SAS's "compged"[1], which is basically levenshtein distance with transposes (ab
<-> ba) and LOTS of different weights for certain ops (like insert a blank
versus delete from the end versus insert a regular character).  

Compged seems to work really well for us when trying to match addresses (MUCH
better than pure levenshtein), and it would be a great tool for data miners.

I have a number of questions:

1.  Does anybody else care? I would love to see this in contrib, but if the
chances are slim, then I would like to know that too.

2.  Has anybody else done something like this and can give ideas or source?  It
seems to me that the code will have to be a mess of pointers and indexes, but if
there is some theory that simplifies it I haven't heard about it.  (Levenshtein
without transposes is theoretically clean, but I think the fact that we have
transposes means we look ahead 2 chars and lose all the nice dynamic programming
stuff.)

3.  I will probably implement this for ascii characters -- if anyone has any
thoughts on other encodings, please share.

Thanks for everyone's time.  I will try to implement a command line version and
put that on pastebin for people to look at while I port it to the postgres
environment.


[1]
(http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm)



Re: Generalized edit function?

From
Josh Berkus
Date:
"Fork",

> 1.  Does anybody else care? I would love to see this in contrib, but if the
> chances are slim, then I would like to know that too.

That really depends on how well it works, and how much code it is.  It's
way too early for anyone to have a viewpoint on this.  For example, a
few years ago I'd have said that trigrams were hopelessly specialized
for mainstream PostgreSQL, but not they're not.

The path for this is to create an external project (on pgfoundry,
github, sourceforge, code.google,com, etc.) and then an Extension for
this.  Once it's production-quality, and if contrib is even relevant at
that point, you can propose it.

> 3.  I will probably implement this for ascii characters -- if anyone has any
> thoughts on other encodings, please share.

Why would the implementation for ASCII be different from Unicode?
Surely any distance algorithm is encoding-neutral.

Anyway, if it's ASCII-only, that's a guaranteed way to make sure it
isn't taken seriously.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Generalized edit function?

From
Robert Haas
Date:
On Sat, Feb 26, 2011 at 4:19 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Anyway, if it's ASCII-only, that's a guaranteed way to make sure it
> isn't taken seriously.

Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff
in contrib/fuzzystrmatch still is.  We had to work pretty hard to
avoid a massive performance loss when we made it multi-byte aware, and
IIRC there is still a pretty significant performance loss if any
multibyte characters are actually present.  But at least now it
doesn't return totally bogus answers.

So I have some sympathy with the OP's desire not to burden himself
with the non-ASCII case if he doesn't need it for his application, but
I also agree with your point that we probably wouldn't accept code
into contrib that doesn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Generalized edit function?

From
fork
Date:
Robert Haas <robertmhaas <at> gmail.com> writes:

> 
> On Sat, Feb 26, 2011 at 4:19 PM, Josh Berkus <josh <at> agliodbs.com> wrote:
> > Anyway, if it's ASCII-only, that's a guaranteed way to make sure it
> > isn't taken seriously.
> 
> Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff
> in contrib/fuzzystrmatch still is.

I am only looking at 9.0.3 for levenshtein, so I don't have any thoughts yet on
multi-byteness so far.   I will have to figure out the multibyte character work
once I get the basic algorithm working -- any thoughts on that?  Any pitfalls in
porting?

> So I have some sympathy with the OP's desire not to burden himself
> with the non-ASCII case if he doesn't need it for his application,

> but
> I also agree with your point that we probably wouldn't accept code
> into contrib that doesn't.

Good to know.  I will try to avoid backing myself into an ascii corner.






Re: Generalized edit function?

From
Robert Haas
Date:
On Sat, Feb 26, 2011 at 7:40 PM, fork <forkandwait@gmail.com> wrote:
>> Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff
>> in contrib/fuzzystrmatch still is.
>
> I am only looking at 9.0.3 for levenshtein, so I don't have any thoughts yet on
> multi-byteness so far.   I will have to figure out the multibyte character work
> once I get the basic algorithm working -- any thoughts on that?  Any pitfalls in
> porting?

The main thing with levenshtein() is that if you're working with
single byte characters then you can reference the i'th character as
x[i], whereas if you have multi-byte characters then you need to build
an offset table and look at length[i] bytes beginning at
&x[offset[i]].  That turns out to be significantly more expensive.  As
initially proposed, the patch to add multi-byte awareness built this
lookup table for any multi-byte encoding and used the faster technique
for single-byte encodings, but that wasn't actually so hot, because
the most widely used encoding these days is probably UTF-8, which of
course is a multi-byte encoding.  What we ended up with is a fast-path
for the case where both strings contain only single-byte characters,
which will always be true in a single-byte encoding but might easily
also be true in a multi-byte encoding, especially for English
speakers.  I don't know if that's exactly right for what you're trying
to do - you'll probably need to try some different things and
benchmark.  I would however recommend that you look at the
master-branch implementation of levenshtein() rather than the old
9.0.x one, because it's significantly different, and forward-porting
your changes will probably be hard.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company