Thread: Generalized edit function?
Hi hackers, I am interested in extending Postgres with a "generalized edit function" like SAS's "compged"[1], which is basically levenshtein distance with transposes (ab <-> ba) and LOTS of different weights for certain ops (like insert a blank versus delete from the end versus insert a regular character). Compged seems to work really well for us when trying to match addresses (MUCH better than pure levenshtein), and it would be a great tool for data miners. I have a number of questions: 1. Does anybody else care? I would love to see this in contrib, but if the chances are slim, then I would like to know that too. 2. Has anybody else done something like this and can give ideas or source? It seems to me that the code will have to be a mess of pointers and indexes, but if there is some theory that simplifies it I haven't heard about it. (Levenshtein without transposes is theoretically clean, but I think the fact that we have transposes means we look ahead 2 chars and lose all the nice dynamic programming stuff.) 3. I will probably implement this for ascii characters -- if anyone has any thoughts on other encodings, please share. Thanks for everyone's time. I will try to implement a command line version and put that on pastebin for people to look at while I port it to the postgres environment. [1] (http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm)
"Fork", > 1. Does anybody else care? I would love to see this in contrib, but if the > chances are slim, then I would like to know that too. That really depends on how well it works, and how much code it is. It's way too early for anyone to have a viewpoint on this. For example, a few years ago I'd have said that trigrams were hopelessly specialized for mainstream PostgreSQL, but not they're not. The path for this is to create an external project (on pgfoundry, github, sourceforge, code.google,com, etc.) and then an Extension for this. Once it's production-quality, and if contrib is even relevant at that point, you can propose it. > 3. I will probably implement this for ascii characters -- if anyone has any > thoughts on other encodings, please share. Why would the implementation for ASCII be different from Unicode? Surely any distance algorithm is encoding-neutral. Anyway, if it's ASCII-only, that's a guaranteed way to make sure it isn't taken seriously. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Sat, Feb 26, 2011 at 4:19 PM, Josh Berkus <josh@agliodbs.com> wrote: > Anyway, if it's ASCII-only, that's a guaranteed way to make sure it > isn't taken seriously. Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff in contrib/fuzzystrmatch still is. We had to work pretty hard to avoid a massive performance loss when we made it multi-byte aware, and IIRC there is still a pretty significant performance loss if any multibyte characters are actually present. But at least now it doesn't return totally bogus answers. So I have some sympathy with the OP's desire not to burden himself with the non-ASCII case if he doesn't need it for his application, but I also agree with your point that we probably wouldn't accept code into contrib that doesn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas <at> gmail.com> writes: > > On Sat, Feb 26, 2011 at 4:19 PM, Josh Berkus <josh <at> agliodbs.com> wrote: > > Anyway, if it's ASCII-only, that's a guaranteed way to make sure it > > isn't taken seriously. > > Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff > in contrib/fuzzystrmatch still is. I am only looking at 9.0.3 for levenshtein, so I don't have any thoughts yet on multi-byteness so far. I will have to figure out the multibyte character work once I get the basic algorithm working -- any thoughts on that? Any pitfalls in porting? > So I have some sympathy with the OP's desire not to burden himself > with the non-ASCII case if he doesn't need it for his application, > but > I also agree with your point that we probably wouldn't accept code > into contrib that doesn't. Good to know. I will try to avoid backing myself into an ascii corner.
On Sat, Feb 26, 2011 at 7:40 PM, fork <forkandwait@gmail.com> wrote: >> Pre-9.1 levenshtein is ASCII-only, and I think some of the other stuff >> in contrib/fuzzystrmatch still is. > > I am only looking at 9.0.3 for levenshtein, so I don't have any thoughts yet on > multi-byteness so far. I will have to figure out the multibyte character work > once I get the basic algorithm working -- any thoughts on that? Any pitfalls in > porting? The main thing with levenshtein() is that if you're working with single byte characters then you can reference the i'th character as x[i], whereas if you have multi-byte characters then you need to build an offset table and look at length[i] bytes beginning at &x[offset[i]]. That turns out to be significantly more expensive. As initially proposed, the patch to add multi-byte awareness built this lookup table for any multi-byte encoding and used the faster technique for single-byte encodings, but that wasn't actually so hot, because the most widely used encoding these days is probably UTF-8, which of course is a multi-byte encoding. What we ended up with is a fast-path for the case where both strings contain only single-byte characters, which will always be true in a single-byte encoding but might easily also be true in a multi-byte encoding, especially for English speakers. I don't know if that's exactly right for what you're trying to do - you'll probably need to try some different things and benchmark. I would however recommend that you look at the master-branch implementation of levenshtein() rather than the old 9.0.x one, because it's significantly different, and forward-porting your changes will probably be hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company