Thread: BK-Tree Implementation on top of GiST
Hi, In an address search framework of a company, we need to deal with queries including potential spelling errors. After some external address space constraints (e.g. match first letters, word length, etc.) we're still ending up with a huge data set to filter through Levenshtein like distance metrics. Sequential scanning a record set with roughly 100,000 entries through some sort of distance metric is not something we'd want in production. For this purpose, I plan to implement BK-Trees[1] on top of GiST, which will (technically) reduce overhead complexity from O(n) to O(logn). As far as I'm concerned, such a work will worth the time it will take when compared to overhead reduction it will bring. [1] Some approaches to best-match file searching http://portal.acm.org/citation.cfm?id=362003.362025 Anyway, I have some experience with source code of intarray module. Does anybody have suggestions/warnings/comments about such a project? Would PostgreSQL team welcome such a patch to get integrated into fuzzystrmatch module, or should I create my own project at pgfoundry? BTW, as you'd imagine, related implementation won't be something specific to Levenshtein. Any distance metric on any kind of data will be able to benefit from BK-Trees. Regards.
* Volkan YAZICI: > [1] Some approaches to best-match file searching > http://portal.acm.org/citation.cfm?id=362003.362025 http://citeseer.ist.psu.edu/1593.html suggests that this uninteresting (too much of the database is examined) once you go past an edit distance of 1. I don't know if this is a problem in your case (it is in mine). It's a pity that this whole set of problems is still mostly unsolved.
Have you seen contrib/pg_trgm module ? Oleg On Sun, 28 Oct 2007, Volkan YAZICI wrote: > Hi, > > In an address search framework of a company, we need to deal with > queries including potential spelling errors. After some external > address space constraints (e.g. match first letters, word length, > etc.) we're still ending up with a huge data set to filter through > Levenshtein like distance metrics. > > Sequential scanning a record set with roughly 100,000 entries through > some sort of distance metric is not something we'd want in > production. For this purpose, I plan to implement BK-Trees[1] on top > of GiST, which will (technically) reduce overhead complexity from O(n) > to O(logn). As far as I'm concerned, such a work will worth the time > it will take when compared to overhead reduction it will bring. > > [1] Some approaches to best-match file searching > http://portal.acm.org/citation.cfm?id=362003.362025 > > Anyway, I have some experience with source code of intarray > module. Does anybody have suggestions/warnings/comments about such a > project? Would PostgreSQL team welcome such a patch to get integrated > into fuzzystrmatch module, or should I create my own project at > pgfoundry? > > BTW, as you'd imagine, related implementation won't be something > specific to Levenshtein. Any distance metric on any kind of data will > be able to benefit from BK-Trees. > > > Regards. > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Florian Weimer <fw@deneb.enyo.de> writes: > http://citeseer.ist.psu.edu/1593.html suggests that this uninteresting > (too much of the database is examined) once you go past an edit distance > of 1. I don't know if this is a problem in your case (it is in mine). Did you see the test results in bk-tree[1] project? Results will change with respect to metric distance distribution of your input data, but I was quite impressed by the numbers when I first saw them. [1] http://www.cliki.net/bk-tree Regards.