Re: Partial match in GIN (next vesrion) - Mailing list pgsql-patches

From Oleg Bartunov
Subject Re: Partial match in GIN (next vesrion)
Date
Msg-id Pine.LNX.4.64.0805162226440.21547@sn.sai.msu.ru
Whole thread Raw
In response to Re: Partial match in GIN (next vesrion)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-patches
Wildspeed was designed as an example application of the GIN's partial
match and as a useful extension for *short* strings. It's also good
standalone demonstration of GIN API. We tried to stay away from full text
search, parser, word delimiters and etc.
From that point of view it might be
useful contrib, but I agree we have to think better to let it more
usable.

Oleg

On Fri, 16 May 2008, Tom Lane wrote:

> Teodor Sigaev <teodor@sigaev.ru> writes:
>> http://www.sigaev.ru/misc/partial_match_gin-0.10.gz
>> http://www.sigaev.ru/misc/tsearch_prefix-0.9.gz
>> http://www.sigaev.ru/misc/wildspeed-0.12.tgz
>
> I've applied the first two of these with minor editorialization (mostly
> fixing documentation).  However, I'm having a hard time convincing myself
> that anyone will find wildspeed useful in its current form.  I did a
> simple experiment using a table of titles of database papers:
>
> contrib_regression=# select count(*), avg(length(title)) from pub;
> count  |         avg
> --------+---------------------
> 236984 | 64.7647520507713601
> (1 row)
>
> This takes about 22MB on disk as a Postgres table.  I was expecting the
> wildspeed index to be about 65 times as large, which is bad enough
> already, but actually it weighed in at 2165MB or nearly 100X bigger.
> Plus it took forever to build: 35 minutes on a fairly fast machine
> with maintenance_work_mem set to 512MB.
>
> In comparison, building a conventional full-text-search index (GIN
> tsvector) took about 22 seconds including constructing the tsvector
> column, and the tsvectors plus index take about 54MB.  The relative
> search performance is about what you'd expect from the difference in
> index sizes, ie, wildspeed loses.
>
> So I'm thinking wildspeed really needs to be redesigned if it's to be
> anything but a toy.  I can't see putting it into contrib in this form.
>
> One idea that I had was to break the given string into words (splitting
> at spaces or punctuation) and store the rotations of individual words
> instead of the whole string.  (Actually, maybe you only need suffixes
> not rotations, ie for 'abcd' store 'abcd', 'bcd', 'cd', 'd'.)  Then
> similarly break the LIKE pattern apart at words to create word-fragment
> search keys.  In this scheme the operator would always(?) require
> rechecking since any part of the pattern involving punctuation wouldn't
> be checkable by the index.  The advantage is that the index bloat factor
> is governed by the average word length not the average whole-string
> length.
>
> There are probably other approaches that would help, too.
>
>             regards, tom lane
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

pgsql-patches by date:

Previous
From: Andrew Chernow
Date:
Subject: Re: libpq object hooks
Next
From: "Merlin Moncure"
Date:
Subject: Re: libpq object hooks