Home > mailing lists

Re: Partial match in GIN (next vesrion) - Mailing list pgsql-patches

From	Tom Lane
Subject	Re: Partial match in GIN (next vesrion)
Date	May 16, 2008 18:03:29
Msg-id	7971.1210961003@sss.pgh.pa.us Whole thread Raw
In response to	Re: Partial match in GIN (next vesrion) (Teodor Sigaev <teodor@sigaev.ru>)
Responses	Re: Partial match in GIN (next vesrion)
List	pgsql-patches

Tree view

Teodor Sigaev <teodor@sigaev.ru> writes:
> http://www.sigaev.ru/misc/partial_match_gin-0.10.gz
> http://www.sigaev.ru/misc/tsearch_prefix-0.9.gz
> http://www.sigaev.ru/misc/wildspeed-0.12.tgz

I've applied the first two of these with minor editorialization (mostly
fixing documentation).  However, I'm having a hard time convincing myself
that anyone will find wildspeed useful in its current form.  I did a
simple experiment using a table of titles of database papers:

contrib_regression=# select count(*), avg(length(title)) from pub;
 count  |         avg
--------+---------------------
 236984 | 64.7647520507713601
(1 row)

This takes about 22MB on disk as a Postgres table.  I was expecting the
wildspeed index to be about 65 times as large, which is bad enough
already, but actually it weighed in at 2165MB or nearly 100X bigger.
Plus it took forever to build: 35 minutes on a fairly fast machine
with maintenance_work_mem set to 512MB.

In comparison, building a conventional full-text-search index (GIN
tsvector) took about 22 seconds including constructing the tsvector
column, and the tsvectors plus index take about 54MB.  The relative
search performance is about what you'd expect from the difference in
index sizes, ie, wildspeed loses.

So I'm thinking wildspeed really needs to be redesigned if it's to be
anything but a toy.  I can't see putting it into contrib in this form.

One idea that I had was to break the given string into words (splitting
at spaces or punctuation) and store the rotations of individual words
instead of the whole string.  (Actually, maybe you only need suffixes
not rotations, ie for 'abcd' store 'abcd', 'bcd', 'cd', 'd'.)  Then
similarly break the LIKE pattern apart at words to create word-fragment
search keys.  In this scheme the operator would always(?) require
rechecking since any part of the pattern involving punctuation wouldn't
be checkable by the index.  The advantage is that the index bloat factor
is governed by the average word length not the average whole-string
length.

There are probably other approaches that would help, too.

            regards, tom lane

pgsql-patches by date:

From: Bruce Momjian
Date: 16 May 2008, 17:18:09
Subject: Re: Patch to change psql default banner v6

From: Magnus Hagander
Date: 16 May 2008, 18:31:11
Subject: Re: libpq thread-locking

Re: Partial match in GIN (next vesrion) - Mailing list pgsql-patches

Previous

Next