Home > mailing lists

Re: wildcard search support for pg_trgm - Mailing list pgsql-hackers

From	Jesper Krogh
Subject	Re: wildcard search support for pg_trgm
Date	January 24, 2011 12:14:26
Msg-id	4D3DA553.2070909@krogh.cc Whole thread Raw
In response to	Re: wildcard search support for pg_trgm (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: wildcard search support for pg_trgm
List	pgsql-hackers

Tree view

On 2011-01-24 16:34, Alexander Korotkov wrote:
> Hi!
>
> On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański<wulczer@wulczer.org>  wrote:
>
>> I see two issues with this patch. First of them is the resulting index
>> size. I created a table with 5 copies of
>> /usr/share/dict/american-english in it and a gin index on it, using
>> gin_trgm_ops. The results were:
>>
>>   * relation size: 18MB
>>   * index size: 109 MB
>>
>> while without the patch the GIN index was 43 MB. I'm not really sure
>> *why* this happens, as it's not obvious from reading the patch what
>> exactly is this extra data that gets stored in the index, making it more
>> than double its size.
>>
> Do you sure that you did comparison correctly? The sequence of index
> building and data insertion does matter. I tried to build gin index on  5
> copies of /usr/share/dict/american-english with patch and got 43 MB index
> size.
>
>
>> That leads me to the second issue. The pg_trgm code is already woefully
>> uncommented, and after spending quite some time reading it back and
>> forth I have to admit that I don't really understand what the code does
>> in the first place, and so I don't understand what does that patch
>> change. I read all the changes in detail and I could't find any obvious
>> mistakes like reading over array boundaries or dereferencing
>> uninitialized pointers, but I can't tell if the patch is correct
>> semantically. All test cases I threw at it work, though.
>>
> I'll try to write sufficient comment and send new revision of patch.
>
Would it be hard to make it support "n-grams" (e.g. making the length
configurable) instead of trigrams? I actually had the feeling that
penta-grams (pen-tuples or whatever they would be called) would
be better for my usecase (large substring-search in large documents ..
eg. 500 within 3.000.

Larger sizes.. lesser "sensitivity" => Faster lookup .. perhaps my logic 
is wrong?

Hm.. or will the knngist stuff help me here by selecting the best using
pentuples from the beginning?

The above comment is actually general to pg_trgm and not to the wildcard 
search
patch above.

Jesper
-- 
Jesper

pgsql-hackers by date:

From: Alexander Korotkov
Date: 24 January 2011, 11:37:37
Subject: Re: wildcard search support for pg_trgm

From: Heikki Linnakangas
Date: 24 January 2011, 12:54:24
Subject: Re: Allowing multiple concurrent base backups

Re: wildcard search support for pg_trgm - Mailing list pgsql-hackers

Previous

Next