Re: wildcard search support for pg_trgm - Mailing list pgsql-hackers

From Jesper Krogh
Subject Re: wildcard search support for pg_trgm
Date
Msg-id 4D3DA553.2070909@krogh.cc
Whole thread Raw
In response to Re: wildcard search support for pg_trgm  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: wildcard search support for pg_trgm  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 2011-01-24 16:34, Alexander Korotkov wrote:
> Hi!
>
> On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański<wulczer@wulczer.org>  wrote:
>
>> I see two issues with this patch. First of them is the resulting index
>> size. I created a table with 5 copies of
>> /usr/share/dict/american-english in it and a gin index on it, using
>> gin_trgm_ops. The results were:
>>
>>   * relation size: 18MB
>>   * index size: 109 MB
>>
>> while without the patch the GIN index was 43 MB. I'm not really sure
>> *why* this happens, as it's not obvious from reading the patch what
>> exactly is this extra data that gets stored in the index, making it more
>> than double its size.
>>
> Do you sure that you did comparison correctly? The sequence of index
> building and data insertion does matter. I tried to build gin index on  5
> copies of /usr/share/dict/american-english with patch and got 43 MB index
> size.
>
>
>> That leads me to the second issue. The pg_trgm code is already woefully
>> uncommented, and after spending quite some time reading it back and
>> forth I have to admit that I don't really understand what the code does
>> in the first place, and so I don't understand what does that patch
>> change. I read all the changes in detail and I could't find any obvious
>> mistakes like reading over array boundaries or dereferencing
>> uninitialized pointers, but I can't tell if the patch is correct
>> semantically. All test cases I threw at it work, though.
>>
> I'll try to write sufficient comment and send new revision of patch.
>
Would it be hard to make it support "n-grams" (e.g. making the length
configurable) instead of trigrams? I actually had the feeling that
penta-grams (pen-tuples or whatever they would be called) would
be better for my usecase (large substring-search in large documents ..
eg. 500 within 3.000.

Larger sizes.. lesser "sensitivity" => Faster lookup .. perhaps my logic 
is wrong?

Hm.. or will the knngist stuff help me here by selecting the best using
pentuples from the beginning?

The above comment is actually general to pg_trgm and not to the wildcard 
search
patch above.

Jesper
-- 
Jesper



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: wildcard search support for pg_trgm
Next
From: Heikki Linnakangas
Date:
Subject: Re: Allowing multiple concurrent base backups