Re: Why not keeping positions in GIN? - Mailing list pgsql-hackers

From Oleg Bartunov
Subject Re: Why not keeping positions in GIN?
Date
Msg-id Pine.LNX.4.64.0705281722520.12152@sn.sai.msu.ru
Whole thread Raw
In response to Re: Why not keeping positions in GIN?  ("Hitoshi Harada" <hitoshi_harada@forcia.com>)
List pgsql-hackers
Hitoshi,

there is no problem to write n-gram dictionary for tsearch2 ! The problem
is in how to define word boundary.

Oleg

On Sat, 26 May 2007, Hitoshi Harada wrote:

>> FYI, Tatsuo uses tsearch2 for indexing japanese documents. But I agree,
>> n-gram index would be more universal for asian languages.
> Yeah, I know, but in tsearch2 for japanese sample you must use external
> morphological analysis libraries to separate words. It is powerful but I
> need more "lightweight" approach. Also especially when you search for
> non-document(such like titles, names, or pattern in the genome), the
> approach above is not so useful.
>
> As I mentioned, GIN is also powerful for array data type search, so I am
> very expecting it will have additional information.
>
> Anyway, thanks a lot for much information. I try to read it.
>
> Regards,
>
> Hitoshi Harada
>
>> -----Original Message-----
>> From: Oleg Bartunov [mailto:oleg@sai.msu.su]
>> Sent: Saturday, May 26, 2007 10:12 PM
>> To: Hitoshi Harada
>> Cc: pgsql-hackers@postgresql.org
>> Subject: Re: [HACKERS] Why not keeping positions in GIN?
>>
>> On Fri, 25 May 2007, Hitoshi Harada wrote:
>>
>>> Hi,
>>>
>>> I was walking through GIN am source code these days, and found that it
> has
>>> only posting lists but no positions related those.
>>>
>>> The reason I was doing that is, to try to implement n-gram text search
> index
>>> on GIN for myself. As you know Japanese is not like English or other
>>> European languages. If you write Japanese (or other 'not separated')
> text
>>> index by n-gram, it should have entry positions on the entry as well as
> the
>>> posting lists, because you must know if each split query key are joined
> with
>>> each other in the data. To know this, position must be there.
>>
>> FYI, Tatsuo uses tsearch2 for indexing japanese documents. But I agree,
>> n-gram index would be more universal for asian languages.
>>
>>>
>>> It's not only about Japanese. When you search "phrase" for text in
> English,
>>> the same logic above will be needed. I don't research about tsearch2 but
> is
>>> there any problem?? Also, in some case int-array inverted index needs
> the
>>> entry positions as well, I guess. Obtaining positions with posting lists
> is
>>> "general" enough for GIN, isn't it?
>>>
>>> Is there any future plan around it?
>>
>> Yes, we do have plans. See our todo,
> http://www.sai.msu.su/~megera/wiki/todo
>> You may read also FTSBOOK, http://www.sai.msu.su/~megera/postgres/fts/doc
>> and slides from PGCon2007,
>> http://www.sai.msu.su/~megera/postgres/talks/fts-pgcon2007.pdf
>>>
>>>
>>> Regards,
>>>
>>> Hitoshi Harada
>>>
>>>
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 4: Have you searched our list archives?
>>>
>>>               http://archives.postgresql.org
>>>
>>
>>      Regards,
>>          Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


pgsql-hackers by date:

Previous
From: Robert Treat
Date:
Subject: Re: Reviewing temp_tablespaces GUC patch
Next
From: Tom Lane
Date:
Subject: What is the maximum encoding-conversion growth rate, anyway?