Thread: GIN : Working with term positions

GIN : Working with term positions

From

Yoann Moreau

Date:

21 October 2011, 07:30:14

Hello,
I'm using a GIN index for a text column on a big table. I use it to rank
the rows, but I also need to get the term positions for each document of
a subset of documents. I assume these positions are stored in the index,
because doc says positions can be used for cover density ranking and
because to_tsvector function gives them :

select * from to_tsvector('I get lexemes and I get term positions.');
             to_tsvector
----------------------------------------
  'get':2,6 'lexem':3 'posit':8 'term':7

I can get the term positions with to_tsvector but only by parsing the
result string, is there any more handy way ? Something like :
select * from term_and_positions('I get lexemes and I get term positions.');
  term    | positions
---------+-----------
    'get' |     {2,6}
  'lexem' |       {3}


Then, from the term positions, I need to get the character offset of
these term positions. I assume it is NOT stored in the GIN index. By
character offset I mean the character count from string begining to the
term. For the previous example it would be : 'get' --> {2,20}.

I thought about using ts_headline to return the whole text with term
tagged and then parse it to compute the character offsets from the tags.
But this function is very slow, seems like it does not use the GIN index
at all. And I suppose it can't because there is no way to know from a
term position where its substring is in the text.

Now I think the only solution is to make my own C function parsing the
text like to_tsvector does and counting terms AND characters read from
the begining of the text to match them. I got a look on the code, and it
does not seems easy to do because characters offset or string length are
never used by the parsetext function (ts_parse.c). If you have any other
suggestion, would love to hear that !

Regards, Yoann Moreau

Re: GIN : Working with term positions

From

Yoann Moreau

Date:

26 October 2011, 13:20:14

On 21/10/11 12:23, Yoann Moreau wrote:
> Hello,
> I'm using a GIN index for a text column on a big table. I use it to
> rank the rows, but I also need to get the term positions for each
> document of a subset of documents. I assume these positions are stored
> in the index, because doc says positions can be used for cover density
> ranking and because to_tsvector function gives them :
>
> select * from to_tsvector('I get lexemes and I get term positions.');
>             to_tsvector
> ----------------------------------------
>  'get':2,6 'lexem':3 'posit':8 'term':7
>
> I can get the term positions with to_tsvector but only by parsing the
> result string, is there any more handy way ? Something like :
> select * from term_and_positions('I get lexemes and I get term
> positions.');
>  term    | positions
> ---------+-----------
>    'get' |     {2,6}
>  'lexem' |       {3}
>
>
> Then, from the term positions, I need to get the character offset of
> these term positions. I assume it is NOT stored in the GIN index. By
> character offset I mean the character count from string begining to
> the term. For the previous example it would be : 'get' --> {2,20}.
>
> I thought about using ts_headline to return the whole text with term
> tagged and then parse it to compute the character offsets from the
> tags. But this function is very slow, seems like it does not use the
> GIN index at all. And I suppose it can't because there is no way to
> know from a term position where its substring is in the text.
>
> Now I think the only solution is to make my own C function parsing the
> text like to_tsvector does and counting terms AND characters read from
> the begining of the text to match them. I got a look on the code, and
> it does not seems easy to do because characters offset or string
> length are never used by the parsetext function (ts_parse.c). If you
> have any other suggestion, would love to hear that !
>
> Regards, Yoann Moreau

Hello again, I'm sorry my need is actually a bit different than what I
have asked. I need to get the term positions using the GIN index, when I
query my text column, i.e. for a given term.

For example for 2 rows of a 'docs' table with a text column 'text' :
'I get lexemes and I get term positions.'
'Did you get the positions ?'

I'd need a function like this :
select term_positions(text, 'get') from docs;
   id_doc | positions
---------+-----------
        1 |     {2,6}
        2 |       {3}

I know it can't be as simple as this, because the query would first need
to be filtered with a WHERE using a tsquery and this can't be done in
the function called like in my example. I suppose such a feature does
not exist, but is there any way to get the positions of the matching
terms when querying a GIN index ?

The only possible way I imagine right now is to firstly filter the rows
with "to_tsvector(text) @@ to_tsquery('get')" and then call
"to_tsvector(text)" for the n highest ranked rows, parsing the string
returned by the function to find the term and its positions. But would
be way more efficient to get them directly at the first call when
matching the terms with @@ operator. I know it would be impossible if
the query contain more than 1 term because it can't return 2 arrays of
position in one row (i.e. for one document), but for now I'm trying to
do this for 1 query term.
Any help or advice would be welcome !

By the way, I have done the C function computing the character offset of
a given term position for a text column. It's not done in a good way,
but it's more a topic for hackers postgresql list.

Regards, Yoann Moreau

Re: GIN : Working with term positions

From

Cédric Villemain

Date:

27 October 2011, 07:24:20

2011/10/26 Yoann Moreau <yoann.moreau@univ-avignon.fr>:
> On 21/10/11 12:23, Yoann Moreau wrote:
>>
>> Hello,
>> I'm using a GIN index for a text column on a big table. I use it to rank
>> the rows, but I also need to get the term positions for each document of a
>> subset of documents. I assume these positions are stored in the index,
>> because doc says positions can be used for cover density ranking and because
>> to_tsvector function gives them :
>>
>> select * from to_tsvector('I get lexemes and I get term positions.');
>>            to_tsvector
>> ----------------------------------------
>>  'get':2,6 'lexem':3 'posit':8 'term':7
>>
>> I can get the term positions with to_tsvector but only by parsing the
>> result string, is there any more handy way ? Something like :
>> select * from term_and_positions('I get lexemes and I get term
>> positions.');
>>  term    | positions
>> ---------+-----------
>>   'get' |     {2,6}
>>  'lexem' |       {3}
>>
>>
>> Then, from the term positions, I need to get the character offset of these
>> term positions. I assume it is NOT stored in the GIN index. By character
>> offset I mean the character count from string begining to the term. For the
>> previous example it would be : 'get' --> {2,20}.
>>
>> I thought about using ts_headline to return the whole text with term
>> tagged and then parse it to compute the character offsets from the tags. But
>> this function is very slow, seems like it does not use the GIN index at all.
>> And I suppose it can't because there is no way to know from a term position
>> where its substring is in the text.
>>
>> Now I think the only solution is to make my own C function parsing the
>> text like to_tsvector does and counting terms AND characters read from the
>> begining of the text to match them. I got a look on the code, and it does
>> not seems easy to do because characters offset or string length are never
>> used by the parsetext function (ts_parse.c). If you have any other
>> suggestion, would love to hear that !
>>
>> Regards, Yoann Moreau
>
> Hello again, I'm sorry my need is actually a bit different than what I have
> asked. I need to get the term positions using the GIN index, when I query my
> text column, i.e. for a given term.
>
> For example for 2 rows of a 'docs' table with a text column 'text' :
> 'I get lexemes and I get term positions.'
> 'Did you get the positions ?'
>
> I'd need a function like this :
> select term_positions(text, 'get') from docs;
>  id_doc | positions
> ---------+-----------
>       1 |     {2,6}
>       2 |       {3}
>
> I know it can't be as simple as this, because the query would first need to
> be filtered with a WHERE using a tsquery and this can't be done in the
> function called like in my example. I suppose such a feature does not exist,
> but is there any way to get the positions of the matching terms when
> querying a GIN index ?
>
> The only possible way I imagine right now is to firstly filter the rows with
> "to_tsvector(text) @@ to_tsquery('get')" and then call "to_tsvector(text)"
> for the n highest ranked rows, parsing the string returned by the function
> to find the term and its positions. But would be way more efficient to get
> them directly at the first call when matching the terms with @@ operator. I
> know it would be impossible if the query contain more than 1 term because it
> can't return 2 arrays of position in one row (i.e. for one document), but
> for now I'm trying to do this for 1 query term.
> Any help or advice would be welcome !
>
> By the way, I have done the C function computing the character offset of a
> given term position for a text column. It's not done in a good way, but it's
> more a topic for hackers postgresql list.
>

Don't forget when you success that word positions are affected by the
word removed by stop-words.

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation