Re: Term positions in GIN fulltext index - Mailing list pgsql-hackers

From Florian Pflug
Subject Re: Term positions in GIN fulltext index
Date
Msg-id 20322C15-13A6-41C2-8860-BED67328378D@phlo.org
Whole thread Raw
In response to Term positions in GIN fulltext index  (Yoann Moreau <yoann.moreau@univ-avignon.fr>)
Responses Re: Term positions in GIN fulltext index
List pgsql-hackers
On Nov3, 2011, at 16:52 , Yoann Moreau wrote:
> I'm using a GIN index for a text column on a big table. I use it to rank
> the rows, but I also need to get the term positions for each document of a
> subset of documents for one or more terms. I suppose these positions are stored
> in the index as the to_tsvector shows them : 'lexeme':{positions}

There's a difference between values of type tsvector, and what GIN indices
on columns or expressions of type tsvector store.

Values of type tsvector, of course, store weights and positions for each lexem.

But GIN indices store only the bare lexems without weights and positions. In
general, GIN indices work by extracting "elements" from values to be indexed,
and store these "elements" in a btree, together with pointers to the rows
containing the indexed values.

Thus, if you created a function index on the results of to_tsvector, i.e.
if you do CREATE INDEX gin_idx ON docs USING gin (to_tsvector(text))
then the weights and positions aren't stored anywhere - they'll only exists in
the transient, in-memory tsvector value that to_tsvector returns, but not in
the on-disk GIN index gin_idx.

For the positions and weights to be store, you need to store the result of
to_tsvector in a column of type tsvector, say text_tsvector, and create the
index as CREATE INDEX gin_idx ON docs USING gin (text_tsvector)

The GIN index gin_idx still won't store weights and positions, but the column
text_tsvector will.

> For example, for 2 rows of a 'docs' table with a text column 'text' (indexed with GIN) :
> 'I get lexemes and I get term positions.'
> 'Did you get the positions ?'
>
> I'd need a function like this :
> select term_positions(text, 'get') from docs;
> id_doc | positions
> --------+-----------
>      1 |     {2,6}
>      2 |       {3}

As I pointed out above, you'll first need to make sure to store the result of
to_tsvector in a columns. Then, what you need seems to be a functions that
takes a tsvector value and returns the contained lexems as individual rows.

Postgres doesn't seem to contain such a function currently (don't believe that,
though - go and recheck the documentation. I don't know all thousands of built-in
functions by heart). But it's easy to add one. You could either use PL/pgSQL
to parse the tsvector's textual representation, or write a C function. If you
go the PL/pgSQL route, regexp_split_to_table() might come in handy.

> I'd like to add this function in my database, for experimental purpose.
> I got a look at the source code but didn't find some code example using the GIN index ;
> I can not figure out where the GIN index is read as a tsvector
> or where the '@@' operator gets the matching tsvectors for the terms of the tsquery.

The basic flow of information is:

to_tsvector takes a string, parses and, applies various dictionaries according
to the textsearch configuration, and finally returns a value of type tsvector.
See the files names tsvector* for the implementation of that process, and for
the implementation of the various support functions which work on values of type
tsvector.

The GIN index machinery then calls the tsvector's extractValue() function to extract
the "elements" mentioned above from the tsvector value. That function is called
gin_extract_tsvector() and lives in tsginidx.c. The extracted "elements" are
then added to the GIN index's internal btree.

During query execution, if postgres sees that the operator tsvector @@ tsquery
is used, and that the left argument is a GIN-indexed column, it will use the
extractQuery() and consistent() functions to quickly find matching rows by
scanning the internal btree index. In the case of tsvector and tsquery, the
implementation of these functions are gin_extract_tsquery() and
gin_tsquery_consistent(), found also in tsginidx.c.

I suggest you read http://www.postgresql.org/docs/9.1/interactive/gin.html,
it explains all of this in (much) more detail.

best regards,
Florian Pflug



pgsql-hackers by date:

Previous
From: Pavan Deolasee
Date:
Subject: Storing hot members of PGPROC out of the band
Next
From: Simon Riggs
Date:
Subject: Further plans to refactor xlog.c