Re: Term positions in GIN fulltext index - Mailing list pgsql-hackers
From | Florian Pflug |
---|---|
Subject | Re: Term positions in GIN fulltext index |
Date | |
Msg-id | 20322C15-13A6-41C2-8860-BED67328378D@phlo.org Whole thread Raw |
In response to | Term positions in GIN fulltext index (Yoann Moreau <yoann.moreau@univ-avignon.fr>) |
Responses |
Re: Term positions in GIN fulltext index
|
List | pgsql-hackers |
On Nov3, 2011, at 16:52 , Yoann Moreau wrote: > I'm using a GIN index for a text column on a big table. I use it to rank > the rows, but I also need to get the term positions for each document of a > subset of documents for one or more terms. I suppose these positions are stored > in the index as the to_tsvector shows them : 'lexeme':{positions} There's a difference between values of type tsvector, and what GIN indices on columns or expressions of type tsvector store. Values of type tsvector, of course, store weights and positions for each lexem. But GIN indices store only the bare lexems without weights and positions. In general, GIN indices work by extracting "elements" from values to be indexed, and store these "elements" in a btree, together with pointers to the rows containing the indexed values. Thus, if you created a function index on the results of to_tsvector, i.e. if you do CREATE INDEX gin_idx ON docs USING gin (to_tsvector(text)) then the weights and positions aren't stored anywhere - they'll only exists in the transient, in-memory tsvector value that to_tsvector returns, but not in the on-disk GIN index gin_idx. For the positions and weights to be store, you need to store the result of to_tsvector in a column of type tsvector, say text_tsvector, and create the index as CREATE INDEX gin_idx ON docs USING gin (text_tsvector) The GIN index gin_idx still won't store weights and positions, but the column text_tsvector will. > For example, for 2 rows of a 'docs' table with a text column 'text' (indexed with GIN) : > 'I get lexemes and I get term positions.' > 'Did you get the positions ?' > > I'd need a function like this : > select term_positions(text, 'get') from docs; > id_doc | positions > --------+----------- > 1 | {2,6} > 2 | {3} As I pointed out above, you'll first need to make sure to store the result of to_tsvector in a columns. Then, what you need seems to be a functions that takes a tsvector value and returns the contained lexems as individual rows. Postgres doesn't seem to contain such a function currently (don't believe that, though - go and recheck the documentation. I don't know all thousands of built-in functions by heart). But it's easy to add one. You could either use PL/pgSQL to parse the tsvector's textual representation, or write a C function. If you go the PL/pgSQL route, regexp_split_to_table() might come in handy. > I'd like to add this function in my database, for experimental purpose. > I got a look at the source code but didn't find some code example using the GIN index ; > I can not figure out where the GIN index is read as a tsvector > or where the '@@' operator gets the matching tsvectors for the terms of the tsquery. The basic flow of information is: to_tsvector takes a string, parses and, applies various dictionaries according to the textsearch configuration, and finally returns a value of type tsvector. See the files names tsvector* for the implementation of that process, and for the implementation of the various support functions which work on values of type tsvector. The GIN index machinery then calls the tsvector's extractValue() function to extract the "elements" mentioned above from the tsvector value. That function is called gin_extract_tsvector() and lives in tsginidx.c. The extracted "elements" are then added to the GIN index's internal btree. During query execution, if postgres sees that the operator tsvector @@ tsquery is used, and that the left argument is a GIN-indexed column, it will use the extractQuery() and consistent() functions to quickly find matching rows by scanning the internal btree index. In the case of tsvector and tsquery, the implementation of these functions are gin_extract_tsquery() and gin_tsquery_consistent(), found also in tsginidx.c. I suggest you read http://www.postgresql.org/docs/9.1/interactive/gin.html, it explains all of this in (much) more detail. best regards, Florian Pflug
pgsql-hackers by date: