Thread: Term positions in GIN fulltext index
Hello, I'm using a GIN index for a text column on a big table. I use it to rank the rows, but I also need to get the term positions for each document of a subset of documents for one or more terms. I suppose these positions are stored in the index as the to_tsvector shows them : 'lexeme':{positions} I've searched and asked on general postgresql mailing list, and I assume there is no simple way to get these term positions. For example, for 2 rows of a 'docs' table with a text column 'text' (indexed with GIN) : 'I get lexemes and I get term positions.' 'Did you get the positions ?' I'd need a function like this : select term_positions(text, 'get') from docs; id_doc | positions --------+----------- 1 | {2,6} 2 | {3} I'd like to add this function in my database, for experimental purpose. I got a look at the source code but didn't find some code example using the GIN index ; I can not figure out where the GIN index is read as a tsvector or where the '@@' operator gets the matching tsvectors for the terms of the tsquery. Any help about where to start reading would be very welcome :) Regards, Yoann Moreau
Yoann Moreau <yoann.moreau@univ-avignon.fr> wrote: > I'd need a function like this : > select term_positions(text, 'get') from docs; > id_doc | positions > --------+----------- > 1 | {2,6} > 2 | {3} > > I'd like to add this function in my database, for experimental > purpose. I got a look at the source code but didn't find some code > example using the GIN index ; > I can not figure out where the GIN index is read as a tsvector > or where the '@@' operator gets the matching tsvectors for the > terms of the tsquery. > > Any help about where to start reading would be very welcome :) I'm not really clear on what you want to read about. Do you need help creating your own function on the fly, or with how to access the information to write the function? If the former, these links might help: http://www.postgresql.org/docs/9.1/interactive/extend.html http://www.postgresql.org/docs/9.1/interactive/sql-createfunction.html If the latter, have you looked at this file?: src/backend/utils/adt/tsrank.c Or was it something else that I'm missing? -Kevin
On Nov3, 2011, at 16:52 , Yoann Moreau wrote: > I'm using a GIN index for a text column on a big table. I use it to rank > the rows, but I also need to get the term positions for each document of a > subset of documents for one or more terms. I suppose these positions are stored > in the index as the to_tsvector shows them : 'lexeme':{positions} There's a difference between values of type tsvector, and what GIN indices on columns or expressions of type tsvector store. Values of type tsvector, of course, store weights and positions for each lexem. But GIN indices store only the bare lexems without weights and positions. In general, GIN indices work by extracting "elements" from values to be indexed, and store these "elements" in a btree, together with pointers to the rows containing the indexed values. Thus, if you created a function index on the results of to_tsvector, i.e. if you do CREATE INDEX gin_idx ON docs USING gin (to_tsvector(text)) then the weights and positions aren't stored anywhere - they'll only exists in the transient, in-memory tsvector value that to_tsvector returns, but not in the on-disk GIN index gin_idx. For the positions and weights to be store, you need to store the result of to_tsvector in a column of type tsvector, say text_tsvector, and create the index as CREATE INDEX gin_idx ON docs USING gin (text_tsvector) The GIN index gin_idx still won't store weights and positions, but the column text_tsvector will. > For example, for 2 rows of a 'docs' table with a text column 'text' (indexed with GIN) : > 'I get lexemes and I get term positions.' > 'Did you get the positions ?' > > I'd need a function like this : > select term_positions(text, 'get') from docs; > id_doc | positions > --------+----------- > 1 | {2,6} > 2 | {3} As I pointed out above, you'll first need to make sure to store the result of to_tsvector in a columns. Then, what you need seems to be a functions that takes a tsvector value and returns the contained lexems as individual rows. Postgres doesn't seem to contain such a function currently (don't believe that, though - go and recheck the documentation. I don't know all thousands of built-in functions by heart). But it's easy to add one. You could either use PL/pgSQL to parse the tsvector's textual representation, or write a C function. If you go the PL/pgSQL route, regexp_split_to_table() might come in handy. > I'd like to add this function in my database, for experimental purpose. > I got a look at the source code but didn't find some code example using the GIN index ; > I can not figure out where the GIN index is read as a tsvector > or where the '@@' operator gets the matching tsvectors for the terms of the tsquery. The basic flow of information is: to_tsvector takes a string, parses and, applies various dictionaries according to the textsearch configuration, and finally returns a value of type tsvector. See the files names tsvector* for the implementation of that process, and for the implementation of the various support functions which work on values of type tsvector. The GIN index machinery then calls the tsvector's extractValue() function to extract the "elements" mentioned above from the tsvector value. That function is called gin_extract_tsvector() and lives in tsginidx.c. The extracted "elements" are then added to the GIN index's internal btree. During query execution, if postgres sees that the operator tsvector @@ tsquery is used, and that the left argument is a GIN-indexed column, it will use the extractQuery() and consistent() functions to quickly find matching rows by scanning the internal btree index. In the case of tsvector and tsquery, the implementation of these functions are gin_extract_tsquery() and gin_tsquery_consistent(), found also in tsginidx.c. I suggest you read http://www.postgresql.org/docs/9.1/interactive/gin.html, it explains all of this in (much) more detail. best regards, Florian Pflug
Yoann Moreau <yoann.moreau@univ-avignon.fr> writes: > I'm using a GIN index for a text column on a big table. I use it to rank > the rows, but I also need to get the term positions for each document of a > subset of documents for one or more terms. I suppose these positions are stored > in the index as the to_tsvector shows them : 'lexeme':{positions} I'm pretty sure that a GIN index on tsvector does *not* store positions --- it only knows about the strings. Don't know one way or the other about GIST. regards, tom lane
On Thu, Nov 3, 2011 at 4:52 PM, Yoann Moreau <yoann.moreau@univ-avignon.fr> wrote: > I'd need a function like this : > select term_positions(text, 'get') from docs; > id_doc | positions > --------+----------- > 1 | {2,6} > 2 | {3} > check this out: http://www.postgresql.org/docs/current/static/textsearch-debugging.html ts_debug does what You want, and more. Look at it's source - it`s a plain sql function, You can make something based on it. Greetings Marcin Mańk
On Thu, Nov 3, 2011 at 11:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
GiST index doesn't store positions too. See gtsvector_compress. It converts tsvector to array of crc32 of words. If that value is anyway too large then function converts it to signature.Yoann Moreau <yoann.moreau@univ-avignon.fr> writes:I'm pretty sure that a GIN index on tsvector does *not* store positions
> I'm using a GIN index for a text column on a big table. I use it to rank
> the rows, but I also need to get the term positions for each document of a
> subset of documents for one or more terms. I suppose these positions are stored
> in the index as the to_tsvector shows them : 'lexeme':{positions}
--- it only knows about the strings. Don't know one way or the other
about GIST.
------
With best regards,
Alexander Korotkov.
On 03/11/11 19:19, Florian Pflug wrote: > There's a difference between values of type tsvector, and what GIN indices > on columns or expressions of type tsvector store. I was wondering what was the point about storing the tsvector in the table, I now understand. I then should use the GIN index to rank my documents, and work on the stored tsvectors for positions. > As I pointed out above, you'll first need to make sure to store the result of > to_tsvector in a columns. Then, what you need seems to be a functions that > takes a tsvector value and returns the contained lexems as individual rows. > > Postgres doesn't seem to contain such a function currently (don't believe that, > though - go and recheck the documentation. I don't know all thousands of built-in > functions by heart). But it's easy to add one. You could either use PL/pgSQL > to parse the tsvector's textual representation, or write a C function. If you > go the PL/pgSQL route, regexp_split_to_table() might come in handy. This seems easier to program than what I was thinking about, I'm going to do that. But I'm wondering about size of database with the GIN index plus the tsvector column, and performance about parsing the whole tsvectors for each document I need positions from (as I need them for a very few terms). Maybe some external fulltext engine managing lexemes and positions would be more efficient for my purpose. I'll try some different things and let you know the results. Thanks all for your help Regards, Yoann Moreau
On Nov4, 2011, at 11:15 , Yoann Moreau wrote: > On 03/11/11 19:19, Florian Pflug wrote: >> Postgres doesn't seem to contain such a function currently (don't believe that, >> though - go and recheck the documentation. I don't know all thousands of built-in >> functions by heart). But it's easy to add one. You could either use PL/pgSQL >> to parse the tsvector's textual representation, or write a C function. If you >> go the PL/pgSQL route, regexp_split_to_table() might come in handy. > > This seems easier to program than what I was thinking about, I'm going to do that. > But I'm wondering about size of database with the GIN index plus the tsvector column, > and performance about parsing the whole tsvectors for each document I need positions > from (as I need them for a very few terms). AFAICS, the internal storage layout of tsvector should allow you to extract an individual lexem's positions quite efficiently (with time complexity log(N) where N is the number of lexems in the tsvector). Doing so will require you to implement your function in C though - any solution that works from a tsvector's textual representation will obviously have time complexity N. best regards, Florian Pflug
On 04/11/11 12:15, Florian Pflug wrote: > AFAICS, the internal storage layout of tsvector should allow you to extract an > individual lexem's positions quite efficiently (with time complexity log(N) where > N is the number of lexems in the tsvector). Doing so will require you to implement > your function in C though - any solution that works from a tsvector's textual > representation will obviously have time complexity N. > > best regards, > Florian Pflug > I'll do a pl/pgsql function first, I need to test it with other parts of the project. But I will look for more efficient algorithms for a C function as soon as possible if we still decide to use the postgresql fulltext engine. Regards, Yoann Moreau