Re: GIN - Generalized Inverted iNdex. - Mailing list pgsql-hackers

From Oleg Bartunov
Subject Re: GIN - Generalized Inverted iNdex.
Date
Msg-id Pine.GSO.4.63.0604071750480.23110@ra.sai.msu.su
Whole thread Raw
In response to GIN - Generalized Inverted iNdex.  (Teodor Sigaev <teodor@sigaev.ru>)
List pgsql-hackers
One addition - some results of Gin testing is available:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/GinTest

Oleg

On Fri, 7 Apr 2006, Teodor Sigaev wrote:

> We (me and Oleg) are glad to present GIN to PostgreSQL. If community will 
> agree, we will commit it to HEAD branch.
>
> http://www.sigaev.ru/gin/gin.gz
> http://www.sigaev.ru/gin/README
>
> Install:
> % cd  pgsql
> % zcat gin.gz | patch -p0
> make and initdb
>
> README:
>
> Gin for PostgreSQL
>
> Gin stands for Generalized Inverted iNdex and should be considered as a 
> genie,
> not drink.
>
> Generalized means that index doesn't know what operation it accelerates, it
> works with strategies, defined for specific data type (Index Method 
> Strategies).
> In that sense, gin is similar to GiST and differ from btree index,which has
> predefined comparison based operations.
>
> Inverted index is an index structure storing a (key,posting list) pairs, 
> where
> posting list is a set of documents, which contain this key (document, 
> usually,
> contains many keys). The primary goal of the Gin index is a scalable full 
> text
> search in PostgreSQL.
>
> Gin is consists of a B-tree builded over entries (ET, entries tree), where
> entry is an element of indexed value ( element of array, lexeme for tsvector)
> and where each tuple in the leaf pages is either pointer to B-tree over item
> pointers (PT, posting tree), or list of item pointers (PL, posting list) if
> tuple is small enough.
>
> Notes: There is no delete operation for ET. The reason for that is that from
> our experience, a set of unique words of a whole collection changed very 
> rare.
> This greatly simplify code and concurrency algorithm.
>
> Gin comes with built-in support for one dimensional arrays ( integer[], 
> text[],
> no support for NULL elements) and following operations:
>
>    * contains : value_array @ query_array
>    * overlap : value_array && query_array
>    * contained: value_array ~ query_array
>
> Synopsis
>
> =# create index txt_idx on aa using gin(a);
>
> Features
>
>    * Concurrency
>    * WAL-logging (recoverable)
>    * user-defined opclass, the scheme is similar to GiST
>    * optimized index creation (make use of maintenance_work_mem to 
> accumulate
>      postings in memory)
>
> Limitations
>
>    * no support for multicolumn indices
>    * Gin doesn't uses scan->kill_prior_tuple & scan->ignore_killed_tuples
>    * Gin search entry only by equality matching, this may be improved in 
> future
>
> Gin interface
>
> OpClass interface (pseudocode). Example for opclass is in ginarayproc.c.
>
> Datum* extractValue(Datum inputValue, uint32* nentries)
>    Returns array of Datum of entries of value to be indexed, nentries should
>    contains number of returning entries
> int compareEntry( Datum a, Datum b )
>    Compares two entries (not the indexing values!)
> Datum* extractQuery(Datum query, uint32* nentries, StrategyNumber n)
>    Returns array of Datum of entries of query to be searched, n contains
>    Strategy number of operation.
> bool consistent( bool[] check, StrategyNumber n, Datum query)
>    The size of check array is the same as sizeof of array returned by
>    extractQuery. Each element of check array is true if indexed value has
>    corresponding entry in the query, i.e. if check[i] == TRUE then i-th 
> entry
>    of query is presented in indexed value. Function should returns true if
>    indexed value matches by StrategyNumber and query.
>
> Open items (We appreciate any comments, help, advice and recommendations).
>
>    * teach optimizer/executor, that GIN is intrinsically clustered, i.e., it
>      always returns ItemPointer in ascending order.
>    * tweak gincostestimate
>    * GIN stores several ItemPointer to heap tuple, so vacuum full produces
>      warning message:
>     WARNING:  index "idx" contains 88395 row versions, but table contains 
> 51812 row versions
>     HINT:  Rebuild the index with REINDEX.
>
> TODO
>
> Nearest future:
>
>    * add tsearch2 opclass
>    * create full scale test suite
>
> Distant future:
>
>    * Replace entry btree to something like to GiST
>    * Add multicolumn support
>    * Optimize insert operation (use background index insertion)
>
>
> Authors:
>     Teodor Sigaev <teodor@sigaev.ru>
>     Oleg Bartunov <oleg@sai.msu.su>
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


pgsql-hackers by date:

Previous
From: "Nicolas Barbier"
Date:
Subject: Re: WAL Bypass for indexes
Next
From: Oleg Bartunov
Date:
Subject: Re: GIN - Generalized Inverted iNdex.