Re: GIN - Generalized Inverted iNdex. - Mailing list pgsql-hackers
From | Oleg Bartunov |
---|---|
Subject | Re: GIN - Generalized Inverted iNdex. |
Date | |
Msg-id | Pine.GSO.4.63.0604071750480.23110@ra.sai.msu.su Whole thread Raw |
In response to | GIN - Generalized Inverted iNdex. (Teodor Sigaev <teodor@sigaev.ru>) |
List | pgsql-hackers |
One addition - some results of Gin testing is available: http://www.sai.msu.su/~megera/oddmuse/index.cgi/GinTest Oleg On Fri, 7 Apr 2006, Teodor Sigaev wrote: > We (me and Oleg) are glad to present GIN to PostgreSQL. If community will > agree, we will commit it to HEAD branch. > > http://www.sigaev.ru/gin/gin.gz > http://www.sigaev.ru/gin/README > > Install: > % cd pgsql > % zcat gin.gz | patch -p0 > make and initdb > > README: > > Gin for PostgreSQL > > Gin stands for Generalized Inverted iNdex and should be considered as a > genie, > not drink. > > Generalized means that index doesn't know what operation it accelerates, it > works with strategies, defined for specific data type (Index Method > Strategies). > In that sense, gin is similar to GiST and differ from btree index,which has > predefined comparison based operations. > > Inverted index is an index structure storing a (key,posting list) pairs, > where > posting list is a set of documents, which contain this key (document, > usually, > contains many keys). The primary goal of the Gin index is a scalable full > text > search in PostgreSQL. > > Gin is consists of a B-tree builded over entries (ET, entries tree), where > entry is an element of indexed value ( element of array, lexeme for tsvector) > and where each tuple in the leaf pages is either pointer to B-tree over item > pointers (PT, posting tree), or list of item pointers (PL, posting list) if > tuple is small enough. > > Notes: There is no delete operation for ET. The reason for that is that from > our experience, a set of unique words of a whole collection changed very > rare. > This greatly simplify code and concurrency algorithm. > > Gin comes with built-in support for one dimensional arrays ( integer[], > text[], > no support for NULL elements) and following operations: > > * contains : value_array @ query_array > * overlap : value_array && query_array > * contained: value_array ~ query_array > > Synopsis > > =# create index txt_idx on aa using gin(a); > > Features > > * Concurrency > * WAL-logging (recoverable) > * user-defined opclass, the scheme is similar to GiST > * optimized index creation (make use of maintenance_work_mem to > accumulate > postings in memory) > > Limitations > > * no support for multicolumn indices > * Gin doesn't uses scan->kill_prior_tuple & scan->ignore_killed_tuples > * Gin search entry only by equality matching, this may be improved in > future > > Gin interface > > OpClass interface (pseudocode). Example for opclass is in ginarayproc.c. > > Datum* extractValue(Datum inputValue, uint32* nentries) > Returns array of Datum of entries of value to be indexed, nentries should > contains number of returning entries > int compareEntry( Datum a, Datum b ) > Compares two entries (not the indexing values!) > Datum* extractQuery(Datum query, uint32* nentries, StrategyNumber n) > Returns array of Datum of entries of query to be searched, n contains > Strategy number of operation. > bool consistent( bool[] check, StrategyNumber n, Datum query) > The size of check array is the same as sizeof of array returned by > extractQuery. Each element of check array is true if indexed value has > corresponding entry in the query, i.e. if check[i] == TRUE then i-th > entry > of query is presented in indexed value. Function should returns true if > indexed value matches by StrategyNumber and query. > > Open items (We appreciate any comments, help, advice and recommendations). > > * teach optimizer/executor, that GIN is intrinsically clustered, i.e., it > always returns ItemPointer in ascending order. > * tweak gincostestimate > * GIN stores several ItemPointer to heap tuple, so vacuum full produces > warning message: > WARNING: index "idx" contains 88395 row versions, but table contains > 51812 row versions > HINT: Rebuild the index with REINDEX. > > TODO > > Nearest future: > > * add tsearch2 opclass > * create full scale test suite > > Distant future: > > * Replace entry btree to something like to GiST > * Add multicolumn support > * Optimize insert operation (use background index insertion) > > > Authors: > Teodor Sigaev <teodor@sigaev.ru> > Oleg Bartunov <oleg@sai.msu.su> > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-hackers by date: