Re: Updated tsearch documentation - Mailing list pgsql-hackers
From | Oleg Bartunov |
---|---|
Subject | Re: Updated tsearch documentation |
Date | |
Msg-id | Pine.LNX.4.64.0706211347360.1881@sn.sai.msu.ru Whole thread Raw |
In response to | Re: Updated tsearch documentation (Bruce Momjian <bruce@momjian.us>) |
Responses |
Re: Updated tsearch documentation
Re: Updated tsearch documentation |
List | pgsql-hackers |
On Wed, 20 Jun 2007, Bruce Momjian wrote: > Oleg Bartunov wrote: >> On Wed, 20 Jun 2007, Bruce Momjian wrote: >>>> Comments to editorial work of Bruce Momjian. >>>> >>>> fulltext-intro.sgml: >>>> >>>> it is useful to have a predefined list of lexemes. >>>> >>>> Bruce, here should be list of types of lexemes ! >>> >>> Agreed. Are the list of lexemes parser-specific? >>> >> >> yes, it it parser which defines types of lexemes. > > OK, how will users get a list of supported lexemes? Do we need a list > per supported parser? it's documented, see "Parser functions" for token_type(); postgres=# select * from token_type('default'); tokid | alias | description -------+--------------+----------------------------------- 1 | lword | Latin word 2 | nlword | Non-latinword 3 | word | Word 4 | email | Email 5 | url | URL 6 | host |Host 7 | sfloat | Scientific notation 8 | version | VERSION 9 | part_hword | Part of hyphenatedword 10 | nlpart_hword | Non-latin part of hyphenated word 11 | lpart_hword | Latin part of hyphenated word 12 | blank | Space symbols 13 | tag | HTML Tag 14 | protocol | Protocol head 15 | hword | Hyphenated word 16 | lhword | Latin hyphenated word 17 | nlhword | Non-latin hyphenated word 18 | uri | URI 19 | file | File or path name 20 | float | Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | HTML Entity >>>> The integer option controls several behaviors which is done using bit-wise >>>> fields and <literal>|</literal> (for example, <literal>2|4</literal>): >>>> <!-- why so complex? --> >>>> >>>>> to avoid 2 arguments >>> >>> But I don't see why you would want to set two of those values --- they >>> seem mutually exclusive, e.g. >>> >>> 1 divides the rank by the 1 + logarithm of the document length >>> 2 divides the rank by the length itself >>> >>> I assume you do either one, not both. >> >> but what's about others variants ? > > OK, here is the full list: > > 0 (the default) ignores document length > 1 divides the rank by the 1 + logarithm of the document length > 2 divides the rank by the length itself > 4 divides the rank by the mean harmonic distance between extents > 8 divides the rank by the number of unique words in document > 16 divides the rank by 1 + logarithm of the number of unique words in > document > > so which ones would be both enabled? no one ! This is a list of possible values of rank normalization flag, which could be ORed together. =# select rank_cd('1:1,2,3 4:5 6:7', '1&4',1); rank_cd ----------- 0.0279055 =# select rank_cd('1:1,2,3 4:5 6:7', '1&4',1|16); rank_cd ----------- 0.0139528 > >> >> What I missed is the definition of extent. >> >>> From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking >> Extent is a shortest and non-nested sequence of words, which satisfy a query. > > I don't understand how that relates to this. because of "4 divides the rank by the mean harmonic distance between extents" ^^^^^^^ it reflects how dense extents which satisfy query are in document. > >>> >>>> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n >>>> if none is specified that the current configuration is used. >>>> >>>>> I don't understand this question >>> >>> Same issue as above --- why allow a number here when the name works just >>> fine. We don't allow tables to be specified by number, so why >>> configurations? >>> >>>> <para> >>>> <!-- why? --> >>>> Note that the cascade dropping of the <function>headline</function> function >>>> cause dropping of the <literal>parser</literal> used in fulltext configuration >>>> <replaceable>tsname</replaceable>. >>>> </para> >>>> >>>>> hmm, probably it should be reversed - cascade dropping of the parser cause >>>>> dropping of the headline function. >>> >>> Agreed. >>> >>>> >>>> In example below, <literal>fulltext_idx</literal> is >>>> a GIN index:<!-- why isn't this automatic --> >>>> >>>>> It's explained above. The problem is that current index api doesn't allow >>>>> to say if search was lossy or exact, so to preserve performance of >>>>> GIN index we had to introduce @@@ operator, which is the same as @@, but >>>>> lossy. >>> >>> Well, then we have to fix the API. Telling users to use a different >>> operator based on what index is defined is just bad style. >> >> This was raised by Heikki and we discussed it a bit in Ottawa, but it's >> unclear if it's doable for 8.3. @@@ operator is in rare use, so we could >> say it will be improved in future versions. > > Uh, I am wondering if we just have to force heap access in all cases > until it is fixed. no-no ! We'll lost performance of GIN index, which isn't lossy and don't need heap access. I don't see what's wrong if we say that some feature doesn't supported by text search operator with GIN index. >> We need to decide if we need oids as user-visible argument. I don't see >> any value, probably Teodor think other way. > > This is a good time to clean up the API because there are going to be > user-visible changes anyway. I agree. Keep in mind this, until we get more serious tasks done. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-hackers by date: