Re: Updated tsearch documentation - Mailing list pgsql-hackers

From Oleg Bartunov
Subject Re: Updated tsearch documentation
Date
Msg-id Pine.LNX.4.64.0706211347360.1881@sn.sai.msu.ru
Whole thread Raw
In response to Re: Updated tsearch documentation  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Updated tsearch documentation
Re: Updated tsearch documentation
List pgsql-hackers
On Wed, 20 Jun 2007, Bruce Momjian wrote:

> Oleg Bartunov wrote:
>> On Wed, 20 Jun 2007, Bruce Momjian wrote:
>>>> Comments to editorial work of Bruce Momjian.
>>>>
>>>> fulltext-intro.sgml:
>>>>
>>>> it is useful to have a predefined list of lexemes.
>>>>
>>>> Bruce, here should be list of types of lexemes !
>>>
>>> Agreed.  Are the list of lexemes parser-specific?
>>>
>>
>> yes, it it parser which defines types of lexemes.
>
> OK, how will users get a list of supported lexemes?  Do we need a list
> per supported parser?

it's documented, see "Parser functions" for token_type();

postgres=# select * from token_type('default'); tokid |    alias     |            description
-------+--------------+-----------------------------------     1 | lword        | Latin word     2 | nlword       |
Non-latinword     3 | word         | Word     4 | email        | Email     5 | url          | URL     6 | host
|Host     7 | sfloat       | Scientific notation     8 | version      | VERSION     9 | part_hword   | Part of
hyphenatedword    10 | nlpart_hword | Non-latin part of hyphenated word    11 | lpart_hword  | Latin part of hyphenated
word   12 | blank        | Space symbols    13 | tag          | HTML Tag    14 | protocol     | Protocol head    15 |
hword       | Hyphenated word    16 | lhword       | Latin hyphenated word    17 | nlhword      | Non-latin hyphenated
word   18 | uri          | URI    19 | file         | File or path name    20 | float        | Decimal notation    21 |
int         | Signed integer    22 | uint         | Unsigned integer    23 | entity       | HTML Entity
 

>>>> The integer option controls several behaviors which is done using bit-wise
>>>> fields and <literal>|</literal> (for example, <literal>2|4</literal>):
>>>> <!-- why so complex? -->
>>>>
>>>>> to avoid 2 arguments
>>>
>>> But I don't see why you would want to set two of those values --- they
>>> seem mutually exclusive, e.g.
>>>
>>>     1 divides the rank by the 1 + logarithm of the document length
>>>     2 divides the rank by the length itself
>>>
>>> I assume you do either one, not both.
>>
>> but what's about others variants ?
>
> OK, here is the full list:
>
>     0 (the default) ignores document length
>     1 divides the rank by the 1 + logarithm of the document length
>     2 divides the rank by the length itself
>     4 divides the rank by the mean harmonic distance between extents
>     8 divides the rank by the number of unique words in document
>     16 divides the rank by 1 + logarithm of the number of unique words in
>        document
>
> so which ones would be both enabled?

no one ! This is a list of possible values of rank normalization flag, which 
could be ORed together.

=# select rank_cd('1:1,2,3 4:5 6:7', '1&4',1);  rank_cd
----------- 0.0279055
=# select rank_cd('1:1,2,3 4:5 6:7', '1&4',1|16);  rank_cd
----------- 0.0139528


>
>>
>> What I missed is the definition of extent.
>>
>>> From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking
>> Extent is a shortest and non-nested sequence of words, which satisfy a query.
>
> I don't understand how that relates to this.

because of 
"4 divides the rank by the mean harmonic distance between extents"
   ^^^^^^^
 
it reflects how dense extents which satisfy query are in document.
>
>>>
>>>> its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n
>>>> if none is specified that the current configuration is used.
>>>>
>>>>> I don't understand this question
>>>
>>> Same issue as above --- why allow a number here when the name works just
>>> fine.  We don't allow tables to be specified by number, so why
>>> configurations?
>>>
>>>> <para>
>>>> <!-- why?  -->
>>>> Note that the cascade dropping of the <function>headline</function> function
>>>> cause dropping of the <literal>parser</literal> used in fulltext configuration
>>>> <replaceable>tsname</replaceable>.
>>>> </para>
>>>>
>>>>> hmm, probably it should be reversed - cascade dropping of the parser cause
>>>>> dropping of the headline function.
>>>
>>> Agreed.
>>>
>>>>
>>>> In example below, <literal>fulltext_idx</literal> is
>>>> a GIN index:<!-- why isn't this automatic -->
>>>>
>>>>> It's explained above. The problem is that current index api doesn't allow
>>>>> to say if search was lossy or exact, so to preserve performance of
>>>>> GIN index we had to introduce @@@ operator, which is the same as @@, but
>>>>> lossy.
>>>
>>> Well, then we have to fix the API.  Telling users to use a different
>>> operator based on what index is defined is just bad style.
>>
>> This was raised by Heikki and we discussed it a bit in Ottawa, but it's
>> unclear if it's doable for 8.3.  @@@ operator is in rare use, so we could
>> say it will be improved in future versions.
>
> Uh, I am wondering if we just have to force heap access in all cases
> until it is fixed.

no-no ! We'll lost performance of GIN index, which isn't lossy and don't
need heap access. I don't see what's wrong if we say that some feature
doesn't supported by text search operator with GIN index.

>> We need to decide if we need oids as user-visible argument. I don't see
>> any value, probably Teodor think other way.
>
> This is a good time to clean up the API because there are going to be
> user-visible changes anyway.

I agree. Keep in mind this, until we get more serious tasks done.
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: GUC time unit spelling a bit inconsistent
Next
From: Gregory Stark
Date:
Subject: Re: GUC time unit spelling a bit inconsistent