Thread: tsearch2 anomoly?

tsearch2 anomoly?

From
RC Gobeille
Date:
I'm having trouble understanding to_tsvector.  (PostreSQL 8.1.9 contrib)

In this first case converting 'gallery2-httpd-conf' makes sense to me
and is exactly what I want.  It looks like the entire string is
indexed plus the substrings broken by '-' are indexed.


ossdb=# select to_tsvector('gallery2-httpd-conf');
                        to_tsvector
---------------------------------------------------------
'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1


However, I'd expect the same to happen in the httpd example - but it
does not appear to.

ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm');
         to_tsvector
---------------------------
'httpd-2.2.3-5.src.rpm':1

Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ?

Is this a bug or design?


Thank you!
Bob


Re: tsearch2 anomoly?

From
Oleg Bartunov
Date:
This is how default parser works.  See output from
select * from ts_debug('gallery2-httpd-conf');
and
select * from ts_debug('httpd-2.2.3-5.src.rpm');

All token type:

select * from token_type();


On Thu, 6 Sep 2007, RC Gobeille wrote:

> I'm having trouble understanding to_tsvector.  (PostreSQL 8.1.9 contrib)
>
> In this first case converting 'gallery2-httpd-conf' makes sense to me and is
> exactly what I want.  It looks like the entire string is indexed plus the
> substrings broken by '-' are indexed.
>
>
> ossdb=# select to_tsvector('gallery2-httpd-conf');
>                      to_tsvector
> ---------------------------------------------------------
> 'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1
>
>
> However, I'd expect the same to happen in the httpd example - but it does not
> appear to.
>
> ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm');
>       to_tsvector
> ---------------------------
> 'httpd-2.2.3-5.src.rpm':1
>
> Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ?
>
> Is this a bug or design?
>
>
> Thank you!
> Bob

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsearch2 anomoly?

From
RC Gobeille
Date:
Thanks and I didn't know about ts_debug, so thanks for that also.

For the record, I see how to use my own processing function (e.g.
dropatsymbol) to get what I need:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro
.html

However, can you explain the logic behind the parsing difference if I just
add a ".s" to a string:


ossdb=# select ts_debug('gallery2-httpd-2.1-conf.');
                                                ts_debug
-----------------------------------------------------------------------
 (default,hword,"Hyphenated word",gallery2-httpd-2,{simple},"'2' 'httpd'
'gallery2' 'gallery2-httpd-2'")
 (default,part_hword,"Part of hyphenated word",gallery2,{simple},'gallery2')
 (default,lpart_hword,"Latin part of hyphenated
word",httpd,{en_stem},'httpd')
 (default,float,"Decimal notation",2.1,{simple},'2.1')
 (default,lpart_hword,"Latin part of hyphenated word",conf,{en_stem},'conf')
(5 rows)

ossdb=# select ts_debug('gallery2-httpd-2.1-conf.s');
                                      ts_debug
---------------------------------------------------------------------
 (default,host,Host,gallery2-httpd-2.1-conf.s,{simple},'gallery2-httpd-2.1-c
onf.s')
(1 row)

Thanks again,
Bob


On 9/6/07 11:19 AM, "Oleg Bartunov" <oleg@sai.msu.su> wrote:

> This is how default parser works.  See output from
> select * from ts_debug('gallery2-httpd-conf');
> and
> select * from ts_debug('httpd-2.2.3-5.src.rpm');
>
> All token type:
>
> select * from token_type();
>
>
> On Thu, 6 Sep 2007, RC Gobeille wrote:
>
>> I'm having trouble understanding to_tsvector.  (PostreSQL 8.1.9 contrib)
>>
>> In this first case converting 'gallery2-httpd-conf' makes sense to me and is
>> exactly what I want.  It looks like the entire string is indexed plus the
>> substrings broken by '-' are indexed.
>>
>>
>> ossdb=# select to_tsvector('gallery2-httpd-conf');
>>                      to_tsvector
>> ---------------------------------------------------------
>> 'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1
>>
>>
>> However, I'd expect the same to happen in the httpd example - but it does not
>> appear to.
>>
>> ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm');
>>       to_tsvector
>> ---------------------------
>> 'httpd-2.2.3-5.src.rpm':1
>>
>> Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ?
>>
>> Is this a bug or design?
>>
>>
>> Thank you!
>> Bob
>
>         Regards,
>                 Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83



Re: tsearch2 anomoly?

From
Teodor Sigaev
Date:
Usual text hasn't strict syntax rules, so parser tries to recognize most
probable token.  Something with '.', '-' and alnum characters is often a
filename, but filename is very rare finished or started by dot.

RC Gobeille wrote:
> Thanks and I didn't know about ts_debug, so thanks for that also.
>
> For the record, I see how to use my own processing function (e.g.
> dropatsymbol) to get what I need:
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro
> .html
>
> However, can you explain the logic behind the parsing difference if I just
> add a ".s" to a string:
>
>
> ossdb=# select ts_debug('gallery2-httpd-2.1-conf.');
>                                                 ts_debug
> -----------------------------------------------------------------------
>  (default,hword,"Hyphenated word",gallery2-httpd-2,{simple},"'2' 'httpd'
> 'gallery2' 'gallery2-httpd-2'")
>  (default,part_hword,"Part of hyphenated word",gallery2,{simple},'gallery2')
>  (default,lpart_hword,"Latin part of hyphenated
> word",httpd,{en_stem},'httpd')
>  (default,float,"Decimal notation",2.1,{simple},'2.1')
>  (default,lpart_hword,"Latin part of hyphenated word",conf,{en_stem},'conf')
> (5 rows)
>
> ossdb=# select ts_debug('gallery2-httpd-2.1-conf.s');
>                                       ts_debug
> ---------------------------------------------------------------------
>  (default,host,Host,gallery2-httpd-2.1-conf.s,{simple},'gallery2-httpd-2.1-c
> onf.s')
> (1 row)
>
> Thanks again,
> Bob
>
>
> On 9/6/07 11:19 AM, "Oleg Bartunov" <oleg@sai.msu.su> wrote:
>
>> This is how default parser works.  See output from
>> select * from ts_debug('gallery2-httpd-conf');
>> and
>> select * from ts_debug('httpd-2.2.3-5.src.rpm');
>>
>> All token type:
>>
>> select * from token_type();
>>
>>
>> On Thu, 6 Sep 2007, RC Gobeille wrote:
>>
>>> I'm having trouble understanding to_tsvector.  (PostreSQL 8.1.9 contrib)
>>>
>>> In this first case converting 'gallery2-httpd-conf' makes sense to me and is
>>> exactly what I want.  It looks like the entire string is indexed plus the
>>> substrings broken by '-' are indexed.
>>>
>>>
>>> ossdb=# select to_tsvector('gallery2-httpd-conf');
>>>                      to_tsvector
>>> ---------------------------------------------------------
>>> 'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1
>>>
>>>
>>> However, I'd expect the same to happen in the httpd example - but it does not
>>> appear to.
>>>
>>> ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm');
>>>       to_tsvector
>>> ---------------------------
>>> 'httpd-2.2.3-5.src.rpm':1
>>>
>>> Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ?
>>>
>>> Is this a bug or design?
>>>
>>>
>>> Thank you!
>>> Bob
>>         Regards,
>>                 Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>
>

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/