Thread: tsearch2 anomoly?
I'm having trouble understanding to_tsvector. (PostreSQL 8.1.9 contrib) In this first case converting 'gallery2-httpd-conf' makes sense to me and is exactly what I want. It looks like the entire string is indexed plus the substrings broken by '-' are indexed. ossdb=# select to_tsvector('gallery2-httpd-conf'); to_tsvector --------------------------------------------------------- 'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1 However, I'd expect the same to happen in the httpd example - but it does not appear to. ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm'); to_tsvector --------------------------- 'httpd-2.2.3-5.src.rpm':1 Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ? Is this a bug or design? Thank you! Bob
This is how default parser works. See output from select * from ts_debug('gallery2-httpd-conf'); and select * from ts_debug('httpd-2.2.3-5.src.rpm'); All token type: select * from token_type(); On Thu, 6 Sep 2007, RC Gobeille wrote: > I'm having trouble understanding to_tsvector. (PostreSQL 8.1.9 contrib) > > In this first case converting 'gallery2-httpd-conf' makes sense to me and is > exactly what I want. It looks like the entire string is indexed plus the > substrings broken by '-' are indexed. > > > ossdb=# select to_tsvector('gallery2-httpd-conf'); > to_tsvector > --------------------------------------------------------- > 'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1 > > > However, I'd expect the same to happen in the httpd example - but it does not > appear to. > > ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm'); > to_tsvector > --------------------------- > 'httpd-2.2.3-5.src.rpm':1 > > Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ? > > Is this a bug or design? > > > Thank you! > Bob Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Thanks and I didn't know about ts_debug, so thanks for that also. For the record, I see how to use my own processing function (e.g. dropatsymbol) to get what I need: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro .html However, can you explain the logic behind the parsing difference if I just add a ".s" to a string: ossdb=# select ts_debug('gallery2-httpd-2.1-conf.'); ts_debug ----------------------------------------------------------------------- (default,hword,"Hyphenated word",gallery2-httpd-2,{simple},"'2' 'httpd' 'gallery2' 'gallery2-httpd-2'") (default,part_hword,"Part of hyphenated word",gallery2,{simple},'gallery2') (default,lpart_hword,"Latin part of hyphenated word",httpd,{en_stem},'httpd') (default,float,"Decimal notation",2.1,{simple},'2.1') (default,lpart_hword,"Latin part of hyphenated word",conf,{en_stem},'conf') (5 rows) ossdb=# select ts_debug('gallery2-httpd-2.1-conf.s'); ts_debug --------------------------------------------------------------------- (default,host,Host,gallery2-httpd-2.1-conf.s,{simple},'gallery2-httpd-2.1-c onf.s') (1 row) Thanks again, Bob On 9/6/07 11:19 AM, "Oleg Bartunov" <oleg@sai.msu.su> wrote: > This is how default parser works. See output from > select * from ts_debug('gallery2-httpd-conf'); > and > select * from ts_debug('httpd-2.2.3-5.src.rpm'); > > All token type: > > select * from token_type(); > > > On Thu, 6 Sep 2007, RC Gobeille wrote: > >> I'm having trouble understanding to_tsvector. (PostreSQL 8.1.9 contrib) >> >> In this first case converting 'gallery2-httpd-conf' makes sense to me and is >> exactly what I want. It looks like the entire string is indexed plus the >> substrings broken by '-' are indexed. >> >> >> ossdb=# select to_tsvector('gallery2-httpd-conf'); >> to_tsvector >> --------------------------------------------------------- >> 'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1 >> >> >> However, I'd expect the same to happen in the httpd example - but it does not >> appear to. >> >> ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm'); >> to_tsvector >> --------------------------- >> 'httpd-2.2.3-5.src.rpm':1 >> >> Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ? >> >> Is this a bug or design? >> >> >> Thank you! >> Bob > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83
Usual text hasn't strict syntax rules, so parser tries to recognize most probable token. Something with '.', '-' and alnum characters is often a filename, but filename is very rare finished or started by dot. RC Gobeille wrote: > Thanks and I didn't know about ts_debug, so thanks for that also. > > For the record, I see how to use my own processing function (e.g. > dropatsymbol) to get what I need: > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro > .html > > However, can you explain the logic behind the parsing difference if I just > add a ".s" to a string: > > > ossdb=# select ts_debug('gallery2-httpd-2.1-conf.'); > ts_debug > ----------------------------------------------------------------------- > (default,hword,"Hyphenated word",gallery2-httpd-2,{simple},"'2' 'httpd' > 'gallery2' 'gallery2-httpd-2'") > (default,part_hword,"Part of hyphenated word",gallery2,{simple},'gallery2') > (default,lpart_hword,"Latin part of hyphenated > word",httpd,{en_stem},'httpd') > (default,float,"Decimal notation",2.1,{simple},'2.1') > (default,lpart_hword,"Latin part of hyphenated word",conf,{en_stem},'conf') > (5 rows) > > ossdb=# select ts_debug('gallery2-httpd-2.1-conf.s'); > ts_debug > --------------------------------------------------------------------- > (default,host,Host,gallery2-httpd-2.1-conf.s,{simple},'gallery2-httpd-2.1-c > onf.s') > (1 row) > > Thanks again, > Bob > > > On 9/6/07 11:19 AM, "Oleg Bartunov" <oleg@sai.msu.su> wrote: > >> This is how default parser works. See output from >> select * from ts_debug('gallery2-httpd-conf'); >> and >> select * from ts_debug('httpd-2.2.3-5.src.rpm'); >> >> All token type: >> >> select * from token_type(); >> >> >> On Thu, 6 Sep 2007, RC Gobeille wrote: >> >>> I'm having trouble understanding to_tsvector. (PostreSQL 8.1.9 contrib) >>> >>> In this first case converting 'gallery2-httpd-conf' makes sense to me and is >>> exactly what I want. It looks like the entire string is indexed plus the >>> substrings broken by '-' are indexed. >>> >>> >>> ossdb=# select to_tsvector('gallery2-httpd-conf'); >>> to_tsvector >>> --------------------------------------------------------- >>> 'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1 >>> >>> >>> However, I'd expect the same to happen in the httpd example - but it does not >>> appear to. >>> >>> ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm'); >>> to_tsvector >>> --------------------------- >>> 'httpd-2.2.3-5.src.rpm':1 >>> >>> Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ? >>> >>> Is this a bug or design? >>> >>> >>> Thank you! >>> Bob >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 > > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/