Thread: to_tsvector in 8.2.3
It may solve my query if anybody can tell me if anything has changes in tsearch2 recently? otherwise... I have two installations of Postgres on Windows machines, one is 8.2.1 and the other is 8.2.3 Both installed the same way, selecting tsearch2 during the installation and restoring the same backup database file. In 8.2.1 select to_tsvector('test text') returns 'test':1 'text':2 as expected In 8.2.3 the same code returns 'test text':1 Can anybody help? -- View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9592230 Sent from the PostgreSQL - general mailing list archive at Nabble.com.
What parse returns ? 8.1.5 and 8.3 return www=# select parse('test text'); parse ---------- (1,test) (12," ") (1,text) (3 rows) Also, what is your configuration ? On Wed, 21 Mar 2007, richardcraig wrote: > > It may solve my query if anybody can tell me if anything has changes in > tsearch2 recently? otherwise... > > I have two installations of Postgres on Windows machines, one is 8.2.1 and > the other is 8.2.3 > > Both installed the same way, selecting tsearch2 during the installation and > restoring the same backup database file. > > In 8.2.1 > select to_tsvector('test text') > returns > 'test':1 'text':2 > as expected > > In 8.2.3 the same code returns > 'test text':1 > > Can anybody help? > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Thanks for the quick response. On 8.2.1 I get the same result as yourself, however on 8.2.3 I get (2,"test text") Configuration? Both databases are UTF8 encoded with language set to C Standard windows installation except for tsearch2 I also have another machine with SQL_ASCII encoding on 8.2.1 and that's OK As far as I can tell they are identical except for the postgresql versions. It's been a long time since I did much with tsearch2 so I'm suspecting a config problem with it that's I've long forgotten about. Richard Oleg Bartunov wrote: > > What parse returns ? 8.1.5 and 8.3 return > > www=# select parse('test text'); > parse > ---------- > (1,test) > (12," ") > (1,text) > (3 rows) > > Also, what is your configuration ? > > > On Wed, 21 Mar 2007, richardcraig wrote: > >> >> It may solve my query if anybody can tell me if anything has changes in >> tsearch2 recently? otherwise... >> >> I have two installations of Postgres on Windows machines, one is 8.2.1 >> and >> the other is 8.2.3 >> >> Both installed the same way, selecting tsearch2 during the installation >> and >> restoring the same backup database file. >> >> In 8.2.1 >> select to_tsvector('test text') >> returns >> 'test':1 'text':2 >> as expected >> >> In 8.2.3 the same code returns >> 'test text':1 >> >> Can anybody help? >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > > -- View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9592631 Sent from the PostgreSQL - general mailing list archive at Nabble.com.
As an update, I've just copied the tsvector.dll file from an 8.2.1 computer to the 8.2.3 computer and it's working OK now. I suspect a change in the dll is causing the problem. Richard richardcraig wrote: > > Oleg > > Thanks for the quick response. > > On 8.2.1 I get the same result as yourself, however on 8.2.3 I get > (2,"test text") > > Configuration? > Both databases are UTF8 encoded with language set to C > Standard windows installation except for tsearch2 > I also have another machine with SQL_ASCII encoding on 8.2.1 and that's OK > > As far as I can tell they are identical except for the postgresql > versions. > > It's been a long time since I did much with tsearch2 so I'm suspecting a > config problem with it that's I've long forgotten about. > > Richard > > > Oleg Bartunov wrote: >> >> What parse returns ? 8.1.5 and 8.3 return >> >> www=# select parse('test text'); >> parse >> ---------- >> (1,test) >> (12," ") >> (1,text) >> (3 rows) >> >> Also, what is your configuration ? >> >> >> On Wed, 21 Mar 2007, richardcraig wrote: >> >>> >>> It may solve my query if anybody can tell me if anything has changes in >>> tsearch2 recently? otherwise... >>> >>> I have two installations of Postgres on Windows machines, one is 8.2.1 >>> and >>> the other is 8.2.3 >>> >>> Both installed the same way, selecting tsearch2 during the installation >>> and >>> restoring the same backup database file. >>> >>> In 8.2.1 >>> select to_tsvector('test text') >>> returns >>> 'test':1 'text':2 >>> as expected >>> >>> In 8.2.3 the same code returns >>> 'test text':1 >>> >>> Can anybody help? >>> >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 9: In versions below 8.0, the planner will ignore your desire to >> choose an index scan if your joining column's datatypes do not >> match >> >> > > -- View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9592812 Sent from the PostgreSQL - general mailing list archive at Nabble.com.
I can't reproduce your problem, but I have not Windows box, can anybody reproduce that? contrib_regression=# select version(); version PostgreSQL 8.2.3 on i386-unknown-freebsd6.2, compiled by GCC gcc (GCC) 3.4.6 [FreeBSD] 20060305 (1 row) contrib_regression=# show server_encoding ; server_encoding ----------------- UTF8 (1 row) contrib_regression=# show lc_collate; lc_collate ------------ C (1 row) contrib_regression=# show lc_ctype; lc_ctype ---------- C (1 row) contrib_regression=# select to_tsvector('test text'); to_tsvector ------------------- 'test':1 'text':2 (1 row) -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Wednesday 21 March 2007 14:25, Teodor Sigaev wrote: | I can't reproduce your problem, but I have not Windows box, can anybody | reproduce that? just a guess in the wild; I once had a similar phenomen and tracked it down to a "non breaking space character" (0xA0). Since then I'm patching the tsearch2 lexer: --- postgresql-8.1.5/contrib/tsearch2/wordparser/parser.l +++ postgresql-8.1.4/contrib/tsearch2/wordparser/parser.l @@ -78,8 +78,8 @@ /* cyrillic koi8 char */ CYRALNUM [0-9\200-\377] CYRALPHA [\200-\377] -ALPHA [a-zA-Z\200-\377] -ALNUM [0-9a-zA-Z\200-\377] +ALPHA [a-zA-Z\200-\237\241-\377] +ALNUM [0-9a-zA-Z\200-\237\241-\377] HOSTNAME ([-_[:alnum:]]+\.)+[[:alpha:]]+ @@ -307,7 +307,7 @@ return UWORD; } -[ \r\n\t]+ { +[ \240\r\n\t]+ { token = tsearch2_yytext; tokenlen = tsearch2_yyleng; return SPACE; Ciao, Thomas -- Thomas Pundt <thomas.pundt@rp-online.de> ---- http://rp-online.de/ ----
8.2 has fully rewritten text parser based on POSIX is* functions. Thomas Pundt wrote: > On Wednesday 21 March 2007 14:25, Teodor Sigaev wrote: > | I can't reproduce your problem, but I have not Windows box, can anybody > | reproduce that? > > just a guess in the wild; I once had a similar phenomen and tracked it down > to a "non breaking space character" (0xA0). Since then I'm patching the > tsearch2 lexer: > > --- postgresql-8.1.5/contrib/tsearch2/wordparser/parser.l > +++ postgresql-8.1.4/contrib/tsearch2/wordparser/parser.l > @@ -78,8 +78,8 @@ > /* cyrillic koi8 char */ > CYRALNUM [0-9\200-\377] > CYRALPHA [\200-\377] > -ALPHA [a-zA-Z\200-\377] > -ALNUM [0-9a-zA-Z\200-\377] > +ALPHA [a-zA-Z\200-\237\241-\377] > +ALNUM [0-9a-zA-Z\200-\237\241-\377] > > > HOSTNAME ([-_[:alnum:]]+\.)+[[:alpha:]]+ > @@ -307,7 +307,7 @@ > return UWORD; > } > > -[ \r\n\t]+ { > +[ \240\r\n\t]+ { > token = tsearch2_yytext; > tokenlen = tsearch2_yyleng; > return SPACE; > > > Ciao, > Thomas > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Wed, 21 Mar 2007, richardcraig wrote: > > As an update, I've just copied the tsvector.dll file from an 8.2.1 computer > to the 8.2.3 computer and it's working OK now. I suspect a change in the dll > is causing the problem. Interesting. > > Richard > > > > richardcraig wrote: >> >> Oleg >> >> Thanks for the quick response. >> >> On 8.2.1 I get the same result as yourself, however on 8.2.3 I get >> (2,"test text") >> >> Configuration? >> Both databases are UTF8 encoded with language set to C >> Standard windows installation except for tsearch2 >> I also have another machine with SQL_ASCII encoding on 8.2.1 and that's OK >> >> As far as I can tell they are identical except for the postgresql >> versions. >> >> It's been a long time since I did much with tsearch2 so I'm suspecting a >> config problem with it that's I've long forgotten about. >> >> Richard >> >> >> Oleg Bartunov wrote: >>> >>> What parse returns ? 8.1.5 and 8.3 return >>> >>> www=# select parse('test text'); >>> parse >>> ---------- >>> (1,test) >>> (12," ") >>> (1,text) >>> (3 rows) >>> >>> Also, what is your configuration ? >>> >>> >>> On Wed, 21 Mar 2007, richardcraig wrote: >>> >>>> >>>> It may solve my query if anybody can tell me if anything has changes in >>>> tsearch2 recently? otherwise... >>>> >>>> I have two installations of Postgres on Windows machines, one is 8.2.1 >>>> and >>>> the other is 8.2.3 >>>> >>>> Both installed the same way, selecting tsearch2 during the installation >>>> and >>>> restoring the same backup database file. >>>> >>>> In 8.2.1 >>>> select to_tsvector('test text') >>>> returns >>>> 'test':1 'text':2 >>>> as expected >>>> >>>> In 8.2.3 the same code returns >>>> 'test text':1 >>>> >>>> Can anybody help? >>>> >>> >>> Regards, >>> Oleg >>> _____________________________________________________________ >>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>> Sternberg Astronomical Institute, Moscow University, Russia >>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>> phone: +007(495)939-16-83, +007(495)939-23-83 >>> >>> ---------------------------(end of broadcast)--------------------------- >>> TIP 9: In versions below 8.0, the planner will ignore your desire to >>> choose an index scan if your joining column's datatypes do not >>> match >>> >>> >> >> > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Wed, Mar 21, 2007 at 04:25:30PM +0300, Teodor Sigaev wrote: > I can't reproduce your problem, but I have not Windows box, can anybody > reproduce that? > > > contrib_regression=# select version(); > version > > PostgreSQL 8.2.3 on i386-unknown-freebsd6.2, compiled by GCC gcc (GCC) > 3.4.6 [FreeBSD] 20060305 > (1 row) > contrib_regression=# show server_encoding ; > server_encoding > ----------------- > UTF8 > (1 row) > > contrib_regression=# show lc_collate; > lc_collate > ------------ > C > (1 row) > > contrib_regression=# show lc_ctype; > lc_ctype > ---------- > C > (1 row) > > contrib_regression=# select to_tsvector('test text'); > to_tsvector > ------------------- > 'test':1 'text':2 > (1 row) postgres=# select version();show lc_collate;show lc_ctype; version -------------------------------------------------------------------------------- ------------- PostgreSQL 8.3devel on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2 (mi ngw-special) (1 row) lc_collate ------------ C (1 row) lc_ctype ---------- C (1 row) postgres=# select to_tsvector('test text'); to_tsvector --------------- 'test text':1 (1 row) (encoding is UTF8 on a fresh database) //Magnus
> postgres=# select to_tsvector('test text'); > to_tsvector > --------------- > 'test text':1 > (1 row) Ok. that's related to http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/tsearch2/wordparser/parser.c.diff?r1=1.11;r2=1.12;f=h commit. Thomas pointed that it can be non-breakable space (0xa0) and that commit assumes any character with C locale and multibyte encoding and > 0x7f is alpha. To check theory, pls, apply attached patch. If so, I'm confused, we can not assume that 0xa0 is a space symbol in any multibyte encoding, even in Windows. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ *** ./contrib/tsearch2/wordparser/parser.c.orig Wed Mar 21 20:41:23 2007 --- ./contrib/tsearch2/wordparser/parser.c Wed Mar 21 21:10:39 2007 *************** *** 124,130 **** --- 124,134 ---- * with C-locale is an alpha character */ if ( c > 0x7f ) + { + if ( c == 0xa0 ) + return 0; return 1; + } return isalnum(0xff & c); } *************** *** 157,163 **** --- 161,171 ---- * with C-locale is an alpha character */ if ( c > 0x7f ) + { + if ( c == 0xa0 ) + return 0; return 1; + } return isalpha(0xff & c); }
On Wed, Mar 21, 2007 at 09:13:55PM +0300, Teodor Sigaev wrote: > >postgres=# select to_tsvector('test text'); > > to_tsvector > >--------------- > > 'test text':1 > >(1 row) > Ok. that's related to > http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/tsearch2/wordparser/parser.c.diff?r1=1.11;r2=1.12;f=h > commit. Thomas pointed that it can be non-breakable space (0xa0) and that > commit assumes any character with C locale and multibyte encoding and > > 0x7f is alpha. > To check theory, pls, apply attached patch. > > If so, I'm confused, we can not assume that 0xa0 is a space symbol in any > multibyte encoding, even in Windows. Nope, same result with this patch. //Magnus
Solved, see attached patch. I had found old Celeron-300 box and install Windows on it, and it was very slow :) > Nope, same result with this patch. Thank you. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ *** ./contrib/tsearch2.orig/./wordparser/parser.c Thu Mar 22 18:39:23 2007 --- ./contrib/tsearch2/./wordparser/parser.c Thu Mar 22 18:51:23 2007 *************** *** 117,123 **** { if (lc_ctype_is_c()) { ! unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar); /* * any non-ascii symbol with multibyte encoding --- 117,123 ---- { if (lc_ctype_is_c()) { ! unsigned int c = *(prs->wstr + prs->state->poschar); /* * any non-ascii symbol with multibyte encoding
Teodor As a non-C windows user (yes - throw stones at me :) ) Do you have a fixed dll for this patch that I can try? Thanks Richard Teodor Sigaev-2 wrote: > > Solved, see attached patch. I had found old Celeron-300 box and install > Windows > on it, and it was very slow :) > > >> Nope, same result with this patch. > Thank you. > > -- > Teodor Sigaev E-mail: teodor@sigaev.ru > WWW: > http://www.sigaev.ru/ > > *** ./contrib/tsearch2.orig/./wordparser/parser.c Thu Mar 22 18:39:23 2007 > --- ./contrib/tsearch2/./wordparser/parser.c Thu Mar 22 18:51:23 2007 > *************** > *** 117,123 **** > { > if (lc_ctype_is_c()) > { > ! unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar); > > /* > * any non-ascii symbol with multibyte encoding > --- 117,123 ---- > { > if (lc_ctype_is_c()) > { > ! unsigned int c = *(prs->wstr + prs->state->poschar); > > /* > * any non-ascii symbol with multibyte encoding > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > > -- View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9790691 Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Sorry, no - I tested on CVS HEAD, so dll isn't compatible :( Wait a bit for 8.2.4 richardcraig wrote: > Teodor > > As a non-C windows user (yes - throw stones at me :) ) Do you have a fixed > dll for this patch that I can try? > > Thanks > > Richard > > > Teodor Sigaev-2 wrote: >> Solved, see attached patch. I had found old Celeron-300 box and install >> Windows >> on it, and it was very slow :) >> >> >>> Nope, same result with this patch. >> Thank you. >> >> -- >> Teodor Sigaev E-mail: teodor@sigaev.ru >> WWW: >> http://www.sigaev.ru/ >> >> *** ./contrib/tsearch2.orig/./wordparser/parser.c Thu Mar 22 18:39:23 2007 >> --- ./contrib/tsearch2/./wordparser/parser.c Thu Mar 22 18:51:23 2007 >> *************** >> *** 117,123 **** >> { >> if (lc_ctype_is_c()) >> { >> ! unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar); >> >> /* >> * any non-ascii symbol with multibyte encoding >> --- 117,123 ---- >> { >> if (lc_ctype_is_c()) >> { >> ! unsigned int c = *(prs->wstr + prs->state->poschar); >> >> /* >> * any non-ascii symbol with multibyte encoding >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 3: Have you checked our extensive FAQ? >> >> http://www.postgresql.org/docs/faq >> >> > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/