Thread: to_tsvector in 8.2.3

to_tsvector in 8.2.3

From
richardcraig
Date:
It may solve my query if anybody can tell me if anything has changes in
tsearch2 recently? otherwise...

I have two installations of Postgres on Windows machines, one is 8.2.1 and
the other is 8.2.3

Both installed the same way, selecting tsearch2 during the installation and
restoring the same backup database file.

In 8.2.1
      select to_tsvector('test text')
returns
      'test':1 'text':2
as expected

In 8.2.3 the same code returns
      'test text':1

Can anybody help?
--
View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9592230
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: to_tsvector in 8.2.3

From
Oleg Bartunov
Date:
What parse returns ? 8.1.5 and 8.3 return

www=# select parse('test text');
   parse
----------
  (1,test)
  (12," ")
  (1,text)
(3 rows)

Also, what is your configuration ?


On Wed, 21 Mar 2007, richardcraig wrote:

>
> It may solve my query if anybody can tell me if anything has changes in
> tsearch2 recently? otherwise...
>
> I have two installations of Postgres on Windows machines, one is 8.2.1 and
> the other is 8.2.3
>
> Both installed the same way, selecting tsearch2 during the installation and
> restoring the same backup database file.
>
> In 8.2.1
>      select to_tsvector('test text')
> returns
>      'test':1 'text':2
> as expected
>
> In 8.2.3 the same code returns
>      'test text':1
>
> Can anybody help?
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: to_tsvector in 8.2.3

From
richardcraig
Date:
Oleg

Thanks for the quick response.

On 8.2.1 I get the same result as yourself, however on 8.2.3 I get
(2,"test text")

Configuration?
Both databases are UTF8 encoded with language set to C
Standard windows installation except for tsearch2
I also have another machine with SQL_ASCII encoding on 8.2.1 and that's OK

As far as I can tell they are identical except for the postgresql versions.

It's been a long time since I did much with tsearch2 so I'm suspecting a
config problem with it that's I've long forgotten about.

Richard


Oleg Bartunov wrote:
>
> What parse returns ? 8.1.5 and 8.3 return
>
> www=# select parse('test text');
>    parse
> ----------
>   (1,test)
>   (12," ")
>   (1,text)
> (3 rows)
>
> Also, what is your configuration ?
>
>
> On Wed, 21 Mar 2007, richardcraig wrote:
>
>>
>> It may solve my query if anybody can tell me if anything has changes in
>> tsearch2 recently? otherwise...
>>
>> I have two installations of Postgres on Windows machines, one is 8.2.1
>> and
>> the other is 8.2.3
>>
>> Both installed the same way, selecting tsearch2 during the installation
>> and
>> restoring the same backup database file.
>>
>> In 8.2.1
>>      select to_tsvector('test text')
>> returns
>>      'test':1 'text':2
>> as expected
>>
>> In 8.2.3 the same code returns
>>      'test text':1
>>
>> Can anybody help?
>>
>
>      Regards,
>          Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>
>

--
View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9592631
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: to_tsvector in 8.2.3

From
richardcraig
Date:
As an update, I've just copied the tsvector.dll file from an 8.2.1 computer
to the 8.2.3 computer and it's working OK now. I suspect a change in the dll
is causing the problem.

Richard



richardcraig wrote:
>
> Oleg
>
> Thanks for the quick response.
>
> On 8.2.1 I get the same result as yourself, however on 8.2.3 I get
> (2,"test text")
>
> Configuration?
> Both databases are UTF8 encoded with language set to C
> Standard windows installation except for tsearch2
> I also have another machine with SQL_ASCII encoding on 8.2.1 and that's OK
>
> As far as I can tell they are identical except for the postgresql
> versions.
>
> It's been a long time since I did much with tsearch2 so I'm suspecting a
> config problem with it that's I've long forgotten about.
>
> Richard
>
>
> Oleg Bartunov wrote:
>>
>> What parse returns ? 8.1.5 and 8.3 return
>>
>> www=# select parse('test text');
>>    parse
>> ----------
>>   (1,test)
>>   (12," ")
>>   (1,text)
>> (3 rows)
>>
>> Also, what is your configuration ?
>>
>>
>> On Wed, 21 Mar 2007, richardcraig wrote:
>>
>>>
>>> It may solve my query if anybody can tell me if anything has changes in
>>> tsearch2 recently? otherwise...
>>>
>>> I have two installations of Postgres on Windows machines, one is 8.2.1
>>> and
>>> the other is 8.2.3
>>>
>>> Both installed the same way, selecting tsearch2 during the installation
>>> and
>>> restoring the same backup database file.
>>>
>>> In 8.2.1
>>>      select to_tsvector('test text')
>>> returns
>>>      'test':1 'text':2
>>> as expected
>>>
>>> In 8.2.3 the same code returns
>>>      'test text':1
>>>
>>> Can anybody help?
>>>
>>
>>      Regards,
>>          Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 9: In versions below 8.0, the planner will ignore your desire to
>>        choose an index scan if your joining column's datatypes do not
>>        match
>>
>>
>
>

--
View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9592812
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: to_tsvector in 8.2.3

From
Teodor Sigaev
Date:
I can't reproduce your problem, but I have not Windows box, can anybody
reproduce that?


contrib_regression=# select version();
                                              version

  PostgreSQL 8.2.3 on i386-unknown-freebsd6.2, compiled by GCC gcc (GCC) 3.4.6
[FreeBSD] 20060305
(1 row)
contrib_regression=#  show server_encoding ;
  server_encoding
-----------------
  UTF8
(1 row)

contrib_regression=# show lc_collate;
  lc_collate
------------
  C
(1 row)

contrib_regression=# show lc_ctype;
  lc_ctype
----------
  C
(1 row)

contrib_regression=# select to_tsvector('test text');
     to_tsvector
-------------------
  'test':1 'text':2
(1 row)

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: to_tsvector in 8.2.3

From
Thomas Pundt
Date:
On Wednesday 21 March 2007 14:25, Teodor Sigaev wrote:
| I can't reproduce your problem, but I have not Windows box, can anybody
| reproduce that?

just a guess in the wild; I once had a similar phenomen and tracked it down
to a "non breaking space character" (0xA0). Since then I'm patching the
tsearch2 lexer:

--- postgresql-8.1.5/contrib/tsearch2/wordparser/parser.l
+++ postgresql-8.1.4/contrib/tsearch2/wordparser/parser.l
@@ -78,8 +78,8 @@
 /* cyrillic koi8 char */
 CYRALNUM       [0-9\200-\377]
 CYRALPHA       [\200-\377]
-ALPHA          [a-zA-Z\200-\377]
-ALNUM          [0-9a-zA-Z\200-\377]
+ALPHA          [a-zA-Z\200-\237\241-\377]
+ALNUM          [0-9a-zA-Z\200-\237\241-\377]


 HOSTNAME       ([-_[:alnum:]]+\.)+[[:alpha:]]+
@@ -307,7 +307,7 @@
        return UWORD;
 }

-[ \r\n\t]+ {
+[ \240\r\n\t]+ {
        token = tsearch2_yytext;
        tokenlen = tsearch2_yyleng;
        return SPACE;


Ciao,
Thomas

--
Thomas Pundt <thomas.pundt@rp-online.de> ---- http://rp-online.de/ ----

Re: to_tsvector in 8.2.3

From
Teodor Sigaev
Date:
8.2 has fully rewritten text parser based on POSIX is* functions.

Thomas Pundt wrote:
> On Wednesday 21 March 2007 14:25, Teodor Sigaev wrote:
> | I can't reproduce your problem, but I have not Windows box, can anybody
> | reproduce that?
>
> just a guess in the wild; I once had a similar phenomen and tracked it down
> to a "non breaking space character" (0xA0). Since then I'm patching the
> tsearch2 lexer:
>
> --- postgresql-8.1.5/contrib/tsearch2/wordparser/parser.l
> +++ postgresql-8.1.4/contrib/tsearch2/wordparser/parser.l
> @@ -78,8 +78,8 @@
>  /* cyrillic koi8 char */
>  CYRALNUM       [0-9\200-\377]
>  CYRALPHA       [\200-\377]
> -ALPHA          [a-zA-Z\200-\377]
> -ALNUM          [0-9a-zA-Z\200-\377]
> +ALPHA          [a-zA-Z\200-\237\241-\377]
> +ALNUM          [0-9a-zA-Z\200-\237\241-\377]
>
>
>  HOSTNAME       ([-_[:alnum:]]+\.)+[[:alpha:]]+
> @@ -307,7 +307,7 @@
>         return UWORD;
>  }
>
> -[ \r\n\t]+ {
> +[ \240\r\n\t]+ {
>         token = tsearch2_yytext;
>         tokenlen = tsearch2_yyleng;
>         return SPACE;
>
>
> Ciao,
> Thomas
>

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: to_tsvector in 8.2.3

From
Oleg Bartunov
Date:
On Wed, 21 Mar 2007, richardcraig wrote:

>
> As an update, I've just copied the tsvector.dll file from an 8.2.1 computer
> to the 8.2.3 computer and it's working OK now. I suspect a change in the dll
> is causing the problem.

Interesting.

>
> Richard
>
>
>
> richardcraig wrote:
>>
>> Oleg
>>
>> Thanks for the quick response.
>>
>> On 8.2.1 I get the same result as yourself, however on 8.2.3 I get
>> (2,"test text")
>>
>> Configuration?
>> Both databases are UTF8 encoded with language set to C
>> Standard windows installation except for tsearch2
>> I also have another machine with SQL_ASCII encoding on 8.2.1 and that's OK
>>
>> As far as I can tell they are identical except for the postgresql
>> versions.
>>
>> It's been a long time since I did much with tsearch2 so I'm suspecting a
>> config problem with it that's I've long forgotten about.
>>
>> Richard
>>
>>
>> Oleg Bartunov wrote:
>>>
>>> What parse returns ? 8.1.5 and 8.3 return
>>>
>>> www=# select parse('test text');
>>>    parse
>>> ----------
>>>   (1,test)
>>>   (12," ")
>>>   (1,text)
>>> (3 rows)
>>>
>>> Also, what is your configuration ?
>>>
>>>
>>> On Wed, 21 Mar 2007, richardcraig wrote:
>>>
>>>>
>>>> It may solve my query if anybody can tell me if anything has changes in
>>>> tsearch2 recently? otherwise...
>>>>
>>>> I have two installations of Postgres on Windows machines, one is 8.2.1
>>>> and
>>>> the other is 8.2.3
>>>>
>>>> Both installed the same way, selecting tsearch2 during the installation
>>>> and
>>>> restoring the same backup database file.
>>>>
>>>> In 8.2.1
>>>>      select to_tsvector('test text')
>>>> returns
>>>>      'test':1 'text':2
>>>> as expected
>>>>
>>>> In 8.2.3 the same code returns
>>>>      'test text':1
>>>>
>>>> Can anybody help?
>>>>
>>>
>>>      Regards,
>>>          Oleg
>>> _____________________________________________________________
>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>> Sternberg Astronomical Institute, Moscow University, Russia
>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 9: In versions below 8.0, the planner will ignore your desire to
>>>        choose an index scan if your joining column's datatypes do not
>>>        match
>>>
>>>
>>
>>
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: to_tsvector in 8.2.3

From
Magnus Hagander
Date:
On Wed, Mar 21, 2007 at 04:25:30PM +0300, Teodor Sigaev wrote:
> I can't reproduce your problem, but I have not Windows box, can anybody
> reproduce that?
>
>
> contrib_regression=# select version();
>                                              version
>
>  PostgreSQL 8.2.3 on i386-unknown-freebsd6.2, compiled by GCC gcc (GCC)
>  3.4.6 [FreeBSD] 20060305
> (1 row)
> contrib_regression=#  show server_encoding ;
>  server_encoding
> -----------------
>  UTF8
> (1 row)
>
> contrib_regression=# show lc_collate;
>  lc_collate
> ------------
>  C
> (1 row)
>
> contrib_regression=# show lc_ctype;
>  lc_ctype
> ----------
>  C
> (1 row)
>
> contrib_regression=# select to_tsvector('test text');
>     to_tsvector
> -------------------
>  'test':1 'text':2
> (1 row)

postgres=# select version();show lc_collate;show lc_ctype;
                                           version

--------------------------------------------------------------------------------
-------------
 PostgreSQL 8.3devel on i686-pc-mingw32, compiled by GCC gcc.exe (GCC)
3.4.2 (mi
ngw-special)
(1 row)

 lc_collate
------------
 C
(1 row)

 lc_ctype
----------
 C
(1 row)

postgres=# select to_tsvector('test text');
  to_tsvector
---------------
 'test text':1
(1 row)


(encoding is UTF8 on a fresh database)

//Magnus


Re: to_tsvector in 8.2.3

From
Teodor Sigaev
Date:
> postgres=# select to_tsvector('test text');
>   to_tsvector
> ---------------
>  'test text':1
> (1 row)
Ok. that's related to
http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/tsearch2/wordparser/parser.c.diff?r1=1.11;r2=1.12;f=h
commit. Thomas pointed that it can be non-breakable space (0xa0) and that commit
assumes any character with C locale and multibyte encoding and > 0x7f is alpha.
To check theory, pls, apply attached patch.

If so, I'm confused, we can not assume that 0xa0 is a space symbol in any
multibyte encoding, even in Windows.



--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/
*** ./contrib/tsearch2/wordparser/parser.c.orig    Wed Mar 21 20:41:23 2007
--- ./contrib/tsearch2/wordparser/parser.c    Wed Mar 21 21:10:39 2007
***************
*** 124,130 ****
--- 124,134 ----
               * with C-locale is an alpha character
               */
              if ( c > 0x7f )
+             {
+                 if ( c == 0xa0 )
+                     return 0;
                  return 1;
+             }

              return isalnum(0xff & c);
          }
***************
*** 157,163 ****
--- 161,171 ----
               * with C-locale is an alpha character
               */
              if ( c > 0x7f )
+             {
+                 if ( c == 0xa0 )
+                     return 0;
                  return 1;
+             }

              return isalpha(0xff & c);
          }

Re: to_tsvector in 8.2.3

From
Magnus Hagander
Date:
On Wed, Mar 21, 2007 at 09:13:55PM +0300, Teodor Sigaev wrote:
> >postgres=# select to_tsvector('test text');
> >  to_tsvector
> >---------------
> > 'test text':1
> >(1 row)
> Ok. that's related to
> http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/tsearch2/wordparser/parser.c.diff?r1=1.11;r2=1.12;f=h
> commit. Thomas pointed that it can be non-breakable space (0xa0) and that
> commit assumes any character with C locale and multibyte encoding and >
> 0x7f is alpha.
> To check theory, pls, apply attached patch.
>
> If so, I'm confused, we can not assume that 0xa0 is a space symbol in any
> multibyte encoding, even in Windows.

Nope, same result with this patch.

//Magnus


Re: to_tsvector in 8.2.3

From
Teodor Sigaev
Date:
Solved, see attached patch. I had found old Celeron-300 box and install Windows
on it, and it was very slow :)


> Nope, same result with this patch.
Thank you.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/
*** ./contrib/tsearch2.orig/./wordparser/parser.c    Thu Mar 22 18:39:23 2007
--- ./contrib/tsearch2/./wordparser/parser.c    Thu Mar 22 18:51:23 2007
***************
*** 117,123 ****
      {
          if (lc_ctype_is_c())
          {
!             unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar);

              /*
               * any non-ascii symbol with multibyte encoding
--- 117,123 ----
      {
          if (lc_ctype_is_c())
          {
!             unsigned int c = *(prs->wstr + prs->state->poschar);

              /*
               * any non-ascii symbol with multibyte encoding

Re: to_tsvector in 8.2.3

From
richardcraig
Date:
Teodor

As a non-C windows user (yes - throw stones at me :) ) Do you have a fixed
dll for this patch that I can try?

Thanks

Richard


Teodor Sigaev-2 wrote:
>
> Solved, see attached patch. I had found old Celeron-300 box and install
> Windows
> on it, and it was very slow :)
>
>
>> Nope, same result with this patch.
> Thank you.
>
> --
> Teodor Sigaev                                   E-mail: teodor@sigaev.ru
>                                                     WWW:
> http://www.sigaev.ru/
>
> *** ./contrib/tsearch2.orig/./wordparser/parser.c    Thu Mar 22 18:39:23 2007
> --- ./contrib/tsearch2/./wordparser/parser.c    Thu Mar 22 18:51:23 2007
> ***************
> *** 117,123 ****
>       {
>           if (lc_ctype_is_c())
>           {
> !             unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar);
>
>               /*
>                * any non-ascii symbol with multibyte encoding
> --- 117,123 ----
>       {
>           if (lc_ctype_is_c())
>           {
> !             unsigned int c = *(prs->wstr + prs->state->poschar);
>
>               /*
>                * any non-ascii symbol with multibyte encoding
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>
>

--
View this message in context: http://www.nabble.com/to_tsvector-in-8.2.3-tf3440132.html#a9790691
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: to_tsvector in 8.2.3

From
Teodor Sigaev
Date:
Sorry, no - I tested on CVS HEAD, so dll isn't compatible :(
Wait a bit for 8.2.4

richardcraig wrote:
> Teodor
>
> As a non-C windows user (yes - throw stones at me :) ) Do you have a fixed
> dll for this patch that I can try?
>
> Thanks
>
> Richard
>
>
> Teodor Sigaev-2 wrote:
>> Solved, see attached patch. I had found old Celeron-300 box and install
>> Windows
>> on it, and it was very slow :)
>>
>>
>>> Nope, same result with this patch.
>> Thank you.
>>
>> --
>> Teodor Sigaev                                   E-mail: teodor@sigaev.ru
>>                                                     WWW:
>> http://www.sigaev.ru/
>>
>> *** ./contrib/tsearch2.orig/./wordparser/parser.c    Thu Mar 22 18:39:23 2007
>> --- ./contrib/tsearch2/./wordparser/parser.c    Thu Mar 22 18:51:23 2007
>> ***************
>> *** 117,123 ****
>>       {
>>           if (lc_ctype_is_c())
>>           {
>> !             unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar);
>>
>>               /*
>>                * any non-ascii symbol with multibyte encoding
>> --- 117,123 ----
>>       {
>>           if (lc_ctype_is_c())
>>           {
>> !             unsigned int c = *(prs->wstr + prs->state->poschar);
>>
>>               /*
>>                * any non-ascii symbol with multibyte encoding
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 3: Have you checked our extensive FAQ?
>>
>>                http://www.postgresql.org/docs/faq
>>
>>
>

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/