Thread: regression test crashes at tsearch

regression test crashes at tsearch

From
Hiroshi Inoue
Date:
Hi,

I see a regression test failure in my mingw-vista port
when I invoke the command
  make check MULTIBYTE=euc_jp NO_LOCALE=yes
.
It causes a crash at tsearch.
The crash seems to occur when the server encoding isn't
UTF-8 with no locale.
The attached is a patch to avoid the crash.

regards,
Hiroshi Inoue


Index: backend/utils/mb/mbutils.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/mb/mbutils.c,v
retrieving revision 1.78
diff -c -r1.78 mbutils.c
*** backend/utils/mb/mbutils.c    22 Jan 2009 10:09:48 -0000    1.78
--- backend/utils/mb/mbutils.c    17 Feb 2009 21:59:26 -0000
***************
*** 575,580 ****
--- 575,584 ----
  wchar2char(char *to, const wchar_t *from, size_t tolen)
  {
      size_t result;
+ #ifdef    WIN32
+     int    encoding = GetDatabaseEncoding();
+     bool    useWcstombs = !(encoding == PG_UTF8 || lc_ctype_is_c());
+ #endif

      if (tolen == 0)
          return 0;
***************
*** 584,602 ****
       * On Windows, the "Unicode" locales assume UTF16 not UTF8 encoding,
       * and for some reason mbstowcs and wcstombs won't do this for us,
       * so we use MultiByteToWideChar().
       */
!     if (GetDatabaseEncoding() == PG_UTF8)
      {
!         result = WideCharToMultiByte(CP_UTF8, 0, from, -1, to, tolen,
                                  NULL, NULL);
          /* A zero return is failure */
!         if (result <= 0)
              result = -1;
          else
          {
-             Assert(result <= tolen);
              /* Microsoft counts the zero terminator in the result */
!             result--;
          }
      }
      else
--- 588,624 ----
       * On Windows, the "Unicode" locales assume UTF16 not UTF8 encoding,
       * and for some reason mbstowcs and wcstombs won't do this for us,
       * so we use MultiByteToWideChar().
+      * Also note wcstombs/mbstowcs is unavailable when LC_CTYPE is C.
       */
!     if (!useWcstombs)
      {
!         int    utf8len = tolen;
!         char *utf8str = to;
!
!         if (encoding != PG_UTF8)
!         {
!             utf8len = pg_encoding_max_length(PG_UTF8) * tolen;
!             utf8str = palloc(utf8len + 1);
!         }
!         utf8len = WideCharToMultiByte(CP_UTF8, 0, from, -1, utf8str, utf8len,
                                  NULL, NULL);
          /* A zero return is failure */
!         if (utf8len <= 0)
              result = -1;
          else
          {
              /* Microsoft counts the zero terminator in the result */
!             result = utf8len - 1;
!             if (encoding != PG_UTF8)
!             {
!                 char *mbstr = pg_do_encoding_conversion((unsigned char *) utf8str, result, PG_UTF8, encoding);
!                 result = strlcpy(to, mbstr, tolen);
!                 if (utf8str != to)
!                     pfree(utf8str);
!                 if (mbstr != utf8str)
!                     pfree(mbstr);
!             }
!             Assert(result <= tolen);
          }
      }
      else
***************
*** 618,637 ****
  char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen)
  {
      size_t        result;

      if (tolen == 0)
          return 0;

  #ifdef WIN32
!     /* See WIN32 "Unicode" comment above */
!     if (GetDatabaseEncoding() == PG_UTF8)
      {
          /* Win32 API does not work for zero-length input */
!         if (fromlen == 0)
              result = 0;
          else
          {
!             result = MultiByteToWideChar(CP_UTF8, 0, from, fromlen, to, tolen - 1);
              /* A zero return is failure */
              if (result == 0)
                  result = -1;
--- 640,672 ----
  char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen)
  {
      size_t        result;
+ #ifdef    WIN32
+     int    encoding = GetDatabaseEncoding();
+     bool    useMbstowcs = !(encoding == PG_UTF8 || lc_ctype_is_c());
+ #endif

      if (tolen == 0)
          return 0;

  #ifdef WIN32
!     if (!useMbstowcs)
      {
+         int    utf8len = fromlen;
+         unsigned char *utf8str = (unsigned char *) from;
+
+         if (encoding != PG_UTF8)
+         {
+             utf8str = pg_do_encoding_conversion(from, fromlen, encoding, PG_UTF8);
+             if (utf8str != from)
+                 utf8len = strlen(utf8str);
+         }
+         /* See WIN32 "Unicode" comment above */
          /* Win32 API does not work for zero-length input */
!         if (utf8len == 0)
              result = 0;
          else
          {
!             result = MultiByteToWideChar(CP_UTF8, 0, utf8str, utf8len, to, tolen - 1);
              /* A zero return is failure */
              if (result == 0)
                  result = -1;
***************
*** 643,648 ****
--- 678,685 ----
              /* Append trailing null wchar (MultiByteToWideChar() does not) */
              to[result] = 0;
          }
+         if (utf8str != from)
+             pfree(utf8str);
      }
      else
  #endif   /* WIN32 */

Re: regression test crashes at tsearch

From
"Hiroshi Saito"
Date:
Hi.

Sorry late reaction.

I try check of CVS-HEAD now.

$ make check NO_LOCALE=true
...
=======================All 120 tests passed.
=======================

However, same action as Inou-sane is seen.

make check MULTIBYTE=euc_jp NO_LOCALE=true

The differences that caused some tests to fail can be viewed in the
file "C:/MinGW/home/HIROSHI/pgsql/src/test/regress/regression.diffs".  A copy of the test summary 
that you see
above is saved in the file "C:/MinGW/home/HIROSHI/pgsql/src/test/regress/regression.out".

make[2]: *** [check] Error 1
make[2]: Leaving directory `/home/HIROSHI/pgsql/src/test/regress'
make[1]: *** [check] Error 2
make[1]: Leaving directory `/home/HIROSHI/pgsql/src/test'
make: *** [check] Error 2

http://winpg.jp/~saito/pg_bug/20090223-CVS-regression.diffs
http://winpg.jp/~saito/pg_bug/20090223-CVS-regression.out

Then, It is comfortable after applying Inoue-san patch.
make check MULTIBYTE=euc_jp NO_LOCALE=true
...
=======================All 120 tests passed.
=======================

I think that Mr. Inoue's patch is right.
why isn't it taken into consideration yet?

Regards,
Hiroshi Saito

----- Original Message ----- 
From: "Hiroshi Inoue" <inoue@tpf.co.jp>


> Hi,
>
> I see a regression test failure in my mingw-vista port
> when I invoke the command
>  make check MULTIBYTE=euc_jp NO_LOCALE=yes
> .
> It causes a crash at tsearch.
> The crash seems to occur when the server encoding isn't
> UTF-8 with no locale.
> The attached is a patch to avoid the crash.
>
> regards,
> Hiroshi Inoue
>
>
> 



Re: regression test crashes at tsearch

From
Teodor Sigaev
Date:
> I think that Mr. Inoue's patch is right.
> why isn't it taken into consideration yet?

I can't check that patch because I don't have a Windows box.
But I did some investigations. As I understand, the patch prevents from calling 
of wcstombs/mbstowcs with C locale and I checked trace for that. But 
wcstombs/mbstowcs doesn't called at all with C-locale.  With C locale char2wchar 
calls pg_mb2wchar_with_len(), and char2wchar is called from single place: 
TParserInit. wchar2char isn't called at all. Next, regression tests are 
successfully passed until ispell_tst configuration was created. Looking at diff   SELECT to_tsquery('ispell_tst',
'footballyklubber:b& rebookings:A & sky');^M                                               to_tsquery
     ^M
 
! 
-----------------------------------------------------------------------------------------------------^M
!  'f':B & 'o':B & 'b':B & 'l':B & 'y':B & 'l':B & 'b':B & 'e':B & 'r':A & 'b':A 
& 'o':A & 'g':A & 'y'^M

I believe that even here parser is working correctly but there is problems in 
ispell.

So, although patch fixes the problem, it seems to me patch doesn't do it in 
right place. The patch converts multibyte string in non-utf encoding with 
C-locale into wide string by multibyte->utf8->wide conversation instead of 
pg_mb2wchar_with_len() call. Is output of pg_mb2wchar_with_len() compatible with 
isalpha (and others) functions?



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: regression test crashes at tsearch

From
Hiroshi Inoue
Date:
Teodor Sigaev wrote:
>> I think that Mr. Inoue's patch is right.
>> why isn't it taken into consideration yet?
> 
> I can't check that patch because I don't have a Windows box.
> But I did some investigations. As I understand, the patch prevents from 
> calling of wcstombs/mbstowcs with C locale and I checked trace for that. 
> But wcstombs/mbstowcs doesn't called at all with C-locale.  With C 
> locale char2wchar calls pg_mb2wchar_with_len(), 

pg_mb2wchar_with_len() converts server encoded strings to pg_wchar
strings. But pg_wchar is typedef'd as unsigned int which is not the
same as wchar_t at least on Windows (unsigned short).
If pg_mb2wchar_with_len() works for wchar_t on Windows, we had better call it in all cases on Windows to implement
char2wchar().
> and char2wchar is called from single place: TParserInit.
> wchar2char isn't called at all.

I modified it corresponding to the change in char2wchar() so that
wchar2char(char2wchar(x)) becomes x. Though I'm not sure if it is
called or not, isn't it an principle?
If there's an effective function like pg_wchar2mb_with_len() which
converts wchar_t strings to server encoded strings, we had better
simply call it for char2wchar().

regards,
Hiroshi Inoue



Re: regression test crashes at tsearch

From
Teodor Sigaev
Date:
> pg_mb2wchar_with_len() converts server encoded strings to pg_wchar
> strings. But pg_wchar is typedef'd as unsigned int which is not the
> same as wchar_t at least on Windows (unsigned short).
Oops. The problem is here. TParserInit allocates twice less memory than needed.
And it happens if sizeof(wchar_t) < sizeof(pg_wchar) and C-locale for
non-Windows box. Also for Windows, encoding should be non-utf. So, all p_is*
functions are broken in this case because they work with wrong data.

.
> I modified it corresponding to the change in char2wchar() so that
> wchar2char(char2wchar(x)) becomes x. Though I'm not sure if it is
mbstowcs/wcstombs doesn't work with C-locale in other OSes too, so that's not
needed.

> If there's an effective function like pg_wchar2mb_with_len() which
> converts wchar_t strings to server encoded strings, we had better
> simply call it for char2wchar().
I don't see a way to produce correct result of char2wchar with C-locale and
sizeof(wchar_t) = 2.

In summary, I suggest to remove support of C-locale from char2wchar function and
  tsearch's parser should directly use pg_mb2wchar_with_len() in case of
C-locale and multibyte encoding. In all other places char2wchar is called only
for non-C locale.

Please, test attached patch.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: regression test crashes at tsearch

From
"Hiroshi Saito"
Date:
Hi Teodor-san.

Sorry late reaction.

----- Original Message ----- 
From: "Teodor Sigaev" <teodor@sigaev.ru>


>> If there's an effective function like pg_wchar2mb_with_len() which
>> converts wchar_t strings to server encoded strings, we had better
>> simply call it for char2wchar().
> I don't see a way to produce correct result of char2wchar with C-locale and 
> sizeof(wchar_t) = 2.
> 
> In summary, I suggest to remove support of C-locale from char2wchar function and 
>  tsearch's parser should directly use pg_mb2wchar_with_len() in case of 
> C-locale and multibyte encoding. In all other places char2wchar is called only 
> for non-C locale.
> 
> Please, test attached patch.

Um, I think your patch like the overkill reaction of C-locale...
However, I tried your patch. 

make check MULTIBYTE=euc_jp NO_LOCALE=true
...
=======================All 120 tests passed.
=======================

Anyway, either should be applied. 
Thanks.

Regards,
Hiroshi Saito


Re: regression test crashes at tsearch

From
Teodor Sigaev
Date:
> Um, I think your patch like the overkill reaction of C-locale...
Patch makes char2wchar and wchar2char symmetric to C-locale.

> However, I tried your patch.
> make check MULTIBYTE=euc_jp NO_LOCALE=true
> ...
> =======================
> All 120 tests passed.
> =======================
> 
> Anyway, either should be applied. Thanks.
Thank you. Changes are committed up to 8.2
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: regression test crashes at tsearch

From
Greg Stark
Date:
On Wed, Feb 25, 2009 at 6:44 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> mbstowcs/wcstombs doesn't work with C-locale in other OSes too, so that's
> not needed.


Say what? What OSes is that?

-- 
greg


Re: regression test crashes at tsearch

From
Teodor Sigaev
Date:
>
> Say what? What OSes is that?
>
See attached test program. It tries to convert multibyte russian word in
UTF8 to wide char with C, ru_RU-KOI8-R and ru_RU.UTF-8 locales. The word
contains 6 letters.

FreeBSD 7.2 (short output):
========C==========
mbstowcs returns 12
========ru_RU.KOI8-R==========
mbstowcs returns 12
========ru_RU.UTF-8==========
mbstowcs returns 6

Linux 2.6.23 libc 2.5 (short output):
========C==========
mbstowcs returns -1
========ru_RU.KOI8-R==========
mbstowcs returns 12
========ru_RU.UTF-8==========
mbstowcs returns 6


The program also prints test of iswalpha:
Linux 2.6.23 libc 2.5 (full output):
========C==========
mbstowcs returns -1
ERROR
========ru_RU.KOI8-R==========
mbstowcs returns 12
         0-th chacter is alpha
         1-th chacter is NOT alpha
         2-th chacter is alpha
         3-th chacter is NOT alpha
         4-th chacter is alpha
         5-th chacter is NOT alpha
         6-th chacter is alpha
         7-th chacter is NOT alpha
         8-th chacter is alpha
         9-th chacter is NOT alpha
         10-th chacter is alpha
         11-th chacter is NOT alpha
========ru_RU.UTF-8==========
mbstowcs returns 6
         0-th chacter is alpha
         1-th chacter is alpha
         2-th chacter is alpha
         3-th chacter is alpha
         4-th chacter is alpha
         5-th chacter is alpha

Attachment

Re: regression test crashes at tsearch

From
Greg Stark
Date:
Hmm well the KOI8 tests unsurprisingly produce random results on non- 
KOI8 input. It's pure chance you didn't get EILSEQ.

What errno did you get for the C locale test? On which input character? 
Perhaps it's sihnalljng EILSEQ for every byte >0x80 ? That seems  
broken to me but perhaps not to a glibc pedant out there.

-- 
Greg


On 2 Mar 2009, at 19:17, Teodor Sigaev <teodor@sigaev.ru> wrote:

>> Say what? What OSes is that?
> See attached test program. It tries to convert multibyte russian  
> word in UTF8 to wide char with C, ru_RU-KOI8-R and ru_RU.UTF-8  
> locales. The word contains 6 letters.
>
> FreeBSD 7.2 (short output):
> ========C==========
> mbstowcs returns 12
> ========ru_RU.KOI8-R==========
> mbstowcs returns 12
> ========ru_RU.UTF-8==========
> mbstowcs returns 6
>
> Linux 2.6.23 libc 2.5 (short output):
> ========C==========
> mbstowcs returns -1
> ========ru_RU.KOI8-R==========
> mbstowcs returns 12
> ========ru_RU.UTF-8==========
> mbstowcs returns 6
>
>
> The program also prints test of iswalpha:
> Linux 2.6.23 libc 2.5 (full output):
> ========C==========
> mbstowcs returns -1
> ERROR
> ========ru_RU.KOI8-R==========
> mbstowcs returns 12
>        0-th chacter is alpha
>        1-th chacter is NOT alpha
>        2-th chacter is alpha
>        3-th chacter is NOT alpha
>        4-th chacter is alpha
>        5-th chacter is NOT alpha
>        6-th chacter is alpha
>        7-th chacter is NOT alpha
>        8-th chacter is alpha
>        9-th chacter is NOT alpha
>        10-th chacter is alpha
>        11-th chacter is NOT alpha
> ========ru_RU.UTF-8==========
> mbstowcs returns 6
>        0-th chacter is alpha
>        1-th chacter is alpha
>        2-th chacter is alpha
>        3-th chacter is alpha
>        4-th chacter is alpha
>        5-th chacter is alpha
> <t.c.gz>


Re: regression test crashes at tsearch

From
Teodor Sigaev
Date:
> Hmm well the KOI8 tests unsurprisingly produce random results on
> non-KOI8 input. It's pure chance you didn't get EILSEQ.
Because KOI8 is not multibyte encoding.

> What errno did you get for the C locale test? On which input
> character?Perhaps it's sihnalljng EILSEQ for every byte >0x80 ? That
> seems broken to me but perhaps not to a glibc pedant out there.

Linux
========C==========
mbstowcs returns -1 errno: 84 Invalid or incomplete multibyte or wide
character
========ru_RU.KOI8-R==========
mbstowcs returns 12 errno: 0 Success
========ru_RU.UTF-8==========
mbstowcs returns 6 errno: 0 Success

FreeBSD
========C==========
mbstowcs returns 12 errno: 0 Unknown error: 0
========ru_RU.KOI8-R==========
mbstowcs returns 12 errno: 0 Unknown error: 0
========ru_RU.UTF-8==========
mbstowcs returns 6 errno: 0 Unknown error: 0

In any case, we could not trust mbstowcs if current locale is not match
to encoding. And we could not check every pair of locale/encoding but
mbstowcs with C-locale and multibyte encoding obviously doesn't work.


Attachment