Thread: regression test crashes at tsearch
Hi, I see a regression test failure in my mingw-vista port when I invoke the command make check MULTIBYTE=euc_jp NO_LOCALE=yes . It causes a crash at tsearch. The crash seems to occur when the server encoding isn't UTF-8 with no locale. The attached is a patch to avoid the crash. regards, Hiroshi Inoue Index: backend/utils/mb/mbutils.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/utils/mb/mbutils.c,v retrieving revision 1.78 diff -c -r1.78 mbutils.c *** backend/utils/mb/mbutils.c 22 Jan 2009 10:09:48 -0000 1.78 --- backend/utils/mb/mbutils.c 17 Feb 2009 21:59:26 -0000 *************** *** 575,580 **** --- 575,584 ---- wchar2char(char *to, const wchar_t *from, size_t tolen) { size_t result; + #ifdef WIN32 + int encoding = GetDatabaseEncoding(); + bool useWcstombs = !(encoding == PG_UTF8 || lc_ctype_is_c()); + #endif if (tolen == 0) return 0; *************** *** 584,602 **** * On Windows, the "Unicode" locales assume UTF16 not UTF8 encoding, * and for some reason mbstowcs and wcstombs won't do this for us, * so we use MultiByteToWideChar(). */ ! if (GetDatabaseEncoding() == PG_UTF8) { ! result = WideCharToMultiByte(CP_UTF8, 0, from, -1, to, tolen, NULL, NULL); /* A zero return is failure */ ! if (result <= 0) result = -1; else { - Assert(result <= tolen); /* Microsoft counts the zero terminator in the result */ ! result--; } } else --- 588,624 ---- * On Windows, the "Unicode" locales assume UTF16 not UTF8 encoding, * and for some reason mbstowcs and wcstombs won't do this for us, * so we use MultiByteToWideChar(). + * Also note wcstombs/mbstowcs is unavailable when LC_CTYPE is C. */ ! if (!useWcstombs) { ! int utf8len = tolen; ! char *utf8str = to; ! ! if (encoding != PG_UTF8) ! { ! utf8len = pg_encoding_max_length(PG_UTF8) * tolen; ! utf8str = palloc(utf8len + 1); ! } ! utf8len = WideCharToMultiByte(CP_UTF8, 0, from, -1, utf8str, utf8len, NULL, NULL); /* A zero return is failure */ ! if (utf8len <= 0) result = -1; else { /* Microsoft counts the zero terminator in the result */ ! result = utf8len - 1; ! if (encoding != PG_UTF8) ! { ! char *mbstr = pg_do_encoding_conversion((unsigned char *) utf8str, result, PG_UTF8, encoding); ! result = strlcpy(to, mbstr, tolen); ! if (utf8str != to) ! pfree(utf8str); ! if (mbstr != utf8str) ! pfree(mbstr); ! } ! Assert(result <= tolen); } } else *************** *** 618,637 **** char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen) { size_t result; if (tolen == 0) return 0; #ifdef WIN32 ! /* See WIN32 "Unicode" comment above */ ! if (GetDatabaseEncoding() == PG_UTF8) { /* Win32 API does not work for zero-length input */ ! if (fromlen == 0) result = 0; else { ! result = MultiByteToWideChar(CP_UTF8, 0, from, fromlen, to, tolen - 1); /* A zero return is failure */ if (result == 0) result = -1; --- 640,672 ---- char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen) { size_t result; + #ifdef WIN32 + int encoding = GetDatabaseEncoding(); + bool useMbstowcs = !(encoding == PG_UTF8 || lc_ctype_is_c()); + #endif if (tolen == 0) return 0; #ifdef WIN32 ! if (!useMbstowcs) { + int utf8len = fromlen; + unsigned char *utf8str = (unsigned char *) from; + + if (encoding != PG_UTF8) + { + utf8str = pg_do_encoding_conversion(from, fromlen, encoding, PG_UTF8); + if (utf8str != from) + utf8len = strlen(utf8str); + } + /* See WIN32 "Unicode" comment above */ /* Win32 API does not work for zero-length input */ ! if (utf8len == 0) result = 0; else { ! result = MultiByteToWideChar(CP_UTF8, 0, utf8str, utf8len, to, tolen - 1); /* A zero return is failure */ if (result == 0) result = -1; *************** *** 643,648 **** --- 678,685 ---- /* Append trailing null wchar (MultiByteToWideChar() does not) */ to[result] = 0; } + if (utf8str != from) + pfree(utf8str); } else #endif /* WIN32 */
Hi. Sorry late reaction. I try check of CVS-HEAD now. $ make check NO_LOCALE=true ... =======================All 120 tests passed. ======================= However, same action as Inou-sane is seen. make check MULTIBYTE=euc_jp NO_LOCALE=true The differences that caused some tests to fail can be viewed in the file "C:/MinGW/home/HIROSHI/pgsql/src/test/regress/regression.diffs". A copy of the test summary that you see above is saved in the file "C:/MinGW/home/HIROSHI/pgsql/src/test/regress/regression.out". make[2]: *** [check] Error 1 make[2]: Leaving directory `/home/HIROSHI/pgsql/src/test/regress' make[1]: *** [check] Error 2 make[1]: Leaving directory `/home/HIROSHI/pgsql/src/test' make: *** [check] Error 2 http://winpg.jp/~saito/pg_bug/20090223-CVS-regression.diffs http://winpg.jp/~saito/pg_bug/20090223-CVS-regression.out Then, It is comfortable after applying Inoue-san patch. make check MULTIBYTE=euc_jp NO_LOCALE=true ... =======================All 120 tests passed. ======================= I think that Mr. Inoue's patch is right. why isn't it taken into consideration yet? Regards, Hiroshi Saito ----- Original Message ----- From: "Hiroshi Inoue" <inoue@tpf.co.jp> > Hi, > > I see a regression test failure in my mingw-vista port > when I invoke the command > make check MULTIBYTE=euc_jp NO_LOCALE=yes > . > It causes a crash at tsearch. > The crash seems to occur when the server encoding isn't > UTF-8 with no locale. > The attached is a patch to avoid the crash. > > regards, > Hiroshi Inoue > > >
> I think that Mr. Inoue's patch is right. > why isn't it taken into consideration yet? I can't check that patch because I don't have a Windows box. But I did some investigations. As I understand, the patch prevents from calling of wcstombs/mbstowcs with C locale and I checked trace for that. But wcstombs/mbstowcs doesn't called at all with C-locale. With C locale char2wchar calls pg_mb2wchar_with_len(), and char2wchar is called from single place: TParserInit. wchar2char isn't called at all. Next, regression tests are successfully passed until ispell_tst configuration was created. Looking at diff SELECT to_tsquery('ispell_tst', 'footballyklubber:b& rebookings:A & sky');^M to_tsquery ^M ! -----------------------------------------------------------------------------------------------------^M ! 'f':B & 'o':B & 'b':B & 'l':B & 'y':B & 'l':B & 'b':B & 'e':B & 'r':A & 'b':A & 'o':A & 'g':A & 'y'^M I believe that even here parser is working correctly but there is problems in ispell. So, although patch fixes the problem, it seems to me patch doesn't do it in right place. The patch converts multibyte string in non-utf encoding with C-locale into wide string by multibyte->utf8->wide conversation instead of pg_mb2wchar_with_len() call. Is output of pg_mb2wchar_with_len() compatible with isalpha (and others) functions? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev wrote: >> I think that Mr. Inoue's patch is right. >> why isn't it taken into consideration yet? > > I can't check that patch because I don't have a Windows box. > But I did some investigations. As I understand, the patch prevents from > calling of wcstombs/mbstowcs with C locale and I checked trace for that. > But wcstombs/mbstowcs doesn't called at all with C-locale. With C > locale char2wchar calls pg_mb2wchar_with_len(), pg_mb2wchar_with_len() converts server encoded strings to pg_wchar strings. But pg_wchar is typedef'd as unsigned int which is not the same as wchar_t at least on Windows (unsigned short). If pg_mb2wchar_with_len() works for wchar_t on Windows, we had better call it in all cases on Windows to implement char2wchar(). > and char2wchar is called from single place: TParserInit. > wchar2char isn't called at all. I modified it corresponding to the change in char2wchar() so that wchar2char(char2wchar(x)) becomes x. Though I'm not sure if it is called or not, isn't it an principle? If there's an effective function like pg_wchar2mb_with_len() which converts wchar_t strings to server encoded strings, we had better simply call it for char2wchar(). regards, Hiroshi Inoue
> pg_mb2wchar_with_len() converts server encoded strings to pg_wchar > strings. But pg_wchar is typedef'd as unsigned int which is not the > same as wchar_t at least on Windows (unsigned short). Oops. The problem is here. TParserInit allocates twice less memory than needed. And it happens if sizeof(wchar_t) < sizeof(pg_wchar) and C-locale for non-Windows box. Also for Windows, encoding should be non-utf. So, all p_is* functions are broken in this case because they work with wrong data. . > I modified it corresponding to the change in char2wchar() so that > wchar2char(char2wchar(x)) becomes x. Though I'm not sure if it is mbstowcs/wcstombs doesn't work with C-locale in other OSes too, so that's not needed. > If there's an effective function like pg_wchar2mb_with_len() which > converts wchar_t strings to server encoded strings, we had better > simply call it for char2wchar(). I don't see a way to produce correct result of char2wchar with C-locale and sizeof(wchar_t) = 2. In summary, I suggest to remove support of C-locale from char2wchar function and tsearch's parser should directly use pg_mb2wchar_with_len() in case of C-locale and multibyte encoding. In all other places char2wchar is called only for non-C locale. Please, test attached patch. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
Hi Teodor-san. Sorry late reaction. ----- Original Message ----- From: "Teodor Sigaev" <teodor@sigaev.ru> >> If there's an effective function like pg_wchar2mb_with_len() which >> converts wchar_t strings to server encoded strings, we had better >> simply call it for char2wchar(). > I don't see a way to produce correct result of char2wchar with C-locale and > sizeof(wchar_t) = 2. > > In summary, I suggest to remove support of C-locale from char2wchar function and > tsearch's parser should directly use pg_mb2wchar_with_len() in case of > C-locale and multibyte encoding. In all other places char2wchar is called only > for non-C locale. > > Please, test attached patch. Um, I think your patch like the overkill reaction of C-locale... However, I tried your patch. make check MULTIBYTE=euc_jp NO_LOCALE=true ... =======================All 120 tests passed. ======================= Anyway, either should be applied. Thanks. Regards, Hiroshi Saito
> Um, I think your patch like the overkill reaction of C-locale... Patch makes char2wchar and wchar2char symmetric to C-locale. > However, I tried your patch. > make check MULTIBYTE=euc_jp NO_LOCALE=true > ... > ======================= > All 120 tests passed. > ======================= > > Anyway, either should be applied. Thanks. Thank you. Changes are committed up to 8.2 -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Wed, Feb 25, 2009 at 6:44 PM, Teodor Sigaev <teodor@sigaev.ru> wrote: > mbstowcs/wcstombs doesn't work with C-locale in other OSes too, so that's > not needed. Say what? What OSes is that? -- greg
> > Say what? What OSes is that? > See attached test program. It tries to convert multibyte russian word in UTF8 to wide char with C, ru_RU-KOI8-R and ru_RU.UTF-8 locales. The word contains 6 letters. FreeBSD 7.2 (short output): ========C========== mbstowcs returns 12 ========ru_RU.KOI8-R========== mbstowcs returns 12 ========ru_RU.UTF-8========== mbstowcs returns 6 Linux 2.6.23 libc 2.5 (short output): ========C========== mbstowcs returns -1 ========ru_RU.KOI8-R========== mbstowcs returns 12 ========ru_RU.UTF-8========== mbstowcs returns 6 The program also prints test of iswalpha: Linux 2.6.23 libc 2.5 (full output): ========C========== mbstowcs returns -1 ERROR ========ru_RU.KOI8-R========== mbstowcs returns 12 0-th chacter is alpha 1-th chacter is NOT alpha 2-th chacter is alpha 3-th chacter is NOT alpha 4-th chacter is alpha 5-th chacter is NOT alpha 6-th chacter is alpha 7-th chacter is NOT alpha 8-th chacter is alpha 9-th chacter is NOT alpha 10-th chacter is alpha 11-th chacter is NOT alpha ========ru_RU.UTF-8========== mbstowcs returns 6 0-th chacter is alpha 1-th chacter is alpha 2-th chacter is alpha 3-th chacter is alpha 4-th chacter is alpha 5-th chacter is alpha
Attachment
Hmm well the KOI8 tests unsurprisingly produce random results on non- KOI8 input. It's pure chance you didn't get EILSEQ. What errno did you get for the C locale test? On which input character? Perhaps it's sihnalljng EILSEQ for every byte >0x80 ? That seems broken to me but perhaps not to a glibc pedant out there. -- Greg On 2 Mar 2009, at 19:17, Teodor Sigaev <teodor@sigaev.ru> wrote: >> Say what? What OSes is that? > See attached test program. It tries to convert multibyte russian > word in UTF8 to wide char with C, ru_RU-KOI8-R and ru_RU.UTF-8 > locales. The word contains 6 letters. > > FreeBSD 7.2 (short output): > ========C========== > mbstowcs returns 12 > ========ru_RU.KOI8-R========== > mbstowcs returns 12 > ========ru_RU.UTF-8========== > mbstowcs returns 6 > > Linux 2.6.23 libc 2.5 (short output): > ========C========== > mbstowcs returns -1 > ========ru_RU.KOI8-R========== > mbstowcs returns 12 > ========ru_RU.UTF-8========== > mbstowcs returns 6 > > > The program also prints test of iswalpha: > Linux 2.6.23 libc 2.5 (full output): > ========C========== > mbstowcs returns -1 > ERROR > ========ru_RU.KOI8-R========== > mbstowcs returns 12 > 0-th chacter is alpha > 1-th chacter is NOT alpha > 2-th chacter is alpha > 3-th chacter is NOT alpha > 4-th chacter is alpha > 5-th chacter is NOT alpha > 6-th chacter is alpha > 7-th chacter is NOT alpha > 8-th chacter is alpha > 9-th chacter is NOT alpha > 10-th chacter is alpha > 11-th chacter is NOT alpha > ========ru_RU.UTF-8========== > mbstowcs returns 6 > 0-th chacter is alpha > 1-th chacter is alpha > 2-th chacter is alpha > 3-th chacter is alpha > 4-th chacter is alpha > 5-th chacter is alpha > <t.c.gz>
> Hmm well the KOI8 tests unsurprisingly produce random results on > non-KOI8 input. It's pure chance you didn't get EILSEQ. Because KOI8 is not multibyte encoding. > What errno did you get for the C locale test? On which input > character?Perhaps it's sihnalljng EILSEQ for every byte >0x80 ? That > seems broken to me but perhaps not to a glibc pedant out there. Linux ========C========== mbstowcs returns -1 errno: 84 Invalid or incomplete multibyte or wide character ========ru_RU.KOI8-R========== mbstowcs returns 12 errno: 0 Success ========ru_RU.UTF-8========== mbstowcs returns 6 errno: 0 Success FreeBSD ========C========== mbstowcs returns 12 errno: 0 Unknown error: 0 ========ru_RU.KOI8-R========== mbstowcs returns 12 errno: 0 Unknown error: 0 ========ru_RU.UTF-8========== mbstowcs returns 6 errno: 0 Unknown error: 0 In any case, we could not trust mbstowcs if current locale is not match to encoding. And we could not check every pair of locale/encoding but mbstowcs with C-locale and multibyte encoding obviously doesn't work.