Thread: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem
Yes, I do build with --enable-locale, but I do not build with --enable-mb and do not use client_encoding or server_encoding. The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the environment in both server and client, and this has worked for me in 6.4.2. Regards, Daniel >>>"Hiroshi Inoue" said:> > > > -----Original Message-----> > From: owner-pgsql-hackers@postgreSQL.org> > [mailto:owner-pgsql-hackers@postgreSQL.org]OnBehalf Of Daniel Kalchev> > Sent: Wednesday, June 09, 1999 5:04 PM> > To: pgsql-hackers@hub.org>> Subject: [HACKERS] Postgres 6.5 beta2 and beta3 problem> > > > > > Hello,> > > > Sorry that I pickedthis too late in the release cycle, but other work > > prevented my earlier involvement in the testing.> > > > Thereare at least two serious problems that I discovered so far > > with Postgres > > 6.5 beta2 and beta3 (running on BSD/OS4.0.1):> > > > 1. LIKE with indexes works worse than without indexes.> >> > Did you built with --enable-locale ?> >Regards.> > Hiroshi Inoue> Inoue@tpf.co.jp >
> Yes, > > I do build with --enable-locale, but I do not build with --enable-mb and do > not use client_encoding or server_encoding. > > The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the > environment in both server and client, and this has worked for me in 6.4.2. This certainly explains it. With locale enabled, LIKE does not use indexes because we can't figure out how to do the indexing trick with non-ASCII character sets because we can't figure out the maximum character value for a particular encoding. We didn't do the check in 6.4.*, and LIKE was not returning the proper results for queries at those sites that used locale. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce, This is extremely bad news, because if will make PostgreSQL 6.5 unusable for most of my applications. Perhaps something can be done to resolve this situation? I understand the problem for 16-bit characters support, but for most of the encodings that support only 8 bit characters it should be safe to assume the maximum character value is 255. Anyway, making this check compile-time defined would certainly fix things here, because in my case the cyrillic letters order match that of the binary encoding (that is, the first alphabet letter is before the second etc). Perhaps the locale data can be used to gather this information? I will do some testing without using locale to see what happens. Regards, Daniel >>>Bruce Momjian said:> > Yes,> > > > I do build with --enable-locale, but I do not build with --enable-mb and d o > >not use client_encoding or server_encoding.> > > > The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the> > environment in both server and client, and this has worked for me in 6.4.2 .> > > This certainly explains it. With locale enabled, LIKE does not use> indexes because we can't figure out how to do the indexing trick with> non-ASCIIcharacter sets because we can't figure out the maximum> character value for a particular encoding.> > We didn'tdo the check in 6.4.*, and LIKE was not returning the proper> results for queries at those sites that used locale.>> -- > Bruce Momjian | http://www.op.net/~candle> maillist@candle.pha.pa.us | (610) 853-3000> + If your life is a hard drive, | 830 Blythe Avenue> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Wed, 9 Jun 1999, Bruce Momjian wrote: > Date: Wed, 9 Jun 1999 08:43:50 -0400 (EDT) > From: Bruce Momjian <maillist@candle.pha.pa.us> > To: Daniel Kalchev <daniel@digsys.bg> > Cc: Hiroshi Inoue <Inoue@tpf.co.jp>, pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem > > > Yes, > > > > I do build with --enable-locale, but I do not build with --enable-mb and do > > not use client_encoding or server_encoding. > > > > The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the > > environment in both server and client, and this has worked for me in 6.4.2. > > > This certainly explains it. With locale enabled, LIKE does not use > indexes because we can't figure out how to do the indexing trick with > non-ASCII character sets because we can't figure out the maximum > character value for a particular encoding. If so, why explain reports 'Index Scan ....' ? apod=> \d Database = apod+------------------+----------------------------------+----------+| Owner | Relation | Type |+------------------+----------------------------------+----------+| megera | idx_adate | index || megera | idx_atitle | index || megera | idx_url | index || megera | titles | table |+------------------+----------------------------------+----------+ apod=> explain select atitle from titles where atitle like 'Sun%'; NOTICE: QUERY PLAN: Index Scan using idx_atitle on titles (cost=33.28 rows=1 width=12) Current cvs, --enable-locale --with-mb=KOI8 Regards, Oleg > > We didn't do the check in 6.4.*, and LIKE was not returning the proper > results for queries at those sites that used locale. > > -- > Bruce Momjian | http://www.op.net/~candle > maillist@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Wed, 9 Jun 1999, Daniel Kalchev wrote: > Date: Wed, 09 Jun 1999 16:15:58 +0300 > From: Daniel Kalchev <daniel@digsys.bg> > To: Bruce Momjian <maillist@candle.pha.pa.us> > Cc: pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] Postgres 6.5 beta2 and beta3 problem > > Bruce, > > This is extremely bad news, because if will make PostgreSQL 6.5 unusable for > most of my applications. Perhaps something can be done to resolve this > situation? > > I understand the problem for 16-bit characters support, but for most of the > encodings that support only 8 bit characters it should be safe to assume the > maximum character value is 255. > > Anyway, making this check compile-time defined would certainly fix things > here, because in my case the cyrillic letters order match that of the binary > encoding (that is, the first alphabet letter is before the second etc). > > Perhaps the locale data can be used to gather this information? It's certainly there ! locale data contains all information about specific character set and encoding. Is it possible to use it to create indices ? It should be slower but benefits of using indices will cover expenses for non-US people. I didn't notice such behaivour in Informix and Oracle. Fixing this would be a good point in respect of popularity of Postgres. Are there any chance to place it in TODO for 6.6 ? At least explain should reflect such fact ! Regards, Oleg > > I will do some testing without using locale to see what happens. > > Regards, > Daniel > > >>>Bruce Momjian said: > > > Yes, > > > > > > I do build with --enable-locale, but I do not build with --enable-mb and d > o > > > not use client_encoding or server_encoding. > > > > > > The content of the keys is in cyrillic. I have LC_CTYPE=CP1251 in the > > > environment in both server and client, and this has worked for me in 6.4.2 > . > > > > > > This certainly explains it. With locale enabled, LIKE does not use > > indexes because we can't figure out how to do the indexing trick with > > non-ASCII character sets because we can't figure out the maximum > > character value for a particular encoding. > > > > We didn't do the check in 6.4.*, and LIKE was not returning the proper > > results for queries at those sites that used locale. > > > > -- > > Bruce Momjian | http://www.op.net/~candle > > maillist@candle.pha.pa.us | (610) 853-3000 > > + If your life is a hard drive, | 830 Blythe Avenue > > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > > > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Bruce Momjian <maillist@candle.pha.pa.us> writes: > This certainly explains it. With locale enabled, LIKE does not use > indexes because we can't figure out how to do the indexing trick with > non-ASCII character sets because we can't figure out the maximum > character value for a particular encoding. We don't actually need the *maximum* character value, what we need is to be able to generate a *slightly larger* character value. For example, what the parser is doing now:fld LIKE 'abc%' ==> fld <= 'abc\377' is not even really right in ASCII locale, because it will reject a data value like 'abc\377x'. I think what we really want is to generate the "next value of the same length" and use a < comparison. In ASCII locale this meansfld LIKE 'abc%' ==> fld < 'abd' which is reliable regardless of what comes after abc in the data. The trick is to figure out a "next" value without assuming a lot about the local character set and collation sequence. I had been thinking about a brute force method: generate a string and check to see whether strcmp claims it's greater than the original or not; if not, increment the last byte and try again. You'd also have to be able to back up and increment earlier bytes if you maxed out the last one. For example, in French locale,fld LIKE 'ab\376%' you'd first produce 'ab\377' but discover that it's less than 'ab\376' (because \377 is y-dieresis which sorts like 'y'). Your next try must be 'ac\377' which will succeed. But I am worried whether this trick will work in multibyte locales --- incrementing the last byte might generate an invalid character sequence and produce unpredictable results from strcmp. So we need some help from someone who knows a lot about collation orders and multibyte character representations. regards, tom lane
Tom Lane wrote: > > Bruce Momjian <maillist@candle.pha.pa.us> writes: > > This certainly explains it. With locale enabled, LIKE does not use > > indexes because we can't figure out how to do the indexing trick with > > non-ASCII character sets because we can't figure out the maximum > > character value for a particular encoding. > > We don't actually need the *maximum* character value, what we need is > to be able to generate a *slightly larger* character value. > > For example, what the parser is doing now: > fld LIKE 'abc%' ==> fld <= 'abc\377' > is not even really right in ASCII locale, because it will reject a > data value like 'abc\377x'. > > I think what we really want is to generate the "next value of the > same length" and use a < comparison. In ASCII locale this means > fld LIKE 'abc%' ==> fld < 'abd' > which is reliable regardless of what comes after abc in the data. > The trick is to figure out a "next" value without assuming a lot > about the local character set and collation sequence. in single-byte locales it should be easy: 1. sort a char[256] array from 0-255 using the current locale settings,do it once, either at startup or when first needed. 2. use binary search on that array to locate the last char before %in this sorted array:if (it is not the last char in sortedarray)then (replace that char with the one at index+1)else ( if (it is not the first char in like string) then (discardthe last char and goto 2. else (don't do the end restriction)) some locales where the string is already sorted may use special treatment (ASCII, CYRILLIC) > But I am worried whether this trick will work in multibyte locales --- > incrementing the last byte might generate an invalid character sequence > and produce unpredictable results from strcmp. So we need some help > from someone who knows a lot about collation orders and multibyte > character representations. for double-byte locales something similar should work, but getting the initial array is probably tricky ---------------- Hannu
It was me who found the bug and supplied the ugly fix, but that was the only quick fix we could find until someone figure out how to get the correct one from locale. USE_LOCALE uses strcoll instead of strcmp for comparasion and indexes and it sorts correctly but does not work with the MakeIndexable trick since \377 might not be the highest char in the alphabet or worse might not be a char of the alphabet at all (like in sv/fi locales). The patch gives a slower correct result instead of a fast incorrect one, which is better IMHO, but maybe I am old fashion. :-) Anyway I think a correct solution should be: "a LIKE 'abc%'" should generate: "a >= 'abc\0' AND a < 'abd'" ^ where d is the last char before % + 1, or more correct the next char in alphabet after the last char before %. Still looking for a locale guru. Hello!! :-( regards, -- ----------------- Göran Thyni This is Penguin Country. On a quiet night you can hear Windows NT reboot!
> Bruce, > > This is extremely bad news, because if will make PostgreSQL 6.5 unusable for > most of my applications. Perhaps something can be done to resolve this > situation? > > I understand the problem for 16-bit characters support, but for most of the > encodings that support only 8 bit characters it should be safe to assume the > maximum character value is 255. > > Anyway, making this check compile-time defined would certainly fix things > here, because in my case the cyrillic letters order match that of the binary > encoding (that is, the first alphabet letter is before the second etc). > > Perhaps the locale data can be used to gather this information? > > I will do some testing without using locale to see what happens. The locale check is near the bottom of parser/gram.y, so you can certainly enable indexing there. I am told that french does not have 255 as it's max character, and there there is no collating interface to request the highest character. I suppose one hack would be to go out and test all the char values to see which is highest. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> It's certainly there ! locale data contains all information about > specific character set and encoding. Is it possible to use it to create > indices ? It should be slower but benefits of using indices will cover > expenses for non-US people. I didn't notice such behaivour in Informix > and Oracle. Fixing this would be a good point in respect of popularity > of Postgres. Are there any chance to place it in TODO for 6.6 ? > At least explain should reflect such fact ! > > Added to TODO: * Allow indexing of LIKE with localle character sets -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Yes. This pretty much sums up the problem. > Bruce Momjian <maillist@candle.pha.pa.us> writes: > > This certainly explains it. With locale enabled, LIKE does not use > > indexes because we can't figure out how to do the indexing trick with > > non-ASCII character sets because we can't figure out the maximum > > character value for a particular encoding. > > We don't actually need the *maximum* character value, what we need is > to be able to generate a *slightly larger* character value. > > For example, what the parser is doing now: > fld LIKE 'abc%' ==> fld <= 'abc\377' > is not even really right in ASCII locale, because it will reject a > data value like 'abc\377x'. > > I think what we really want is to generate the "next value of the > same length" and use a < comparison. In ASCII locale this means > fld LIKE 'abc%' ==> fld < 'abd' > which is reliable regardless of what comes after abc in the data. > > The trick is to figure out a "next" value without assuming a lot > about the local character set and collation sequence. I had > been thinking about a brute force method: generate a string and > check to see whether strcmp claims it's greater than the original > or not; if not, increment the last byte and try again. You'd > also have to be able to back up and increment earlier bytes if > you maxed out the last one. For example, in French locale, > fld LIKE 'ab\376%' > you'd first produce 'ab\377' but discover that it's less than > 'ab\376' (because \377 is y-dieresis which sorts like 'y'). > Your next try must be 'ac\377' which will succeed. > > But I am worried whether this trick will work in multibyte locales --- > incrementing the last byte might generate an invalid character sequence > and produce unpredictable results from strcmp. So we need some help > from someone who knows a lot about collation orders and multibyte > character representations. > > regards, tom lane > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>>>Bruce Momjian said:> The locale check is near the bottom of parser/gram.y, so you can> certainly enable indexing there. I commented the two ifdefs that had USE_LOCALE in gram.y and now like uses indexes for me... > I am told that french does not have 255 as it's max character, and there> there is no collating interface to request thehighest character. I> suppose one hack would be to go out and test all the char values to see> which is highest. As I understand the current 'non-locale' way the like indexing works, it should create this problem for the Bulgarian cyrillic (cp1251) too, because character code 255 is valid cyrillic character (the last one from the alphabet). Therefore, the solution proposed by Hannu Krosing should be implemented. I believe we can make the assumption, that if you --enable-locale, but do not use --with-mb, then you are using single-byte locales and therefore the hack might work properly. If you use --with-mb you are out of luck until someone explains better how multibyte characters are ordered. Is there other place than gram.y where this hack needs to be implemented? While I am bashing the locale support... why the ~* operator does not work with locales? That is, I need to use construct like SELECT key from t WHERE upper(a) ~ upper('somestring'); instead of SELECT key FROM t WHERE a ~* 'somestring'; Or store everything in the database in uppercase (works for keys) and upper the string in the frontend. The result is that this construct does not use indices. We also cannot create indexes by upper(a). I believe this was outstanding problem in pre-6.4. I found this later problem to be resolved by modifying the backend/regex/makefile to add -funsigned-char to CFLAGS. This is under BSD/OS 4.0.1 - I found out, that by default characters that are 'alpha' in cyrillic are threated by the compiler as negative and therefore isalpha() returns zero... I believe this should be fixed as it may be causing other problems with non-ASCII locales. My proposal is to add -funsigned-char to all places where routines such as 'isalpha' are used, and ifdef it for USE_LOCALE. Daniel
> I believe we can make the assumption, that if you --enable-locale, but do not > use --with-mb, then you are using single-byte locales and therefore the hack > might work properly. I believe at least one person uses both --enable-locale and --with-mb for single byte locale. > If you use --with-mb you are out of luck until someone > explains better how multibyte characters are ordered. I think until NATIONAL CHARACTER is fully implemented, we would not be able to properly handle sort orders with multi-byte characters. (Thomas, I've been think about implementing NCHAR too!) So you could do whatever you like as long as they are in #ifndef MULTIBYTE:-) > I found this later problem to be resolved by modifying the > backend/regex/makefile to add -funsigned-char to CFLAGS. This is under BSD/OS > 4.0.1 - I found out, that by default characters that are 'alpha' in cyrillic > are threated by the compiler as negative and therefore isalpha() returns > zero... I believe this should be fixed as it may be causing other problems > with non-ASCII locales. > > My proposal is to add -funsigned-char to all places where routines such as > 'isalpha' are used, and ifdef it for USE_LOCALE. Once I propsed another solution to BSD/OS guy (I don't remember who he was) and asked him to check if it worked. Unfortunately I got no answer from him. Then I merged it into #ifdef MULTIBYTE sections in regcomp.c and it seems work at least for cyrillic locale (Thanks Oleg for testing!). --- Tatsuo Ishii
Tatsuo Ishii wrote: > I think until NATIONAL CHARACTER is fully implemented, we would not be > able to properly handle sort orders with multi-byte characters. > (Thomas, I've been think about implementing NCHAR too!) > > So you could do whatever you like as long as they are in #ifndef > MULTIBYTE:-) Hmm, and the world is moving towards Unicode. we definitely need working support for locales and multichar. Perhaps the right way to do it is to store everything in som Unicode format internally and convert the output according to the suggested "per-column-locale-definition". Larry Wall has hacked UTF8 support into perl so it is doable, so let see what we can do for 6.6 (time permitting as always). regards, Göran
Hi! On Thu, 10 Jun 1999, Tatsuo Ishii wrote: > > I believe we can make the assumption, that if you --enable-locale, but do not > > use --with-mb, then you are using single-byte locales and therefore the hack > > might work properly. > > I believe at least one person uses both --enable-locale and --with-mb > for single byte locale. I think it's me. But very soon all in Russia will use locale+mb for koi8 and win1251 locales - all of us are always in need to convert between them. Making Postgres convert on-the-fly is big win. > Once I propsed another solution to BSD/OS guy (I don't remember who he > was) and asked him to check if it worked. Unfortunately I got no > answer from him. Then I merged it into #ifdef MULTIBYTE sections in I remember he raised the issue few times, and finally we agreed on -funsigned-char. I think, he got a working Postgres and locale after all. > regcomp.c and it seems work at least for cyrillic locale (Thanks Oleg > for testing!). Works now pretty good! > --- > Tatsuo Ishii Oleg. ---- Oleg Broytmann http://members.xoom.com/phd2/ phd2@earthling.net Programmers don't die, they justGOSUB without RETURN.
>and the world is moving towards Unicode. >we definitely need working support for locales and multichar. >Perhaps the right way to do it is to store everything in >som Unicode format internally and convert the output >according to the suggested "per-column-locale-definition". No. There's nothing perfect in the world. Unicode is not the exception too. So we need to keep the freedom of choice of the internal encodings. Currently the mb support allows serveral internal encodings including Unicode and mule-internal-code. (yes, you can do regexp/like to Unicode data if mb support is enabled). -- Tatsuo Ishii
On 11-Jun-99 Goran Thyni wrote: > Tatsuo Ishii wrote: >> I think until NATIONAL CHARACTER is fully implemented, we would not be >> able to properly handle sort orders with multi-byte characters. >> (Thomas, I've been think about implementing NCHAR too!) >> >> So you could do whatever you like as long as they are in #ifndef >> MULTIBYTE:-) > > Hmm, > and the world is moving towards Unicode. > we definitely need working support for locales and multichar. > Perhaps the right way to do it is to store everything in > som Unicode format internally and convert the output > according to the suggested "per-column-locale-definition". > > Larry Wall has hacked UTF8 support into perl so it is doable, > so let see what we can do for 6.6 (time permitting as always). IMHO, also will be pleasent add charset/unicode for database or possible, for single table at runtime, not for the whole postgres by configure. --- Dmitry Samersoff, dms@wplus.net, ICQ:3161705 http://devnull.wplus.net * There will come soft rains ...
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > Currently the mb support allows serveral internal > encodings including Unicode and mule-internal-code. > (yes, you can do regexp/like to Unicode data if mb support is > enabled). One of the things that bothers me about makeIndexable() is that it doesn't seem to be multibyte-aware; does it really work in MB case? regards, tom lane
Re: [HACKERS] Re: locales and MB (was: Postgres 6.5 beta2 and beta3 problem)
From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > Currently the mb support allows serveral internal > > encodings including Unicode and mule-internal-code. > > (yes, you can do regexp/like to Unicode data if mb support is > > enabled). > > One of the things that bothers me about makeIndexable() is that it > doesn't seem to be multibyte-aware; does it really work in MB case? Yes. This is because I carefully choose multibyte encodings for the backend that have following characteristics: o if the 8th bit of a byte is off then it is a ascii character o otherwise it is part of non ascii multibyte characters With these assumptions, makeIndexable() works very well with multibyte chars. Not all multibyte encodings satisfy above conditions. For example, SJIS (an encoding for Japanese) and Big5 (for traditional Chinese) does not satisfies those requirements. In these encodings the first byte of the double byte is always 8th bit on. However in second byte sometimes 8th bit is off: this means we cannot distinguish it from ascii since it may accidentally matches a bit pattern of an ascii char. This is why I do not allow SJIS and Big5 as the server encodings. Users can use SJIS and Big5 for the client encoding, however. You might ask why I don't make makeIndexable() multibyte-aware. It definitely possible. But you should know there are many places that need to be multibyte-aware in this sence. The parser is one of the good example. Making everything in the backend multibyte-aware is not worse to do, in my opinion. --- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > Yes. This is because I carefully choose multibyte encodings for > the backend that have following characteristics: > o if the 8th bit of a byte is off then it is a ascii character > o otherwise it is part of non ascii multibyte characters Ah so. > You might ask why I don't make makeIndexable() multibyte-aware. It > definitely possible. But you should know there are many places that > need to be multibyte-aware in this sence. The parser is one of the > good example. Right, it's much easier to dodge the problem by restricting backend encodings, and since we have conversions that doesn't hurt anyone. Now that I think about it, all the explicitly MB-aware code that I've seen is in frontend stuff. Thanks for the clue... regards, tom lane