Thread: LIKE gripes
1. like.c doesn't compile cleanly anymore: $ make gcc -c -I../../../../src/include -O1 -Wall -Wmissing-prototypes -Wmissing-declarations -g -o like.o like.c like.c:143: warning: no previous prototype for `inamelike' like.c:155: warning: no previous prototype for `inamenlike' like.c:167: warning: no previous prototype for `inamelike_escape' like.c:180: warning: no previous prototype for `inamenlike_escape' like.c:193: warning: no previous prototype for `itextlike' like.c:205: warning: no previous prototype for `itextnlike' like.c:217: warning: no previous prototype for `itextlike_escape' like.c:230: warning: no previous prototype for `itextnlike_escape' like.c: In function `MatchTextLower': like.c:401: warning: implicit declaration of function `tolower' 2. The strings regress test fails. I think you forgot to commit the expected file to go with the new test file? regards, tom lane
> 1. like.c doesn't compile cleanly anymore: > $ make > gcc -c -I../../../../src/include -O1 -Wall -Wmissing-prototypes -Wmissing-declarations -g -o like.o like.c > like.c:143: warning: no previous prototype for `inamelike' > like.c:155: warning: no previous prototype for `inamenlike' > like.c:167: warning: no previous prototype for `inamelike_escape' > like.c:180: warning: no previous prototype for `inamenlike_escape' > like.c:193: warning: no previous prototype for `itextlike' > like.c:205: warning: no previous prototype for `itextnlike' > like.c:217: warning: no previous prototype for `itextlike_escape' > like.c:230: warning: no previous prototype for `itextnlike_escape' > like.c: In function `MatchTextLower': > like.c:401: warning: implicit declaration of function `tolower' OK, will look at it. The first ones I see here; the second I'm not sure I do. Will see what other files have for #include's. > 2. The strings regress test fails. I think you forgot to commit the > expected file to go with the new test file? Yup. I just noticed that here after updating the tree. - Thomas
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: >> like.c: In function `MatchTextLower': >> like.c:401: warning: implicit declaration of function `tolower' > OK, will look at it. The first ones I see here; the second I'm not sure > I do. Will see what other files have for #include's. I think <ctype.h> is what's needed here. regards, tom lane
> > OK, will look at it. The first ones I see here; the second I'm not sure > > I do. Will see what other files have for #include's. > I think <ctype.h> is what's needed here. OK, I've updated builtins.h and like.c to eliminate compiler warnings, and updates the strings.out regression test result file. I've also updated the like support code to 1) eliminate a gratuitous case statement (actually, two) in favor of an if/elseif/else construct. 2) eliminate the extraneous backslash quoting code for like() functions only. As I mentioned in the cvs note, this behavior can be put back in if we really want it by replacing a NULL with a "\" in two function calls. - Thomas
Where has MULTIBYTE Stuff in like.c gone ? Hiroshi Inoue > -----Original Message----- > From: Thomas Lockhart > Sent: Monday, August 07, 2000 10:07 AM > To: Tom Lane > Cc: pgsql-hackers@postgreSQL.org > Subject: [HACKERS] Re: LIKE gripes > > > > 1. like.c doesn't compile cleanly anymore: > > $ make > > gcc -c -I../../../../src/include -O1 -Wall > -Wmissing-prototypes -Wmissing-declarations -g -o like.o like.c > > like.c:143: warning: no previous prototype for `inamelike' > > like.c:155: warning: no previous prototype for `inamenlike' > > like.c:167: warning: no previous prototype for `inamelike_escape' > > like.c:180: warning: no previous prototype for `inamenlike_escape' > > like.c:193: warning: no previous prototype for `itextlike' > > like.c:205: warning: no previous prototype for `itextnlike' > > like.c:217: warning: no previous prototype for `itextlike_escape' > > like.c:230: warning: no previous prototype for `itextnlike_escape' > > like.c: In function `MatchTextLower': > > like.c:401: warning: implicit declaration of function `tolower' > > OK, will look at it. The first ones I see here; the second I'm not sure > I do. Will see what other files have for #include's. > > > 2. The strings regress test fails. I think you forgot to commit the > > expected file to go with the new test file? > > Yup. I just noticed that here after updating the tree. > > - Thomas >
> Where has MULTIBYTE Stuff in like.c gone ? Uh, I was wondering where it was in the first place! Will fix it asap... There was some string copying stuff in a middle layer of the like() code, but I had thought that it was there only to get a null-terminated string. When I rewrote the code to eliminate the need for null termination (by using the length attribute of the text data type) then the need for copying went away. Or so I thought :( The other piece to the puzzle is that the lowest-level like() support routine traversed the strings using the increment operator, and so I didn't understand that there was any MB support in there. I now see that *all* of these strings get stuffed into unsigned int arrays during copying; I had (sort of) understood some of the encoding schemes (most use a combination of one to three byte sequences for each character) and didn't realize that this normalization was being done on the fly. So, this answers some questions I have related to implementing character sets: 1) For each character set, we would need to provide operators for "next character" and for boolean comparisons for each character set. Why don't we have those now? Answer: because everything is getting promoted to a 32-bit internal encoding every time a comparison or traversal is required. 2) For each character set, we would need to provide conversion functions to other "compatible" character sets, or to a character "superset". Why don't we have those conversion functions? Answer: we do! There is an internal 32-bit encoding within which all comparisons are done. Anyway, I think it will be pretty easy to put the MB stuff back in, by #ifdef'ing some string copying inside each of the routines (such as namelike()). The underlying routine no longer requires a null-terminated string (using explicit lengths instead) so I'll generate those lengths in the same place unless they are already provided by the char->int MB support code. In the future, I'd like to see us use alternate encodings as-is, or as a common set like UniCode (16 bits wide afaik) rather than having to do this widening to 32 bits on the fly. Then, each supported character set can be efficiently manipulated internally, and only converted to another encoding when mixing with another character set. Any and all advice welcome and accepted (though "keep your hands off the MB code!" seems a bit too late ;) Sorry for the shake-up... - Thomas
> > Where has MULTIBYTE Stuff in like.c gone ? I didn't know that:-) > Uh, I was wondering where it was in the first place! Will fix it asap... > > There was some string copying stuff in a middle layer of the like() > code, but I had thought that it was there only to get a null-terminated > string. When I rewrote the code to eliminate the need for null > termination (by using the length attribute of the text data type) then > the need for copying went away. Or so I thought :( > > The other piece to the puzzle is that the lowest-level like() support > routine traversed the strings using the increment operator, and so I > didn't understand that there was any MB support in there. I now see that > *all* of these strings get stuffed into unsigned int arrays during > copying; I had (sort of) understood some of the encoding schemes (most > use a combination of one to three byte sequences for each character) and > didn't realize that this normalization was being done on the fly. > > So, this answers some questions I have related to implementing character > sets: > > 1) For each character set, we would need to provide operators for "next > character" and for boolean comparisons for each character set. Why don't > we have those now? Answer: because everything is getting promoted to a > 32-bit internal encoding every time a comparison or traversal is > required. MB has something similar to the "next character" fucntion called pg_encoding_mblen. It tells the length of the MB word pointed to so that you could move forward to the next MB word etc. > 2) For each character set, we would need to provide conversion functions > to other "compatible" character sets, or to a character "superset". Why > don't we have those conversion functions? Answer: we do! There is an > internal 32-bit encoding within which all comparisons are done. Right. > Anyway, I think it will be pretty easy to put the MB stuff back in, by > #ifdef'ing some string copying inside each of the routines (such as > namelike()). The underlying routine no longer requires a null-terminated > string (using explicit lengths instead) so I'll generate those lengths > in the same place unless they are already provided by the char->int MB > support code. I have not taken a look at your new like code, but I guess you could use pg_mbstrlen(const unsigned char *mbstr) It tells the number of words in mbstr (however mbstr needs to null terminated). > In the future, I'd like to see us use alternate encodings as-is, or as a > common set like UniCode (16 bits wide afaik) rather than having to do > this widening to 32 bits on the fly. Then, each supported character set > can be efficiently manipulated internally, and only converted to another > encoding when mixing with another character set. If you are planning to convert everything to Unicode or whatever before storing them into the disk, I'd like to object the idea. It's not only the waste of disk space but will bring serious performance degration. For example, each ISO 8859 byte occupies 2 bytes after converted to Unicode. I dont't think this two times disk space consuming is acceptable. -- Tatsuo Ishii
> MB has something similar to the "next character" fucntion called > pg_encoding_mblen. It tells the length of the MB word pointed to so > that you could move forward to the next MB word etc. > > 2) For each character set, we would need to provide conversion functions > > to other "compatible" character sets, or to a character "superset". Why > > don't we have those conversion functions? Answer: we do! There is an > > internal 32-bit encoding within which all comparisons are done. > Right. OK. As you know, I have an interest in this, but little knowledge ;) > > Anyway, I think it will be pretty easy to put the MB stuff back in, by > > #ifdef'ing some string copying inside each of the routines (such as > > namelike()). The underlying routine no longer requires a null-terminated > > string (using explicit lengths instead) so I'll generate those lengths > > in the same place unless they are already provided by the char->int MB > > support code. > I have not taken a look at your new like code, but I guess you could use > pg_mbstrlen(const unsigned char *mbstr) > It tells the number of words in mbstr (however mbstr needs to null > terminated). To get the length I'm now just running through the output string looking for a zero value. This should be more efficient than reading the original string twice; it might be nice if the conversion routines (which now return nothing) returned the actual number of pg_wchars in the output. The original like() code allocates a pg_wchar array dimensioned by the number of bytes in the input string (which happens to be the absolute upper limit for the size of the 32-bit-encoded string). Worst case, this results in a 4-1 expansion of memory, and always requires a palloc()/pfree() for each call to the comparison routines. I think I have a solution for the current code; could someone test its behavior with MB enabled? It is now committed to the source tree; I know it compiles, but afaik am not equipped to test it :( > > In the future, I'd like to see us use alternate encodings as-is, or as a > > common set like UniCode (16 bits wide afaik) rather than having to do > > this widening to 32 bits on the fly. Then, each supported character set > > can be efficiently manipulated internally, and only converted to another > > encoding when mixing with another character set. > If you are planning to convert everything to Unicode or whatever > before storing them into the disk, I'd like to object the idea. It's > not only the waste of disk space but will bring serious performance > degration. For example, each ISO 8859 byte occupies 2 bytes after > converted to Unicode. I dont't think this two times disk space > consuming is acceptable. I am not planning on converting everything to UniCode for disk storage. What I would *like* to do is the following: 1) support each encoding "natively", using Postgres' type system to distinguish between them. This would allow strings with the same encodings to be used without conversion, and would both minimize storage requirements *and* run-time conversion costs. 2) support conversions between encodings, again using Postgres' type system to suggest the appropriate conversion routines. This would allow strings with different but compatible encodings to be mixed, but requires internal conversions *only* if someone is mixing encodings inside their database. 3) one of the supported encodings might be Unicode, and if one chooses, that could be used for on-disk storage. Same with the other existing encodings. 4) this difference approach to encoding support can coexist with the existing MB support since (1) - (3) is done without mention of existing MB internal features. So you can choose which scheme to use, and can test the new scheme without breaking the existing one. imho this comes closer to one of the important goals of maximizing performance for internal operations (since there is less internal string copying/conversion required), even at the expense of extra conversion cost when doing input/output (a good trade since *usually* there are lots of internal operations to a few i/o operations). Comments? - Thomas
> To get the length I'm now just running through the output string looking > for a zero value. This should be more efficient than reading the > original string twice; it might be nice if the conversion routines > (which now return nothing) returned the actual number of pg_wchars in > the output. Sounds resonable. I'm going to enhance them as you suggested. > The original like() code allocates a pg_wchar array dimensioned by the > number of bytes in the input string (which happens to be the absolute > upper limit for the size of the 32-bit-encoded string). Worst case, this > results in a 4-1 expansion of memory, and always requires a > palloc()/pfree() for each call to the comparison routines. Right. There would be another approach to avoid use such that extra memory space. However I am not sure it worth to implement right now. > I think I have a solution for the current code; could someone test its > behavior with MB enabled? It is now committed to the source tree; I know > it compiles, but afaik am not equipped to test it :( It passed the MB test, but fails the string test. Yes, I know it fails becasue ILIKE for MB is not implemented (yet). I'm looking forward to implement the missing part. Is it ok for you, Thomas? > I am not planning on converting everything to UniCode for disk storage. Glad to hear that. > What I would *like* to do is the following: > > 1) support each encoding "natively", using Postgres' type system to > distinguish between them. This would allow strings with the same > encodings to be used without conversion, and would both minimize storage > requirements *and* run-time conversion costs. > > 2) support conversions between encodings, again using Postgres' type > system to suggest the appropriate conversion routines. This would allow > strings with different but compatible encodings to be mixed, but > requires internal conversions *only* if someone is mixing encodings > inside their database. > > 3) one of the supported encodings might be Unicode, and if one chooses, > that could be used for on-disk storage. Same with the other existing > encodings. > > 4) this difference approach to encoding support can coexist with the > existing MB support since (1) - (3) is done without mention of existing > MB internal features. So you can choose which scheme to use, and can > test the new scheme without breaking the existing one. > > imho this comes closer to one of the important goals of maximizing > performance for internal operations (since there is less internal string > copying/conversion required), even at the expense of extra conversion > cost when doing input/output (a good trade since *usually* there are > lots of internal operations to a few i/o operations). > > Comments? Please note that existing MB implementation does not need such an extra conversion cost except some MB-aware-functions(text_length etc.), regex, like and the input/output stage. Also MB stores native encodings 'as is' onto the disk. Anyway, it looks like MB would eventually be merged into/deplicated by your new implementaion of multiple encodings support. BTW, Thomas, do you have a plan to support collation functions? -- Tatsuo Ishii
> > I think I have a solution for the current code; could someone test its > > behavior with MB enabled? It is now committed to the source tree; I know > > it compiles, but afaik am not equipped to test it :( > It passed the MB test, but fails the string test. Yes, I know it fails > becasue ILIKE for MB is not implemented (yet). I'm looking forward to > implement the missing part. Is it ok for you, Thomas? Whew! I'm glad "fails the string test" is because of the ILIKE/tolower() issue; I was afraid you would say "... because Thomas' bad code dumps core..." :) Yes, feel free to implement the missing parts. I'm not even sure how to do it! Do you think it would be best in the meantime to disable the ILIKE tests, or perhaps to separate that out into a different test? > Please note that existing MB implementation does not need such an > extra conversion cost except some MB-aware-functions(text_length > etc.), regex, like and the input/output stage. Also MB stores native > encodings 'as is' onto the disk. Yes. I am probably getting a skewed view of MB since the LIKE code is an edge case which illustrates the difficulties in handling character sets in general no matter what solution is used. > Anyway, it looks like MB would eventually be merged into/deplicated by > your new implementaion of multiple encodings support. I've started writing up a description of my plans (based on our previous discussions), and as I do so I appreciate more and more your current solution ;) imho you have solved several issues such as storage format, client/server communication, and mixed-encoding comparison and manipulation which would all need to be solved by a "new implementation". My current thought is to leave MB intact, and to start implementing "character sets" as distinct types (I know you have said that this is a lot of work, and I agree that is true for the complete set). Once I have done one or a few character sets (perhaps using a Latin subset of Unicode so I can test it by converting between ASCII and Unicode using character sets I know how to read ;) then we can start implementing a "complete solution" for those character sets which includes character and string comparison building blocks like "<", ">", and "tolower()", full comparison functions, and conversion routines between different character sets. But that by itself does not solve, for example, client/server encoding issues, so let's think about that again once we have some "type-full" character sets to play with. The default solution will of course use MB to handle this. > BTW, Thomas, do you have a plan to support collation functions? Yes, that is something that I hope will come out naturally from a combination of SQL9x language features and use of the type system to handle character sets. Then, for example (hmm, examples might be better in Japanese since you have such a rich mix of encodings ;), CREATE TABLE t1 (name TEXT COLLATE francais); will (or might ;) result in using the "francais" data type for the name column. SELECT * FROM t1 WHERE name < _FRANCAIS 'merci'; will use the "francais" data type for the string literal. And CREATE TABLE t1 (name VARCHAR(10) CHARACTER SET latin1 COLLATE francais); will (might?) use, say, the "latin1_francais" data type. Each of these data types will be a loadable module (which could be installed into template1 to make them available to every new database), and each can reuse underlying support routines to avoid as much duplicate code as possible. Maybe there would be defined a default encoding for a type, say "latin1" for "francais", so that the backend or some external scripts can help set these up. There is a good chance we will need (yet another) system table to allow us to tie these types into character sets and collations; otherwise Postgres might not be able to recognize that a type is implementing these language features and, for example, pg_dump might not be able to reconstruct the correct table creation syntax. I notice that SQL99 has *a lot* of new specifics on character set support, which prescribe things like CREATE COLLATION... and DROP COLLATION... This means that there is less thinking involved in the syntax but more work to make those exact commands fit into Postgres. SQL92 left most of this as an exercise for the reader. I'd be happier if we knew this stuff *could* be implemented by seeing another DB implement it. Are you aware of any that do (besides our own of course)? - Thomas
> > > I think I have a solution for the current code; could someone test its > > > behavior with MB enabled? It is now committed to the source tree; I know > > > it compiles, but afaik am not equipped to test it :( > > It passed the MB test, but fails the string test. Yes, I know it fails > > becasue ILIKE for MB is not implemented (yet). I'm looking forward to > > implement the missing part. Is it ok for you, Thomas? > > Whew! I'm glad "fails the string test" is because of the ILIKE/tolower() > issue; I was afraid you would say "... because Thomas' bad code dumps > core..." :) > > Yes, feel free to implement the missing parts. I'm not even sure how to > do it! Do you think it would be best in the meantime to disable the > ILIKE tests, or perhaps to separate that out into a different test? Done. I have committed changes to like.c. > > Please note that existing MB implementation does not need such an > > extra conversion cost except some MB-aware-functions(text_length > > etc.), regex, like and the input/output stage. Also MB stores native > > encodings 'as is' onto the disk. > > Yes. I am probably getting a skewed view of MB since the LIKE code is an > edge case which illustrates the difficulties in handling character sets > in general no matter what solution is used. This time I have slightly modified the way to support MB so that to eliminate the up-to-4-times-memory-consuming-problem. The regression test has passed (including the strings test) with/without MB on Red Hat Linux 5.2. Tests on other platforms are welcome. -- Tatsuo Ishii
> > Yes, feel free to implement the missing parts... > Done. I have committed changes to like.c. ... > This time I have slightly modified the way to support MB so that to > eliminate the up-to-4-times-memory-consuming-problem. Great! btw, would it be OK if I took README.mb, README.locale, and README.Charsets and consolidated them into a chapter or two in the main docs? istm that they are more appropriate there than in these isolated README files. I've already gotten a good start on it... Also, I propose to consolidate and eliminate README.fsync, which duplicates (or will duplicate) info available in the Admin Guide. The fact that it hasn't been touched since 1996, and still refers to Postgres'95, is a clue that some changes are in order. - Thomas
> Great! btw, would it be OK if I took README.mb, README.locale, and > README.Charsets and consolidated them into a chapter or two in the main > docs? istm that they are more appropriate there than in these isolated > README files. I've already gotten a good start on it... Sure, no problem at all for README.mb (Also please feel free to correct any grammatical mistakes in it). I'm not sure for README.locale and README.Charsets but I guess Oleg would be glad to hear that... > Also, I propose to consolidate and eliminate README.fsync, which > duplicates (or will duplicate) info available in the Admin Guide. The > fact that it hasn't been touched since 1996, and still refers to > Postgres'95, is a clue that some changes are in order. Agreed. -- Tatsuo Ishii
On Tue, 22 Aug 2000, Thomas Lockhart wrote: > Date: Tue, 22 Aug 2000 06:57:51 +0000 > From: Thomas Lockhart <lockhart@alumni.caltech.edu> > To: Tatsuo Ishii <t-ishii@sra.co.jp> > Cc: Inoue@tpf.co.jp, pgsql-hackers@postgreSQL.org > Subject: Re: [HACKERS] LIKE gripes (and charset docs) > > > > Yes, feel free to implement the missing parts... > > Done. I have committed changes to like.c. > ... > > This time I have slightly modified the way to support MB so that to > > eliminate the up-to-4-times-memory-consuming-problem. > > Great! btw, would it be OK if I took README.mb, README.locale, and > README.Charsets and consolidated them into a chapter or two in the main > docs? istm that they are more appropriate there than in these isolated > README files. I've already gotten a good start on it... > I think one chapter about internationalization support would be ok. Regards, Oleg > Also, I propose to consolidate and eliminate README.fsync, which > duplicates (or will duplicate) info available in the Admin Guide. The > fact that it hasn't been touched since 1996, and still refers to > Postgres'95, is a clue that some changes are in order. > > - Thomas > _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > Also, I propose to consolidate and eliminate README.fsync, which > duplicates (or will duplicate) info available in the Admin Guide. ?? Peter did that already. README.fsync was "cvs remove"d over a month ago... regards, tom lane