Thread: LIKE gripes

LIKE gripes

From

Tom Lane

Date:

06 August 2000, 20:40:29

1. like.c doesn't compile cleanly anymore:

$ make
gcc -c  -I../../../../src/include  -O1 -Wall -Wmissing-prototypes -Wmissing-declarations -g -o like.o like.c
like.c:143: warning: no previous prototype for `inamelike'
like.c:155: warning: no previous prototype for `inamenlike'
like.c:167: warning: no previous prototype for `inamelike_escape'
like.c:180: warning: no previous prototype for `inamenlike_escape'
like.c:193: warning: no previous prototype for `itextlike'
like.c:205: warning: no previous prototype for `itextnlike'
like.c:217: warning: no previous prototype for `itextlike_escape'
like.c:230: warning: no previous prototype for `itextnlike_escape'
like.c: In function `MatchTextLower':
like.c:401: warning: implicit declaration of function `tolower'

2. The strings regress test fails.  I think you forgot to commit the
expected file to go with the new test file?
        regards, tom lane

Re: LIKE gripes

From

Thomas Lockhart

Date:

06 August 2000, 21:03:41

> 1. like.c doesn't compile cleanly anymore:
> $ make
> gcc -c  -I../../../../src/include  -O1 -Wall -Wmissing-prototypes -Wmissing-declarations -g -o like.o like.c
> like.c:143: warning: no previous prototype for `inamelike'
> like.c:155: warning: no previous prototype for `inamenlike'
> like.c:167: warning: no previous prototype for `inamelike_escape'
> like.c:180: warning: no previous prototype for `inamenlike_escape'
> like.c:193: warning: no previous prototype for `itextlike'
> like.c:205: warning: no previous prototype for `itextnlike'
> like.c:217: warning: no previous prototype for `itextlike_escape'
> like.c:230: warning: no previous prototype for `itextnlike_escape'
> like.c: In function `MatchTextLower':
> like.c:401: warning: implicit declaration of function `tolower'

OK, will look at it. The first ones I see here; the second I'm not sure
I do. Will see what other files have for #include's.

> 2. The strings regress test fails.  I think you forgot to commit the
> expected file to go with the new test file?

Yup. I just noticed that here after updating the tree.
                 - Thomas

Re: LIKE gripes

From

Tom Lane

Date:

06 August 2000, 21:12:32

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
>> like.c: In function `MatchTextLower':
>> like.c:401: warning: implicit declaration of function `tolower'

> OK, will look at it. The first ones I see here; the second I'm not sure
> I do. Will see what other files have for #include's.

I think <ctype.h> is what's needed here.
        regards, tom lane

Re: LIKE gripes

From

Thomas Lockhart

Date:

06 August 2000, 23:31:20

> > OK, will look at it. The first ones I see here; the second I'm not sure
> > I do. Will see what other files have for #include's.
> I think <ctype.h> is what's needed here.

OK, I've updated builtins.h and like.c to eliminate compiler warnings,
and updates the strings.out regression test result file. I've also
updated the like support code to

1) eliminate a gratuitous case statement (actually, two) in favor of an
if/elseif/else construct.

2) eliminate the extraneous backslash quoting code for like() functions
only. As I mentioned in the cvs note, this behavior can be put back in
if we really want it by replacing a NULL with a "\" in two function
calls.
                  - Thomas

RE: Re: LIKE gripes

From

"Hiroshi Inoue"

Date:

08 August 2000, 03:37:37

Where has MULTIBYTE Stuff in like.c gone ?

Hiroshi Inoue

> -----Original Message-----
> From: Thomas Lockhart
> Sent: Monday, August 07, 2000 10:07 AM
> To: Tom Lane
> Cc: pgsql-hackers@postgreSQL.org
> Subject: [HACKERS] Re: LIKE gripes
> 
> 
> > 1. like.c doesn't compile cleanly anymore:
> > $ make
> > gcc -c  -I../../../../src/include  -O1 -Wall 
> -Wmissing-prototypes -Wmissing-declarations -g -o like.o like.c
> > like.c:143: warning: no previous prototype for `inamelike'
> > like.c:155: warning: no previous prototype for `inamenlike'
> > like.c:167: warning: no previous prototype for `inamelike_escape'
> > like.c:180: warning: no previous prototype for `inamenlike_escape'
> > like.c:193: warning: no previous prototype for `itextlike'
> > like.c:205: warning: no previous prototype for `itextnlike'
> > like.c:217: warning: no previous prototype for `itextlike_escape'
> > like.c:230: warning: no previous prototype for `itextnlike_escape'
> > like.c: In function `MatchTextLower':
> > like.c:401: warning: implicit declaration of function `tolower'
> 
> OK, will look at it. The first ones I see here; the second I'm not sure
> I do. Will see what other files have for #include's.
> 
> > 2. The strings regress test fails.  I think you forgot to commit the
> > expected file to go with the new test file?
> 
> Yup. I just noticed that here after updating the tree.
> 
>                   - Thomas
>

Re: Re: LIKE gripes

From

Thomas Lockhart

Date:

08 August 2000, 11:15:46

> Where has MULTIBYTE Stuff in like.c gone ?

Uh, I was wondering where it was in the first place! Will fix it asap...

There was some string copying stuff in a middle layer of the like()
code, but I had thought that it was there only to get a null-terminated
string. When I rewrote the code to eliminate the need for null
termination (by using the length attribute of the text data type) then
the need for copying went away. Or so I thought :(

The other piece to the puzzle is that the lowest-level like() support
routine traversed the strings using the increment operator, and so I
didn't understand that there was any MB support in there. I now see that
*all* of these strings get stuffed into unsigned int arrays during
copying; I had (sort of) understood some of the encoding schemes (most
use a combination of one to three byte sequences for each character) and
didn't realize that this normalization was being done on the fly. 

So, this answers some questions I have related to implementing character
sets:

1) For each character set, we would need to provide operators for "next
character" and for boolean comparisons for each character set. Why don't
we have those now? Answer: because everything is getting promoted to a
32-bit internal encoding every time a comparison or traversal is
required.

2) For each character set, we would need to provide conversion functions
to other "compatible" character sets, or to a character "superset". Why
don't we have those conversion functions? Answer: we do! There is an
internal 32-bit encoding within which all comparisons are done.

Anyway, I think it will be pretty easy to put the MB stuff back in, by
#ifdef'ing some string copying inside each of the routines (such as
namelike()). The underlying routine no longer requires a null-terminated
string (using explicit lengths instead) so I'll generate those lengths
in the same place unless they are already provided by the char->int MB
support code.

In the future, I'd like to see us use alternate encodings as-is, or as a
common set like UniCode (16 bits wide afaik) rather than having to do
this widening to 32 bits on the fly. Then, each supported character set
can be efficiently manipulated internally, and only converted to another
encoding when mixing with another character set.

Any and all advice welcome and accepted (though "keep your hands off the
MB code!" seems a bit too late ;)

Sorry for the shake-up...
                    - Thomas

Re: Re: LIKE gripes

From

Tatsuo Ishii

Date:

09 August 2000, 08:45:11

> > Where has MULTIBYTE Stuff in like.c gone ?

I didn't know that:-)

> Uh, I was wondering where it was in the first place! Will fix it asap...
> 
> There was some string copying stuff in a middle layer of the like()
> code, but I had thought that it was there only to get a null-terminated
> string. When I rewrote the code to eliminate the need for null
> termination (by using the length attribute of the text data type) then
> the need for copying went away. Or so I thought :(
> 
> The other piece to the puzzle is that the lowest-level like() support
> routine traversed the strings using the increment operator, and so I
> didn't understand that there was any MB support in there. I now see that
> *all* of these strings get stuffed into unsigned int arrays during
> copying; I had (sort of) understood some of the encoding schemes (most
> use a combination of one to three byte sequences for each character) and
> didn't realize that this normalization was being done on the fly. 
> 
> So, this answers some questions I have related to implementing character
> sets:
>
> 1) For each character set, we would need to provide operators for "next
> character" and for boolean comparisons for each character set. Why don't
> we have those now? Answer: because everything is getting promoted to a
> 32-bit internal encoding every time a comparison or traversal is
> required.

MB has something similar to the "next character" fucntion called
pg_encoding_mblen. It tells the length of the MB word pointed to so
that you could move forward to the next MB word etc.

> 2) For each character set, we would need to provide conversion functions
> to other "compatible" character sets, or to a character "superset". Why
> don't we have those conversion functions? Answer: we do! There is an
> internal 32-bit encoding within which all comparisons are done.

Right.

> Anyway, I think it will be pretty easy to put the MB stuff back in, by
> #ifdef'ing some string copying inside each of the routines (such as
> namelike()). The underlying routine no longer requires a null-terminated
> string (using explicit lengths instead) so I'll generate those lengths
> in the same place unless they are already provided by the char->int MB
> support code.

I have not taken a look at your new like code, but I guess you could use
    pg_mbstrlen(const unsigned char *mbstr)

It tells the number of words in mbstr (however mbstr needs to null
terminated).

> In the future, I'd like to see us use alternate encodings as-is, or as a
> common set like UniCode (16 bits wide afaik) rather than having to do
> this widening to 32 bits on the fly. Then, each supported character set
> can be efficiently manipulated internally, and only converted to another
> encoding when mixing with another character set.

If you are planning to convert everything to Unicode or whatever
before storing them into the disk, I'd like to object the idea. It's
not only the waste of disk space but will bring serious performance
degration. For example, each ISO 8859 byte occupies 2 bytes after
converted to Unicode. I dont't think this two times disk space
consuming is acceptable.
--
Tatsuo Ishii

Re: Re: LIKE gripes

From

Thomas Lockhart

Date:

09 August 2000, 10:24:56

> MB has something similar to the "next character" fucntion called
> pg_encoding_mblen. It tells the length of the MB word pointed to so
> that you could move forward to the next MB word etc.
> > 2) For each character set, we would need to provide conversion functions
> > to other "compatible" character sets, or to a character "superset". Why
> > don't we have those conversion functions? Answer: we do! There is an
> > internal 32-bit encoding within which all comparisons are done.
> Right.

OK. As you know, I have an interest in this, but little knowledge ;)

> > Anyway, I think it will be pretty easy to put the MB stuff back in, by
> > #ifdef'ing some string copying inside each of the routines (such as
> > namelike()). The underlying routine no longer requires a null-terminated
> > string (using explicit lengths instead) so I'll generate those lengths
> > in the same place unless they are already provided by the char->int MB
> > support code.
> I have not taken a look at your new like code, but I guess you could use
>                 pg_mbstrlen(const unsigned char *mbstr)
> It tells the number of words in mbstr (however mbstr needs to null
> terminated).

To get the length I'm now just running through the output string looking
for a zero value. This should be more efficient than reading the
original string twice; it might be nice if the conversion routines
(which now return nothing) returned the actual number of pg_wchars in
the output.

The original like() code allocates a pg_wchar array dimensioned by the
number of bytes in the input string (which happens to be the absolute
upper limit for the size of the 32-bit-encoded string). Worst case, this
results in a 4-1 expansion of memory, and always requires a
palloc()/pfree() for each call to the comparison routines.

I think I have a solution for the current code; could someone test its
behavior with MB enabled? It is now committed to the source tree; I know
it compiles, but afaik am not equipped to test it :(

> > In the future, I'd like to see us use alternate encodings as-is, or as a
> > common set like UniCode (16 bits wide afaik) rather than having to do
> > this widening to 32 bits on the fly. Then, each supported character set
> > can be efficiently manipulated internally, and only converted to another
> > encoding when mixing with another character set.
> If you are planning to convert everything to Unicode or whatever
> before storing them into the disk, I'd like to object the idea. It's
> not only the waste of disk space but will bring serious performance
> degration. For example, each ISO 8859 byte occupies 2 bytes after
> converted to Unicode. I dont't think this two times disk space
> consuming is acceptable.

I am not planning on converting everything to UniCode for disk storage.
What I would *like* to do is the following:

1) support each encoding "natively", using Postgres' type system to
distinguish between them. This would allow strings with the same
encodings to be used without conversion, and would both minimize storage
requirements *and* run-time conversion costs.

2) support conversions between encodings, again using Postgres' type
system to suggest the appropriate conversion routines. This would allow
strings with different but compatible encodings to be mixed, but
requires internal conversions *only* if someone is mixing encodings
inside their database.

3) one of the supported encodings might be Unicode, and if one chooses,
that could be used for on-disk storage. Same with the other existing
encodings.

4) this difference approach to encoding support can coexist with the
existing MB support since (1) - (3) is done without mention of existing
MB internal features. So you can choose which scheme to use, and can
test the new scheme without breaking the existing one.

imho this comes closer to one of the important goals of maximizing
performance for internal operations (since there is less internal string
copying/conversion required), even at the expense of extra conversion
cost when doing input/output (a good trade since *usually* there are
lots of internal operations to a few i/o operations).

Comments?
                     - Thomas

Re: Re: LIKE gripes

From

Tatsuo Ishii

Date:

11 August 2000, 04:13:33

> To get the length I'm now just running through the output string looking
> for a zero value. This should be more efficient than reading the
> original string twice; it might be nice if the conversion routines
> (which now return nothing) returned the actual number of pg_wchars in
> the output.

Sounds resonable. I'm going to enhance them as you suggested.

> The original like() code allocates a pg_wchar array dimensioned by the
> number of bytes in the input string (which happens to be the absolute
> upper limit for the size of the 32-bit-encoded string). Worst case, this
> results in a 4-1 expansion of memory, and always requires a
> palloc()/pfree() for each call to the comparison routines.

Right.

There would be another approach to avoid use such that extra memory
space.  However I am not sure it worth to implement right now.

> I think I have a solution for the current code; could someone test its
> behavior with MB enabled? It is now committed to the source tree; I know
> it compiles, but afaik am not equipped to test it :(

It passed the MB test, but fails the string test. Yes, I know it fails
becasue ILIKE for MB is not implemented (yet). I'm looking forward to
implement the missing part. Is it ok for you, Thomas?

> I am not planning on converting everything to UniCode for disk storage.

Glad to hear that.

> What I would *like* to do is the following:
> 
> 1) support each encoding "natively", using Postgres' type system to
> distinguish between them. This would allow strings with the same
> encodings to be used without conversion, and would both minimize storage
> requirements *and* run-time conversion costs.
> 
> 2) support conversions between encodings, again using Postgres' type
> system to suggest the appropriate conversion routines. This would allow
> strings with different but compatible encodings to be mixed, but
> requires internal conversions *only* if someone is mixing encodings
> inside their database.
> 
> 3) one of the supported encodings might be Unicode, and if one chooses,
> that could be used for on-disk storage. Same with the other existing
> encodings.
> 
> 4) this difference approach to encoding support can coexist with the
> existing MB support since (1) - (3) is done without mention of existing
> MB internal features. So you can choose which scheme to use, and can
> test the new scheme without breaking the existing one.
> 
> imho this comes closer to one of the important goals of maximizing
> performance for internal operations (since there is less internal string
> copying/conversion required), even at the expense of extra conversion
> cost when doing input/output (a good trade since *usually* there are
> lots of internal operations to a few i/o operations).
> 
> Comments?

Please note that existing MB implementation does not need such an
extra conversion cost except some MB-aware-functions(text_length
etc.), regex, like and the input/output stage. Also MB stores native
encodings 'as is' onto the disk.

Anyway, it looks like MB would eventually be merged into/deplicated by
your new implementaion of multiple encodings support.

BTW, Thomas, do you have a plan to support collation functions?
--
Tatsuo Ishii

Re: Re: LIKE gripes

From

Thomas Lockhart

Date:

11 August 2000, 10:10:07

> > I think I have a solution for the current code; could someone test its
> > behavior with MB enabled? It is now committed to the source tree; I know
> > it compiles, but afaik am not equipped to test it :(
> It passed the MB test, but fails the string test. Yes, I know it fails
> becasue ILIKE for MB is not implemented (yet). I'm looking forward to
> implement the missing part. Is it ok for you, Thomas?

Whew! I'm glad "fails the string test" is because of the ILIKE/tolower()
issue; I was afraid you would say "... because Thomas' bad code dumps
core..." :)

Yes, feel free to implement the missing parts. I'm not even sure how to
do it! Do you think it would be best in the meantime to disable the
ILIKE tests, or perhaps to separate that out into a different test?

> Please note that existing MB implementation does not need such an
> extra conversion cost except some MB-aware-functions(text_length
> etc.), regex, like and the input/output stage. Also MB stores native
> encodings 'as is' onto the disk.

Yes. I am probably getting a skewed view of MB since the LIKE code is an
edge case which illustrates the difficulties in handling character sets
in general no matter what solution is used.

> Anyway, it looks like MB would eventually be merged into/deplicated by
> your new implementaion of multiple encodings support.

I've started writing up a description of my plans (based on our previous
discussions), and as I do so I appreciate more and more your current
solution ;) imho you have solved several issues such as storage format,
client/server communication, and mixed-encoding comparison and
manipulation which would all need to be solved by a "new
implementation".

My current thought is to leave MB intact, and to start implementing
"character sets" as distinct types (I know you have said that this is a
lot of work, and I agree that is true for the complete set). Once I have
done one or a few character sets (perhaps using a Latin subset of
Unicode so I can test it by converting between ASCII and Unicode using
character sets I know how to read ;) then we can start implementing a
"complete solution" for those character sets which includes character
and string comparison building blocks like "<", ">", and "tolower()",
full comparison functions, and conversion routines between different
character sets.

But that by itself does not solve, for example, client/server encoding
issues, so let's think about that again once we have some "type-full"
character sets to play with. The default solution will of course use MB
to handle this.

> BTW, Thomas, do you have a plan to support collation functions?

Yes, that is something that I hope will come out naturally from a
combination of SQL9x language features and use of the type system to
handle character sets. Then, for example (hmm, examples might be better
in Japanese since you have such a rich mix of encodings ;),
 CREATE TABLE t1 (name TEXT COLLATE francais);

will (or might ;) result in using the "francais" data type for the name
column.
 SELECT * FROM t1 WHERE name < _FRANCAIS 'merci';

will use the "francais" data type for the string literal. And
 CREATE TABLE t1 (name VARCHAR(10) CHARACTER SET latin1 COLLATE
francais);

will (might?) use, say, the "latin1_francais" data type. Each of these
data types will be a loadable module (which could be installed into
template1 to make them available to every new database), and each can
reuse underlying support routines to avoid as much duplicate code as
possible.

Maybe there would be defined a default encoding for a type, say "latin1"
for "francais", so that the backend or some external scripts can help
set these up. There is a good chance we will need (yet another) system
table to allow us to tie these types into character sets and collations;
otherwise Postgres might not be able to recognize that a type is
implementing these language features and, for example, pg_dump might not
be able to reconstruct the correct table creation syntax.

I notice that SQL99 has *a lot* of new specifics on character set
support, which prescribe things like CREATE COLLATION... and DROP
COLLATION... This means that there is less thinking involved in the
syntax but more work to make those exact commands fit into Postgres.
SQL92 left most of this as an exercise for the reader. I'd be happier if
we knew this stuff *could* be implemented by seeing another DB implement
it. Are you aware of any that do (besides our own of course)?
                    - Thomas

Re: Re: LIKE gripes

From

Tatsuo Ishii

Date:

22 August 2000, 02:37:53

> > > I think I have a solution for the current code; could someone test its
> > > behavior with MB enabled? It is now committed to the source tree; I know
> > > it compiles, but afaik am not equipped to test it :(
> > It passed the MB test, but fails the string test. Yes, I know it fails
> > becasue ILIKE for MB is not implemented (yet). I'm looking forward to
> > implement the missing part. Is it ok for you, Thomas?
> 
> Whew! I'm glad "fails the string test" is because of the ILIKE/tolower()
> issue; I was afraid you would say "... because Thomas' bad code dumps
> core..." :)
> 
> Yes, feel free to implement the missing parts. I'm not even sure how to
> do it! Do you think it would be best in the meantime to disable the
> ILIKE tests, or perhaps to separate that out into a different test?

Done. I have committed changes to like.c.

> > Please note that existing MB implementation does not need such an
> > extra conversion cost except some MB-aware-functions(text_length
> > etc.), regex, like and the input/output stage. Also MB stores native
> > encodings 'as is' onto the disk.
> 
> Yes. I am probably getting a skewed view of MB since the LIKE code is an
> edge case which illustrates the difficulties in handling character sets
> in general no matter what solution is used.

This time I have slightly modified the way to support MB so that to
eliminate the up-to-4-times-memory-consuming-problem.

The regression test has passed (including the strings test)
with/without MB on Red Hat Linux 5.2. Tests on other platforms are
welcome.
--
Tatsuo Ishii

Re: LIKE gripes (and charset docs)

From

Thomas Lockhart

Date:

22 August 2000, 03:14:49

> > Yes, feel free to implement the missing parts...
> Done. I have committed changes to like.c.
...
> This time I have slightly modified the way to support MB so that to
> eliminate the up-to-4-times-memory-consuming-problem.

Great! btw, would it be OK if I took README.mb, README.locale, and
README.Charsets and consolidated them into a chapter or two in the main
docs? istm that they are more appropriate there than in these isolated
README files. I've already gotten a good start on it...

Also, I propose to consolidate and eliminate README.fsync, which
duplicates (or will duplicate) info available in the Admin Guide. The
fact that it hasn't been touched since 1996, and still refers to
Postgres'95, is a clue that some changes are in order.
                   - Thomas

Re: LIKE gripes (and charset docs)

From

Tatsuo Ishii

Date:

22 August 2000, 03:29:55

> Great! btw, would it be OK if I took README.mb, README.locale, and
> README.Charsets and consolidated them into a chapter or two in the main
> docs? istm that they are more appropriate there than in these isolated
> README files. I've already gotten a good start on it...

Sure, no problem at all for README.mb (Also please feel free to
correct any grammatical mistakes in it). I'm not sure for
README.locale and README.Charsets but I guess Oleg would be glad to
hear that...

> Also, I propose to consolidate and eliminate README.fsync, which
> duplicates (or will duplicate) info available in the Admin Guide. The
> fact that it hasn't been touched since 1996, and still refers to
> Postgres'95, is a clue that some changes are in order.

Agreed.
--
Tatsuo Ishii

Re: LIKE gripes (and charset docs)

From

Oleg Bartunov

Date:

22 August 2000, 08:37:59

On Tue, 22 Aug 2000, Thomas Lockhart wrote:

> Date: Tue, 22 Aug 2000 06:57:51 +0000
> From: Thomas Lockhart <lockhart@alumni.caltech.edu>
> To: Tatsuo Ishii <t-ishii@sra.co.jp>
> Cc: Inoue@tpf.co.jp, pgsql-hackers@postgreSQL.org
> Subject: Re: [HACKERS] LIKE gripes (and charset docs)
> 
> > > Yes, feel free to implement the missing parts...
> > Done. I have committed changes to like.c.
> ...
> > This time I have slightly modified the way to support MB so that to
> > eliminate the up-to-4-times-memory-consuming-problem.
> 
> Great! btw, would it be OK if I took README.mb, README.locale, and
> README.Charsets and consolidated them into a chapter or two in the main
> docs? istm that they are more appropriate there than in these isolated
> README files. I've already gotten a good start on it...
> 

I think one chapter about internationalization support would be ok.
Regards,
    Oleg

> Also, I propose to consolidate and eliminate README.fsync, which
> duplicates (or will duplicate) info available in the Admin Guide. The
> fact that it hasn't been touched since 1996, and still refers to
> Postgres'95, is a clue that some changes are in order.
> 
>                     - Thomas
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: LIKE gripes (and charset docs)

From

Tom Lane

Date:

22 August 2000, 10:30:13

Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> Also, I propose to consolidate and eliminate README.fsync, which
> duplicates (or will duplicate) info available in the Admin Guide.

??  Peter did that already.  README.fsync was "cvs remove"d over a
month ago...
        regards, tom lane