Re: Re: LIKE gripes - Mailing list pgsql-hackers
From | Tatsuo Ishii |
---|---|
Subject | Re: Re: LIKE gripes |
Date | |
Msg-id | 20000809214513M.t-ishii@sra.co.jp Whole thread Raw |
In response to | Re: Re: LIKE gripes (Thomas Lockhart <lockhart@alumni.caltech.edu>) |
List | pgsql-hackers |
> > Where has MULTIBYTE Stuff in like.c gone ? I didn't know that:-) > Uh, I was wondering where it was in the first place! Will fix it asap... > > There was some string copying stuff in a middle layer of the like() > code, but I had thought that it was there only to get a null-terminated > string. When I rewrote the code to eliminate the need for null > termination (by using the length attribute of the text data type) then > the need for copying went away. Or so I thought :( > > The other piece to the puzzle is that the lowest-level like() support > routine traversed the strings using the increment operator, and so I > didn't understand that there was any MB support in there. I now see that > *all* of these strings get stuffed into unsigned int arrays during > copying; I had (sort of) understood some of the encoding schemes (most > use a combination of one to three byte sequences for each character) and > didn't realize that this normalization was being done on the fly. > > So, this answers some questions I have related to implementing character > sets: > > 1) For each character set, we would need to provide operators for "next > character" and for boolean comparisons for each character set. Why don't > we have those now? Answer: because everything is getting promoted to a > 32-bit internal encoding every time a comparison or traversal is > required. MB has something similar to the "next character" fucntion called pg_encoding_mblen. It tells the length of the MB word pointed to so that you could move forward to the next MB word etc. > 2) For each character set, we would need to provide conversion functions > to other "compatible" character sets, or to a character "superset". Why > don't we have those conversion functions? Answer: we do! There is an > internal 32-bit encoding within which all comparisons are done. Right. > Anyway, I think it will be pretty easy to put the MB stuff back in, by > #ifdef'ing some string copying inside each of the routines (such as > namelike()). The underlying routine no longer requires a null-terminated > string (using explicit lengths instead) so I'll generate those lengths > in the same place unless they are already provided by the char->int MB > support code. I have not taken a look at your new like code, but I guess you could use pg_mbstrlen(const unsigned char *mbstr) It tells the number of words in mbstr (however mbstr needs to null terminated). > In the future, I'd like to see us use alternate encodings as-is, or as a > common set like UniCode (16 bits wide afaik) rather than having to do > this widening to 32 bits on the fly. Then, each supported character set > can be efficiently manipulated internally, and only converted to another > encoding when mixing with another character set. If you are planning to convert everything to Unicode or whatever before storing them into the disk, I'd like to object the idea. It's not only the waste of disk space but will bring serious performance degration. For example, each ISO 8859 byte occupies 2 bytes after converted to Unicode. I dont't think this two times disk space consuming is acceptable. -- Tatsuo Ishii
pgsql-hackers by date: