Thread: Progress report on locale safe LIKE indexing
I have so far implemented the following: An operator class text_binary_ops that does memcmp()-based comparison of text data. The operators are named $<$ etc. for lack of a better idea. That lack is further illustrated by the idea to name them "binary-<" etc., which wouldn't get through the parser, but it doesn't need to. The system will use such an index for the queries in question if the locale is not "like-safe", in the terminology of the code (I'll end up renaming that a little). If the locale is plain C/POSIX, it will use the "normal" index. (Should it be able to use either one in that case?) Open issues: Currently, this only works on text. Before I go out and duplicate all this code and the catalog entries for varchar and bpchar, is there a Better Way? In match_special_index_operator(), the code looks up the required operators by name (<, >=). In other places we go out of our way to not attach predefined meanings to operator names. (In yet other places we do, of course.) Wouldn't it be better to test whether the candidate index is a btree and then select the operator to use from the btree strategy entries? One uglification factor here is that the comment * We cheat a little by not checking for availability of "=" ... any * index type should support "=", methinks. no longer holds. PS: I wasn't able to find or reconstruct a concrete test case for this in the archives. Naturally, I'd accept this approach on theoretical purity, but if someone remembers a real tough one, let me know. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Peter Eisentraut wrote: > > I have so far implemented the following: > > An operator class text_binary_ops that does memcmp()-based comparison of > text data. The operators are named $<$ etc. for lack of a better idea. > That lack is further illustrated by the idea to name them "binary-<" etc., > which wouldn't get through the parser, but it doesn't need to. > > The system will use such an index for the queries in question if the > locale is not "like-safe", in the terminology of the code (I'll end up > renaming that a little). This depends on the assumption that '=' is equivalent in any locale. Is it guaranteed ? For example, ( 'a' = 'A' ) isn't allowed in any locale ? regards, Hiroshi Inoue
Hiroshi Inoue writes: > > An operator class text_binary_ops that does memcmp()-based comparison of > > text data. The operators are named $<$ etc. for lack of a better idea. > > That lack is further illustrated by the idea to name them "binary-<" etc., > > which wouldn't get through the parser, but it doesn't need to. > > > > The system will use such an index for the queries in question if the > > locale is not "like-safe", in the terminology of the code (I'll end up > > renaming that a little). > > This depends on the assumption that '=' is equivalent in > any locale. Is it guaranteed ? > For example, ( 'a' = 'A' ) isn't allowed in any locale ? The whole point here is not to rely on '='. Instead we use a different opclass which does "locale-safe" comparisons, as said above. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> -----Original Message----- > From: Peter Eisentraut [mailto:peter_e@gmx.net] > > Hiroshi Inoue writes: > > > > An operator class text_binary_ops that does memcmp()-based > comparison of > > > text data. The operators are named $<$ etc. for lack of a > better idea. > > > That lack is further illustrated by the idea to name them > "binary-<" etc., > > > which wouldn't get through the parser, but it doesn't need to. > > > > > > The system will use such an index for the queries in question if the > > > locale is not "like-safe", in the terminology of the code (I'll end up > > > renaming that a little). > > > > This depends on the assumption that '=' is equivalent in > > any locale. Is it guaranteed ? > > For example, ( 'a' = 'A' ) isn't allowed in any locale ? > > The whole point here is not to rely on '='. Instead we use a different > opclass which does "locale-safe" comparisons, as said above. Isn't 'a' LIKE 'A' if 'a' = 'A' ? LIKE seems to use the collating sequence. regards, Hiroshi Inoue
Hiroshi Inoue writes: > Isn't 'a' LIKE 'A' if 'a' = 'A' ? Yes. But 'a' <> 'A'. > LIKE seems to use the collating sequence. No. The collating sequence defines the order of all possible strings. LIKE doesn't order anything. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> -----Original Message----- > From: Peter Eisentraut [mailto:peter_e@gmx.net] > > Hiroshi Inoue writes: > > > Isn't 'a' LIKE 'A' if 'a' = 'A' ? > > Yes. But 'a' <> 'A'. Please look at my first question. This depends on the assumption that '=' is equivalent in any locale. Is it guaranteed? For example, ( 'a' = 'A' ) isn't allowed in any locale ?. And your answer was The whole point here is not to rely on '='. Clearly your theory depends on the assumption that If a = b in some locale then a = b in ASCII locale. And where does 'a' <> 'A' come from ? The definition of '=' is a part of collating sequence. > > > LIKE seems to use the collating sequence. > > No. The collating sequence defines the order of all possible strings. > LIKE doesn't order anything. Again where does it come from ? regards, Hiroshi Inoue
Hiroshi Inoue writes: > Please look at my first question. > This depends on the assumption that '=' is equivalent in > any locale. Is it guaranteed ? > For example, ( 'a' = 'A' ) isn't allowed in any locale ?. > > And your answer was > The whole point here is not to rely on '='. > > Clearly your theory depends on the assumption that > If a = b in some locale then a = b in ASCII locale. > > And where does 'a' <> 'A' come from ? > The definition of '=' is a part of collating sequence. > > > > > > LIKE seems to use the collating sequence. > > > > No. The collating sequence defines the order of all possible strings. > > LIKE doesn't order anything. > > Again where does it come from ? Let me elaborate again: We want to be able to use btree indexes for LIKE expressions, under the theory that given the expression col LIKE 'foo%' we can augment the expression col >= 'foo' and col < 'fop', which a btree can handle. Our problem is that this theory was false, because if the operators >= and < are locale-aware they can do just about anything. So my solution was that I implement an extra set of operators >= and < (named $>=$ and $<$ for the heck of it) that are *not* locale-aware so that this whole thing works again. Now, if you look at the code that does the LIKE pattern matching you'll see that it does not use any locale features, it simply compares characters for equality based on their character codes, accounting for the wildcards. Consequentially, this whole operation has nothing to do with locales. It was an error that it did in the first place, that's why we had all these problems. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Peter Eisentraut <peter_e@gmx.net> writes: > Now, if you look at the code that does the LIKE pattern matching you'll > see that it does not use any locale features, it simply compares > characters for equality based on their character codes, accounting for the > wildcards. Consequentially, this whole operation has nothing to do with > locales. But the LIKE code does know about multibyte character sets. Is it safe to assume that memcmp-based sorting will not make any mistakes with multibyte characters? regards, tom lane
> -----Original Message----- > From: Peter Eisentraut > > Hiroshi Inoue writes: > > > Please look at my first question. > > This depends on the assumption that '=' is equivalent in > > any locale. Is it guaranteed ? > > For example, ( 'a' = 'A' ) isn't allowed in any locale ?. > > > > And your answer was > > The whole point here is not to rely on '='. > > > > Clearly your theory depends on the assumption that > > If a = b in some locale then a = b in ASCII locale. > > > > And where does 'a' <> 'A' come from ? > > The definition of '=' is a part of collating sequence. > > > > > > > > > LIKE seems to use the collating sequence. > > > > > > No. The collating sequence defines the order of all possible strings. > > > LIKE doesn't order anything. > > > > Again where does it come from ? > > Let me elaborate again: > > Now, if you look at the code that does the LIKE pattern matching you'll > see that it does not use any locale features, it simply compares > characters for equality based on their character codes, accounting for the > wildcards. Consequentially, this whole operation has nothing to do with > locales. Oh I see your point. Hmm * string1 = string2 * doesn't imply * string1 LIKE string2 * ? Otherwise the current criterion of LIKE matching unwittingly assumes that there's no locale that has the different definition of '=' from that of ASCII locale. I don't think the current implementation is strictly right. regards, Hiroshi Inoue
Hiroshi Inoue writes: > Hmm * string1 = string2 * doesn't imply * string1 LIKE string2 * ? In the current implementation of LIKE, you're right. The SQL standard allows for the possibility that "[d]epending on the collating sequence, two strings may compare as equal even if they are of different lengths or contain different sequences of characters." However, I doubt that this can really happen in practice. For example, in some collating sequences (such as en_US), characters with diacritic marks (accents) are "more equal" than others, but in the end there's always a tie breaker. Or do you know an example where this really happens? > Otherwise the current criterion of LIKE matching unwittingly assumes > that there's no locale that has the different definition of '=' from that of > ASCII locale. I don't think the current implementation is strictly right. Strictly speaking, it isn't. The SQL standard says that literal characters in the pattern must be matched to the characters in the supplied value according to the collating sequence. (See 8.5 GR 3. d) ii) 4).) However, I strongly doubt that that would actually be a good idea. Pattern matching generally doesn't work this way (cf. POSIX regexp), and since locale-aware comparisons are context-sensitive in some cases I don't even want to think about whether this could actually work when faced with wildcards. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Tom Lane writes: > But the LIKE code does know about multibyte character sets. Is it safe > to assume that memcmp-based sorting will not make any mistakes with > multibyte characters? Remember that this memcmp-based sorting is the same sorting that texteq will do when locale is turned off. So if there were a problem it would already exist, but I'm sure there isn't one. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
We're about to start up Postgresql in production, and I am looking for a few tools, so I do not have to reinvent the wheel. I'm looking for tools that: 1) Grep through log files, looking for particulary nasty items, that should get sent to me in email or to a cell phone. I noticed you can use the syslog daemon, which would be great for sorting. Anyone have a sort list? 2) Probably, we'll use Bib Brother or Whats Up Gold to monitor system resources, like cpu, memory, and disk space. Anyone have suggestions? 3) Interesting tools any self-respecting DBA should have, that looks for particular "quirks" in Postgresql. For example, in Informix, we were always checking fragmentation and space issues. Naomi Walker Chief Information Officer Eldorado Computing, Inc. 602-604-3100 nwalker@eldocomp.com
Peter Eisentraut wrote: > > Hiroshi Inoue writes: > > > Hmm * string1 = string2 * doesn't imply * string1 LIKE string2 * ? > > In the current implementation of LIKE, you're right. The SQL standard > allows for the possibility that "[d]epending on the collating sequence, > two strings may compare as equal even if they are of different lengths or > contain different sequences of characters." However, I doubt that this > can really happen in practice. For example, in some collating sequences > (such as en_US), characters with diacritic marks (accents) are "more > equal" than others, but in the end there's always a tie breaker. Or do > you know an example where this really happens? I can see the examples in a documentation M$ SQL Server though I can't try it in reality. For example ignore case(low/high) ignore accents I don't think they are strange as collating sequences. You are establishing a pretty big mechanism and I think you should clarify the assumption. Please tell me the assumption. I can think of the followings. 1) Because the current implementaion of LIKE isn't locale-aware, we should be compatible with it for ever. 2) strcoll(str1, str2) == 0 means strcmp(str1, str2) == 0 in any locale. regards, Hiroshi Inoue
Hiroshi Inoue writes: > 1) Because the current implementaion of LIKE isn't locale-aware, > we should be compatible with it for ever. I'm not sure I intended to say that. The combination of the following factors seems important: a) compatibility is desirable b) no requests to the contrary have been made c) the LIKE locale awareness as defined by SQL is quite brain-dead > 2) strcoll(str1, str2) == 0 means strcmp(str1, str2) == 0 > in any locale. I *think* that is true, but until something happens about 1) it doesn't matter. What is your position: Do you think my solution is short-lived because the LIKE implementation is wrong and should be changed? -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Peter Eisentraut wrote: > > Hiroshi Inoue writes: > > > 1) Because the current implementaion of LIKE isn't locale-aware, > > we should be compatible with it for ever. > > I'm not sure I intended to say that. The combination of the following > factors seems important: > > a) compatibility is desirable > b) no requests to the contrary have been made > c) the LIKE locale awareness as defined by SQL is quite brain-dead > > > 2) strcoll(str1, str2) == 0 means strcmp(str1, str2) == 0 > > in any locale. > > I *think* that is true, but until something happens about 1) it doesn't > matter. > Tatsuo reported a case that strcoll(str1, str2) == 0 but strcmp( str1, str2) != 0 though it seems to be considered as an OS bug. > What is your position: Do you think my solution is short-lived because > the LIKE implementation is wrong and should be changed? Yes I'm afraid of it though I'm not sure. If your solution is short-lived, your change would be not only useless but also harmful. So I expect locale-aware people to confirm that we are in the right direction. I myself don't need the locale support at all. In Japan existent locales themselves seem to be harmful for most people as Tatsuo mentioned already. regards, Hiroshi Inoue
Hiroshi Inoue writes: > If your solution is short-lived, your change would be not > only useless but also harmful. So I expect locale-aware > people to confirm that we are in the right direction. I am a bit confused here. We have tinkered with LIKE indexing at least a year. Now that a solution is found that *works*, it is claimed that it is harmful because LIKE was doing the wrong thing in the first place. OTOH, I have not seen anyone independently claim that LIKE is wrong, nor do I see anyone proposing to actually change it. If we install my fix now we have a system that works better than the previous release with no change in semantics. If someone wants to change LIKE at a later point, reverting my changes would be the least part of that work. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> > If your solution is short-lived, your change would be not > > only useless but also harmful. So I expect locale-aware > > people to confirm that we are in the right direction. > > I am a bit confused here. We have tinkered with LIKE indexing at least a > year. Now that a solution is found that *works*, it is claimed that it is > harmful because LIKE was doing the wrong thing in the first place. OTOH, > I have not seen anyone independently claim that LIKE is wrong, nor do I > see anyone proposing to actually change it. Because we configure --without-locale ? How should we see or object to wrong behavior in a part of the software we don't need or use ? Andreas
Zeugswetter Andreas SB SD writes: > > I am a bit confused here. We have tinkered with LIKE indexing at > least a > > year. Now that a solution is found that *works*, it is claimed that > it is > > harmful because LIKE was doing the wrong thing in the first place. > OTOH, > > I have not seen anyone independently claim that LIKE is wrong, nor do > I > > see anyone proposing to actually change it. > > Because we configure --without-locale ? How should we see or object to > wrong behavior in a part of the software we don't need or use ? Wrong answer. Read my second sentence again and compare it to your first sentence. -- Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
> -----Original Message----- > From: Peter Eisentraut > > Hiroshi Inoue writes: > > > If your solution is short-lived, your change would be not > > only useless but also harmful. So I expect locale-aware > > people to confirm that we are in the right direction. > > I am a bit confused here. We have tinkered with LIKE indexing at least a > year. Now that a solution is found that *works*, it is claimed that it is > harmful because LIKE was doing the wrong thing in the first place. OTOH, > I have not seen anyone independently claim that LIKE is wrong, nor do I > see anyone proposing to actually change it. Probably no one has used such a (useful) locale that strcoll(str1, str2) == 0 but strcmp(str1, str2) != 0 with PostgreSQL. As long as the vagueness with LIKE is held within utils/adt/like.c I won't complain. > If we install my fix now we > have a system that works better than the previous release with no change > in semantics. If someone wants to change LIKE at a later point, reverting > my changes would be the least part of that work. If your change is useful, the change would be hard to revert. We had better be more careful about the change. For example, you are defining text_binary_ops on text data type but how about introduing a new data type (text collate ASCII) and define text_ascii_ops on the new type ? We could use operators like =, <=, >= instead of $=$, $<=$, $>=$ ... We may be able to to lay the foundation of the collate support for individual column by changing existent text type as (text collate some_collate) . regards, Hiroshi Inoue