Thread: Progress report on locale safe LIKE indexing

Progress report on locale safe LIKE indexing

From
Peter Eisentraut
Date:
I have so far implemented the following:

An operator class text_binary_ops that does memcmp()-based comparison of
text data.  The operators are named $<$ etc. for lack of a better idea.
That lack is further illustrated by the idea to name them "binary-<" etc.,
which wouldn't get through the parser, but it doesn't need to.

The system will use such an index for the queries in question if the
locale is not "like-safe", in the terminology of the code (I'll end up
renaming that a little).  If the locale is plain C/POSIX, it will use the
"normal" index.  (Should it be able to use either one in that case?)

Open issues:

Currently, this only works on text.  Before I go out and duplicate all
this code and the catalog entries for varchar and bpchar, is there a
Better Way?

In match_special_index_operator(), the code looks up the required
operators by name (<, >=).  In other places we go out of our way to not
attach predefined meanings to operator names.  (In yet other places we do,
of course.)  Wouldn't it be better to test whether the candidate index is
a btree and then select the operator to use from the btree strategy
entries?  One uglification factor here is that the comment
  * We cheat a little by not checking for availability of "=" ... any  * index type should support "=", methinks.

no longer holds.


PS: I wasn't able to find or reconstruct a concrete test case for this in
the archives.  Naturally, I'd accept this approach on theoretical purity,
but if someone remembers a real tough one, let me know.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Progress report on locale safe LIKE indexing

From
Hiroshi Inoue
Date:
Peter Eisentraut wrote:
> 
> I have so far implemented the following:
> 
> An operator class text_binary_ops that does memcmp()-based comparison of
> text data.  The operators are named $<$ etc. for lack of a better idea.
> That lack is further illustrated by the idea to name them "binary-<" etc.,
> which wouldn't get through the parser, but it doesn't need to.
> 
> The system will use such an index for the queries in question if the
> locale is not "like-safe", in the terminology of the code (I'll end up
> renaming that a little).

This depends on the assumption that '=' is equivalent in
any locale. Is it guaranteed ?
For example, ( 'a' = 'A' ) isn't allowed in any locale ?

regards,
Hiroshi Inoue


Re: Progress report on locale safe LIKE indexing

From
Peter Eisentraut
Date:
Hiroshi Inoue writes:

> > An operator class text_binary_ops that does memcmp()-based comparison of
> > text data.  The operators are named $<$ etc. for lack of a better idea.
> > That lack is further illustrated by the idea to name them "binary-<" etc.,
> > which wouldn't get through the parser, but it doesn't need to.
> >
> > The system will use such an index for the queries in question if the
> > locale is not "like-safe", in the terminology of the code (I'll end up
> > renaming that a little).
>
> This depends on the assumption that '=' is equivalent in
> any locale. Is it guaranteed ?
> For example, ( 'a' = 'A' ) isn't allowed in any locale ?

The whole point here is not to rely on '='.  Instead we use a different
opclass which does "locale-safe" comparisons, as said above.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



RE: Progress report on locale safe LIKE indexing

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Peter Eisentraut [mailto:peter_e@gmx.net]
>
> Hiroshi Inoue writes:
>
> > > An operator class text_binary_ops that does memcmp()-based
> comparison of
> > > text data.  The operators are named $<$ etc. for lack of a
> better idea.
> > > That lack is further illustrated by the idea to name them
> "binary-<" etc.,
> > > which wouldn't get through the parser, but it doesn't need to.
> > >
> > > The system will use such an index for the queries in question if the
> > > locale is not "like-safe", in the terminology of the code (I'll end up
> > > renaming that a little).
> >
> > This depends on the assumption that '=' is equivalent in
> > any locale. Is it guaranteed ?
> > For example, ( 'a' = 'A' ) isn't allowed in any locale ?
>
> The whole point here is not to rely on '='.  Instead we use a different
> opclass which does "locale-safe" comparisons, as said above.

Isn't 'a' LIKE 'A' if 'a' = 'A' ?
LIKE seems to use the collating sequence.

regards,
Hiroshi  Inoue



RE: Progress report on locale safe LIKE indexing

From
Peter Eisentraut
Date:
Hiroshi Inoue writes:

> Isn't 'a' LIKE 'A' if 'a' = 'A' ?

Yes.  But 'a' <> 'A'.

> LIKE seems to use the collating sequence.

No.  The collating sequence defines the order of all possible strings.
LIKE doesn't order anything.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



RE: Progress report on locale safe LIKE indexing

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Peter Eisentraut [mailto:peter_e@gmx.net]
> 
> Hiroshi Inoue writes:
> 
> > Isn't 'a' LIKE 'A' if 'a' = 'A' ?
> 
> Yes.  But 'a' <> 'A'.

Please look at my first question.  This depends on the assumption that '=' is equivalent in  any locale. Is it
guaranteed?  For example, ( 'a' = 'A' ) isn't allowed in any locale ?. 
 

And your answer was  The whole point here is not to rely on '='. 

Clearly your theory depends on the assumption that  If a = b in some locale then a = b in ASCII locale.

And where does 'a' <> 'A' come from ?
The definition of '=' is a part of collating sequence.

> 
> > LIKE seems to use the collating sequence.
> 
> No.  The collating sequence defines the order of all possible strings.
> LIKE doesn't order anything.

Again where does it come from ?

regards,
Hiroshi Inoue


RE: Progress report on locale safe LIKE indexing

From
Peter Eisentraut
Date:
Hiroshi Inoue writes:

> Please look at my first question.
>    This depends on the assumption that '=' is equivalent in
>    any locale. Is it guaranteed ?
>    For example, ( 'a' = 'A' ) isn't allowed in any locale ?.
>
> And your answer was
>    The whole point here is not to rely on '='.
>
> Clearly your theory depends on the assumption that
>    If a = b in some locale then a = b in ASCII locale.
>
> And where does 'a' <> 'A' come from ?
> The definition of '=' is a part of collating sequence.
>
> >
> > > LIKE seems to use the collating sequence.
> >
> > No.  The collating sequence defines the order of all possible strings.
> > LIKE doesn't order anything.
>
> Again where does it come from ?

Let me elaborate again:

We want to be able to use btree indexes for LIKE expressions, under the
theory that given the expression col LIKE 'foo%' we can augment the
expression col >= 'foo' and col < 'fop', which a btree can handle.  Our
problem is that this theory was false, because if the operators >= and <
are locale-aware they can do just about anything.  So my solution was that
I implement an extra set of operators >= and < (named $>=$ and $<$ for the
heck of it) that are *not* locale-aware so that this whole thing works
again.

Now, if you look at the code that does the LIKE pattern matching you'll
see that it does not use any locale features, it simply compares
characters for equality based on their character codes, accounting for the
wildcards.  Consequentially, this whole operation has nothing to do with
locales.  It was an error that it did in the first place, that's why we
had all these problems.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Progress report on locale safe LIKE indexing

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Now, if you look at the code that does the LIKE pattern matching you'll
> see that it does not use any locale features, it simply compares
> characters for equality based on their character codes, accounting for the
> wildcards.  Consequentially, this whole operation has nothing to do with
> locales.

But the LIKE code does know about multibyte character sets.  Is it safe
to assume that memcmp-based sorting will not make any mistakes with
multibyte characters?
        regards, tom lane


RE: Progress report on locale safe LIKE indexing

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Peter Eisentraut
>
> Hiroshi Inoue writes:
>
> > Please look at my first question.
> >    This depends on the assumption that '=' is equivalent in
> >    any locale. Is it guaranteed ?
> >    For example, ( 'a' = 'A' ) isn't allowed in any locale ?.
> >
> > And your answer was
> >    The whole point here is not to rely on '='.
> >
> > Clearly your theory depends on the assumption that
> >    If a = b in some locale then a = b in ASCII locale.
> >
> > And where does 'a' <> 'A' come from ?
> > The definition of '=' is a part of collating sequence.
> >
> > >
> > > > LIKE seems to use the collating sequence.
> > >
> > > No.  The collating sequence defines the order of all possible strings.
> > > LIKE doesn't order anything.
> >
> > Again where does it come from ?
>
> Let me elaborate again:
>
> Now, if you look at the code that does the LIKE pattern matching you'll
> see that it does not use any locale features, it simply compares
> characters for equality based on their character codes, accounting for the
> wildcards.  Consequentially, this whole operation has nothing to do with
> locales.

Oh I see your point.
Hmm  * string1 = string2 * doesn't imply * string1 LIKE string2 * ?

Otherwise the current criterion of LIKE matching unwittingly assumes
that there's no locale that has the different definition of '=' from that of
ASCII locale.  I don't think the current implementation is strictly right.

regards,
Hiroshi Inoue



RE: Progress report on locale safe LIKE indexing

From
Peter Eisentraut
Date:
Hiroshi Inoue writes:

> Hmm  * string1 = string2 * doesn't imply * string1 LIKE string2 * ?

In the current implementation of LIKE, you're right.  The SQL standard
allows for the possibility that "[d]epending on the collating sequence,
two strings may compare as equal even if they are of different lengths or
contain different sequences of characters."  However, I doubt that this
can really happen in practice.  For example, in some collating sequences
(such as en_US), characters with diacritic marks (accents) are "more
equal" than others, but in the end there's always a tie breaker.  Or do
you know an example where this really happens?

> Otherwise the current criterion of LIKE matching unwittingly assumes
> that there's no locale that has the different definition of '=' from that of
> ASCII locale.  I don't think the current implementation is strictly right.

Strictly speaking, it isn't.  The SQL standard says that literal
characters in the pattern must be matched to the characters in the
supplied value according to the collating sequence.  (See 8.5 GR 3. d) ii)
4).)

However, I strongly doubt that that would actually be a good idea.
Pattern matching generally doesn't work this way (cf. POSIX regexp), and
since locale-aware comparisons are context-sensitive in some cases I don't
even want to think about whether this could actually work when faced with
wildcards.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Progress report on locale safe LIKE indexing

From
Peter Eisentraut
Date:
Tom Lane writes:

> But the LIKE code does know about multibyte character sets.  Is it safe
> to assume that memcmp-based sorting will not make any mistakes with
> multibyte characters?

Remember that this memcmp-based sorting is the same sorting that texteq
will do when locale is turned off.  So if there were a problem it would
already exist, but I'm sure there isn't one.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Tool Search

From
Naomi Walker
Date:
We're about to start up Postgresql in production, and I am looking for a 
few tools, so I do not have to reinvent the wheel.

I'm looking for tools that:
 1) Grep through log files, looking for particulary nasty items, that 
should get sent to me in email or to a cell phone.  I noticed you can use 
the syslog daemon, which would be great for sorting.  Anyone have a sort list?
  2) Probably, we'll use Bib Brother or Whats Up Gold to monitor system 
resources, like cpu, memory, and disk space.  Anyone have suggestions?
  3) Interesting tools any self-respecting DBA should have, that looks for 
particular "quirks" in Postgresql.  For example, in Informix, we were 
always checking fragmentation and space issues.


Naomi Walker
Chief Information Officer
Eldorado Computing, Inc.
602-604-3100
nwalker@eldocomp.com



Re: Progress report on locale safe LIKE indexing

From
Hiroshi Inoue
Date:
Peter Eisentraut wrote:
> 
> Hiroshi Inoue writes:
> 
> > Hmm  * string1 = string2 * doesn't imply * string1 LIKE string2 * ?
> 
> In the current implementation of LIKE, you're right.  The SQL standard
> allows for the possibility that "[d]epending on the collating sequence,
> two strings may compare as equal even if they are of different lengths or
> contain different sequences of characters."  However, I doubt that this
> can really happen in practice.  For example, in some collating sequences
> (such as en_US), characters with diacritic marks (accents) are "more
> equal" than others, but in the end there's always a tie breaker.  Or do
> you know an example where this really happens?

I can see the examples in a documentation M$ SQL Server though
I can't try it in reality.
For example ignore case(low/high) ignore accents 
I don't think they are strange as collating sequences.

You are establishing a pretty big mechanism and I think
you should clarify the assumption.
Please tell me the assumption.
I can think of the followings.

1) Because the current implementaion of LIKE isn't locale-aware,  we should be compatible with it for ever.
2) strcoll(str1, str2) == 0 means strcmp(str1, str2) == 0  in any locale.

regards,
Hiroshi Inoue


Re: Progress report on locale safe LIKE indexing

From
Peter Eisentraut
Date:
Hiroshi Inoue writes:

> 1) Because the current implementaion of LIKE isn't locale-aware,
>    we should be compatible with it for ever.

I'm not sure I intended to say that.  The combination of the following
factors seems important:

a) compatibility is desirable
b) no requests to the contrary have been made
c) the LIKE locale awareness as defined by SQL is quite brain-dead

> 2) strcoll(str1, str2) == 0 means strcmp(str1, str2) == 0
>    in any locale.

I *think* that is true, but until something happens about 1) it doesn't
matter.


What is your position:  Do you think my solution is short-lived because
the LIKE implementation is wrong and should be changed?

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



Re: Progress report on locale safe LIKE indexing

From
Hiroshi Inoue
Date:
Peter Eisentraut wrote:
> 
> Hiroshi Inoue writes:
> 
> > 1) Because the current implementaion of LIKE isn't locale-aware,
> >    we should be compatible with it for ever.
> 
> I'm not sure I intended to say that.  The combination of the following
> factors seems important:
> 
> a) compatibility is desirable
> b) no requests to the contrary have been made
> c) the LIKE locale awareness as defined by SQL is quite brain-dead
> 
> > 2) strcoll(str1, str2) == 0 means strcmp(str1, str2) == 0
> >    in any locale.
> 
> I *think* that is true, but until something happens about 1) it doesn't
> matter.
> 

Tatsuo reported a case that strcoll(str1, str2) == 0 but strcmp(
str1, str2) != 0 though it seems to be considered as an OS bug.

> What is your position:  Do you think my solution is short-lived because
> the LIKE implementation is wrong and should be changed?

Yes I'm afraid of it though I'm not sure.
If your solution is short-lived, your change would be not
only useless but also harmful. So I expect locale-aware
people to confirm that we are in the right direction.
I myself don't need the locale support at all. In Japan
existent locales themselves seem to be harmful for most
people as Tatsuo mentioned already.

regards,
Hiroshi Inoue


Re: Progress report on locale safe LIKE indexing

From
Peter Eisentraut
Date:
Hiroshi Inoue writes:

> If your solution is short-lived, your change would be not
> only useless but also harmful. So I expect locale-aware
> people to confirm that we are in the right direction.

I am a bit confused here.  We have tinkered with LIKE indexing at least a
year.  Now that a solution is found that *works*, it is claimed that it is
harmful because LIKE was doing the wrong thing in the first place.  OTOH,
I have not seen anyone independently claim that LIKE is wrong, nor do I
see anyone proposing to actually change it.  If we install my fix now we
have a system that works better than the previous release with no change
in semantics.  If someone wants to change LIKE at a later point, reverting
my changes would be the least part of that work.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



RE: Progress report on locale safe LIKE indexing

From
"Zeugswetter Andreas SB SD"
Date:
> > If your solution is short-lived, your change would be not
> > only useless but also harmful. So I expect locale-aware
> > people to confirm that we are in the right direction.
> 
> I am a bit confused here.  We have tinkered with LIKE indexing at
least a
> year.  Now that a solution is found that *works*, it is claimed that
it is
> harmful because LIKE was doing the wrong thing in the first place.
OTOH,
> I have not seen anyone independently claim that LIKE is wrong, nor do
I
> see anyone proposing to actually change it.

Because we configure --without-locale ? How should we see or object to 
wrong behavior in a part of the software we don't need or use ?

Andreas


RE: Progress report on locale safe LIKE indexing

From
Peter Eisentraut
Date:
Zeugswetter Andreas SB SD writes:

> > I am a bit confused here.  We have tinkered with LIKE indexing at
> least a
> > year.  Now that a solution is found that *works*, it is claimed that
> it is
> > harmful because LIKE was doing the wrong thing in the first place.
> OTOH,
> > I have not seen anyone independently claim that LIKE is wrong, nor do
> I
> > see anyone proposing to actually change it.
>
> Because we configure --without-locale ? How should we see or object to
> wrong behavior in a part of the software we don't need or use ?

Wrong answer.  Read my second sentence again and compare it to your first
sentence.

-- 
Peter Eisentraut   peter_e@gmx.net   http://funkturm.homeip.net/~peter



RE: Progress report on locale safe LIKE indexing

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Peter Eisentraut
>
> Hiroshi Inoue writes:
>
> > If your solution is short-lived, your change would be not
> > only useless but also harmful. So I expect locale-aware
> > people to confirm that we are in the right direction.
>
> I am a bit confused here.  We have tinkered with LIKE indexing at least a
> year.  Now that a solution is found that *works*, it is claimed that it is
> harmful because LIKE was doing the wrong thing in the first place.  OTOH,
> I have not seen anyone independently claim that LIKE is wrong, nor do I
> see anyone proposing to actually change it.

Probably no one has used such a (useful) locale that
strcoll(str1, str2) == 0 but strcmp(str1, str2) != 0
with PostgreSQL. As long as the vagueness with LIKE
is held within utils/adt/like.c I won't complain.

> If we install my fix now we
> have a system that works better than the previous release with no change
> in semantics.  If someone wants to change LIKE at a later point, reverting
> my changes would be the least part of that work.

If your change is useful, the change would be hard to revert.
We had better be more careful about the change.
For example, you are defining text_binary_ops on text data type
but how about introduing a new data type (text collate ASCII) and
define text_ascii_ops on the new type ? We could use operators
like =, <=, >= instead of $=$, $<=$, $>=$ ...  We may be able to
to lay the foundation of the collate support for individual column
by changing existent text type as (text collate some_collate) .

regards,
Hiroshi Inoue