Thread: LIKE fixed(?) for non-ASCII collation orders

LIKE fixed(?) for non-ASCII collation orders

From
Tom Lane
Date:
I have just committed what I hope is the final solution for the problem
of LIKE index optimization in non-ASCII locales.  indxpath.c now
generates both a lower and upper indexqual in all locales.  For example,x LIKE 'foo%t'
will create indexqual conditionsx >= 'foo' AND x < 'fop'
The "<" condition is omitted only if the code is unable to produce a
string greater than the pattern's constant prefix.

Locale-specific variations in collation order are handled by the
cut-and-try method I suggested a while ago:

> The approach I was considering for fixing the problem was to use a
> loop that would repeatedly try to generate a string greater than the
> prefix string.  The basic loop step would increment the rightmost
> byte as Goran has done (or, if it's already up to the limit, chop
> it off and increment the next character position).  Then test to
> see whether the '<' operator actually believes the result is
> greater than the given prefix, and repeat if not.

Although I believe that the code will work in non-ASCII locales and
MULTIBYTE character sets, I'm not set up to try it easily.  Could
some folks try it out and report back?

The critical subroutine is attached, so if you prefer to eyeball
the code, here it is...
        regards, tom lane

/** Try to generate a string greater than the given string or any string it is* a prefix of.  If successful, return a
palloc'dstring; else return NULL.** To work correctly in non-ASCII locales with weird collation orders,* we cannot
simplyincrement "foo" to "fop" --- we have to check whether* we actually produced a string greater than the given one.
Ifnot,* increment the righthand byte again and repeat.  If we max out the righthand* byte, truncate off the last
characterand start incrementing the next.* For example, if "z" were the last character in the sort order, then we*
couldproduce "foo" as a string greater than "fonz".** This could be rather slow in the worst case, but in most cases we
won't*have to try more than one or two strings before succeeding.** XXX in a sufficiently weird locale, this might
produceincorrect results?* For example, in German I believe "ss" is treated specially --- if we are* given "foos" and
return"foot", will this actually be greater than "fooss"?*/
 
static char *
make_greater_string(const char * str, Oid datatype)
{   char       *workstr;   int         len;
   /* Make a modifiable copy, which will be our return value if successful */   workstr = pstrdup((char *) str);
   while ((len = strlen(workstr)) > 0)   {       unsigned char  *lastchar = (unsigned char *) (workstr + len - 1);
       /*        * Try to generate a larger string by incrementing the last byte.        */       while (*lastchar <
(unsignedchar) 255)       {           (*lastchar)++;           if (string_lessthan(str, workstr, datatype))
 return workstr;            /* Success! */       }       /*        * Truncate off the last character, which might be
morethan 1 byte        * in MULTIBYTE case.        */
 
#ifdef MULTIBYTE       len = pg_mbcliplen((const unsigned char *) workstr, len, len-1);       workstr[len] = '\0';
#else       *lastchar = '\0';
#endif   }
   /* Failed... */   pfree(workstr);   return NULL;
}


Re: [HACKERS] LIKE fixed(?) for non-ASCII collation orders

From
Andreas Degert
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I have just committed what I hope is the final solution for the problem
> of LIKE index optimization in non-ASCII locales.  indxpath.c now
> generates both a lower and upper indexqual in all locales.  For example,
>     x LIKE 'foo%t'
> will create indexqual conditions
>     x >= 'foo' AND x < 'fop'
> The "<" condition is omitted only if the code is unable to produce a
> string greater than the pattern's constant prefix.

the .. >= .. < .. condition will result in addtional matches, like 'fo ot',
so you still have to check with LIKE. I'm using such an expression
(without the additional LIKE) in an application, and it seems to match 
at least everything that LIKE would match too (my users would have
complained about missing matches, but i never did a formal test or
evaluation).


Re: [HACKERS] LIKE fixed(?) for non-ASCII collation orders

From
Tom Lane
Date:
Andreas Degert <ad@papyrus-gmbh.de> writes:
> the .. >= .. < .. condition will result in addtional matches, like 'fo ot',
> so you still have to check with LIKE.

Of course.  The point of all this is just to create a "prefilter" that can
be implemented by scanning an index, so that most of the rows in the
table need not be visited at all when the LIKE pattern has the right
form.  But the original LIKE operator is still executed as part of the
WHERE condition for each row.
        regards, tom lane


Re: [HACKERS] LIKE fixed(?) for non-ASCII collation orders

From
Bruce Momjian
Date:
Yes, we always still match the LIKE.  The additional comparisons just
make the number of rows smaller.


> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > I have just committed what I hope is the final solution for the problem
> > of LIKE index optimization in non-ASCII locales.  indxpath.c now
> > generates both a lower and upper indexqual in all locales.  For example,
> >     x LIKE 'foo%t'
> > will create indexqual conditions
> >     x >= 'foo' AND x < 'fop'
> > The "<" condition is omitted only if the code is unable to produce a
> > string greater than the pattern's constant prefix.
> 
> the .. >= .. < .. condition will result in addtional matches, like 'fo ot',
> so you still have to check with LIKE. I'm using such an expression
> (without the additional LIKE) in an application, and it seems to match 
> at least everything that LIKE would match too (my users would have
> complained about missing matches, but i never did a formal test or
> evaluation).
> 
> ************
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026