Thread: like/ilike improvements

like/ilike improvements

From
Andrew Dunstan
Date:
Starting from a review of a patch from Itagaki Takahiro to improve LIKE 
performance for UTF8-encoded databases, I have been working on improving 
both efficiency of the LIKE/ILIKE code and the code quality.

The main efficiency improvement comes from some fairly tricky analysis 
and discussion on -patches. Essentially there are two calls that we make 
to advance the text and pattern cursors: NextByte and NextChar. In the 
case of single byte charsets these are in fact the same thing, but in 
multi byte charsets they are obviously not, and in that case NextChar is 
a lot more expensive. It turns out (according to the analysis) that the 
only time we actually need to use NextChar is when we are matching an 
"_" in a like/ilike pattern. It also turns out that there are some 
comparison tests that we can hoist out of a loop and thus avoid 
repeating over and over. Also, some calls can be marked "inline" to 
improve efficiency. Finally, the special case of computing lower(x) on 
the fly for ILIKE comparisons on single byte charset strings turns out 
to have the potential to call lower() O(n^2) times, so it has been 
removed and we now treat foo ILIKE bar as lower(foo) LIKE lower(bar) for 
all charsets uniformly. There will be cases where this approach wins and 
cases where it loses, but the wins are potentially dramatic, whereas the 
losses should be mild.

The current state of this work is at 
http://archives.postgresql.org/pgsql-patches/2007-05/msg00385.php

I've been testing it using a set of 5m rows of random Latin1 data - each 
row is between 100 and 400 chars long, and 20% of them (roughly) have 
the string "foo" randomly located within them. The test platform is 
gcc/fc6/AMD64.

I have loaded the data into both Latin1 and UTF8 encoded databases. (I'm 
not sure if there are other multibyte charsets that are compatible with 
Latin1 client encoding). My test is essentially:
 select count(*) from footable where t like '%_foo%'; select count(*) from footable where t ilike '%_foo%';
 select count(*) from footable where t like '%foo%'; select count(*) from footable where t ilike '%foo%';

Note that the "%_" case is probably the worst for these changes, since 
it involves lots of calls to NextChar() (see above).

The multibyte results show significant improvement. The results are 
about flat or a slight improvement for the singlebyte cases. I'll post 
some numbers on this shortly.

But before I commit this I'd appreciate seeing some more testing, both 
for correctness and performance.

cheers

andrew











Re: like/ilike improvements

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> ... It turns out (according to the analysis) that the 
> only time we actually need to use NextChar is when we are matching an 
> "_" in a like/ilike pattern.

I thought we'd determined that advancing bytewise for "%" was also risky,
in two cases:

1. Multibyte character set that is not UTF8 (more specifically, does not
have a guarantee that first bytes and not-first bytes are distinct)

2. "_" immediately follows the "%".
        regards, tom lane


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> ... It turns out (according to the analysis) that the 
>> only time we actually need to use NextChar is when we are matching an 
>> "_" in a like/ilike pattern.
>>     
>
> I thought we'd determined that advancing bytewise for "%" was also risky,
> in two cases:
>
> 1. Multibyte character set that is not UTF8 (more specifically, does not
> have a guarantee that first bytes and not-first bytes are distinct)
>   

I will review - I thought we had ruled that out.

Which non-UTF8 multi-byte charset would be best to test with?

> 2. "_" immediately follows the "%".
>
>             
>   

The patch in fact calls NextChar in this case.

cheers

andrew


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Andrew Dunstan wrote:
>
>
> Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>  
>>> ... It turns out (according to the analysis) that the only time we 
>>> actually need to use NextChar is when we are matching an "_" in a 
>>> like/ilike pattern.
>>>     
>>
>> I thought we'd determined that advancing bytewise for "%" was also 
>> risky,
>> in two cases:
>>
>> 1. Multibyte character set that is not UTF8 (more specifically, does not
>> have a guarantee that first bytes and not-first bytes are distinct)

I thought we disposed of the idea that there was a problem with charsets 
that didn't do first byte special.

And Dennis said:

> Tom Lane skrev:
>> You could imagine trying to do
>> % a byte at a time (and indeed that's what I'd been thinking it did)
>> but that gets you out of sync which breaks the _ case.
>
> It is only when you have a pattern like '%_' when this is a problem 
> and we could detect this and do byte by byte when it's not. Now we 
> check (*p == '\\') || (*p == '_') in each iteration when we scan over 
> characters for '%', and we could do it once and have different loops 
> for the two cases.

That's pretty much what the patch does now - It never tries to match a 
single byte when it sees "_", whether or not preceeded by "%".

cheers

andrew





Re: like/ilike improvements

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> I thought we'd determined that advancing bytewise for "%" was also 
>> risky, in two cases:
>> 
>> 1. Multibyte character set that is not UTF8 (more specifically, does not
>> have a guarantee that first bytes and not-first bytes are distinct)

> I thought we disposed of the idea that there was a problem with charsets 
> that didn't do first byte special.

We disposed of that in connection with a version of the patch that had
"%" advancing in NextChar units, so that comparison of ordinary
characters was always safely char-aligned.  Consider 2-byte characters
represented as {AB} etc:
DATA    x{AB}{CD}y
PATTERN    %{BC}%

If "%" advances by bytes then this will find a spurious match.  The
only thing that prevents it is if "B" can't be both a leading and a
trailing byte of validly-encoded MB characters.
        regards, tom lane


Re: like/ilike improvements

From
"Guillaume Smet"
Date:
On 5/22/07, Andrew Dunstan <andrew@dunslane.net> wrote:
> But before I commit this I'd appreciate seeing some more testing, both
> for correctness and performance.

Any chance the patch applies cleanly on a 8.2 code base? I can test it
on a real life 8.2 db but I won't have the time to load the data in a
CVS HEAD one.
If there is no obvious reason for it to fail on 8.2, I'll try to see
if I can apply it.

Thanks.

--
Guillaume


Re: like/ilike improvements

From
Andrew - Supernews
Date:
On 2007-05-22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If "%" advances by bytes then this will find a spurious match.  The
> only thing that prevents it is if "B" can't be both a leading and a
> trailing byte of validly-encoded MB characters.

Which is (by design) true in UTF8, but is not true of most other
multibyte charsets.

The %_ case is also trivially handled in UTF8 by simply ensuring that
_ doesn't match a non-initial octet. This allows % to advance by bytes
without danger of losing sync.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services


Re: like/ilike improvements

From
Tom Lane
Date:
Andrew - Supernews <andrew+nonews@supernews.com> writes:
> On 2007-05-22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If "%" advances by bytes then this will find a spurious match.  The
>> only thing that prevents it is if "B" can't be both a leading and a
>> trailing byte of validly-encoded MB characters.

> Which is (by design) true in UTF8, but is not true of most other
> multibyte charsets.

> The %_ case is also trivially handled in UTF8 by simply ensuring that
> _ doesn't match a non-initial octet. This allows % to advance by bytes
> without danger of losing sync.

Yeah.  It seems we need three comparison functions after all:

1. Single-byte character set: needs NextByte and ByteEq only.

2. Generic multi-byte character set: both % and _ must advance by
characters to ensure we never try an out-of-alignment character
comparison.  But simple character comparison works bytewise given
that.  So primitives are NextChar, NextByte, ByteEq.

3. UTF8: % can advance bytewise.  _ must check it is on a first byte
(else return match failure) and if so do NextChar.  So primitives
are NextChar, NextByte, ByteEq, IsFirstByte.

In no case do we need CharEq.  I'd be inclined to drop ByteEq as a
macro and just use "==", too.
        regards, tom lane


Re: like/ilike improvements

From
mark@mark.mielke.cc
Date:
On Tue, May 22, 2007 at 12:12:51PM -0400, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > ... It turns out (according to the analysis) that the 
> > only time we actually need to use NextChar is when we are matching an 
> > "_" in a like/ilike pattern.
> I thought we'd determined that advancing bytewise for "%" was also risky,
> in two cases:
> 1. Multibyte character set that is not UTF8 (more specifically, does not
> have a guarantee that first bytes and not-first bytes are distinct)
> 2. "_" immediately follows the "%".

Have you considered a two pass approach? First pass - match on bytes.
Only if you find a match with the first pass, start a second pass to
do a 'safe' check?

Are there optimizations to recognize whether the index was created as
lower(field) or upper(field), and translate ILIKE to the appropriate
one?

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Yeah.  It seems we need three comparison functions after all:
>   

Yeah, that was my confusion. I thought we had concluded that we didn't, 
but clearly we do.

> 1. Single-byte character set: needs NextByte and ByteEq only.
>
> 2. Generic multi-byte character set: both % and _ must advance by
> characters to ensure we never try an out-of-alignment character
> comparison.  But simple character comparison works bytewise given
> that.  So primitives are NextChar, NextByte, ByteEq.
>
> 3. UTF8: % can advance bytewise.  _ must check it is on a first byte
> (else return match failure) and if so do NextChar.  So primitives
> are NextChar, NextByte, ByteEq, IsFirstByte.
>
> In no case do we need CharEq.  I'd be inclined to drop ByteEq as a
> macro and just use "==", too.
>
>     
>   

I'll work this up. I think it will be easier if I marry cases 1 and 2, 
with NextChar being the same as NextByte in the single byte case.

cheers

andrew



Re: like/ilike improvements

From
db@zigo.dhs.org
Date:
> And Dennis said:
>
>> It is only when you have a pattern like '%_' when this is a problem
>> and we could detect this and do byte by byte when it's not. Now we
>> check (*p == '\\') || (*p == '_') in each iteration when we scan over
>> characters for '%', and we could do it once and have different loops
>> for the two cases.
>
> That's pretty much what the patch does now - It never tries to match a
> single byte when it sees "_", whether or not preceeded by "%".

My comment was about UTF-8 since I thought we were making a special
version for UTF-8. I don't know what properties other multibyte encodings
have.

/Dennis



Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
>
> 3. UTF8: % can advance bytewise.  _ must check it is on a first byte
> (else return match failure) and if so do NextChar.  So primitives
> are NextChar, NextByte, ByteEq, IsFirstByte.
>
>
>   

We should only be able to get out of step from the "%_" case, I believe, 
so we should only need to do the first-byte test in that case (which is 
in a different code path from the normal "_" case. Does that seem right?

cheers

andrew


Re: like/ilike improvements

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> We should only be able to get out of step from the "%_" case, I believe, 
> so we should only need to do the first-byte test in that case (which is 
> in a different code path from the normal "_" case. Does that seem right?

At least put Assert(IsFirstByte()) in the main path.

I'm a bit suspicious of the separate-path business anyway.  Will it do
the right thing with say "%%%_" ?
        regards, tom lane


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> We should only be able to get out of step from the "%_" case, I believe, 
>> so we should only need to do the first-byte test in that case (which is 
>> in a different code path from the normal "_" case. Does that seem right?
>>     
>
> At least put Assert(IsFirstByte()) in the main path.
>
> I'm a bit suspicious of the separate-path business anyway.  Will it do
> the right thing with say "%%%_" ?
>
>             
>   

Yes:

           /* %% is the same as % according to the SQL standard */           /* Advance past all %'s */           while
((plen> 0) && (*p == '%'))               NextByte(p, plen);
 

cheers

andrew




Re: like/ilike improvements

From
Andrew Dunstan
Date:

Andrew Dunstan wrote:
>
>
> Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>  
>>> We should only be able to get out of step from the "%_" case, I 
>>> believe, so we should only need to do the first-byte test in that 
>>> case (which is in a different code path from the normal "_" case. 
>>> Does that seem right?
>>>     
>>
>> At least put Assert(IsFirstByte()) in the main path.
>>
>> I'm a bit suspicious of the separate-path business anyway.  Will it do
>> the right thing with say "%%%_" ?
>>
>>            
>>   
>
> Yes:
>
>
>            /* %% is the same as % according to the SQL standard */
>            /* Advance past all %'s */
>            while ((plen > 0) && (*p == '%'))
>                NextByte(p, plen);

I am also wondering if it might be sensible to make this choice once at 
backend startup and store a function pointer, instead of doing it for 
every string processed by like/ilike:
   if (pg_database_encoding_max_length() == 1)       return SB_MatchText(s, slen, p, plen);   else if
(GetDatabaseEncoding()== PG_UTF8)       return UTF8_MatchText(s, slen, p, plen);   else       return MB_MatchText(s,
slen,p, plen);
 

I guess that might make matters harder if we ever got per-column encodings.

cheers

andrew




Re: like/ilike improvements

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I am also wondering if it might be sensible to make this choice once at 
> backend startup and store a function pointer, instead of doing it for 
> every string processed by like/ilike:

>     if (pg_database_encoding_max_length() == 1)
>         return SB_MatchText(s, slen, p, plen);
>     else if (GetDatabaseEncoding() == PG_UTF8)
>         return UTF8_MatchText(s, slen, p, plen);
>     else
>         return MB_MatchText(s, slen, p, plen);

> I guess that might make matters harder if we ever got per-column encodings.

Yeah.  It's not saving much anyway ... I wouldn't bother.
        regards, tom lane


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>
>> We should only be able to get out of step from the "%_" case, I believe,
>> so we should only need to do the first-byte test in that case (which is
>> in a different code path from the normal "_" case. Does that seem right?
>>
>
> At least put Assert(IsFirstByte()) in the main path.
>
> I'm a bit suspicious of the separate-path business anyway.  Will it do
> the right thing with say "%%%_" ?
>
>
>

OK, Here is a patch that I am fairly confident does what's been
discussed, as summarised by Tom.

To answer Guillaume's question - it probably won't apply cleanly to 8.2
sources.

cheers

andrew


Index: src/backend/utils/adt/like.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v
retrieving revision 1.68
diff -c -r1.68 like.c
*** src/backend/utils/adt/like.c    27 Feb 2007 23:48:08 -0000    1.68
--- src/backend/utils/adt/like.c    24 May 2007 00:26:49 -0000
***************
*** 28,48 ****
  #define LIKE_ABORT                        (-1)


! static int    MatchText(char *t, int tlen, char *p, int plen);
! static int    MatchTextIC(char *t, int tlen, char *p, int plen);
! static int    MatchBytea(char *t, int tlen, char *p, int plen);
! static text *do_like_escape(text *, text *);

! static int    MBMatchText(char *t, int tlen, char *p, int plen);
! static int    MBMatchTextIC(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);

  /*--------------------
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *--------------------
   */
! static int
  wchareq(char *p1, char *p2)
  {
      int            p1_len;
--- 28,50 ----
  #define LIKE_ABORT                        (-1)


! static int    SB_MatchText(char *t, int tlen, char *p, int plen);
! static text *SB_do_like_escape(text *, text *);

! static int    MB_MatchText(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);

+ static int    UTF8_MatchText(char *t, int tlen, char *p, int plen);
+
+ static int    GenericMatchText(char *s, int slen, char* p, int plen);
+ static int    Generic_Text_IC_like(text *str, text *pat);
+
  /*--------------------
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *--------------------
   */
! static inline int
  wchareq(char *p1, char *p2)
  {
      int            p1_len;
***************
*** 72,86 ****
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.    This should be revisited when we install better locale support.
-  *
-  * Note that MBMatchText and MBMatchTextIC do exactly the same thing now.
-  * Is it worth refactoring to avoid duplicated code?  They might become
-  * different again in the future.
   */

  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
- #define ICHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
      do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
--- 74,85 ----
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.    This should be revisited when we install better locale support.
   */

+ #define NextByte(p, plen)    ((p)++, (plen)--)
+
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
      do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
***************
*** 90,122 ****
               *(dst)++ = *(src)++; \
         } while (0)

! #define MatchText    MBMatchText
! #define MatchTextIC MBMatchTextIC
  #define do_like_escape    MB_do_like_escape

  #include "like_match.c"

- #undef CHAREQ
- #undef ICHAREQ
- #undef NextChar
- #undef CopyAdvChar
- #undef MatchText
- #undef MatchTextIC
- #undef do_like_escape
-
  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) (*(p1) == *(p2))
! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2)))
! #define NextChar(p, plen) ((p)++, (plen)--)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)

  #include "like_match.c"

- /* And some support for BYTEA */
- #define BYTEA_CHAREQ(p1, p2) (*(p1) == *(p2))
- #define BYTEA_NextChar(p, plen) ((p)++, (plen)--)
- #define BYTEA_CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)


  /*
   *    interface routines called by the function manager
--- 89,148 ----
               *(dst)++ = *(src)++; \
         } while (0)

! #define MatchText    MB_MatchText
  #define do_like_escape    MB_do_like_escape

  #include "like_match.c"

  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) (*p1 == *p2)
! #define NextChar(p, plen) NextByte(p, plen)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)

+ #define MatchText    SB_MatchText
+ #define do_like_escape    SB_do_like_escape
+
  #include "like_match.c"


+ /* setup to compile like_match.c for UTF8 encoding */
+
+ #define NextChar(p, plen) \
+     do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
+ #define UTF8_OPT
+ #define MatchText    UTF8_MatchText
+
+ #include "like_match.c"
+
+ static inline int
+ GenericMatchText(char *s, int slen, char* p, int plen)
+ {
+     if (pg_database_encoding_max_length() == 1)
+         return SB_MatchText(s, slen, p, plen);
+     else if (GetDatabaseEncoding() == PG_UTF8)
+         return UTF8_MatchText(s, slen, p, plen);
+     else
+         return MB_MatchText(s, slen, p, plen);
+ }
+
+ static inline int
+ Generic_Text_IC_like(text *str, text *pat)
+ {
+     char       *s,
+                *p;
+     int            slen,
+                 plen;
+
+     /* Force inputs to lower case to achieve case insensitivity */
+     str = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(str)));
+     pat = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(pat)));
+     s = VARDATA(str);
+     slen = (VARSIZE(str) - VARHDRSZ);
+     p = VARDATA(pat);
+     plen = (VARSIZE(pat) - VARHDRSZ);
+
+     return GenericMatchText(s, slen, p, plen);
+ }

  /*
   *    interface routines called by the function manager
***************
*** 138,147 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 164,170 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 162,171 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 185,191 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 186,195 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 206,212 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 210,219 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 227,233 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 234,240 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (MatchBytea(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 248,254 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (SB_MatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 255,261 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (MatchBytea(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 269,275 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (SB_MatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 270,306 ****
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     char       *s,
!                *p;
!     int            slen,
!                 plen;
!
!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = NameStr(*str);
!         slen = strlen(s);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         text       *strtext;

!         strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         strtext = DatumGetTextP(DirectFunctionCall1(lower,
!                                                   PointerGetDatum(strtext)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!
!         s = VARDATA(strtext);
!         slen = (VARSIZE(strtext) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 284,294 ----
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     text       *strtext;

!     strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!     result = (Generic_Text_IC_like(strtext, pat) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 311,347 ****
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     char       *s,
!                *p;
!     int            slen,
!                 plen;

!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = NameStr(*str);
!         slen = strlen(s);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         text       *strtext;
!
!         strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         strtext = DatumGetTextP(DirectFunctionCall1(lower,
!                                                   PointerGetDatum(strtext)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!
!         s = VARDATA(strtext);
!         slen = (VARSIZE(strtext) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 299,309 ----
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     text       *strtext;

!     strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!     result = (Generic_Text_IC_like(strtext, pat) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 352,383 ****
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
-     char       *s,
-                *p;
-     int            slen,
-                 plen;

!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         str = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(str)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 314,321 ----
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;

!     result = (Generic_Text_IC_like(str, pat) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 388,419 ****
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
-     char       *s,
-                *p;
-     int            slen,
-                 plen;

!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         str = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(str)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 326,333 ----
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;

!     result = (Generic_Text_IC_like(str, pat) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 430,436 ****
      text       *result;

      if (pg_database_encoding_max_length() == 1)
!         result = do_like_escape(pat, esc);
      else
          result = MB_do_like_escape(pat, esc);

--- 344,350 ----
      text       *result;

      if (pg_database_encoding_max_length() == 1)
!         result = SB_do_like_escape(pat, esc);
      else
          result = MB_do_like_escape(pat, esc);

***************
*** 446,624 ****
  {
      bytea       *pat = PG_GETARG_BYTEA_P(0);
      bytea       *esc = PG_GETARG_BYTEA_P(1);
!     bytea       *result;
!     char       *p,
!                *e,
!                *r;
!     int            plen,
!                 elen;
!     bool        afterescape;
!
!     p = VARDATA(pat);
!     plen = (VARSIZE(pat) - VARHDRSZ);
!     e = VARDATA(esc);
!     elen = (VARSIZE(esc) - VARHDRSZ);
!
!     /*
!      * Worst-case pattern growth is 2x --- unlikely, but it's hardly worth
!      * trying to calculate the size more accurately than that.
!      */
!     result = (text *) palloc(plen * 2 + VARHDRSZ);
!     r = VARDATA(result);
!
!     if (elen == 0)
!     {
!         /*
!          * No escape character is wanted.  Double any backslashes in the
!          * pattern to make them act like ordinary characters.
!          */
!         while (plen > 0)
!         {
!             if (*p == '\\')
!                 *r++ = '\\';
!             BYTEA_CopyAdvChar(r, p, plen);
!         }
!     }
!     else
!     {
!         /*
!          * The specified escape must be only a single character.
!          */
!         BYTEA_NextChar(e, elen);
!         if (elen != 0)
!             ereport(ERROR,
!                     (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
!                      errmsg("invalid escape string"),
!                   errhint("Escape string must be empty or one character.")));
!
!         e = VARDATA(esc);
!
!         /*
!          * If specified escape is '\', just copy the pattern as-is.
!          */
!         if (*e == '\\')
!         {
!             memcpy(result, pat, VARSIZE(pat));
!             PG_RETURN_BYTEA_P(result);
!         }
!
!         /*
!          * Otherwise, convert occurrences of the specified escape character to
!          * '\', and double occurrences of '\' --- unless they immediately
!          * follow an escape character!
!          */
!         afterescape = false;
!         while (plen > 0)
!         {
!             if (BYTEA_CHAREQ(p, e) && !afterescape)
!             {
!                 *r++ = '\\';
!                 BYTEA_NextChar(p, plen);
!                 afterescape = true;
!             }
!             else if (*p == '\\')
!             {
!                 *r++ = '\\';
!                 if (!afterescape)
!                     *r++ = '\\';
!                 BYTEA_NextChar(p, plen);
!                 afterescape = false;
!             }
!             else
!             {
!                 BYTEA_CopyAdvChar(r, p, plen);
!                 afterescape = false;
!             }
!         }
!     }
!
!     SET_VARSIZE(result, r - ((char *) result));

!     PG_RETURN_BYTEA_P(result);
  }

- /*
-  * Same as above, but specifically for bytea (binary) datatype
-  */
- static int
- MatchBytea(char *t, int tlen, char *p, int plen)
- {
-     /* Fast path for match-everything pattern */
-     if ((plen == 1) && (*p == '%'))
-         return LIKE_TRUE;
-
-     while ((tlen > 0) && (plen > 0))
-     {
-         if (*p == '\\')
-         {
-             /* Next pattern char must match literally, whatever it is */
-             BYTEA_NextChar(p, plen);
-             if ((plen <= 0) || !BYTEA_CHAREQ(t, p))
-                 return LIKE_FALSE;
-         }
-         else if (*p == '%')
-         {
-             /* %% is the same as % according to the SQL standard */
-             /* Advance past all %'s */
-             while ((plen > 0) && (*p == '%'))
-                 BYTEA_NextChar(p, plen);
-             /* Trailing percent matches everything. */
-             if (plen <= 0)
-                 return LIKE_TRUE;
-
-             /*
-              * Otherwise, scan for a text position at which we can match the
-              * rest of the pattern.
-              */
-             while (tlen > 0)
-             {
-                 /*
-                  * Optimization to prevent most recursion: don't recurse
-                  * unless first pattern char might match this text char.
-                  */
-                 if (BYTEA_CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
-                 {
-                     int            matched = MatchBytea(t, tlen, p, plen);
-
-                     if (matched != LIKE_FALSE)
-                         return matched; /* TRUE or ABORT */
-                 }
-
-                 BYTEA_NextChar(t, tlen);
-             }
-
-             /*
-              * End of text with no match, so no point in trying later places
-              * to start matching this pattern.
-              */
-             return LIKE_ABORT;
-         }
-         else if ((*p != '_') && !BYTEA_CHAREQ(t, p))
-         {
-             /*
-              * Not the single-character wildcard and no explicit match? Then
-              * time to quit...
-              */
-             return LIKE_FALSE;
-         }
-
-         BYTEA_NextChar(t, tlen);
-         BYTEA_NextChar(p, plen);
-     }
-
-     if (tlen > 0)
-         return LIKE_FALSE;        /* end of pattern, but not of text */
-
-     /* End of input string.  Do we have matching pattern remaining? */
-     while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
-                                          * pattern */
-         BYTEA_NextChar(p, plen);
-     if (plen <= 0)
-         return LIKE_TRUE;
-
-     /*
-      * End of text with no match, so no point in trying later places to start
-      * matching this pattern.
-      */
-     return LIKE_ABORT;
- }    /* MatchBytea() */
--- 360,367 ----
  {
      bytea       *pat = PG_GETARG_BYTEA_P(0);
      bytea       *esc = PG_GETARG_BYTEA_P(1);
!     bytea       *result = SB_do_like_escape((text *)pat, (text *)esc);

!     PG_RETURN_BYTEA_P((bytea *)result);
  }

Index: src/backend/utils/adt/like_match.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like_match.c,v
retrieving revision 1.15
diff -c -r1.15 like_match.c
*** src/backend/utils/adt/like_match.c    27 Feb 2007 23:48:08 -0000    1.15
--- src/backend/utils/adt/like_match.c    24 May 2007 00:26:49 -0000
***************
*** 8,20 ****
   *
   * Before the inclusion, we need to define following macros:
   *
!  * CHAREQ
!  * ICHAREQ
!  * NextChar
!  * CopyAdvChar
!  * MatchText (MBMatchText)
!  * MatchTextIC (MBMatchTextIC)
!  * do_like_escape (MB_do_like_escape)
   *
   * Copyright (c) 1996-2007, PostgreSQL Global Development Group
   *
--- 8,17 ----
   *
   * Before the inclusion, we need to define following macros:
   *
!  * NextChar
!  * MatchText - to name of function wanted
!  * do_like_escape - name of function if wanted - requires CHAREQ and CopyAdvChar
!  * UTF8_OPT if UTF8 version of MatchText wanted
   *
   * Copyright (c) 1996-2007, PostgreSQL Global Development Group
   *
***************
*** 57,62 ****
--- 54,63 ----
  **
  */

+ #ifdef UTF8_OPT
+ #define IsFirstByte(c) ((*c & 0xC0) != 0x80)
+ #endif
+

  /*--------------------
   *    Match text and p, return LIKE_TRUE, LIKE_FALSE, or LIKE_ABORT.
***************
*** 81,89 ****
      {
          if (*p == '\\')
          {
!             /* Next pattern char must match literally, whatever it is */
!             NextChar(p, plen);
!             if ((plen <= 0) || !CHAREQ(t, p))
                  return LIKE_FALSE;
          }
          else if (*p == '%')
--- 82,90 ----
      {
          if (*p == '\\')
          {
!             /* Next byte must match literally, whatever it is */
!             NextByte(p, plen);
!             if ((plen <= 0) || *p != *t )
                  return LIKE_FALSE;
          }
          else if (*p == '%')
***************
*** 91,97 ****
              /* %% is the same as % according to the SQL standard */
              /* Advance past all %'s */
              while ((plen > 0) && (*p == '%'))
!                 NextChar(p, plen);
              /* Trailing percent matches everything. */
              if (plen <= 0)
                  return LIKE_TRUE;
--- 92,98 ----
              /* %% is the same as % according to the SQL standard */
              /* Advance past all %'s */
              while ((plen > 0) && (*p == '%'))
!                 NextByte(p, plen);
              /* Trailing percent matches everything. */
              if (plen <= 0)
                  return LIKE_TRUE;
***************
*** 100,206 ****
               * Otherwise, scan for a text position at which we can match the
               * rest of the pattern.
               */
!             while (tlen > 0)
              {
!                 /*
!                  * Optimization to prevent most recursion: don't recurse
!                  * unless first pattern char might match this text char.
!                  */
!                 if (CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
                  {
                      int            matched = MatchText(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                         return matched; /* TRUE or ABORT */
!                 }

!                 NextChar(t, tlen);
              }
!
!             /*
!              * End of text with no match, so no point in trying later places
!              * to start matching this pattern.
!              */
!             return LIKE_ABORT;
!         }
!         else if ((*p != '_') && !CHAREQ(t, p))
!         {
!             /*
!              * Not the single-character wildcard and no explicit match? Then
!              * time to quit...
!              */
!             return LIKE_FALSE;
!         }
!
!         NextChar(t, tlen);
!         NextChar(p, plen);
!     }
!
!     if (tlen > 0)
!         return LIKE_FALSE;        /* end of pattern, but not of text */
!
!     /* End of input string.  Do we have matching pattern remaining? */
!     while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
!                                          * pattern */
!         NextChar(p, plen);
!     if (plen <= 0)
!         return LIKE_TRUE;
!
!     /*
!      * End of text with no match, so no point in trying later places to start
!      * matching this pattern.
!      */
!     return LIKE_ABORT;
! }    /* MatchText() */
!
! /*
!  * Same as above, but ignore case
!  */
! static int
! MatchTextIC(char *t, int tlen, char *p, int plen)
! {
!     /* Fast path for match-everything pattern */
!     if ((plen == 1) && (*p == '%'))
!         return LIKE_TRUE;
!
!     while ((tlen > 0) && (plen > 0))
!     {
!         if (*p == '\\')
!         {
!             /* Next pattern char must match literally, whatever it is */
!             NextChar(p, plen);
!             if ((plen <= 0) || !ICHAREQ(t, p))
!                 return LIKE_FALSE;
!         }
!         else if (*p == '%')
!         {
!             /* %% is the same as % according to the SQL standard */
!             /* Advance past all %'s */
!             while ((plen > 0) && (*p == '%'))
!                 NextChar(p, plen);
!             /* Trailing percent matches everything. */
!             if (plen <= 0)
!                 return LIKE_TRUE;
!
!             /*
!              * Otherwise, scan for a text position at which we can match the
!              * rest of the pattern.
!              */
!             while (tlen > 0)
              {
!                 /*
!                  * Optimization to prevent most recursion: don't recurse
!                  * unless first pattern char might match this text char.
!                  */
!                 if (ICHAREQ(t, p) || (*p == '\\') || (*p == '_'))
                  {
!                     int            matched = MatchTextIC(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                         return matched; /* TRUE or ABORT */
                  }

!                 NextChar(t, tlen);
              }

              /*
--- 101,156 ----
               * Otherwise, scan for a text position at which we can match the
               * rest of the pattern.
               */
!             if (*p == '_')
              {
! #ifdef UTF8_OPT
!                 if ( ! IsFirstByte(t))
!                     return LIKE_FALSE;
! #endif
!                 while (tlen > 0)
                  {
                      int            matched = MatchText(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                             return matched; /* TRUE or ABORT */

!                     NextChar(t, tlen);
!                 }
              }
!             else if (*p == '\\')
              {
!                 while (tlen > 0)
                  {
!                     int            matched = MatchText(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                             return matched; /* TRUE or ABORT */
!
!                     NextByte(t, tlen);
                  }
+             }
+             else
+             {

!                 while (tlen > 0)
!                 {
!                     /*
!                      * Optimization to prevent most recursion: don't recurse
!                      * unless first pattern char matches this text char.
!                      */
!                     if (*t == *p)
!                     {
!                         int            matched = MatchText(t, tlen, p, plen);
!
!                         if (matched != LIKE_FALSE)
!                             return matched; /* TRUE or ABORT */
!                     }
! #ifdef UTF8OPT
!                     NextByte(t, tlen);
! #else
!                     NextChar(t, tlen);
! #endif
!                 }
              }

              /*
***************
*** 209,215 ****
               */
              return LIKE_ABORT;
          }
!         else if ((*p != '_') && !ICHAREQ(t, p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
--- 159,176 ----
               */
              return LIKE_ABORT;
          }
!         else if (*p == '_')
!         {
! #ifdef UTF8_OPT
!             /* "_" not following "%" - should not be out of sync */
!             Assert( IsFirstByte(t) );
! #endif
!             NextChar(t, tlen);
!
!             NextByte(p, plen);
!             continue;
!         }
!         else if (*t != *p)
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
***************
*** 217,225 ****
               */
              return LIKE_FALSE;
          }
!
!         NextChar(t, tlen);
!         NextChar(p, plen);
      }

      if (tlen > 0)
--- 178,190 ----
               */
              return LIKE_FALSE;
          }
!         /*
!          * It is safe to use NextByte instead of NextChar here, even for
!          * multi-byte character sets, because we are not following
!          * immediately after a wildcard character.
!          */
!         NextByte(t, tlen);
!         NextByte(p, plen);
      }

      if (tlen > 0)
***************
*** 228,234 ****
      /* End of input string.  Do we have matching pattern remaining? */
      while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
                                           * pattern */
!         NextChar(p, plen);
      if (plen <= 0)
          return LIKE_TRUE;

--- 193,200 ----
      /* End of input string.  Do we have matching pattern remaining? */
      while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
                                           * pattern */
!         NextByte(p, plen);
!
      if (plen <= 0)
          return LIKE_TRUE;

***************
*** 237,248 ****
       * matching this pattern.
       */
      return LIKE_ABORT;
! }    /* MatchTextIC() */

  /*
   * like_escape() --- given a pattern and an ESCAPE string,
   * convert the pattern to use Postgres' standard backslash escape convention.
   */
  static text *
  do_like_escape(text *pat, text *esc)
  {
--- 203,216 ----
       * matching this pattern.
       */
      return LIKE_ABORT;
! }    /* MatchText() */

  /*
   * like_escape() --- given a pattern and an ESCAPE string,
   * convert the pattern to use Postgres' standard backslash escape convention.
   */
+ #ifdef do_like_escape
+
  static text *
  do_like_escape(text *pat, text *esc)
  {
***************
*** 336,338 ****
--- 304,324 ----

      return result;
  }
+ #endif /* do_like_escape */
+
+ #ifdef CHAREQ
+ #undef CHAREQ
+ #endif
+
+ #undef NextChar
+ #undef CopyAdvChar
+ #undef MatchText
+
+ #ifdef do_like_escape
+ #undef do_like_escape
+ #endif
+
+ #ifdef UTF8_OPT
+ #undef IsFirstByte
+ #undef UTF8_OPT
+ #endif

Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>
>> We should only be able to get out of step from the "%_" case, I believe,
>> so we should only need to do the first-byte test in that case (which is
>> in a different code path from the normal "_" case. Does that seem right?
>>
>
> At least put Assert(IsFirstByte()) in the main path.
>
> I'm a bit suspicious of the separate-path business anyway.  Will it do
> the right thing with say "%%%_" ?
>
>
>

OK, Here is a patch that I am fairly confident does what's been
discussed, as summarised by Tom.

To answer Guillaume's question - it probably won't apply cleanly to 8.2
sources.

cheers

andrew


Index: src/backend/utils/adt/like.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v
retrieving revision 1.68
diff -c -r1.68 like.c
*** src/backend/utils/adt/like.c    27 Feb 2007 23:48:08 -0000    1.68
--- src/backend/utils/adt/like.c    24 May 2007 00:26:49 -0000
***************
*** 28,48 ****
  #define LIKE_ABORT                        (-1)


! static int    MatchText(char *t, int tlen, char *p, int plen);
! static int    MatchTextIC(char *t, int tlen, char *p, int plen);
! static int    MatchBytea(char *t, int tlen, char *p, int plen);
! static text *do_like_escape(text *, text *);

! static int    MBMatchText(char *t, int tlen, char *p, int plen);
! static int    MBMatchTextIC(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);

  /*--------------------
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *--------------------
   */
! static int
  wchareq(char *p1, char *p2)
  {
      int            p1_len;
--- 28,50 ----
  #define LIKE_ABORT                        (-1)


! static int    SB_MatchText(char *t, int tlen, char *p, int plen);
! static text *SB_do_like_escape(text *, text *);

! static int    MB_MatchText(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);

+ static int    UTF8_MatchText(char *t, int tlen, char *p, int plen);
+
+ static int    GenericMatchText(char *s, int slen, char* p, int plen);
+ static int    Generic_Text_IC_like(text *str, text *pat);
+
  /*--------------------
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *--------------------
   */
! static inline int
  wchareq(char *p1, char *p2)
  {
      int            p1_len;
***************
*** 72,86 ****
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.    This should be revisited when we install better locale support.
-  *
-  * Note that MBMatchText and MBMatchTextIC do exactly the same thing now.
-  * Is it worth refactoring to avoid duplicated code?  They might become
-  * different again in the future.
   */

  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
- #define ICHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
      do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
--- 74,85 ----
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.    This should be revisited when we install better locale support.
   */

+ #define NextByte(p, plen)    ((p)++, (plen)--)
+
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
      do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
***************
*** 90,122 ****
               *(dst)++ = *(src)++; \
         } while (0)

! #define MatchText    MBMatchText
! #define MatchTextIC MBMatchTextIC
  #define do_like_escape    MB_do_like_escape

  #include "like_match.c"

- #undef CHAREQ
- #undef ICHAREQ
- #undef NextChar
- #undef CopyAdvChar
- #undef MatchText
- #undef MatchTextIC
- #undef do_like_escape
-
  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) (*(p1) == *(p2))
! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2)))
! #define NextChar(p, plen) ((p)++, (plen)--)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)

  #include "like_match.c"

- /* And some support for BYTEA */
- #define BYTEA_CHAREQ(p1, p2) (*(p1) == *(p2))
- #define BYTEA_NextChar(p, plen) ((p)++, (plen)--)
- #define BYTEA_CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)


  /*
   *    interface routines called by the function manager
--- 89,148 ----
               *(dst)++ = *(src)++; \
         } while (0)

! #define MatchText    MB_MatchText
  #define do_like_escape    MB_do_like_escape

  #include "like_match.c"

  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) (*p1 == *p2)
! #define NextChar(p, plen) NextByte(p, plen)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)

+ #define MatchText    SB_MatchText
+ #define do_like_escape    SB_do_like_escape
+
  #include "like_match.c"


+ /* setup to compile like_match.c for UTF8 encoding */
+
+ #define NextChar(p, plen) \
+     do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
+ #define UTF8_OPT
+ #define MatchText    UTF8_MatchText
+
+ #include "like_match.c"
+
+ static inline int
+ GenericMatchText(char *s, int slen, char* p, int plen)
+ {
+     if (pg_database_encoding_max_length() == 1)
+         return SB_MatchText(s, slen, p, plen);
+     else if (GetDatabaseEncoding() == PG_UTF8)
+         return UTF8_MatchText(s, slen, p, plen);
+     else
+         return MB_MatchText(s, slen, p, plen);
+ }
+
+ static inline int
+ Generic_Text_IC_like(text *str, text *pat)
+ {
+     char       *s,
+                *p;
+     int            slen,
+                 plen;
+
+     /* Force inputs to lower case to achieve case insensitivity */
+     str = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(str)));
+     pat = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(pat)));
+     s = VARDATA(str);
+     slen = (VARSIZE(str) - VARHDRSZ);
+     p = VARDATA(pat);
+     plen = (VARSIZE(pat) - VARHDRSZ);
+
+     return GenericMatchText(s, slen, p, plen);
+ }

  /*
   *    interface routines called by the function manager
***************
*** 138,147 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 164,170 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 162,171 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 185,191 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 186,195 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 206,212 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 210,219 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 227,233 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 234,240 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (MatchBytea(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 248,254 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (SB_MatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 255,261 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (MatchBytea(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 269,275 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (SB_MatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 270,306 ****
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     char       *s,
!                *p;
!     int            slen,
!                 plen;
!
!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = NameStr(*str);
!         slen = strlen(s);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         text       *strtext;

!         strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         strtext = DatumGetTextP(DirectFunctionCall1(lower,
!                                                   PointerGetDatum(strtext)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!
!         s = VARDATA(strtext);
!         slen = (VARSIZE(strtext) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 284,294 ----
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     text       *strtext;

!     strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!     result = (Generic_Text_IC_like(strtext, pat) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 311,347 ****
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     char       *s,
!                *p;
!     int            slen,
!                 plen;

!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = NameStr(*str);
!         slen = strlen(s);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         text       *strtext;
!
!         strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         strtext = DatumGetTextP(DirectFunctionCall1(lower,
!                                                   PointerGetDatum(strtext)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!
!         s = VARDATA(strtext);
!         slen = (VARSIZE(strtext) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 299,309 ----
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     text       *strtext;

!     strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!     result = (Generic_Text_IC_like(strtext, pat) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 352,383 ****
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
-     char       *s,
-                *p;
-     int            slen,
-                 plen;

!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         str = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(str)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 314,321 ----
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;

!     result = (Generic_Text_IC_like(str, pat) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 388,419 ****
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
-     char       *s,
-                *p;
-     int            slen,
-                 plen;

!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         str = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(str)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 326,333 ----
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;

!     result = (Generic_Text_IC_like(str, pat) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 430,436 ****
      text       *result;

      if (pg_database_encoding_max_length() == 1)
!         result = do_like_escape(pat, esc);
      else
          result = MB_do_like_escape(pat, esc);

--- 344,350 ----
      text       *result;

      if (pg_database_encoding_max_length() == 1)
!         result = SB_do_like_escape(pat, esc);
      else
          result = MB_do_like_escape(pat, esc);

***************
*** 446,624 ****
  {
      bytea       *pat = PG_GETARG_BYTEA_P(0);
      bytea       *esc = PG_GETARG_BYTEA_P(1);
!     bytea       *result;
!     char       *p,
!                *e,
!                *r;
!     int            plen,
!                 elen;
!     bool        afterescape;
!
!     p = VARDATA(pat);
!     plen = (VARSIZE(pat) - VARHDRSZ);
!     e = VARDATA(esc);
!     elen = (VARSIZE(esc) - VARHDRSZ);
!
!     /*
!      * Worst-case pattern growth is 2x --- unlikely, but it's hardly worth
!      * trying to calculate the size more accurately than that.
!      */
!     result = (text *) palloc(plen * 2 + VARHDRSZ);
!     r = VARDATA(result);
!
!     if (elen == 0)
!     {
!         /*
!          * No escape character is wanted.  Double any backslashes in the
!          * pattern to make them act like ordinary characters.
!          */
!         while (plen > 0)
!         {
!             if (*p == '\\')
!                 *r++ = '\\';
!             BYTEA_CopyAdvChar(r, p, plen);
!         }
!     }
!     else
!     {
!         /*
!          * The specified escape must be only a single character.
!          */
!         BYTEA_NextChar(e, elen);
!         if (elen != 0)
!             ereport(ERROR,
!                     (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
!                      errmsg("invalid escape string"),
!                   errhint("Escape string must be empty or one character.")));
!
!         e = VARDATA(esc);
!
!         /*
!          * If specified escape is '\', just copy the pattern as-is.
!          */
!         if (*e == '\\')
!         {
!             memcpy(result, pat, VARSIZE(pat));
!             PG_RETURN_BYTEA_P(result);
!         }
!
!         /*
!          * Otherwise, convert occurrences of the specified escape character to
!          * '\', and double occurrences of '\' --- unless they immediately
!          * follow an escape character!
!          */
!         afterescape = false;
!         while (plen > 0)
!         {
!             if (BYTEA_CHAREQ(p, e) && !afterescape)
!             {
!                 *r++ = '\\';
!                 BYTEA_NextChar(p, plen);
!                 afterescape = true;
!             }
!             else if (*p == '\\')
!             {
!                 *r++ = '\\';
!                 if (!afterescape)
!                     *r++ = '\\';
!                 BYTEA_NextChar(p, plen);
!                 afterescape = false;
!             }
!             else
!             {
!                 BYTEA_CopyAdvChar(r, p, plen);
!                 afterescape = false;
!             }
!         }
!     }
!
!     SET_VARSIZE(result, r - ((char *) result));

!     PG_RETURN_BYTEA_P(result);
  }

- /*
-  * Same as above, but specifically for bytea (binary) datatype
-  */
- static int
- MatchBytea(char *t, int tlen, char *p, int plen)
- {
-     /* Fast path for match-everything pattern */
-     if ((plen == 1) && (*p == '%'))
-         return LIKE_TRUE;
-
-     while ((tlen > 0) && (plen > 0))
-     {
-         if (*p == '\\')
-         {
-             /* Next pattern char must match literally, whatever it is */
-             BYTEA_NextChar(p, plen);
-             if ((plen <= 0) || !BYTEA_CHAREQ(t, p))
-                 return LIKE_FALSE;
-         }
-         else if (*p == '%')
-         {
-             /* %% is the same as % according to the SQL standard */
-             /* Advance past all %'s */
-             while ((plen > 0) && (*p == '%'))
-                 BYTEA_NextChar(p, plen);
-             /* Trailing percent matches everything. */
-             if (plen <= 0)
-                 return LIKE_TRUE;
-
-             /*
-              * Otherwise, scan for a text position at which we can match the
-              * rest of the pattern.
-              */
-             while (tlen > 0)
-             {
-                 /*
-                  * Optimization to prevent most recursion: don't recurse
-                  * unless first pattern char might match this text char.
-                  */
-                 if (BYTEA_CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
-                 {
-                     int            matched = MatchBytea(t, tlen, p, plen);
-
-                     if (matched != LIKE_FALSE)
-                         return matched; /* TRUE or ABORT */
-                 }
-
-                 BYTEA_NextChar(t, tlen);
-             }
-
-             /*
-              * End of text with no match, so no point in trying later places
-              * to start matching this pattern.
-              */
-             return LIKE_ABORT;
-         }
-         else if ((*p != '_') && !BYTEA_CHAREQ(t, p))
-         {
-             /*
-              * Not the single-character wildcard and no explicit match? Then
-              * time to quit...
-              */
-             return LIKE_FALSE;
-         }
-
-         BYTEA_NextChar(t, tlen);
-         BYTEA_NextChar(p, plen);
-     }
-
-     if (tlen > 0)
-         return LIKE_FALSE;        /* end of pattern, but not of text */
-
-     /* End of input string.  Do we have matching pattern remaining? */
-     while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
-                                          * pattern */
-         BYTEA_NextChar(p, plen);
-     if (plen <= 0)
-         return LIKE_TRUE;
-
-     /*
-      * End of text with no match, so no point in trying later places to start
-      * matching this pattern.
-      */
-     return LIKE_ABORT;
- }    /* MatchBytea() */
--- 360,367 ----
  {
      bytea       *pat = PG_GETARG_BYTEA_P(0);
      bytea       *esc = PG_GETARG_BYTEA_P(1);
!     bytea       *result = SB_do_like_escape((text *)pat, (text *)esc);

!     PG_RETURN_BYTEA_P((bytea *)result);
  }

Index: src/backend/utils/adt/like_match.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like_match.c,v
retrieving revision 1.15
diff -c -r1.15 like_match.c
*** src/backend/utils/adt/like_match.c    27 Feb 2007 23:48:08 -0000    1.15
--- src/backend/utils/adt/like_match.c    24 May 2007 00:26:49 -0000
***************
*** 8,20 ****
   *
   * Before the inclusion, we need to define following macros:
   *
!  * CHAREQ
!  * ICHAREQ
!  * NextChar
!  * CopyAdvChar
!  * MatchText (MBMatchText)
!  * MatchTextIC (MBMatchTextIC)
!  * do_like_escape (MB_do_like_escape)
   *
   * Copyright (c) 1996-2007, PostgreSQL Global Development Group
   *
--- 8,17 ----
   *
   * Before the inclusion, we need to define following macros:
   *
!  * NextChar
!  * MatchText - to name of function wanted
!  * do_like_escape - name of function if wanted - requires CHAREQ and CopyAdvChar
!  * UTF8_OPT if UTF8 version of MatchText wanted
   *
   * Copyright (c) 1996-2007, PostgreSQL Global Development Group
   *
***************
*** 57,62 ****
--- 54,63 ----
  **
  */

+ #ifdef UTF8_OPT
+ #define IsFirstByte(c) ((*c & 0xC0) != 0x80)
+ #endif
+

  /*--------------------
   *    Match text and p, return LIKE_TRUE, LIKE_FALSE, or LIKE_ABORT.
***************
*** 81,89 ****
      {
          if (*p == '\\')
          {
!             /* Next pattern char must match literally, whatever it is */
!             NextChar(p, plen);
!             if ((plen <= 0) || !CHAREQ(t, p))
                  return LIKE_FALSE;
          }
          else if (*p == '%')
--- 82,90 ----
      {
          if (*p == '\\')
          {
!             /* Next byte must match literally, whatever it is */
!             NextByte(p, plen);
!             if ((plen <= 0) || *p != *t )
                  return LIKE_FALSE;
          }
          else if (*p == '%')
***************
*** 91,97 ****
              /* %% is the same as % according to the SQL standard */
              /* Advance past all %'s */
              while ((plen > 0) && (*p == '%'))
!                 NextChar(p, plen);
              /* Trailing percent matches everything. */
              if (plen <= 0)
                  return LIKE_TRUE;
--- 92,98 ----
              /* %% is the same as % according to the SQL standard */
              /* Advance past all %'s */
              while ((plen > 0) && (*p == '%'))
!                 NextByte(p, plen);
              /* Trailing percent matches everything. */
              if (plen <= 0)
                  return LIKE_TRUE;
***************
*** 100,206 ****
               * Otherwise, scan for a text position at which we can match the
               * rest of the pattern.
               */
!             while (tlen > 0)
              {
!                 /*
!                  * Optimization to prevent most recursion: don't recurse
!                  * unless first pattern char might match this text char.
!                  */
!                 if (CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
                  {
                      int            matched = MatchText(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                         return matched; /* TRUE or ABORT */
!                 }

!                 NextChar(t, tlen);
              }
!
!             /*
!              * End of text with no match, so no point in trying later places
!              * to start matching this pattern.
!              */
!             return LIKE_ABORT;
!         }
!         else if ((*p != '_') && !CHAREQ(t, p))
!         {
!             /*
!              * Not the single-character wildcard and no explicit match? Then
!              * time to quit...
!              */
!             return LIKE_FALSE;
!         }
!
!         NextChar(t, tlen);
!         NextChar(p, plen);
!     }
!
!     if (tlen > 0)
!         return LIKE_FALSE;        /* end of pattern, but not of text */
!
!     /* End of input string.  Do we have matching pattern remaining? */
!     while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
!                                          * pattern */
!         NextChar(p, plen);
!     if (plen <= 0)
!         return LIKE_TRUE;
!
!     /*
!      * End of text with no match, so no point in trying later places to start
!      * matching this pattern.
!      */
!     return LIKE_ABORT;
! }    /* MatchText() */
!
! /*
!  * Same as above, but ignore case
!  */
! static int
! MatchTextIC(char *t, int tlen, char *p, int plen)
! {
!     /* Fast path for match-everything pattern */
!     if ((plen == 1) && (*p == '%'))
!         return LIKE_TRUE;
!
!     while ((tlen > 0) && (plen > 0))
!     {
!         if (*p == '\\')
!         {
!             /* Next pattern char must match literally, whatever it is */
!             NextChar(p, plen);
!             if ((plen <= 0) || !ICHAREQ(t, p))
!                 return LIKE_FALSE;
!         }
!         else if (*p == '%')
!         {
!             /* %% is the same as % according to the SQL standard */
!             /* Advance past all %'s */
!             while ((plen > 0) && (*p == '%'))
!                 NextChar(p, plen);
!             /* Trailing percent matches everything. */
!             if (plen <= 0)
!                 return LIKE_TRUE;
!
!             /*
!              * Otherwise, scan for a text position at which we can match the
!              * rest of the pattern.
!              */
!             while (tlen > 0)
              {
!                 /*
!                  * Optimization to prevent most recursion: don't recurse
!                  * unless first pattern char might match this text char.
!                  */
!                 if (ICHAREQ(t, p) || (*p == '\\') || (*p == '_'))
                  {
!                     int            matched = MatchTextIC(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                         return matched; /* TRUE or ABORT */
                  }

!                 NextChar(t, tlen);
              }

              /*
--- 101,156 ----
               * Otherwise, scan for a text position at which we can match the
               * rest of the pattern.
               */
!             if (*p == '_')
              {
! #ifdef UTF8_OPT
!                 if ( ! IsFirstByte(t))
!                     return LIKE_FALSE;
! #endif
!                 while (tlen > 0)
                  {
                      int            matched = MatchText(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                             return matched; /* TRUE or ABORT */

!                     NextChar(t, tlen);
!                 }
              }
!             else if (*p == '\\')
              {
!                 while (tlen > 0)
                  {
!                     int            matched = MatchText(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                             return matched; /* TRUE or ABORT */
!
!                     NextByte(t, tlen);
                  }
+             }
+             else
+             {

!                 while (tlen > 0)
!                 {
!                     /*
!                      * Optimization to prevent most recursion: don't recurse
!                      * unless first pattern char matches this text char.
!                      */
!                     if (*t == *p)
!                     {
!                         int            matched = MatchText(t, tlen, p, plen);
!
!                         if (matched != LIKE_FALSE)
!                             return matched; /* TRUE or ABORT */
!                     }
! #ifdef UTF8OPT
!                     NextByte(t, tlen);
! #else
!                     NextChar(t, tlen);
! #endif
!                 }
              }

              /*
***************
*** 209,215 ****
               */
              return LIKE_ABORT;
          }
!         else if ((*p != '_') && !ICHAREQ(t, p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
--- 159,176 ----
               */
              return LIKE_ABORT;
          }
!         else if (*p == '_')
!         {
! #ifdef UTF8_OPT
!             /* "_" not following "%" - should not be out of sync */
!             Assert( IsFirstByte(t) );
! #endif
!             NextChar(t, tlen);
!
!             NextByte(p, plen);
!             continue;
!         }
!         else if (*t != *p)
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
***************
*** 217,225 ****
               */
              return LIKE_FALSE;
          }
!
!         NextChar(t, tlen);
!         NextChar(p, plen);
      }

      if (tlen > 0)
--- 178,190 ----
               */
              return LIKE_FALSE;
          }
!         /*
!          * It is safe to use NextByte instead of NextChar here, even for
!          * multi-byte character sets, because we are not following
!          * immediately after a wildcard character.
!          */
!         NextByte(t, tlen);
!         NextByte(p, plen);
      }

      if (tlen > 0)
***************
*** 228,234 ****
      /* End of input string.  Do we have matching pattern remaining? */
      while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
                                           * pattern */
!         NextChar(p, plen);
      if (plen <= 0)
          return LIKE_TRUE;

--- 193,200 ----
      /* End of input string.  Do we have matching pattern remaining? */
      while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
                                           * pattern */
!         NextByte(p, plen);
!
      if (plen <= 0)
          return LIKE_TRUE;

***************
*** 237,248 ****
       * matching this pattern.
       */
      return LIKE_ABORT;
! }    /* MatchTextIC() */

  /*
   * like_escape() --- given a pattern and an ESCAPE string,
   * convert the pattern to use Postgres' standard backslash escape convention.
   */
  static text *
  do_like_escape(text *pat, text *esc)
  {
--- 203,216 ----
       * matching this pattern.
       */
      return LIKE_ABORT;
! }    /* MatchText() */

  /*
   * like_escape() --- given a pattern and an ESCAPE string,
   * convert the pattern to use Postgres' standard backslash escape convention.
   */
+ #ifdef do_like_escape
+
  static text *
  do_like_escape(text *pat, text *esc)
  {
***************
*** 336,338 ****
--- 304,324 ----

      return result;
  }
+ #endif /* do_like_escape */
+
+ #ifdef CHAREQ
+ #undef CHAREQ
+ #endif
+
+ #undef NextChar
+ #undef CopyAdvChar
+ #undef MatchText
+
+ #ifdef do_like_escape
+ #undef do_like_escape
+ #endif
+
+ #ifdef UTF8_OPT
+ #undef IsFirstByte
+ #undef UTF8_OPT
+ #endif

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: like/ilike improvements

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> OK, Here is a patch that I am fairly confident does what's been 
> discussed, as summarised by Tom.

> ! #define CHAREQ(p1, p2) (*p1 == *p2)
> ...
> + #define IsFirstByte(c) ((*c & 0xC0) != 0x80)

These macros are bugs waiting to happen.  Please parenthesize the
arguments.

The header comment for like_match.c needs more love:
* This file is included by like.c *twice*, to provide an optimization* for single-byte encodings.

I'm not sure I believe the new coding for %-matching at all, and I
certainly don't like the 100% lack of comments explaining why the
different cases are necessary and just how they differ.  In particular,
once we've advanced more than one character, why does it still matter
what was immediately after the %?

There should somewhere be a block comment explaining all the reasoning
we've so painfully gone through about why the three cases (SB, MB, UTF8)
are needed and how they must differ.
        regards, tom lane


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
> I'm not sure I believe the new coding for %-matching at all, and I
> certainly don't like the 100% lack of comments explaining why the
> different cases are necessary and just how they differ.  In particular,
> once we've advanced more than one character, why does it still matter
> what was immediately after the %?
>
>
>   

I don't understand the question. The % processing looks for a place that 
matches what is immediately after the % and then tries to match the 
remainder using a recursive call - so it never actually does matter. I 
haven't actually changed the fundamental logic AFAIK, I have just 
rearranged and optimised it some.

I admit that it takes some pondering to understand - I certainly intend 
to adjust the comments once we are satisfied the code is right. It's 
going to be next week now before I finish this up :-(

cheers

andrew


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
> There should somewhere be a block comment explaining all the reasoning
> we've so painfully gone through about why the three cases (SB, MB, UTF8)
> are needed and how they must differ.
>
>             
>   

I'm working on a detailed description/rationale.

However, I have just about convinced myself that we don't need 
IsFirstByte for matching "_" for UTF8, either preceded by "%" or not, as 
it should always be true. Can anyone come up with a counter example? I 
don't mind keeping it and using it in Assert() though.

cheers

andrew


Re: like/ilike improvements

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> However, I have just about convinced myself that we don't need 
> IsFirstByte for matching "_" for UTF8, either preceded by "%" or not, as 
> it should always be true. Can anyone come up with a counter example?

You have to be on a first byte before you can meaningfully apply
NextChar, and you have to use NextChar or else you don't count
characters correctly (eg "__" must match 2 chars not 2 bytes).
        regards, tom lane


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> However, I have just about convinced myself that we don't need 
>> IsFirstByte for matching "_" for UTF8, either preceded by "%" or not, as 
>> it should always be true. Can anyone come up with a counter example?
>>     
>
> You have to be on a first byte before you can meaningfully apply
> NextChar, and you have to use NextChar or else you don't count
> characters correctly (eg "__" must match 2 chars not 2 bytes).
>
>
>   

Yes, I agree completely. However it looks to me like IsFirstByte will in 
fact always be true when we get to call NextChar for matching "_" for UTF8.

cheers

andrew


Re: like/ilike improvements

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> You have to be on a first byte before you can meaningfully apply
>> NextChar, and you have to use NextChar or else you don't count
>> characters correctly (eg "__" must match 2 chars not 2 bytes).

> Yes, I agree completely. However it looks to me like IsFirstByte will in 
> fact always be true when we get to call NextChar for matching "_" for UTF8.

If that's true, the patch is failing to achieve its goal of treating %
bytewise ...
        regards, tom lane


Re: like/ilike improvements

From
Tom Lane
Date:
I wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Yes, I agree completely. However it looks to me like IsFirstByte will in 
>> fact always be true when we get to call NextChar for matching "_" for UTF8.

> If that's true, the patch is failing to achieve its goal of treating %
> bytewise ...

OK, I studied it a bit more and now see what you're driving at: in this
form of the patch, we treat % bytewise unless it is followed by _, in
which case we treat it char-wise.  That seems a good tradeoff,
considering that such a pattern is probably pretty uncommon --- we
should be willing to handle it a bit slower to simplify other cases.

The patch seems still not right though, because you are advancing by
bytes when \ follows %, and that isn't correct in a non-UTF8 encoding.
The invariant we are actually insisting on here is that at the time of
entry to MatchText(), whether initial or recursive, t and p must be
correctly char-aligned.  I suggest the attached revision of the logic as
a way to clarify that, and maybe save a cycle or two in the inner loop
as well.

Yes, I concur we needn't bother with IsFirstByte except maybe as an
Assert.  If it is an Assert it should be up at the top of the function.
        regards, tom lane
    else if (*p == '%')    {        /* %% is the same as % according to the SQL standard */        /* Advance past all
%'s*/        do {            NextByte(p, plen);        } while (plen > 0 && *p == '%');        /* Trailing percent
matcheseverything. */        if (plen <= 0)            return LIKE_TRUE;
 
        /*         * Otherwise, scan for a text position at which we can match the         * rest of the pattern.
 */        if (*p == '_')        {            /*             * If we have %_ in the pattern, we need to advance
char-wise            * to avoid starting the recursive call on a non-char boundary.             * This could be made
moreefficient, but at the cost of making             * other paths slower; it seems not a common case, so handle
    * it this way.             */            while (tlen > 0)            {                int            matched =
MatchText(t,tlen, p, plen);                                    if (matched != LIKE_FALSE)                        return
matched;/* TRUE or ABORT */
 
                NextChar(t, tlen);            }        }        else        {            /*             * Optimization
toprevent most recursion: don't recurse             * unless first pattern char matches the text char.             */
        char    firstpat;
 
            if (*p == '\\')            {                if (plen < 2)                    return LIKE_FALSE;
  firstpat = p[1];            }            else                firstpat = *p;
 
            while (tlen > 0)            {                if (*t == firstpat)                {                    int
       matched = MatchText(t, tlen, p, plen);                                        if (matched != LIKE_FALSE)
              return matched; /* TRUE or ABORT */                }
 
                /*                 * In UTF8 it's cheaper to advance bytewise and do                 * useless
comparisonsof firstpat to non-first bytes                 * than to invoke pg_mblen.  In other character sets
     * we must advance by chars to avoid spurious matches.                 */
 
#ifdef UTF8OPT                NextByte(t, tlen);
#else                NextChar(t, tlen);
#endif            }        }
        /*         * End of text with no match, so no point in trying later places         * to start matching this
pattern.        */        return LIKE_ABORT;    }
 


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Tom Lane wrote:
>>     
>>> You have to be on a first byte before you can meaningfully apply
>>> NextChar, and you have to use NextChar or else you don't count
>>> characters correctly (eg "__" must match 2 chars not 2 bytes).
>>>       
>
>   
>> Yes, I agree completely. However it looks to me like IsFirstByte will in 
>> fact always be true when we get to call NextChar for matching "_" for UTF8.
>>     
>
> If that's true, the patch is failing to achieve its goal of treating %
> bytewise ...
>   


Let's back up. % processing works by looking for a place in the text 
that might match what follows % in the pattern, and then calling itself 
recursively. For UTF8, if what follows % is _, it does that search by 
repeatedly calling NextChar - otherwise it calls NextByte. But if we're 
not processing a wildcard we have to match an actual complete UTF8 char, 
so the fact that we proceed byte-wise won't get us out of sync. whenever 
we happen to encounter an _. We can't rely on that process for other 
multi-byte charsets because the suffix of one char might be the prefix 
of another, so we could get false matches. That can't happen with UTF8.

cheers

andrew


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Tom Lane wrote:
> OK, I studied it a bit more and now see what you're driving at: in this
> form of the patch, we treat % bytewise unless it is followed by _, in
> which case we treat it char-wise.  That seems a good tradeoff,
> considering that such a pattern is probably pretty uncommon --- we
> should be willing to handle it a bit slower to simplify other cases.
>
> The patch seems still not right though, because you are advancing by
> bytes when \ follows %, and that isn't correct in a non-UTF8 encoding.
> The invariant we are actually insisting on here is that at the time of
> entry to MatchText(), whether initial or recursive, t and p must be
> correctly char-aligned.  I suggest the attached revision of the logic as
> a way to clarify that, and maybe save a cycle or two in the inner loop
> as well.
>   

Good, thanks.
> Yes, I concur we needn't bother with IsFirstByte except maybe as an
> Assert.  If it is an Assert it should be up at the top of the function.
>
>   

Looks like emails crossed. Glad we're on the same page. I'm away for a 
few days, so I'll attend to this next week.

cheers

andrew




Re: like/ilike improvements

From
mark@mark.mielke.cc
Date:
On Thu, May 24, 2007 at 11:20:51PM -0400, Tom Lane wrote:
> I wrote:
> > Andrew Dunstan <andrew@dunslane.net> writes:
> >> Yes, I agree completely. However it looks to me like IsFirstByte will in 
> >> fact always be true when we get to call NextChar for matching "_" for UTF8.
> > If that's true, the patch is failing to achieve its goal of treating %
> > bytewise ...
> OK, I studied it a bit more and now see what you're driving at: in this
> form of the patch, we treat % bytewise unless it is followed by _, in
> which case we treat it char-wise.  That seems a good tradeoff,
> considering that such a pattern is probably pretty uncommon --- we
> should be willing to handle it a bit slower to simplify other cases.

Is it worth the effort to pre-process the pattern?

For example:
   %% -> %   %_ -> _%

If applied recursively, this would automatically cover:
   %_%  -> _%   _%_  -> __%

The 'benefit' would be that the pattern matching code would not
need an inner if statement?

Also - I didn't see a response to my query with regard treating UTF-8
as a two pass match. First pass treating it as bytes. If the first pass
matches, the second pass doing a full analysis. In the case of low
selectivity, this will be a win, as the primary filter would be the
full speed byte-based matching.

I had also asked why the focus would be on high selectivity. Why would
the primary filter criteria for a properly designed select statement by
a like with high selectivity? The only time I have ever used like is
when I expect low selectivity. Is there a reasonable case I am missing?

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: like/ilike improvements

From
"Zeugswetter Andreas ADI SD"
Date:
> > However, I have just about convinced myself that we don't need
> > IsFirstByte for matching "_" for UTF8, either preceded by "%" or
not,
> > as it should always be true. Can anyone come up with a counter
example?
>
> You have to be on a first byte before you can meaningfully
> apply NextChar, and you have to use NextChar or else you
> don't count characters correctly (eg "__" must match 2 chars
> not 2 bytes).

Well, for utf8 NextChar could advance to the next char even if the
current byte
position is in the middle of a multibyte char (skip over all 10xxxxxx).

(Assuming utf16 surrogate pairs are not encoded as 2 x 3bytes, which is
not valid utf8 anyway)

Andreas


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Zeugswetter Andreas ADI SD wrote:
>
>> You have to be on a first byte before you can meaningfully 
>> apply NextChar, and you have to use NextChar or else you 
>> don't count characters correctly (eg "__" must match 2 chars 
>> not 2 bytes).
>>     
>
> Well, for utf8 NextChar could advance to the next char even if the
> current byte
> position is in the middle of a multibyte char (skip over all 10xxxxxx). 
>
>
>   

It doesn't matter - we are satisfied that it won't happen. However, this 
might well be a useful optimisation of NextChar() for the UTF8 case as 
something like
 do { (t)++; (tlen)--}  while ((*(t) & 0xC0) == 0x80 && tlen > 0)

In fact, I'm wondering if that might make the other UTF8 stuff redundant 
- the whole point of what we're doing is to avoid expensive calls to 
NextChar;

cheers

andrew


Re: like/ilike improvements

From
Andrew Dunstan
Date:

mark@mark.mielke.cc wrote:
>
>
> Is it worth the effort to pre-process the pattern?
>
> For example:
>
>     %% -> %
>   

This is already done, required by spec.

>     %_ -> _%
>
> If applied recursively, this would automatically cover:
>
>     %_%  -> _%
>     _%_  -> __%
>
> The 'benefit' would be that the pattern matching code would not
> need an inner if statement?
>   

I doubt it's worth the trouble.

> Also - I didn't see a response to my query with regard treating UTF-8
> as a two pass match. First pass treating it as bytes. If the first pass
> matches, the second pass doing a full analysis. In the case of low
> selectivity, this will be a win, as the primary filter would be the
> full speed byte-based matching.
>   

All matching will now be done byte-wise. CHAREQ is dead.

Advancing will also be done byte-wise except for: . where text matching is against _ for UTF8 . where text matching is
against% or _ for other multi-byte charsets.
 

So two passes doesn't sound like much of a win.
> I had also asked why the focus would be on high selectivity. Why would
> the primary filter criteria for a properly designed select statement by
> a like with high selectivity? The only time I have ever used like is
> when I expect low selectivity. Is there a reasonable case I am missing?
>
>
>   

I think you'd need to show something close to a Pareto improvement: 
nobody worse off and some people better off. If you can do that then 
send in a patch.

However, I'm trying to minimise special case processing for UTF8, not 
create a whole new code path for it. The less special cases we have the 
easier it will be to maintain.


cheers

andrew


Re: like/ilike improvements

From
Tom Lane
Date:
"Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at> writes:
>> You have to be on a first byte before you can meaningfully 
>> apply NextChar, and you have to use NextChar or else you 
>> don't count characters correctly (eg "__" must match 2 chars 
>> not 2 bytes).

> Well, for utf8 NextChar could advance to the next char even if the
> current byte
> position is in the middle of a multibyte char (skip over all 10xxxxxx). 

No doubt the macro could be made to work that way, but would it result
in correct matching behavior?  I doubt it --- you just matched an "_"
to half a character, or some such.
        regards, tom lane


Re: like/ilike improvements

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
>   do { (t)++; (tlen)--}  while ((*(t) & 0xC0) == 0x80 && tlen > 0)

The while *must* test those two conditions in the other order.
(Don't laugh --- we've had reproducible bugs before in which the backend
dumped core because of running off the end of memory due to this type
of mistake.)

> In fact, I'm wondering if that might make the other UTF8 stuff redundant 
> - the whole point of what we're doing is to avoid expensive calls to 
> NextChar;

+1 I think.  This test will be approximately the same expense as what
the outer loop would otherwise be (tlen > 0 and *t != firstpat), and
doing it this way removes an entire layer of intellectual complexity.
Even though the code is hardly different, we are no longer dealing in
misaligned pointers anywhere in the match algorithm.
        regards, tom lane


Re: like/ilike improvements

From
"Guillaume Smet"
Date:
Andrew, All,

> On 5/22/07, Andrew Dunstan <andrew@dunslane.net> wrote:
> > But before I commit this I'd appreciate seeing some more testing, both
> > for correctness and performance.

I finally found some time to test this patch on our data. As our
production database is still using 8.1, I made my tests with 8.1.10
and 8.3devel. As I had very weird results, I tested also 8.2.5.

The patch seems to work as expected in my locale. I didn't notice
problems during the tests I made except for the performance problem I
describe below.

The box is a recent dual core box using CentOS 5. It's a test box
installed specifically to test PostgreSQL 8.3. Every version is
compiled with the same compiler. Locale is fr_FR.UTF-8 and database is
UTF-8 too.
The table used to make the tests fits entirely in RAM.

I tested a simple ILIKE query on our data with 8.3devel and it was far
slower than with 8.1.10 (2 times slower). It was obviously not the
expected result as it should have been faster considering your work.
So I decided to test also with 8.2.5 and it seems a performance
regression was introduced in 8.2 (and not in 8.3 which is in fact a
bit faster than 8.2).

I saw this item in 8.2 release notes:
Allow ILIKE to work for multi-byte encodings (Tom)
Internally, ILIKE now calls lower() and then uses LIKE.
Locale-specific regular expression patterns still do not work in these
encodings.

Could it be responsible of such a slow down?

I attached the results of my tests. If anyone needs more information,
I'll be glad to provide them.

Regards,

--
Guillaume

Attachment

Re: like/ilike improvements

From
Andrew Dunstan
Date:

Guillaume Smet wrote:
> Andrew, All,
>
>   
>> On 5/22/07, Andrew Dunstan <andrew@dunslane.net> wrote:
>>     
>>> But before I commit this I'd appreciate seeing some more testing, both
>>> for correctness and performance.
>>>       
>
> I finally found some time to test this patch on our data. As our
> production database is still using 8.1, I made my tests with 8.1.10
> and 8.3devel. As I had very weird results, I tested also 8.2.5.
>
> The patch seems to work as expected in my locale. I didn't notice
> problems during the tests I made except for the performance problem I
> describe below.
>
> The box is a recent dual core box using CentOS 5. It's a test box
> installed specifically to test PostgreSQL 8.3. Every version is
> compiled with the same compiler. Locale is fr_FR.UTF-8 and database is
> UTF-8 too.
> The table used to make the tests fits entirely in RAM.
>
> I tested a simple ILIKE query on our data with 8.3devel and it was far
> slower than with 8.1.10 (2 times slower). It was obviously not the
> expected result as it should have been faster considering your work.
> So I decided to test also with 8.2.5 and it seems a performance
> regression was introduced in 8.2 (and not in 8.3 which is in fact a
> bit faster than 8.2).
>
> I saw this item in 8.2 release notes:
> Allow ILIKE to work for multi-byte encodings (Tom)
> Internally, ILIKE now calls lower() and then uses LIKE.
> Locale-specific regular expression patterns still do not work in these
> encodings.
>
> Could it be responsible of such a slow down?
>
> I attached the results of my tests. If anyone needs more information,
> I'll be glad to provide them.
>
>   
>

Ugh.

It's at least good to see that the LIKE case has some useful speedup in 
8.3.

Can you run the same set of tests in a single byte encoding like latin1?

We might have to look at doing on-demand lowering, but in a case like 
yours it looks like we'd still end up lowering almost every character 
anyway, so I'm not quite sure what to do. Note that the 8.2 change was a 
bug fix, so we can't just revert it. Maybe we need to look closely at 
the efficiency of lower().

cheers

andrew




Re: like/ilike improvements

From
"Guillaume Smet"
Date:
On 9/19/07, Andrew Dunstan <andrew@dunslane.net> wrote:
> It's at least good to see that the LIKE case has some useful speedup in
> 8.3.

It can be due to your patch or to the varlena header patch. Seqscan is
a bit faster too.

> Can you run the same set of tests in a single byte encoding like latin1?

As discussed on IRC, I'm loading the data in a LATIN1 database for
8.1, 8.2 and 8.3. I'll let you know when I have the results.

> We might have to look at doing on-demand lowering, but in a case like
> yours it looks like we'd still end up lowering almost every character
> anyway, so I'm not quite sure what to do. Note that the 8.2 change was a
> bug fix, so we can't just revert it. Maybe we need to look closely at
> the efficiency of lower().

Yes, I know it's a bug fix but the performance decrease is far from
being negligible in our case.

--
Guillaume


Re: like/ilike improvements

From
"Guillaume Smet"
Date:
On 9/19/07, Andrew Dunstan <andrew@dunslane.net> wrote:
> Can you run the same set of tests in a single byte encoding like latin1?

Here are the results (each query was executed several times before this result):

** 8.1 **
cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
ILIKE '%hocus pocus%'; numeve
-----------900024298    87578
(2 rows)

Time: 135.877 ms

** 8.2 **

cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
ILIKE '%hocus pocus%'; numeve
-----------900024298    87578
(2 rows)

Time: 111.595 ms

** 8.3 **

cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
ILIKE '%hocus pocus%'; numeve
-----------900024298    87578
(2 rows)

Time: 160.582 ms

Results are quite surprising but there's no error, I checked them
several times...

If someone can point me to how I can profile query execution, I can
provide more information.

--
Guillaume


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Guillaume Smet wrote:
> On 9/19/07, Andrew Dunstan <andrew@dunslane.net> wrote:
>   
>> Can you run the same set of tests in a single byte encoding like latin1?
>>     
>
> Here are the results (each query was executed several times before this result):
>
> ** 8.1 **
> cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
> ILIKE '%hocus pocus%';
>   numeve
> -----------
>  900024298
>      87578
> (2 rows)
>
> Time: 135.877 ms
>
> ** 8.2 **
>
> cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
> ILIKE '%hocus pocus%';
>   numeve
> -----------
>  900024298
>      87578
> (2 rows)
>
> Time: 111.595 ms
>
> ** 8.3 **
>
> cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
> ILIKE '%hocus pocus%';
>   numeve
> -----------
>  900024298
>      87578
> (2 rows)
>
> Time: 160.582 ms
>
> Results are quite surprising but there's no error, I checked them
> several times...
>
>
>
>   

No, what this suggests to me is that it might have been a mistake to 
make the single byte case work like the multi-byte case, by pre-lowering 
the string, as we did back in May. It confirms my suspicion that the 
lower() code is the culprit. It should really be lightning fast.

Can you retry both sets of tests but this time in C locale? The lower() 
code works differently in C locale, and it might be that we need to look 
at tweaking just one case.

cheers

andrew


Re: like/ilike improvements

From
"Guillaume Smet"
Date:
On 9/20/07, Andrew Dunstan <andrew@dunslane.net> wrote:
> Can you retry both sets of tests but this time in C locale? The lower()
> code works differently in C locale, and it might be that we need to look
> at tweaking just one case.

Here we go with SQL_ASCII:

** 8.1 **

cityvox_c=# SELECT e.numeve FROM evenement e WHERE e.libgeseve LIKE
'%hocus pocus%';numeve
--------
(0 rows)

Time: 117.485 ms

cityvox_c=# SELECT e.numeve FROM evenement e WHERE e.libgeseve ILIKE
'%hocus pocus%'; numeve
-----------900024298    87578
(2 rows)

Time: 132.823 ms

** 8.2 **

cityvox_c=# SELECT e.numeve FROM evenement e WHERE e.libgeseve LIKE
'%hocus pocus%';numeve
--------
(0 rows)

Time: 100.008 ms
cityvox_c=# SELECT e.numeve FROM evenement e WHERE e.libgeseve ILIKE
'%hocus pocus%'; numeve
-----------900024298    87578
(2 rows)

Time: 113.579 ms

** 8.3 **

cityvox_c=# SELECT e.numeve FROM evenement e WHERE e.libgeseve LIKE
'%hocus pocus%';numeve
--------
(0 rows)

Time: 112.462 ms
cityvox_c=# SELECT e.numeve FROM evenement e WHERE e.libgeseve ILIKE
'%hocus pocus%'; numeve
-----------900024298    87578
(2 rows)

Time: 160.961 ms

--
Guillaume


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Guillaume Smet wrote:

app_hls

> On 9/20/07, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>> Can you retry both sets of tests but this time in C locale? The lower()
>> code works differently in C locale, and it might be that we need to look
>> at tweaking just one case.
>>
>
>


Please try the attached patch, which goes back to using a special case
for single-byte ILIKE. I want to make sure that at the very least we
don't cause a performance regression with the code done this release. I
can't see an obvious way around the problem for multi-byte case -
lower() then requires converting to and from wchar, and I don't see a
way of avoiding calling lower(). If this is a major blocker I would
suggest you look at an alternative to using ILIKE for your UTF8 data.

cheers

andrew
Index: src/backend/utils/adt/like.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v
retrieving revision 1.69
diff -c -r1.69 like.c
*** src/backend/utils/adt/like.c    2 Jun 2007 02:03:42 -0000    1.69
--- src/backend/utils/adt/like.c    20 Sep 2007 13:12:39 -0000
***************
*** 36,41 ****
--- 36,43 ----

  static int    UTF8_MatchText(char *t, int tlen, char *p, int plen);

+ static int    SB_IMatchText(char *t, int tlen, char *p, int plen);
+
  static int    GenericMatchText(char *s, int slen, char* p, int plen);
  static int    Generic_Text_IC_like(text *str, text *pat);

***************
*** 104,109 ****
--- 106,117 ----

  #include "like_match.c"

+ /* setup to compile like_match.c for single byte case insensitive matches */
+ #define MATCH_LOWER
+ #define NextChar(p, plen) NextByte((p), (plen))
+ #define MatchText SB_IMatchText
+
+ #include "like_match.c"

  /* setup to compile like_match.c for UTF8 encoding, using fast NextChar */

***************
*** 132,146 ****
      int            slen,
                  plen;

!     /* Force inputs to lower case to achieve case insensitivity */
!     str = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(str)));
!     pat = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(pat)));
!     s = VARDATA(str);
!     slen = (VARSIZE(str) - VARHDRSZ);
!     p = VARDATA(pat);
!     plen = (VARSIZE(pat) - VARHDRSZ);

!     return GenericMatchText(s, slen, p, plen);
  }

  /*
--- 140,171 ----
      int            slen,
                  plen;

!     /* For efficiency reasons, in the single byte case we don't call
!      * lower() on the pattern and text, but instead call to_lower on each
!      * character.  In the multi-byte case we don't have much choice :-(
!      */

!     if (pg_database_encoding_max_length() > 1)
!     {
!         pat = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(pat)));
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         str = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(str)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         if (GetDatabaseEncoding() == PG_UTF8)
!             return UTF8_MatchText(s, slen, p, plen);
!         else
!             return MB_MatchText(s, slen, p, plen);
!     }
!     else
!     {
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         return SB_IMatchText(s, slen, p, plen);
!     }
  }

  /*
Index: src/backend/utils/adt/like_match.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like_match.c,v
retrieving revision 1.16
diff -c -r1.16 like_match.c
*** src/backend/utils/adt/like_match.c    2 Jun 2007 02:03:42 -0000    1.16
--- src/backend/utils/adt/like_match.c    20 Sep 2007 13:12:39 -0000
***************
*** 13,18 ****
--- 13,19 ----
   * NextChar
   * MatchText - to name of function wanted
   * do_like_escape - name of function if wanted - needs CHAREQ and CopyAdvChar
+  * MATCH_LOWER - define iff using to_lower on text chars
   *
   * Copyright (c) 1996-2007, PostgreSQL Global Development Group
   *
***************
*** 68,73 ****
--- 69,80 ----
   *--------------------
   */

+ #ifdef MATCH_LOWER
+ #define TCHAR(t) tolower((t))
+ #else
+ #define TCHAR(t) (t)
+ #endif
+
  static int
  MatchText(char *t, int tlen, char *p, int plen)
  {
***************
*** 143,155 ****
              else
              {

!                 char firstpat = *p ;

                  if (*p == '\\')
                  {
                      if (plen < 2)
                          return LIKE_FALSE;
!                     firstpat = p[1];
                  }

                  while (tlen > 0)
--- 150,162 ----
              else
              {

!                 char firstpat = TCHAR(*p) ;

                  if (*p == '\\')
                  {
                      if (plen < 2)
                          return LIKE_FALSE;
!                     firstpat = TCHAR(p[1]);
                  }

                  while (tlen > 0)
***************
*** 158,164 ****
                       * Optimization to prevent most recursion: don't recurse
                       * unless first pattern byte matches first text byte.
                       */
!                     if (*t == firstpat)
                      {
                          int            matched = MatchText(t, tlen, p, plen);

--- 165,171 ----
                       * Optimization to prevent most recursion: don't recurse
                       * unless first pattern byte matches first text byte.
                       */
!                     if (TCHAR(*t) == firstpat)
                      {
                          int            matched = MatchText(t, tlen, p, plen);

***************
*** 183,189 ****
              NextByte(p, plen);
              continue;
          }
!         else if (*t != *p)
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
--- 190,196 ----
              NextByte(p, plen);
              continue;
          }
!         else if (TCHAR(*t) != TCHAR(*p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
***************
*** 338,340 ****
--- 345,352 ----
  #undef do_like_escape
  #endif

+ #undef TCHAR
+
+ #ifdef MATCH_LOWER
+ #undef MATCH_LOWER
+ #endif

Re: like/ilike improvements

From
Andrew Dunstan
Date:

I wrote:
>
>
> I can't see an obvious way around the problem for multi-byte case - 
> lower() then requires converting to and from wchar, and I don't see a 
> way of avoiding calling lower().

There is one way we could reduce the use of lower() by up to (almost) 
50% in the common case where the pattern is a constant expression (or a 
literal, as it usually is) - cache the result of lower() on the pattern 
rather than call it for every text the pattern is being compared to. I'm 
not quite sure how to achieve that though.

Anyone have good ideas?

cheers

andrew


Re: like/ilike improvements

From
"Guillaume Smet"
Date:
Andrew,

On 9/20/07, Andrew Dunstan <andrew@dunslane.net> wrote:
> Please try the attached patch, which goes back to using a special case
> for single-byte ILIKE. I want to make sure that at the very least we
> don't cause a performance regression with the code done this release. I
> can't see an obvious way around the problem for multi-byte case -
> lower() then requires converting to and from wchar, and I don't see a
> way of avoiding calling lower(). If this is a major blocker I would
> suggest you look at an alternative to using ILIKE for your UTF8 data.

I tested your patch with latin1 and C encoding.

It's better but still slower than 8.2.

C results:
cityvox_c=# SELECT e.numeve FROM evenement e WHERE e.libgeseve LIKE
'%hocus pocus%';numeve
--------
(0 rows)

Time: 113.655 ms

cityvox_c=# SELECT e.numeve FROM evenement e WHERE e.libgeseve ILIKE
'%hocus pocus%'; numeve
-----------900024298    87578
(2 rows)

Time: 124.829 ms

Latin1 results:
cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
LIKE '%hocus pocus%';numeve
--------
(0 rows)

Time: 113.207 ms

cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
ILIKE '%hocus pocus%'; numeve
-----------900024298    87578
(2 rows)

Time: 123.163 ms

And to answer your IRC question about switching to regexp, it's even
slower than the new UTF-8 ILIKE of 8.3 so I don't think it's the way
to go :).

Regards,

--
Guillaume


Re: like/ilike improvements

From
ITAGAKI Takahiro
Date:
"Guillaume Smet" <guillaume.smet@gmail.com> wrote:

> It's better but still slower than 8.2.

It probablly comes from 'var-varlena' feature in 8.3. Now we store
text fields in a compact format on disks and extract them on access.
It consumes some CPU cycles. If all of data are in buffer cache
and the encoding of database is single-byte encodings, the performance
of LIKE in 8.3 was 10-20% slower than 8.2 on my tests.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: like/ilike improvements

From
Gregory Stark
Date:
"ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> writes:

> "Guillaume Smet" <guillaume.smet@gmail.com> wrote:
>
>> It's better but still slower than 8.2.
>
> It probablly comes from 'var-varlena' feature in 8.3. Now we store
> text fields in a compact format on disks and extract them on access.
> It consumes some CPU cycles. If all of data are in buffer cache
> and the encoding of database is single-byte encodings, the performance
> of LIKE in 8.3 was 10-20% slower than 8.2 on my tests.

Hm, it does seem I missed like.c when I converted all the text operators to
avoid detoasting packed varlenas. I'll send a patch in a few minutes to do
that. I'm surprised it would have such a large effect though.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: like/ilike improvements

From
"Guillaume Smet"
Date:
Gregory,

On 9/21/07, Gregory Stark <stark@enterprisedb.com> wrote:
> Hm, it does seem I missed like.c when I converted all the text operators to
> avoid detoasting packed varlenas. I'll send a patch in a few minutes to do
> that. I'm surprised it would have such a large effect though.

The patch doesn't seem to apply cleanly on head (I have a problem with
oracle_compat.c). I tested it though with latin1 encoding.

The LIKE case is better:
cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
LIKE '%hocus pocus%';numeve
--------
(0 rows)

Time: 98.995 ms

-> it seems to be as fast as 8.2 was, now.

The ILIKE case seems to go into an infinite loop: postmaster takes
100% of CPU and the query never finishes.

--
Guillaume


Re: like/ilike improvements

From
Gregory Stark
Date:
"Guillaume Smet" <guillaume.smet@gmail.com> writes:

> Gregory,
>
> On 9/21/07, Gregory Stark <stark@enterprisedb.com> wrote:
>> Hm, it does seem I missed like.c when I converted all the text operators to
>> avoid detoasting packed varlenas. I'll send a patch in a few minutes to do
>> that. I'm surprised it would have such a large effect though.
>
> The patch doesn't seem to apply cleanly on head (I have a problem with
> oracle_compat.c). I tested it though with latin1 encoding.

Huh, I'll check. You have updated recently right? Because Andrew's changes to
ascii and char and so on just went in very recently.

> The LIKE case is better:
> cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
> LIKE '%hocus pocus%';
>  numeve
> --------
> (0 rows)
>
> Time: 98.995 ms
>
> -> it seems to be as fast as 8.2 was, now.
>
> The ILIKE case seems to go into an infinite loop: postmaster takes
> 100% of CPU and the query never finishes.

Can you send me the test cases you're using? It seems to be working for me and
it passes all the regression tests (no idea if they test ilike though).

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: like/ilike improvements

From
Andrew Dunstan
Date:

Guillaume Smet wrote:
> Gregory,
>
> On 9/21/07, Gregory Stark <stark@enterprisedb.com> wrote:
>   
>> Hm, it does seem I missed like.c when I converted all the text operators to
>> avoid detoasting packed varlenas. I'll send a patch in a few minutes to do
>> that. I'm surprised it would have such a large effect though.
>>     
>
> The patch doesn't seem to apply cleanly on head (I have a problem with
> oracle_compat.c). 
>   

It applied cleanly for me.

cheers

andrew


Re: like/ilike improvements

From
"Guillaume Smet"
Date:
On 9/21/07, Andrew Dunstan <andrew@dunslane.net> wrote:
> It applied cleanly for me.

Yes, it seems something was screwed in my tree. I didn't notice you
commited the patch I applied before Greg's patch.
Anyway, I'm starting with a clean tree containing your fix and what
Tom commited but I have to import the data again due to the catalog
version bump :).

New results coming soon.

-- 
Guillaume


Re: like/ilike improvements

From
"Guillaume Smet"
Date:
On 9/22/07, Guillaume Smet <guillaume.smet@gmail.com> wrote:
> Anyway, I'm starting with a clean tree containing your fix and what
> Tom commited but I have to import the data again due to the catalog
> version bump :).

I have some good news. After Andrew's and Greg's patches, CVS HEAD is
as fast as 8.2 with latin1 encoding:
cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
LIKE '%hocus pocus%';numeve
--------
(0 rows)

Time: 102.731 ms
cityvox_latin1=# SELECT e.numeve FROM evenement e WHERE e.libgeseve
ILIKE '%hocus pocus%'; numeve
-----------900024298    87578
(2 rows)

Time: 120.399 ms

So the only regression left is that from 8.2, ILIKE with UTF-8
encoding is really slower than before but it doesn't seem easy to
solve (if possible).

Regards,

-- 
Guillaume