Re: Boyer-More-Horspool searching LIKE queries - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Boyer-More-Horspool searching LIKE queries
Date
Msg-id b0a868ae-59e3-2410-2f16-cddc31c324a1@iki.fi
Whole thread Raw
In response to Boyer-More-Horspool searching LIKE queries  (Atsushi Ogawa <atsushi.ogawa@gmail.com>)
Responses Re: Boyer-More-Horspool searching LIKE queries
List pgsql-hackers
On 11/01/2022 15:55, Atsushi Ogawa wrote:
> I have created a patch to enable the Boyer-More-Horspool search
> algorithm (B-M-H) for LIKE queries.

Cool!

> The conditions under which B-M-H can be used are as follows.
> 
> (1) B-M-H in LIKE search supports only single-byte character sets and UTF8.
> Multibyte character sets does not support, because it may contain another
> characters in the byte sequence. For UTF-8, it works fine, because in
> UTF-8 the byte sequence of one character cannot contain another character.

You can make it work with any encoding, if you check for that case after 
you find a match. See how text_position() does it.

> (3) The pattern string should be at least 4 characters.
> For example, '%AB%' can use B-M-H.

To be precise, the patch checks that the pattern string is at least 4 
*bytes* long. A pattern like E'%\U0001F418%' would benefit too.

If I'm reading the code correctly, it doesn't account for escapes 
correctly. It will use B-M-H for a pattern like '%\\%', even though 
that's just searching for a single backslash and won't benefit from B-M-H.

> (4) The first and last character of the pattern string should be '%'.

I wonder if we can do better than that. If you have a pattern like 
'%foo%bar', its pretty obvious (to a human) that you can quickly check 
if the string ends in 'bar', and then check if it also contains the 
substring 'foo'. Is there some way to generalize that?

Looking at MatchText() in like.c, there is this piece of code:

>         else if (*p == '%')
>         {
>             char        firstpat;
> 
>             /*
>              * % processing is essentially a search for a text position at
>              * which the remainder of the text matches the remainder of the
>              * pattern, using a recursive call to check each potential match.
>              *
>              * If there are wildcards immediately following the %, we can skip
>              * over them first, using the idea that any sequence of N _'s and
>              * one or more %'s is equivalent to N _'s and one % (ie, it will
>              * match any sequence of at least N text characters).  In this way
>              * we will always run the recursive search loop using a pattern
>              * fragment that begins with a literal character-to-match, thereby
>              * not recursing more than we have to.
>              */
>             NextByte(p, plen);
> 
>             while (plen > 0)
>             {
>                 if (*p == '%')
>                     NextByte(p, plen);
>                 else if (*p == '_')
>                 {
>                     /* If not enough text left to match the pattern, ABORT */
>                     if (tlen <= 0)
>                         return LIKE_ABORT;
>                     NextChar(t, tlen);
>                     NextByte(p, plen);
>                 }
>                 else
>                     break;        /* Reached a non-wildcard pattern char */
>             }

Could we use B-M-H to replace that piece of code?

How does the performance compare with regular expressions? Would it be 
possible to use this for trivial regular expressions too? Or could we 
speed up the regexp engine to take advantage of B-M-H, and use it for 
LIKE? Or something like that?

> I have measured the performance with the following query.

Setting up the B-M-H table adds some initialization overhead, so this 
would be a loss for cases where the LIKE is executed only once, and/or 
the haystack strings are very small. That's probably OK, the overhead is 
probably small, and those cases are probably not performance-critical. 
But would be nice to measure that too.

- Heikki



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Next
From: "Imseih (AWS), Sami"
Date:
Subject: Re: Add index scan progress to pg_stat_progress_vacuum