Home > mailing lists

Thread: Boyer-Moore string searching in LIKE, ILIKE, and/or POSITION?

Boyer-Moore string searching in LIKE, ILIKE, and/or POSITION?

From

"Martin L. Buchanan"

Date:

09 February 2023, 03:04:20

In the PostgreSQL Todo wiki, Boyer-Moore string searching for LIKE is mentioned as an outstanding item.

For the common and simple cases of find this string anywhere in another string:

str LIKE '%foo%'

str ILIKE '%foo%'

position('foo' in str) > 0

Is Boyer-Moore string searching now used by any of these three?

I checked the PG documentation and found no info about this other than what was in the Todo wiki, https://wiki.postgresql.org/wiki/Todo, under Functions. Tom Lane gave a thumbs down to the idea back in 2008, but that was a long time ago: https://www.postgresql.org/message-id/27645.1220635769@sss.pgh.pa.us .

Sincerely,

Martin L Buchanan

senior software engineer

Laramie, WY, USA

Re: Boyer-Moore string searching in LIKE, ILIKE, and/or POSITION?

From

David Rowley

Date:

09 February 2023, 04:30:00

On Thu, 9 Feb 2023 at 13:05, Martin L. Buchanan
<martinlbuchanan@gmail.com> wrote:
> For the common and simple cases of find this string anywhere in another string:
>
> str LIKE '%foo%'
>
> str ILIKE  '%foo%'
>
> position('foo' in str) > 0
>
> Is Boyer-Moore string searching now used by any of these three?

We use a sort of "lossy" Boyer-Moore-Horspool algorithm. See
text_position_setup() and text_position_next() in varlena.c (the lossy
part comes from the skiptables possibly sharing the same entry for
multiple characters as defined by what skiptablemask is set to).

> I checked the PG documentation and found no info about this other than what was in the Todo wiki,
https://wiki.postgresql.org/wiki/Todo,under Functions. Tom Lane gave a thumbs down to the idea back in 2008, but that
wasa long time ago: https://www.postgresql.org/message-id/27645.1220635769@sss.pgh.pa.us . 

Tom's argument seems to think it's impossible, so if you find that
it's definitely not impossible, then you can assume he's wrong about
that.  I've not given it much thought, but I think providing there are
only leading and trailing %'s and no _'s that the LIKE would be
equivalent to checking if position(like_arg2 in like_arg1).

There are probably 2 ways that you could consider going about making this work:

Method 1: See if it's possible to rewrite the query to swap the LIKE
for position().

I'm really unsure where you'd want to put the logic that looks for
such patterns. I don't recall any code that swaps one operator out for
another.  It seems like you could only apply such a transformation if
the like pattern was a Const.  I'm not sure if we'd want to do
anything like introduce some special logic in oper() in parse_oper.c
to do such a rewrite. Given we allow users to define their own
datatypes and operators, it would seem a bit limiting to go and hard
code this transformation. I'm really not sure what it would take to
make such a thing expandable. Maybe some kind of rewrite function
could be defined in pg_operator that can try and fail to come up with
some more efficient alternative OpExpr.

You'd also need to consider that conditions such as: str LIKE 'foo%'
can use an index scan in some cases. If you were to rewrite in that
case you'd likely kill the performance of queries that could exploit
that feature.

Method 2: Adjust like_match.c to look for these patterns and use BMH
when possible.

The problem with this is that you'd need to check the pattern each
call to see if it's compatible. Since you need to check for %'s and
_'s in the middle of the string, then that means adding a complete
preprocessing step that would need to be done on every call to
MatchText(). If that passes then it's maybe going to be faster, but if
it fails then it'll certainly be slower. You might very well struggle
to convince people that this would be a good idea. It sounds like it
would be very easy to demonstrate performance regressions here just by
passing a pattern that always fails that preprocess step. Getting
stuff committed that causes performance regressions in some cases and
wins in other cases can be a pretty difficult and frustrating process.
You have to remember, even if you think the slowdown is some corner
case that only applies ~1% of the time, for some users in the real
world, that might be 100% of their queries.

There are probably other ways you could consider doing this, I just
can't think of them right now.

David

Re: Boyer-Moore string searching in LIKE, ILIKE, and/or POSITION?

From

Tom Lane

Date:

09 February 2023, 04:49:43

David Rowley <dgrowleyml@gmail.com> writes:
> On Thu, 9 Feb 2023 at 13:05, Martin L. Buchanan
> <martinlbuchanan@gmail.com> wrote:
>> str LIKE '%foo%'
>> str ILIKE  '%foo%'
>> position('foo' in str) > 0
>> Is Boyer-Moore string searching now used by any of these three?

> We use a sort of "lossy" Boyer-Moore-Horspool algorithm. See
> text_position_setup() and text_position_next() in varlena.c (the lossy
> part comes from the skiptables possibly sharing the same entry for
> multiple characters as defined by what skiptablemask is set to).

Note that that's used only by the functions in that file; thus,
position() yes, but (I)LIKE no.

> Tom's argument seems to think it's impossible, so if you find that
> it's definitely not impossible, then you can assume he's wrong about
> that.

My point was that it seems like you'd need a separate BMH engine for
each %-separated segment of the LIKE pattern.  I'm not quite clear on
whether BMH can handle '_' (single-char wildcard) conveniently by
itself, although my gut feel is that you can probably make that part
work.  Maybe you can even extend the idea to embedded %, but that
seems more difficult.

Given that the fixed substrings of LIKE patterns are usually rather
short, I'm unconvinced that BMH would buy much compared to its
setup costs.  But hey, prove me wrong.

(One way to amortize the setup costs could be to cache pre-built
BMH data structures keyed by the pattern strings, in a similar fashion
to what utils/adt/regexp.c does for compiled regular expressions.)

> ... Getting
> stuff committed that causes performance regressions in some cases and
> wins in other cases can be a pretty difficult and frustrating process.
> You have to remember, even if you think the slowdown is some corner
> case that only applies ~1% of the time, for some users in the real
> world, that might be 100% of their queries.

Yeah, whatever the shape of the preprocessing might be, it seems
likely that it would be a net loss in some use-cases.  We do manage
to get past that --- the position() code didn't have BMH to start
with --- but it definitely requires solid evidence.

            regards, tom lane

Re: Boyer-Moore string searching in LIKE, ILIKE, and/or POSITION?

From

David Rowley

Date:

09 February 2023, 05:06:34

On Thu, 9 Feb 2023 at 14:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > Tom's argument seems to think it's impossible, so if you find that
> > it's definitely not impossible, then you can assume he's wrong about
> > that.
>
> My point was that it seems like you'd need a separate BMH engine for
> each %-separated segment of the LIKE pattern.  I'm not quite clear on
> whether BMH can handle '_' (single-char wildcard) conveniently by
> itself, although my gut feel is that you can probably make that part
> work.  Maybe you can even extend the idea to embedded %, but that
> seems more difficult.

Yeah, I think to make it work with more complex patterns like
'%some%string%' or '%some_string%' you'd need to break that into
multiple independent searches for each portion between a wildcard
character.  For the former pattern, you'd need to do some final check
that ensures that the 2nd pattern was found in some position >= the
position of the 1st pattern + its length.  For the latter, the final
check would need to validate that the 2nd pattern was found at a
position of the first pattern + its length + 1. It's probably also
possible to make those patterns work when they don't contain the
leading and trailing % by checking that the first pattern is found at
position 0 and the end of the 2nd pattern is found at the end of the
search string.

However, I imagine going to the trouble of trying to make it work for
more complex patterns initially would be a bad idea.  I imagine there
are just too many cases where we could demonstrate performance
regressions and that would cause us to reject the patch.

David

Re: Boyer-Moore string searching in LIKE, ILIKE, and/or POSITION?

From

Vladimir Sitnikov

Date:

09 February 2023, 08:41:46

Here's an interesting read on regex improvements in dot net 7

See "Goodbye, Boyer-Moore" where they drop Boyer-Moore and replace it with vectorized search:

https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/#vectorization

Vladimir