Thread: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
"David Rowley"
Date:
I wrote:
> I look forward to receiving feedback on this.

Peter Eisentraut wrote:
> Please send a patch in diff -c format.  And don't put a single patch
> file in a tar file.

Thanks for the pointer. I'm quite new to this.


I've done some more revisions to the patch. This has mostly just involved
tuning the skip table size based on the size of the search. This has
basically involved lots of benchmarks with different strings and calculating
the best size of table to use. The reason for this is to maintain fast
searches for smaller strings. The overhead of initialising a 256 element
array would probably out weigh the cost of the search if this were not done.
The size of the skip table increases with longer strings, or rather the size
that is utilised.

Performance:
For smaller searches performance of the patch and existing version are very
similar. The patched version starts to out perform the existing version when
the needle and haystack become larger.

The patch wins hands down with searches that leads the existing function in
to dead ends, for example:

SELECT STRPOS('A AA AAA AAAA AAAAA AAAAAA AAAAAAA','AAAAAAA');

When searching for very small strings, say just a single character in a
sentence the existing function marginally beats the patched version.

Outside of Postgres I've done benchmarks where I've searched for every
combination of the search string in the search string. Like:

test     | t      |
test     | te     |
test     | tes    |
test     | test   |
test     | e      |
test     | es     |
test     | est    |
test     | s      |
test     | st     |
test     | t      |


I felt this was fair for both versions. The patched version beat the
unpatched version. The test I carried out was a string of 934 characters.

I can upload more benchmarks if required.

I'm quite happy with the patch now so I'm going to submit it (in diff -c
format this time :)

David.




Attachment

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Heikki Linnakangas
Date:
After reading the wikipedia article on Boyer-Moore search algorithm, it
looks to me like this patch actually implements the simpler
Boyer-Moore-Horspool algorithm that only uses one lookup table. That's
probably fine, as it ought to be faster on small needles and haystacks
because it requires less effort to build the lookup tables, even though
the worst-case performance is worse. It should still be faster than what
we have now.

The skip table really should be constructed only once in
text_position_start and stored in TextPositionState. That would make a
big difference to the performance of those functions that call
text_position_next repeatedly: replace_text, split_text and text_to_array.

David Rowley wrote:
> I've done some more revisions to the patch. This has mostly just involved
> tuning the skip table size based on the size of the search. This has
> basically involved lots of benchmarks with different strings and calculating
> the best size of table to use. The reason for this is to maintain fast
> searches for smaller strings. The overhead of initialising a 256 element
> array would probably out weigh the cost of the search if this were not done.
> The size of the skip table increases with longer strings, or rather the size
> that is utilised.
>
> Performance:
> For smaller searches performance of the patch and existing version are very
> similar. The patched version starts to out perform the existing version when
> the needle and haystack become larger.
>
> The patch wins hands down with searches that leads the existing function in
> to dead ends, for example:
>
> SELECT STRPOS('A AA AAA AAAA AAAAA AAAAAA AAAAAAA','AAAAAAA');
>
> When searching for very small strings, say just a single character in a
> sentence the existing function marginally beats the patched version.
>
> Outside of Postgres I've done benchmarks where I've searched for every
> combination of the search string in the search string. Like:
>
> test     | t      |
> test     | te     |
> test     | tes    |
> test     | test   |
> test     | e      |
> test     | es     |
> test     | est    |
> test     | s      |
> test     | st     |
> test     | t      |
>
>
> I felt this was fair for both versions. The patched version beat the
> unpatched version. The test I carried out was a string of 934 characters.


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> After reading the wikipedia article on Boyer-Moore search algorithm, it
> looks to me like this patch actually implements the simpler
> Boyer-Moore-Horspool algorithm that only uses one lookup table. That's
> probably fine, as it ought to be faster on small needles and haystacks
> because it requires less effort to build the lookup tables, even though
> the worst-case performance is worse. It should still be faster than what
> we have now.

Hmm.  B-M-H has worst case search speed O(M*N) (where M = length of
pattern, N = length of search string); whereas full B-M is O(N).
Maybe we should build the second table when M is large?  Although
realistically that is probably gilding the lily, since frankly there
haven't been many real-world complaints about the speed of these
functions anyway ...

> The skip table really should be constructed only once in
> text_position_start and stored in TextPositionState. That would make a
> big difference to the performance of those functions that call
> text_position_next repeatedly: replace_text, split_text and text_to_array.

+1.  The heuristic about big a skip table to make may need some
adjustment as well, since it seems to be considering only a single
search.

            regards, tom lane

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
"David Rowley"
Date:
Heikki Linnakangas wrote:
> After reading the wikipedia article on Boyer-Moore search algorithm, it
> looks to me like this patch actually implements the simpler
> Boyer-Moore-Horspool algorithm that only uses one lookup table. That's
> probably fine, as it ought to be faster on small needles and haystacks
> because it requires less effort to build the lookup tables, even though
> the worst-case performance is worse. It should still be faster than what
> we have now.

Yes, correct, I really didn't want to slow down smaller searches. This
method seemed to fit the bill better. What I didn't realise is that this
method also had a name.

Tom Lane wrote:
> Hmm.  B-M-H has worst case search speed O(M*N) (where M = length of
> pattern, N = length of search string); whereas full B-M is O(N). Maybe we
> should build the second table when M is large?

I'll look into this. If it pays off for longer searches I'll submit a patch.
I won't have the time until after the 15th, so perhaps that's in November's
commit fest?

> The skip table really should be constructed only once in
> text_position_start and stored in TextPositionState. That would make a
> big difference to the performance of those functions that call
> text_position_next repeatedly: replace_text, split_text and text_to_array.

Of course you are right. That will help for replace and the like. I'll
update the patch tonight.

Thanks both for the feedback.

David.



Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Tom Lane
Date:
"David Rowley" <dgrowley@gmail.com> writes:
> Tom Lane wrote:
>> Hmm.  B-M-H has worst case search speed O(M*N) (where M = length of
>> pattern, N = length of search string); whereas full B-M is O(N). Maybe we
>> should build the second table when M is large?

> I'll look into this. If it pays off for longer searches I'll submit a patch.
> I won't have the time until after the 15th, so perhaps that's in November's
> commit fest?

Yes.  Let's get B-M-H in during this fest and then you can look into
whether a follow-on patch is worthwhile.

            regards, tom lane

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> "David Rowley" <dgrowley@gmail.com> writes:
>> Tom Lane wrote:
>>> Hmm.  B-M-H has worst case search speed O(M*N) (where M = length of
>>> pattern, N = length of search string); whereas full B-M is O(N). Maybe we
>>> should build the second table when M is large?
>
>> I'll look into this. If it pays off for longer searches I'll submit a patch.
>> I won't have the time until after the 15th, so perhaps that's in November's
>> commit fest?
>
> Yes.  Let's get B-M-H in during this fest and then you can look into
> whether a follow-on patch is worthwhile.

Agreed.

Also, it would be nice to use B-M(-H) for LIKE as well. That's a
different codepath, but probably more widely used than textpos and
frients. I haven't looked how feasible it would be to do, though.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Also, it would be nice to use B-M(-H) for LIKE as well.

Right offhand, that seems impossible, at least in patterns with %.
Or were you thinking of trying to separate out the fixed substrings
of a pattern and search for them with BMH?

Anyway, it's not material for this patch, since it'd involve pretty
fundamental redesign of the LIKE code.

            regards, tom lane

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
"David Rowley"
Date:
Heikki Linnakangas wrote:
> The skip table really should be constructed only once in
> text_position_start and stored in TextPositionState. That would make a
> big difference to the performance of those functions that call
> text_position_next repeatedly: replace_text, split_text and text_to_array.

I Wrote:
> Of course you are right. That will help for replace and the like. I'll
> update the patch tonight.

I've made and attached the changes Heikki recommended.
Also updated benchmark spreadsheet. Here ->
http://www.unixbeast.com/~fat/8.3_test_v1.2.xls

Previously there was an error with "Test 6", the test that benchmarked
replace(). I kept this one so not to affect the summary result in the new
sheet. I then added the sheet "Replace Test" to show more accurate results.
I had failed to notice that the optimizer was helping me out more than I
wanted it to.

My tested replace() script runs in 91% of the time than the 8.3 version.
I've not tested with the CVS head.

Now that the skip table is a member of TextPositionState, I was not quite
sure if I should #define a size for it. It would certainly look neater, only
the code that defines the skip table size in text_position_start assumes 256
elements.

Any thoughts on this?

David.

Attachment

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Tom Lane
Date:
"David Rowley" <dgrowley@gmail.com> writes:
> I've made and attached the changes Heikki recommended.

I looked this over a bit and was immediately confused by one thing:
the introductory comment says that the skip table size ought to be based
on the length of the haystack, which makes sense to me, but the code is
actually initializing it on the basis of len2, ie, the length of the
needle.  Isn't that a bug?  Was the same bug present in the tests you
made to determine the best table sizes?

Stylistic issues:

I really dislike having two copies of the heuristic size-setting code.
I think it's worth introducing a second test of use_wchar in order to
arrange text_position_setup like this:

    ... existing code ...

    choose skiptablesize

    initialize skip table (this loop doesn't depend on char width)

    if (!state->use_wchar)
        process char needle
    else
        process wide-char needle

Also it might be worth having local-variable copies of skiptablesize and
len2 for this initialization loop:

    for (ai = 0; ai <= state->skiptablesize; ai++)
        state->skiptable[ai] = state->len2;

I'm not at all sure whether a C compiler will think that it's allowed to
avoid fetching these values again on each iteration, seeing that the
loop is storing into other integer members of *state.  Given that this
loop is exactly the overhead we're trying to control by adjusting
skiptablesize, making it as tight as possible seems worthwhile.

> Now that the skip table is a member of TextPositionState, I was not quite
> sure if I should #define a size for it. It would certainly look neater, only
> the code that defines the skip table size in text_position_start assumes 256
> elements.

I tend to agree that a #define is pointless there, since there's no easy
way to make the relevant code do anything with it.  It would be
worthwhile to point out adjacent to the code what the max length is,
though.  I had gotten as far as rewriting your introductory comment like
this

    /*
     * Prepare the skip table for Boyer-Moore-Horspool searching.  First we
     * must determine how big a skip table to use.  The declaration of
     * TextPositionState allows up to 256 elements, but with haystack lengths
     * of only a few characters we don't really want to have to initialize so
     * many elements --- it would take too long in comparison to the actual
     * search time.  So we choose a useful skip table size based on the
     * haystack length.

before noticing that what I was writing wasn't what the code actually
did.  Another point that doesn't seem to be included in your comments is
that the skip table size *must* be a power of 2 because we use bit
masking.  (In fact state->skiptablesize might better be named skiptablemask
since it's really 1 less than the table size.)

BTW, try to avoid introducing so much random vertical space.  IMHO it
does little for readability and much to make routines too long to fit in
one screenful.  Generally the PG code doesn't use double blank lines
anywhere inside a function, nor blank lines before a right brace line.
pg_indent will clean this up to some extent, but it would be better if
the code looked more like the project style in the first place.

            regards, tom lane

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Tom Lane
Date:
I wrote:
> I looked this over a bit and was immediately confused by one thing:
> the introductory comment says that the skip table size ought to be based
> on the length of the haystack, which makes sense to me, but the code is
> actually initializing it on the basis of len2, ie, the length of the
> needle.  Isn't that a bug?  Was the same bug present in the tests you
> made to determine the best table sizes?

BTW, to the extent that you feel like testing a different idea,
I would suggest:

* don't bother initializing the skiptable when len1 < len2

* otherwise, choose its size based on len1 - len2, not just len1 or
len2.  This is (one less than) the maximum number of search loop
consultations of the skip table that can happen, so it seems like a
plausible number, and better than either length alone.

            regards, tom lane

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
"David Rowley"
Date:
Tom Lane wrote:
> I looked this over a bit and was immediately confused by one thing:
> the introductory comment says that the skip table size ought to be based
> on the length of the haystack, which makes sense to me, but the code is
> actually initializing it on the basis of len2, ie, the length of the
> needle.  Isn't that a bug?  Was the same bug present in the tests you
> made to determine the best table sizes?

Yes Bug. That's what I get for making last minute changes. It didn't affect
the benchmark, that was done before moving the code into postgresql for
testing. The function I tested with had an extra parameter to set the skip
table size. Each possible size was tested and the best time was saved along
with the best size.

> BTW, to the extent that you feel like testing a different idea,
> I would suggest:

> * don't bother initializing the skiptable when len1 < len2

Good plan.

> * otherwise, choose its size based on len1 - len2, not just len1 or
> len2.  This is (one less than) the maximum number of search loop
> consultations of the skip table that can happen, so it seems like a
> plausible number, and better than either length alone.

That seems like a better idea. I had considered len1 * len2, giving that's
the worst case for BMH. Of course the lengths would need to be raised quite
a bit. I'll go with len1 - len2

 I'll make the above changes and then run my benchmark again to see if the
skip table size logic should be updated.

I'll also benchmark and update the benchmark spreadsheet to see what's
changed.

David.



Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Also, it would be nice to use B-M(-H) for LIKE as well.
>
> Right offhand, that seems impossible, at least in patterns with %.
> Or were you thinking of trying to separate out the fixed substrings
> of a pattern and search for them with BMH?

Yep, something like that. Even if it only handled the special case of
'%foobar%', that would be nice, because that's a pretty common special case.

> Anyway, it's not material for this patch, since it'd involve pretty
> fundamental redesign of the LIKE code.

Yep.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)

From
"David Rowley"
Date:
Heikki Linnakangas Wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> >> Also, it would be nice to use B-M(-H) for LIKE as well.
> >
> > Right offhand, that seems impossible, at least in patterns with %.
> > Or were you thinking of trying to separate out the fixed substrings
> > of a pattern and search for them with BMH?

> Yep, something like that. Even if it only handled the special case of
> '%foobar%', that would be nice, because that's a pretty common special
> case.

It would be a quick test to check for % at either end. But we'd also need to
ensure there were none in the middle. That might slow it down. I guess while
looping over the inner chars checking for more %'s we could be building a
skiptable. We'd have to abort it if we found any %'s of course

I think with out giving it much thought _'s could be handled by BMH, these
could just be given a skip distance of 2. Only having a lossy skip table
throws that out the window without having a special check for _

David.