Thread: Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

From
"David Rowley"
Date:
<<<Moved from pgsql-patches>>>

Tom Lane wrote:
> I wrote:
> > I looked this over a bit and was immediately confused by one thing:
> > the introductory comment says that the skip table size ought to be based
> > on the length of the haystack, which makes sense to me, but the code is
> > actually initializing it on the basis of len2, ie, the length of the
> > needle.  Isn't that a bug?  Was the same bug present in the tests you
> > made to determine the best table sizes?
>
> BTW, to the extent that you feel like testing a different idea,
> I would suggest:
>
> * don't bother initializing the skiptable when len1 < len2
>
> * otherwise, choose its size based on len1 - len2, not just len1 or
> len2.  This is (one less than) the maximum number of search loop
> consultations of the skip table that can happen, so it seems like a
> plausible number, and better than either length alone.


I've made the discussed changes. Also updated the benchmark results.
http://www.unixbeast.com/~fat/8.3_test_v1.3.xls


On re-benchmarking to determine the best skip table size, the changes to the
logic didn't seem to affect the results enough to have to change the
thresholds. But I'm sure it will perform a little better in cases like when
both the needle and haystack are very long and close to being the same
length. Having a big skip table in this case is a little bit of a waste as
the search won't get to make much use of it.

Attachment

Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Tom Lane
Date:
"David Rowley" <dgrowley@gmail.com> writes:
> I've made the discussed changes. Also updated the benchmark results.
> http://www.unixbeast.com/~fat/8.3_test_v1.3.xls

Applied with revisions; mostly cosmetic except for one point.  I
realized after studying the code a bit more that B-M cannot possibly win
for a single-character pattern (needle), since the skip distance must
always be 1 in that case.  The fact that it seemed to keep up at that
length has to be because the original coding included a strncmp call
inside the innermost loop, which likely prevents the compiler from
optimizing that loop really tightly.  But the strncmp wasn't doing
anything anyway for the case of pattern length = 1.  So what I committed
special-cases pattern length 1 to be a naive search with a *very* tight
inner loop.  I think it's worth troubling over this case because a
common usage is split_to_array and suchlike with single-character
delimiters.
        regards, tom lane


Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

From
"David Rowley"
Date:
Tom Lane Wrote:
> "David Rowley" <dgrowley@gmail.com> writes:
> > I've made the discussed changes. Also updated the benchmark results.
> > http://www.unixbeast.com/~fat/8.3_test_v1.3.xls

> Applied with revisions; mostly cosmetic except for one point.  I
> realized after studying the code a bit more that B-M cannot possibly win
> for a single-character pattern (needle), since the skip distance must
> always be 1 in that case.  The fact that it seemed to keep up at that
> length has to be because the original coding included a strncmp call
> inside the innermost loop, which likely prevents the compiler from
> optimizing that loop really tightly.  But the strncmp wasn't doing
> anything anyway for the case of pattern length = 1. So what I committed
> special-cases pattern length 1 to be a naive search with a *very* tight
> inner loop.  I think it's worth troubling over this case because a
> common usage is split_to_array and suchlike with single-character
> delimiters.

I had thought about this, but failed to think about string_to_array.
Probably worth the extra code for that.

Well simple enough patch. I've learned quite a bit about the postgresql
source even just for working in varlena.c. I've got a very long way to go
though.

Thanks for all the reviews and suggestions.

David.






Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Heikki Linnakangas
Date:
David Rowley wrote:
> Thanks for all the reviews and suggestions.

David, could you re-run the performance tests you ran earlier? I'm 
curious to know what impact switching to the simpler loop for 1-byte 
pattern had.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

From
"David Rowley"
Date:
Heikki Linnakangas wrote:
> David Rowley wrote:
> Thanks for all the reviews and suggestions.

> David, could you re-run the performance tests you ran earlier? I'm
> curious to know what impact switching to the simpler loop for 1-byte
> pattern had.

Sure: http://www.unixbeast.com/~fat/8.3_test_v1.4.xls

I tested with 8.3.3, and applied the patch updated by tom.

The thing that surprised me was that string_to_array didn't perform as well.
I expected single character searches to perform a little better. I can't
think why it would be slower now.

The strpos() test for the single char needle is taking 1.2% longer. I guess
that makes a little sense as we're now doing a new more length tests in this
case, but then we've eliminated the final single strncmp() for once it finds
that match. The strncmp would have check for nulls, check it's not exceeded
the compare length and check the chars actually match. I figured this would
have made up for that 1.2%. Seems not.

David.



-----Original Message-----
From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com]
Sent: 08 September 2008 08:50
To: David Rowley
Cc: 'Tom Lane'; 'Peter Eisentraut'; pgsql-hackers@postgresql.org
Subject: Re: [PATCHES] [HACKERS] TODO item: Implement Boyer-Moore searching
(First time hacker)


--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)

From
Heikki Linnakangas
Date:
David Rowley wrote:
> The thing that surprised me was that string_to_array didn't perform as well.
> I expected single character searches to perform a little better. I can't
> think why it would be slower now.

Yes, that's strange. I tried to reproduce that here, with a CVS snapshot 
before the patch, and after. With quick testing with psql and \timing 
and the same query you had in that spreadsheet, I couldn't see that kind 
of performance degradation. Oprofile suggests that, on the contrary, 
slightly less time is spent in text_position_next() after the patch, and 
slightly more in text_position_setup(). Together they account for ~10% 
of CPU time in both tests, so a small difference there would be 
insignificant anyway in that test case.

I think we're fine.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com