Thread: Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
From
"David Rowley"
Date:
<<<Moved from pgsql-patches>>> Tom Lane wrote: > I wrote: > > I looked this over a bit and was immediately confused by one thing: > > the introductory comment says that the skip table size ought to be based > > on the length of the haystack, which makes sense to me, but the code is > > actually initializing it on the basis of len2, ie, the length of the > > needle. Isn't that a bug? Was the same bug present in the tests you > > made to determine the best table sizes? > > BTW, to the extent that you feel like testing a different idea, > I would suggest: > > * don't bother initializing the skiptable when len1 < len2 > > * otherwise, choose its size based on len1 - len2, not just len1 or > len2. This is (one less than) the maximum number of search loop > consultations of the skip table that can happen, so it seems like a > plausible number, and better than either length alone. I've made the discussed changes. Also updated the benchmark results. http://www.unixbeast.com/~fat/8.3_test_v1.3.xls On re-benchmarking to determine the best skip table size, the changes to the logic didn't seem to affect the results enough to have to change the thresholds. But I'm sure it will perform a little better in cases like when both the needle and haystack are very long and close to being the same length. Having a big skip table in this case is a little bit of a waste as the search won't get to make much use of it.
Attachment
"David Rowley" <dgrowley@gmail.com> writes: > I've made the discussed changes. Also updated the benchmark results. > http://www.unixbeast.com/~fat/8.3_test_v1.3.xls Applied with revisions; mostly cosmetic except for one point. I realized after studying the code a bit more that B-M cannot possibly win for a single-character pattern (needle), since the skip distance must always be 1 in that case. The fact that it seemed to keep up at that length has to be because the original coding included a strncmp call inside the innermost loop, which likely prevents the compiler from optimizing that loop really tightly. But the strncmp wasn't doing anything anyway for the case of pattern length = 1. So what I committed special-cases pattern length 1 to be a naive search with a *very* tight inner loop. I think it's worth troubling over this case because a common usage is split_to_array and suchlike with single-character delimiters. regards, tom lane
Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
From
"David Rowley"
Date:
Tom Lane Wrote: > "David Rowley" <dgrowley@gmail.com> writes: > > I've made the discussed changes. Also updated the benchmark results. > > http://www.unixbeast.com/~fat/8.3_test_v1.3.xls > Applied with revisions; mostly cosmetic except for one point. I > realized after studying the code a bit more that B-M cannot possibly win > for a single-character pattern (needle), since the skip distance must > always be 1 in that case. The fact that it seemed to keep up at that > length has to be because the original coding included a strncmp call > inside the innermost loop, which likely prevents the compiler from > optimizing that loop really tightly. But the strncmp wasn't doing > anything anyway for the case of pattern length = 1. So what I committed > special-cases pattern length 1 to be a naive search with a *very* tight > inner loop. I think it's worth troubling over this case because a > common usage is split_to_array and suchlike with single-character > delimiters. I had thought about this, but failed to think about string_to_array. Probably worth the extra code for that. Well simple enough patch. I've learned quite a bit about the postgresql source even just for working in varlena.c. I've got a very long way to go though. Thanks for all the reviews and suggestions. David.
Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
From
Heikki Linnakangas
Date:
David Rowley wrote: > Thanks for all the reviews and suggestions. David, could you re-run the performance tests you ran earlier? I'm curious to know what impact switching to the simpler loop for 1-byte pattern had. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
From
"David Rowley"
Date:
Heikki Linnakangas wrote: > David Rowley wrote: > Thanks for all the reviews and suggestions. > David, could you re-run the performance tests you ran earlier? I'm > curious to know what impact switching to the simpler loop for 1-byte > pattern had. Sure: http://www.unixbeast.com/~fat/8.3_test_v1.4.xls I tested with 8.3.3, and applied the patch updated by tom. The thing that surprised me was that string_to_array didn't perform as well. I expected single character searches to perform a little better. I can't think why it would be slower now. The strpos() test for the single char needle is taking 1.2% longer. I guess that makes a little sense as we're now doing a new more length tests in this case, but then we've eliminated the final single strncmp() for once it finds that match. The strncmp would have check for nulls, check it's not exceeded the compare length and check the chars actually match. I figured this would have made up for that 1.2%. Seems not. David. -----Original Message----- From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com] Sent: 08 September 2008 08:50 To: David Rowley Cc: 'Tom Lane'; 'Peter Eisentraut'; pgsql-hackers@postgresql.org Subject: Re: [PATCHES] [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] TODO item: Implement Boyer-Moore searching (First time hacker)
From
Heikki Linnakangas
Date:
David Rowley wrote: > The thing that surprised me was that string_to_array didn't perform as well. > I expected single character searches to perform a little better. I can't > think why it would be slower now. Yes, that's strange. I tried to reproduce that here, with a CVS snapshot before the patch, and after. With quick testing with psql and \timing and the same query you had in that spreadsheet, I couldn't see that kind of performance degradation. Oprofile suggests that, on the contrary, slightly less time is spent in text_position_next() after the patch, and slightly more in text_position_setup(). Together they account for ~10% of CPU time in both tests, so a small difference there would be insignificant anyway in that test case. I think we're fine. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com