Thread: Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
From
"David Rowley"
Date:
I wrote: > I look forward to receiving feedback on this. Peter Eisentraut wrote: > Please send a patch in diff -c format. And don't put a single patch > file in a tar file. Thanks for the pointer. I'm quite new to this. I've done some more revisions to the patch. This has mostly just involved tuning the skip table size based on the size of the search. This has basically involved lots of benchmarks with different strings and calculating the best size of table to use. The reason for this is to maintain fast searches for smaller strings. The overhead of initialising a 256 element array would probably out weigh the cost of the search if this were not done. The size of the skip table increases with longer strings, or rather the size that is utilised. Performance: For smaller searches performance of the patch and existing version are very similar. The patched version starts to out perform the existing version when the needle and haystack become larger. The patch wins hands down with searches that leads the existing function in to dead ends, for example: SELECT STRPOS('A AA AAA AAAA AAAAA AAAAAA AAAAAAA','AAAAAAA'); When searching for very small strings, say just a single character in a sentence the existing function marginally beats the patched version. Outside of Postgres I've done benchmarks where I've searched for every combination of the search string in the search string. Like: test | t | test | te | test | tes | test | test | test | e | test | es | test | est | test | s | test | st | test | t | I felt this was fair for both versions. The patched version beat the unpatched version. The test I carried out was a string of 934 characters. I can upload more benchmarks if required. I'm quite happy with the patch now so I'm going to submit it (in diff -c format this time :) David.
Attachment
Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
From
Heikki Linnakangas
Date:
After reading the wikipedia article on Boyer-Moore search algorithm, it looks to me like this patch actually implements the simpler Boyer-Moore-Horspool algorithm that only uses one lookup table. That's probably fine, as it ought to be faster on small needles and haystacks because it requires less effort to build the lookup tables, even though the worst-case performance is worse. It should still be faster than what we have now. The skip table really should be constructed only once in text_position_start and stored in TextPositionState. That would make a big difference to the performance of those functions that call text_position_next repeatedly: replace_text, split_text and text_to_array. David Rowley wrote: > I've done some more revisions to the patch. This has mostly just involved > tuning the skip table size based on the size of the search. This has > basically involved lots of benchmarks with different strings and calculating > the best size of table to use. The reason for this is to maintain fast > searches for smaller strings. The overhead of initialising a 256 element > array would probably out weigh the cost of the search if this were not done. > The size of the skip table increases with longer strings, or rather the size > that is utilised. > > Performance: > For smaller searches performance of the patch and existing version are very > similar. The patched version starts to out perform the existing version when > the needle and haystack become larger. > > The patch wins hands down with searches that leads the existing function in > to dead ends, for example: > > SELECT STRPOS('A AA AAA AAAA AAAAA AAAAAA AAAAAAA','AAAAAAA'); > > When searching for very small strings, say just a single character in a > sentence the existing function marginally beats the patched version. > > Outside of Postgres I've done benchmarks where I've searched for every > combination of the search string in the search string. Like: > > test | t | > test | te | > test | tes | > test | test | > test | e | > test | es | > test | est | > test | s | > test | st | > test | t | > > > I felt this was fair for both versions. The patched version beat the > unpatched version. The test I carried out was a string of 934 characters. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > After reading the wikipedia article on Boyer-Moore search algorithm, it > looks to me like this patch actually implements the simpler > Boyer-Moore-Horspool algorithm that only uses one lookup table. That's > probably fine, as it ought to be faster on small needles and haystacks > because it requires less effort to build the lookup tables, even though > the worst-case performance is worse. It should still be faster than what > we have now. Hmm. B-M-H has worst case search speed O(M*N) (where M = length of pattern, N = length of search string); whereas full B-M is O(N). Maybe we should build the second table when M is large? Although realistically that is probably gilding the lily, since frankly there haven't been many real-world complaints about the speed of these functions anyway ... > The skip table really should be constructed only once in > text_position_start and stored in TextPositionState. That would make a > big difference to the performance of those functions that call > text_position_next repeatedly: replace_text, split_text and text_to_array. +1. The heuristic about big a skip table to make may need some adjustment as well, since it seems to be considering only a single search. regards, tom lane
Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
From
"David Rowley"
Date:
Heikki Linnakangas wrote: > After reading the wikipedia article on Boyer-Moore search algorithm, it > looks to me like this patch actually implements the simpler > Boyer-Moore-Horspool algorithm that only uses one lookup table. That's > probably fine, as it ought to be faster on small needles and haystacks > because it requires less effort to build the lookup tables, even though > the worst-case performance is worse. It should still be faster than what > we have now. Yes, correct, I really didn't want to slow down smaller searches. This method seemed to fit the bill better. What I didn't realise is that this method also had a name. Tom Lane wrote: > Hmm. B-M-H has worst case search speed O(M*N) (where M = length of > pattern, N = length of search string); whereas full B-M is O(N). Maybe we > should build the second table when M is large? I'll look into this. If it pays off for longer searches I'll submit a patch. I won't have the time until after the 15th, so perhaps that's in November's commit fest? > The skip table really should be constructed only once in > text_position_start and stored in TextPositionState. That would make a > big difference to the performance of those functions that call > text_position_next repeatedly: replace_text, split_text and text_to_array. Of course you are right. That will help for replace and the like. I'll update the patch tonight. Thanks both for the feedback. David.
"David Rowley" <dgrowley@gmail.com> writes: > Tom Lane wrote: >> Hmm. B-M-H has worst case search speed O(M*N) (where M = length of >> pattern, N = length of search string); whereas full B-M is O(N). Maybe we >> should build the second table when M is large? > I'll look into this. If it pays off for longer searches I'll submit a patch. > I won't have the time until after the 15th, so perhaps that's in November's > commit fest? Yes. Let's get B-M-H in during this fest and then you can look into whether a follow-on patch is worthwhile. regards, tom lane
Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
From
Heikki Linnakangas
Date:
Tom Lane wrote: > "David Rowley" <dgrowley@gmail.com> writes: >> Tom Lane wrote: >>> Hmm. B-M-H has worst case search speed O(M*N) (where M = length of >>> pattern, N = length of search string); whereas full B-M is O(N). Maybe we >>> should build the second table when M is large? > >> I'll look into this. If it pays off for longer searches I'll submit a patch. >> I won't have the time until after the 15th, so perhaps that's in November's >> commit fest? > > Yes. Let's get B-M-H in during this fest and then you can look into > whether a follow-on patch is worthwhile. Agreed. Also, it would be nice to use B-M(-H) for LIKE as well. That's a different codepath, but probably more widely used than textpos and frients. I haven't looked how feasible it would be to do, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Also, it would be nice to use B-M(-H) for LIKE as well. Right offhand, that seems impossible, at least in patterns with %. Or were you thinking of trying to separate out the fixed substrings of a pattern and search for them with BMH? Anyway, it's not material for this patch, since it'd involve pretty fundamental redesign of the LIKE code. regards, tom lane
Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
From
"David Rowley"
Date:
Heikki Linnakangas wrote: > The skip table really should be constructed only once in > text_position_start and stored in TextPositionState. That would make a > big difference to the performance of those functions that call > text_position_next repeatedly: replace_text, split_text and text_to_array. I Wrote: > Of course you are right. That will help for replace and the like. I'll > update the patch tonight. I've made and attached the changes Heikki recommended. Also updated benchmark spreadsheet. Here -> http://www.unixbeast.com/~fat/8.3_test_v1.2.xls Previously there was an error with "Test 6", the test that benchmarked replace(). I kept this one so not to affect the summary result in the new sheet. I then added the sheet "Replace Test" to show more accurate results. I had failed to notice that the optimizer was helping me out more than I wanted it to. My tested replace() script runs in 91% of the time than the 8.3 version. I've not tested with the CVS head. Now that the skip table is a member of TextPositionState, I was not quite sure if I should #define a size for it. It would certainly look neater, only the code that defines the skip table size in text_position_start assumes 256 elements. Any thoughts on this? David.
Attachment
"David Rowley" <dgrowley@gmail.com> writes: > I've made and attached the changes Heikki recommended. I looked this over a bit and was immediately confused by one thing: the introductory comment says that the skip table size ought to be based on the length of the haystack, which makes sense to me, but the code is actually initializing it on the basis of len2, ie, the length of the needle. Isn't that a bug? Was the same bug present in the tests you made to determine the best table sizes? Stylistic issues: I really dislike having two copies of the heuristic size-setting code. I think it's worth introducing a second test of use_wchar in order to arrange text_position_setup like this: ... existing code ... choose skiptablesize initialize skip table (this loop doesn't depend on char width) if (!state->use_wchar) process char needle else process wide-char needle Also it might be worth having local-variable copies of skiptablesize and len2 for this initialization loop: for (ai = 0; ai <= state->skiptablesize; ai++) state->skiptable[ai] = state->len2; I'm not at all sure whether a C compiler will think that it's allowed to avoid fetching these values again on each iteration, seeing that the loop is storing into other integer members of *state. Given that this loop is exactly the overhead we're trying to control by adjusting skiptablesize, making it as tight as possible seems worthwhile. > Now that the skip table is a member of TextPositionState, I was not quite > sure if I should #define a size for it. It would certainly look neater, only > the code that defines the skip table size in text_position_start assumes 256 > elements. I tend to agree that a #define is pointless there, since there's no easy way to make the relevant code do anything with it. It would be worthwhile to point out adjacent to the code what the max length is, though. I had gotten as far as rewriting your introductory comment like this /* * Prepare the skip table for Boyer-Moore-Horspool searching. First we * must determine how big a skip table to use. The declaration of * TextPositionState allows up to 256 elements, but with haystack lengths * of only a few characters we don't really want to have to initialize so * many elements --- it would take too long in comparison to the actual * search time. So we choose a useful skip table size based on the * haystack length. before noticing that what I was writing wasn't what the code actually did. Another point that doesn't seem to be included in your comments is that the skip table size *must* be a power of 2 because we use bit masking. (In fact state->skiptablesize might better be named skiptablemask since it's really 1 less than the table size.) BTW, try to avoid introducing so much random vertical space. IMHO it does little for readability and much to make routines too long to fit in one screenful. Generally the PG code doesn't use double blank lines anywhere inside a function, nor blank lines before a right brace line. pg_indent will clean this up to some extent, but it would be better if the code looked more like the project style in the first place. regards, tom lane
I wrote: > I looked this over a bit and was immediately confused by one thing: > the introductory comment says that the skip table size ought to be based > on the length of the haystack, which makes sense to me, but the code is > actually initializing it on the basis of len2, ie, the length of the > needle. Isn't that a bug? Was the same bug present in the tests you > made to determine the best table sizes? BTW, to the extent that you feel like testing a different idea, I would suggest: * don't bother initializing the skiptable when len1 < len2 * otherwise, choose its size based on len1 - len2, not just len1 or len2. This is (one less than) the maximum number of search loop consultations of the skip table that can happen, so it seems like a plausible number, and better than either length alone. regards, tom lane
Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
From
"David Rowley"
Date:
Tom Lane wrote: > I looked this over a bit and was immediately confused by one thing: > the introductory comment says that the skip table size ought to be based > on the length of the haystack, which makes sense to me, but the code is > actually initializing it on the basis of len2, ie, the length of the > needle. Isn't that a bug? Was the same bug present in the tests you > made to determine the best table sizes? Yes Bug. That's what I get for making last minute changes. It didn't affect the benchmark, that was done before moving the code into postgresql for testing. The function I tested with had an extra parameter to set the skip table size. Each possible size was tested and the best time was saved along with the best size. > BTW, to the extent that you feel like testing a different idea, > I would suggest: > * don't bother initializing the skiptable when len1 < len2 Good plan. > * otherwise, choose its size based on len1 - len2, not just len1 or > len2. This is (one less than) the maximum number of search loop > consultations of the skip table that can happen, so it seems like a > plausible number, and better than either length alone. That seems like a better idea. I had considered len1 * len2, giving that's the worst case for BMH. Of course the lengths would need to be raised quite a bit. I'll go with len1 - len2 I'll make the above changes and then run my benchmark again to see if the skip table size logic should be updated. I'll also benchmark and update the benchmark spreadsheet to see what's changed. David.
Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
From
Heikki Linnakangas
Date:
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Also, it would be nice to use B-M(-H) for LIKE as well. > > Right offhand, that seems impossible, at least in patterns with %. > Or were you thinking of trying to separate out the fixed substrings > of a pattern and search for them with BMH? Yep, something like that. Even if it only handled the special case of '%foobar%', that would be nice, because that's a pretty common special case. > Anyway, it's not material for this patch, since it'd involve pretty > fundamental redesign of the LIKE code. Yep. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] TODO item: Implement Boyer-Moore searching (First time hacker)
From
"David Rowley"
Date:
Heikki Linnakangas Wrote: > Tom Lane wrote: > > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > >> Also, it would be nice to use B-M(-H) for LIKE as well. > > > > Right offhand, that seems impossible, at least in patterns with %. > > Or were you thinking of trying to separate out the fixed substrings > > of a pattern and search for them with BMH? > Yep, something like that. Even if it only handled the special case of > '%foobar%', that would be nice, because that's a pretty common special > case. It would be a quick test to check for % at either end. But we'd also need to ensure there were none in the middle. That might slow it down. I guess while looping over the inner chars checking for more %'s we could be building a skiptable. We'd have to abort it if we found any %'s of course I think with out giving it much thought _'s could be handled by BMH, these could just be given a skip distance of 2. Only having a lossy skip table throws that out the window without having a special check for _ David.