Re: strpos() && KMP - Mailing list pgsql-patches

From Tom Lane
Subject Re: strpos() && KMP
Date
Msg-id 12142.1186626260@sss.pgh.pa.us
Whole thread Raw
In response to Re: strpos() && KMP  (Pavel Ajtkulov <ajtkulov@acm.org>)
Responses Re: strpos() && KMP  (Pavel Ajtkulov <ajtkulov@acm.org>)
List pgsql-patches
Pavel Ajtkulov <ajtkulov@acm.org> writes:
> Tom Lane writes:
>> The difficulty with B-M is the need for a table indexed by character
>> code, which at first glance looks impractical for wchars.  But it seems
>> to me that we could use "wchar % 256" as the table index, meaning that
>> wchars with the same trailing byte share the same table entry.

> hash table?

I'd think the cost of hashing would render it impractical.  Most of the
papers I've seen on this topic worry about getting single instructions
out of the search loop --- a hash lookup will cost lots more than that.
Moreover, you'd lose the guarantee of not-worse-than-linear time,
because hash lookup can be pathologically bad if you get a lot of hash
collisions.

> The main difficulty with BM is verification and understanding "good
> suffix shift" (the second part of BM) (I don't understand it entirely).

Yeah, there seem to be a bunch of variants of BM (many of them not
guaranteed linear, which I'm sure we don't want) and the earliest
papers had bugs.  But a good implementation would be a lot easier
sell because it would show benefits for a much wider set of use-cases
than KMP.

            regards, tom lane

pgsql-patches by date:

Previous
From: "Brendan Jurd"
Date:
Subject: Re: Default location for docbook stylesheets in Gentoo
Next
From: Tom Lane
Date:
Subject: Re: Default location for docbook stylesheets in Gentoo