On 30.11.2012 13:20, Alexander Korotkov wrote:
> On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas<hlinnakangas@vmware.com
>> wrote:
>
>> Would it be safe to simply stop short the depth-first search on overflow,
>> and proceed with the graph that was constructed up to that point?
>
> For depth-first it's not. But your proposal naturally makes sense. I've
> changed it to breadth-first search. And then it's safe to mark all
> unprocessed states as final when overflow. It means that we assume every
> unprocessed branch to immediately finish with matching (this can give us
> more false positives but no false negatives).
> For overflow of matrix collection, it's safe to do just OR between all the
> trigrams.
> New version of patch is attached.
Thanks, sounds good.
I've spent quite a long time trying to understand the transformation the
getState/addKeys/addAcrs functions do to the original CNFA graph. I
think that still needs more comments to explain the steps involved in it.
One thing that occurs to me is that it might be better to delay
expanding colors to characters. You could build and transform the graph,
and even create the paths, all while operating on colors. You would end
up with lists of "color trigrams", consisting of sequences of three
colors that must appear in the source string. Only at the last step you
would expand the color trigrams to real character trigrams. I think that
would save a lot of processing while building the graph, if you have
colors that contain many characters. It would allow us to do better than
the fixed small MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS
= 4 in the current patch, it's a shame that we can't do anything with a
fairly simple regexp like "^a[b-g]h$"
- Heikki