Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: WIP: index support for regexp search
Date
Msg-id 50B8C17F.6000507@vmware.com
Whole thread Raw
In response to Re: WIP: index support for regexp search  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: WIP: index support for regexp search
List pgsql-hackers
On 30.11.2012 13:20, Alexander Korotkov wrote:
> On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas<hlinnakangas@vmware.com
>> wrote:
>
>> Would it be safe to simply stop short the depth-first search on overflow,
>> and proceed with the graph that was constructed up to that point?
>
> For depth-first it's not. But your proposal naturally makes sense. I've
> changed it to breadth-first search. And then it's safe to mark all
> unprocessed states as final when overflow. It means that we assume every
> unprocessed branch to immediately finish with matching (this can give us
> more false positives but no false negatives).
> For overflow of matrix collection, it's safe to do just OR between all the
> trigrams.
> New version of patch is attached.

Thanks, sounds good.

I've spent quite a long time trying to understand the transformation the 
getState/addKeys/addAcrs functions do to the original CNFA graph. I 
think that still needs more comments to explain the steps involved in it.

One thing that occurs to me is that it might be better to delay 
expanding colors to characters. You could build and transform the graph, 
and even create the paths, all while operating on colors. You would end 
up with lists of "color trigrams", consisting of sequences of three 
colors that must appear in the source string. Only at the last step you 
would expand the color trigrams to real character trigrams. I think that 
would save a lot of processing while building the graph, if you have 
colors that contain many characters. It would allow us to do better than 
the fixed small MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS 
= 4 in the current patch, it's a shame that we can't do anything with a 
fairly simple regexp like "^a[b-g]h$"

- Heikki



pgsql-hackers by date:

Previous
From: Kohei KaiGai
Date:
Subject: Re: Review: Extra Daemons / bgworker
Next
From: Markus Wanner
Date:
Subject: Re: Review: Extra Daemons / bgworker