Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: WIP: index support for regexp search
Date
Msg-id CAPpHfdsTEGmx-fmBzaqFED8ghiDX+UA0duph=UL383w80A=oDw@mail.gmail.com
Whole thread Raw
In response to Re: WIP: index support for regexp search  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: WIP: index support for regexp search
List pgsql-hackers
On Fri, Nov 30, 2012 at 6:23 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 30.11.2012 13:20, Alexander Korotkov wrote:
On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas<hlinnakangas@vmware.com
wrote:

Would it be safe to simply stop short the depth-first search on overflow,
and proceed with the graph that was constructed up to that point?

For depth-first it's not. But your proposal naturally makes sense. I've
changed it to breadth-first search. And then it's safe to mark all
unprocessed states as final when overflow. It means that we assume every
unprocessed branch to immediately finish with matching (this can give us
more false positives but no false negatives).
For overflow of matrix collection, it's safe to do just OR between all the
trigrams.
New version of patch is attached.

Thanks, sounds good.

I've spent quite a long time trying to understand the transformation the getState/addKeys/addAcrs functions do to the original CNFA graph. I think that still needs more comments to explain the steps involved in it.

One thing that occurs to me is that it might be better to delay expanding colors to characters. You could build and transform the graph, and even create the paths, all while operating on colors. You would end up with lists of "color trigrams", consisting of sequences of three colors that must appear in the source string. Only at the last step you would expand the color trigrams to real character trigrams. I think that would save a lot of processing while building the graph, if you have colors that contain many characters. It would allow us to do better than the fixed small MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS = 4 in the current patch, it's a shame that we can't do anything with a fairly simple regexp like "^a[b-g]h$"

Nice idea to delay expanding colors to characters! Obviously, we should delay expanding inly alphanumerical characters. Because non-alphanumberical characters influence graph structure. Trying to implement...

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: proposal: separate databases for contrib module testing
Next
From: Alexander Korotkov
Date:
Subject: Re: WIP: index support for regexp search