Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: WIP: index support for regexp search
Date
Msg-id CAPpHfdvdcOaVx1z0RiQFgzhcSwd7Pkd9Wg3H-C3EbM=McFw8Tg@mail.gmail.com
Whole thread Raw
In response to Re: WIP: index support for regexp search  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: WIP: index support for regexp search
List pgsql-hackers
On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> [ trgm-regexp-0.15.patch.gz ]

I spent the weekend hacking on this, making a number of bug fixes and a
whole lot of cosmetic changes.  I think there are large parts of this
that are in committable shape now, but I still find the actual graph
transformation logic to be mostly unintelligible.  I think what's most
obscure is the distinction between the arcs list and the keys list of
each state in the expanded graph.  I get the impression that the
general idea is for the arcs to represent exactly-known transitions
while the keys represent imprecisely-known transitions ... but there
seems to be at least some leakage between those categories.  Could
you write down a specification for what's supposed to be happening
there?

Here is my try to specify it.
At first some notions. I know, they are already in the comments, but just in order to put it together.

Extended color - any color of source CNFA or one of two special values:
1) Unknown color - may represent any character either alphanumeric or non-alphanumeric.
2) Blank color - may represent any non-alphanumeric character
Prefix is extended colors of last two characters read by CNFA.
Key is pair of CNFA state and prefix. So, key is a extended state which is containing additional information which can influence further trigrams.

So, if you are in some key and traverse some CNFA arc then you moves info another key. But there are two possible cases (or, sometimes, both of them):
1) Your move from one key into another necessary means read of some trigram. Then you create new arc labeled with that trigram.
2) You move into another key, but you doesn't necessary read an useful trigram. For example, arc of source CNFA is labeled by "unextractable" color. Then you add new key into "keys" array. And outgoing arcs from this key will also be processed similarly to source key. Therefore "keys" array is a set of keys which are achievable from "stateKey" without reading of useful trigram.
We could get rid of "keys" array and produce some "empty arcs" in the second case. You can imagine that "keys" array is set of keys which are achivable by "empty arcs" from "stateKey". States connected with "empty arcs" could be merged on the next stage.
However, the reason of having separated addKeys stage is optimization. In addKey function more generals keys absorb less general ones. In many cases resulting graph becomes much simplier because of this. For example, if your regex is not prefix, then engine puts self-referencing arcs of all possible colors to initial state. Straight-forward processing of this could produce enormous output graph. I had similar situation in early version of patch where keys didn't absorb earch other. 

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: "Etsuro Fujita"
Date:
Subject: Re: Patch for removng unused targets
Next
From: Heikki Linnakangas
Date:
Subject: Re: commit dfda6ebaec67 versus wal_keep_segments