On 13.06.2013 23:19, Alexander Korotkov wrote:
> Hackers,
>
> Attached patch contains opclass which demonstrates advantages of GIN
> additional information storing itself without other GIN improvements. It
> implements inversed task of regex indexing. It works so: you create index
> on regexes and search for regexes matched query string. It introduce two
> additional operators |~ and *~ for case-sensetive and case-insensetive
> regex to string matching, and gin_regexp_trgm_ops opclass.
I'm going to step back from this patch and trigrams for a while, and
ponder in general, how one would do indexing of regexps.
The best approach surely depends a lot on the kinds of regexps and query
strings involved. For example, how much common structure do the regexps
share, how easily they can be decomposed into common and differing
parts. Also, how long are the query strings being used, and how many
regexps does it need to work efficiently with.
The first thought that comes to my mind is to build a GiST index, where
the leafs items are the regexps, in a precompiled form. The upper nodes
are also pre-compiled regexps, which match everything that the child
regexps contain. For example, if you have the regexps ".*foobar.*" and
".*foofoo.*" on a leaf page, the upper node might be ".*foo.*". In other
words, the union function forms a regular expression that matches all
the strings that any of the member regexps, and may match more.
Implementing the union-function is probably the most difficult part of
this approach.
In literature, there is the Aho-Corasick algorithm that can be used to
efficiently search for several substrings in a string in one go. I
believe it can be extended to handle regexps. That does not fit nicely
into existing GIN or GiST indexes, however, so that would need to be a
whole new indexam. Also, I'm not sure how it scales as the number of
regexps searched increses, and incremental updates might be tricky.
- Heikki