Re: Index on regexes - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Index on regexes
Date
Msg-id 51C98B4C.7000709@vmware.com
Whole thread Raw
In response to Index on regexes  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
On 13.06.2013 23:19, Alexander Korotkov wrote:
> Hackers,
>
> Attached patch contains opclass which demonstrates advantages of GIN
> additional information storing itself without other GIN improvements. It
> implements inversed task of regex indexing. It works so: you create index
> on regexes and search for regexes matched query string. It introduce two
> additional operators |~ and *~ for case-sensetive and case-insensetive
> regex to string matching, and gin_regexp_trgm_ops opclass.

I'm going to step back from this patch and trigrams for a while, and 
ponder in general, how one would do indexing of regexps.

The best approach surely depends a lot on the kinds of regexps and query 
strings involved. For example, how much common structure do the regexps 
share, how easily they can be decomposed into common and differing 
parts. Also, how long are the query strings being used, and how many 
regexps does it need to work efficiently with.

The first thought that comes to my mind is to build a GiST index, where 
the leafs items are the regexps, in a precompiled form. The upper nodes 
are also pre-compiled regexps, which match everything that the child 
regexps contain. For example, if you have the regexps ".*foobar.*" and 
".*foofoo.*" on a leaf page, the upper node might be ".*foo.*". In other 
words, the union function forms a regular expression that matches all 
the strings that any of the member regexps, and may match more. 
Implementing the union-function is probably the most difficult part of 
this approach.

In literature, there is the Aho-Corasick algorithm that can be used to 
efficiently search for several substrings in a string in one go. I 
believe it can be extended to handle regexps. That does not fit nicely 
into existing GIN or GiST indexes, however, so that would need to be a 
whole new indexam. Also, I'm not sure how it scales as the number of 
regexps searched increses, and incremental updates might be tricky.

- Heikki



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Index on regexes
Next
From: Ronan Dunklau
Date:
Subject: Re: [PATCH] Fix conversion for Decimal arguments in plpython functions