Home > mailing lists

Re: Index on regexes - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Index on regexes
Date	June 25, 2013 12:21:52
Msg-id	51C98B4C.7000709@vmware.com Whole thread
In response to	Index on regexes (Alexander Korotkov <aekorotkov@gmail.com>)
List	pgsql-hackers

Tree view

On 13.06.2013 23:19, Alexander Korotkov wrote:
> Hackers,
>
> Attached patch contains opclass which demonstrates advantages of GIN
> additional information storing itself without other GIN improvements. It
> implements inversed task of regex indexing. It works so: you create index
> on regexes and search for regexes matched query string. It introduce two
> additional operators |~ and *~ for case-sensetive and case-insensetive
> regex to string matching, and gin_regexp_trgm_ops opclass.

I'm going to step back from this patch and trigrams for a while, and 
ponder in general, how one would do indexing of regexps.

The best approach surely depends a lot on the kinds of regexps and query 
strings involved. For example, how much common structure do the regexps 
share, how easily they can be decomposed into common and differing 
parts. Also, how long are the query strings being used, and how many 
regexps does it need to work efficiently with.

The first thought that comes to my mind is to build a GiST index, where 
the leafs items are the regexps, in a precompiled form. The upper nodes 
are also pre-compiled regexps, which match everything that the child 
regexps contain. For example, if you have the regexps ".*foobar.*" and 
".*foofoo.*" on a leaf page, the upper node might be ".*foo.*". In other 
words, the union function forms a regular expression that matches all 
the strings that any of the member regexps, and may match more. 
Implementing the union-function is probably the most difficult part of 
this approach.

In literature, there is the Aho-Corasick algorithm that can be used to 
efficiently search for several substrings in a string in one go. I 
believe it can be extended to handle regexps. That does not fit nicely 
into existing GIN or GiST indexes, however, so that would need to be a 
whole new indexam. Also, I'm not sure how it scales as the number of 
regexps searched increses, and incremental updates might be tricky.

- Heikki

pgsql-hackers by date:

From: Heikki Linnakangas
Date: 25 June 2013, 12:03:38
Subject: Re: Index on regexes

From: Ronan Dunklau
Date: 25 June 2013, 12:23:51
Subject: Re: [PATCH] Fix conversion for Decimal arguments in plpython functions

Re: Index on regexes - Mailing list pgsql-hackers

Previous

Next