Thread: Index on regexes

Index on regexes

From
Alexander Korotkov
Date:
Hackers,

Attached patch contains opclass which demonstrates advantages of GIN additional information storing itself without other GIN improvements. It implements inversed task of regex indexing. It works so: you create index on regexes and search for regexes matched query string. It introduce two additional operators |~ and *~ for case-sensetive and case-insensetive regex to string matching, and gin_regexp_trgm_ops opclass.

Let's consider some example.

At first, generate some regexes.

CREATE OR REPLACE FUNCTION generate_string(int, int) RETURNS text AS $$
SELECT array_to_string(ARRAY(SELECT chr((97 + random() * 10) :: integer) FROM generate_series(1,($1 + random()*$2)::int)), '');
$$
LANGUAGE sql;

CREATE TABLE test AS select '(' || generate_string(1,4) || '|' || generate_string(1,4) || '|' || generate_string(1,4) || ')' || generate_string(1,2) || '(' || generate_string(1,4) || '|' || generate_string(1,4) || '|' || generate_string(1,4) || ')' AS s FROM generate_series(1,1000000);

I use only 10 characters on alphabet in order to have better chance of matching. It generate some regexes like so:

postgres=# SELECT * FROM test LIMIT 10;
                 s
------------------------------------
 (g|cij|ah)jg(iei|hfc|eef)
 (gbfdb|ehbg|akf)ge(bc|jgee|jidd)
 (jedc|kgc|c)bc(ii|bji|iebc)
 (aa|eie|bgdb)f(fc|he|f)
 (b|ijc|ae)ae(eccb|ie|kjf)
 (bib|igf|kdibf)fij(gcbh|efi|fidj)
 (bkejf|jfdhg|gbfe)bhb(bedj|hh|ggg)
 (kfb|egccd|iefce)jf(kj|jbef|kbc)
 (bhh|c|cd)cb(h|ed|jg)
 (id|j|geg)gc(djif|ai|cjjjc)
(10 rows)

Without index search takes about 10 seconds.

postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..19929.00 rows=5000 width=28) (actual time=172.990..97357.312 rows=438 loops=1)
   Filter: (s |~ 'abcdefghijkl'::text)
   Rows Removed by Filter: 999562
 Total runtime: 97357.490 ms
(4 rows)

And with index it takes only 110 milliseconds.

postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=182.75..7245.94 rows=5000 width=28) (actual time=68.143..110.663 rows=438 loops=1)
   Recheck Cond: (s |~ 'abcdefghijkl'::text)
   ->  Bitmap Index Scan on test_idx  (cost=0.00..181.50 rows=5000 width=0) (actual time=67.929..67.929 rows=438 loops=1)
         Index Cond: (s |~ 'abcdefghijkl'::text)
 Total runtime: 110.870 ms
(5 rows)

------
With best regards,
Alexander Korotkov.
Attachment

Re: Index on regexes

From
"Erik Rijkers"
Date:
On Thu, June 13, 2013 22:19, Alexander Korotkov wrote:

>  [index_on_regexes.1.patch.gz ]

Hi,

Compile of core is OK, but contrib compilation fails:

-- [2013.06.13 23:23:14 idxregex] make contrib
trgm_gin.c: In function ‘gin_regexp_trgm_config’:
trgm_gin.c:410:2: error: unknown type name ‘GinConfig’ GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0); ^
trgm_gin.c:410:26: error: ‘GinConfig’ undeclared (first use in this function) GinConfig *ginConfig = (GinConfig
*)PG_GETARG_POINTER(0);                        ^
 
trgm_gin.c:410:26: note: each undeclared identifier is reported only once for each function it appears in
trgm_gin.c:410:37: error: expected expression before ‘)’ token GinConfig *ginConfig = (GinConfig
*)PG_GETARG_POINTER(0);                                   ^
 
trgm_gin.c:412:11: error: request for member ‘addInfoTypeOid’ in something not a structure or union
ginConfig->addInfoTypeOid= BYTEAOID;          ^
 
make[1]: *** [trgm_gin.o] Error 1
make: *** [all-pg_trgm-recurse] Error 2
trgm_gin.c: In function ‘gin_regexp_trgm_config’:
trgm_gin.c:410:2: error: unknown type name ‘GinConfig’ GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0); ^
trgm_gin.c:410:26: error: ‘GinConfig’ undeclared (first use in this function) GinConfig *ginConfig = (GinConfig
*)PG_GETARG_POINTER(0);                        ^
 
trgm_gin.c:410:26: note: each undeclared identifier is reported only once for each function it appears in
trgm_gin.c:410:37: error: expected expression before ‘)’ token GinConfig *ginConfig = (GinConfig
*)PG_GETARG_POINTER(0);                                   ^
 
trgm_gin.c:412:11: error: request for member ‘addInfoTypeOid’ in something not a structure or union
ginConfig->addInfoTypeOid= BYTEAOID;          ^
 
make[1]: *** [trgm_gin.o] Error 1
make: *** [install-pg_trgm-recurse] Error 2


It is also not entirely clear from your post how this index is created,  I suppose the regex index is only in trgm?

create index test_gin_idx on test (s gin_regexp_trgm_ops);  -->  right?



thanks,

Erik Erijkers





Re: Index on regexes

From
Alexander Korotkov
Date:
On Fri, Jun 14, 2013 at 1:30 AM, Erik Rijkers <er@xs4all.nl> wrote:
On Thu, June 13, 2013 22:19, Alexander Korotkov wrote:

>  [index_on_regexes.1.patch.gz ]

Hi,

Compile of core is OK, but contrib compilation fails:

-- [2013.06.13 23:23:14 idxregex] make contrib
trgm_gin.c: In function ‘gin_regexp_trgm_config’:
trgm_gin.c:410:2: error: unknown type name ‘GinConfig’
  GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
  ^
trgm_gin.c:410:26: error: ‘GinConfig’ undeclared (first use in this function)
  GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
                          ^
trgm_gin.c:410:26: note: each undeclared identifier is reported only once for each function it appears in
trgm_gin.c:410:37: error: expected expression before ‘)’ token
  GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
                                     ^
trgm_gin.c:412:11: error: request for member ‘addInfoTypeOid’ in something not a structure or union
  ginConfig->addInfoTypeOid = BYTEAOID;
           ^
make[1]: *** [trgm_gin.o] Error 1
make: *** [all-pg_trgm-recurse] Error 2
trgm_gin.c: In function ‘gin_regexp_trgm_config’:
trgm_gin.c:410:2: error: unknown type name ‘GinConfig’
  GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
  ^
trgm_gin.c:410:26: error: ‘GinConfig’ undeclared (first use in this function)
  GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
                          ^
trgm_gin.c:410:26: note: each undeclared identifier is reported only once for each function it appears in
trgm_gin.c:410:37: error: expected expression before ‘)’ token
  GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
                                     ^
trgm_gin.c:412:11: error: request for member ‘addInfoTypeOid’ in something not a structure or union
  ginConfig->addInfoTypeOid = BYTEAOID;
           ^
make[1]: *** [trgm_gin.o] Error 1
make: *** [install-pg_trgm-recurse] Error 2

Likely I wasn't explicit enough. You need to apply this patch first:
 
It is also not entirely clear from your post how this index is created,  I suppose the regex index is only in trgm?

create index test_gin_idx on test (s gin_regexp_trgm_ops);  -->  right?

Oh, it was missed. Right.

------
With best regards,
Alexander Korotkov.
 

Re: Index on regexes

From
Heikki Linnakangas
Date:
On 13.06.2013 23:19, Alexander Korotkov wrote:
> Hackers,
>
> Attached patch contains opclass which demonstrates advantages of GIN
> additional information storing itself without other GIN improvements. It
> implements inversed task of regex indexing. It works so: you create index
> on regexes and search for regexes matched query string. It introduce two
> additional operators |~ and *~ for case-sensetive and case-insensetive
> regex to string matching, and gin_regexp_trgm_ops opclass.

Ok, I've read the patch and I think I understand it now. Clever approach.

The idea is to first build an NFA of trigrams from each indexed regular
expression, just like we do in the reverse case where you index strings,
and the query is a single regexp. The extracted trigrams are inserted
into the gin index.

That already helps significantly in certain kinds of applications. In
fact, I wonder if we should do just that, and leave out the information
stored in the additional info. There are pros and cons - you get a
smaller index, but need to recheck more rows.

But here's the clever part: in addition to just indexing all the
trigrams in each regexp, fragments of the trigram graph are stored along
with each item in the posting lists. In the graph, each arc is labeled
with a trigram, and a string can match the regexp only if there is a
path through the graph following only arcs that match trigrams that are
present in the graph. In the posting list of trigram abc, you store all
the arcs in the graph labelled with 'abc'. In the consistent function,
you have access to all the arcs labelled with trigrams that are present
in the string. The function assembles a graph from those arcs, and then
checks if there is a path from initial to final state.

> Let's consider some example.
>
> At first, generate some regexes.
>
> CREATE OR REPLACE FUNCTION generate_string(int, int) RETURNS text AS $$
> SELECT array_to_string(ARRAY(SELECT chr((97 + random() * 10) :: integer)
> FROM generate_series(1,($1 + random()*$2)::int)), '');
> $$
> LANGUAGE sql;
>
> CREATE TABLE test AS select '(' || generate_string(1,4) || '|' ||
> generate_string(1,4) || '|' || generate_string(1,4) || ')' ||
> generate_string(1,2) || '(' || generate_string(1,4) || '|' ||
> generate_string(1,4) || '|' || generate_string(1,4) || ')' AS s FROM
> generate_series(1,1000000);
>
> I use only 10 characters on alphabet in order to have better chance of
> matching. It generate some regexes like so:
>
> postgres=# SELECT * FROM test LIMIT 10;
>                   s
> ------------------------------------
>   (g|cij|ah)jg(iei|hfc|eef)
>   (gbfdb|ehbg|akf)ge(bc|jgee|jidd)
>   (jedc|kgc|c)bc(ii|bji|iebc)
>   (aa|eie|bgdb)f(fc|he|f)
>   (b|ijc|ae)ae(eccb|ie|kjf)
>   (bib|igf|kdibf)fij(gcbh|efi|fidj)
>   (bkejf|jfdhg|gbfe)bhb(bedj|hh|ggg)
>   (kfb|egccd|iefce)jf(kj|jbef|kbc)
>   (bhh|c|cd)cb(h|ed|jg)
>   (id|j|geg)gc(djif|ai|cjjjc)
> (10 rows)
>
> Without index search takes about 10 seconds.
>
> postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
>                                                    QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
>   Seq Scan on test  (cost=0.00..19929.00 rows=5000 width=28) (actual
> time=172.990..97357.312 rows=438 loops=1)
>     Filter: (s |~ 'abcdefghijkl'::text)
>     Rows Removed by Filter: 999562
>   Total runtime: 97357.490 ms
> (4 rows)

That's almost 100 seconds.

> And with index it takes only 110 milliseconds.
>
> postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
>                                                          QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------
>   Bitmap Heap Scan on test  (cost=182.75..7245.94 rows=5000 width=28)
> (actual time=68.143..110.663 rows=438 loops=1)
>     Recheck Cond: (s |~ 'abcdefghijkl'::text)
>     ->   Bitmap Index Scan on test_idx  (cost=0.00..181.50 rows=5000 width=0)
> (actual time=67.929..67.929 rows=438 loops=1)
>           Index Cond: (s |~ 'abcdefghijkl'::text)
>   Total runtime: 110.870 ms
> (5 rows)

That's an impressive result. Now, the next question is, is this a
realistic use of this feature?

Some more numbers from the above test:

* The index in above example is about 180MB in size, while the table is
60MB. Without the graph fragments stored in the "additional info", the
index is about 120MB

* Building the index takes about 370 seconds, or 0.37 ms per row.


I wonder what kind of data sets this feature is likely to be used in
practice? I googled around and found one data set: rules from the Snort
intrusion detection system. I downloaded the community rules from
http://www.snort.org/snort-rules/. They look very different from the
kind of regexps you tested with; the rules contain a lot of strange
characters, so the trigram extraction is not as effective. I had to
modify some of the rules because PostgreSQL regexp engine only supports
up to 255 repetitions with {XXX} syntax, and I also removed all the
modifiers governing case-sensitivity and some other things. I ended up
with 674 regular expressions. Here's a comparison of query speed:

postgres=#  explain analyze select * from t where s |~ 'abcdefghijk';
                                              QUERY PLAN


-------------------------------------------------------------------------------
---------------------
  Seq Scan on t  (cost=0.00..22.43 rows=3 width=123) (actual
time=7534.185..7534
.185 rows=0 loops=1)
    Filter: (s |~ 'abcdefghijk'::text)
    Rows Removed by Filter: 674
  Total runtime: 7534.210 ms
(4 rows)

Time: 7534,620 ms

postgres=#  explain analyze select * from t where s |~ 'abcdefghijk';
                                                        QUERY PLAN


-------------------------------------------------------------------------------
-----------------------------------------
  Bitmap Heap Scan on t  (cost=108.03..115.90 rows=3 width=123) (actual
time=705
9.735..7059.735 rows=0 loops=1)
    Recheck Cond: (s |~ 'abcdefghijk'::text)
    Rows Removed by Index Recheck: 343
    ->  Bitmap Index Scan on t_gin_idx  (cost=0.00..108.03 rows=3
width=0) (actu
al time=37.675..37.675 rows=343 loops=1)
          Index Cond: (s |~ 'abcdefghijk'::text)
  Total runtime: 7059.788 ms
(6 rows)

Time: 7060,306 ms

So, the index doesn't help much in this case. It's eliminating about
half of the rows, leaving 343 rows to be rechecked, but the total
runtime is about the same.


However, sometimes I get a segfault with that query:

#0  0x00007fede3715988 in checkConnectivity (partGraphInfo=0x4157da0)
     at trgm_regexp.c:2262
#1  0x00007fede371286f in gin_regexp_trgm_consistent (fcinfo=0x7fff2a00c230)
     at trgm_gin.c:402
#2  0x000000000083386e in FunctionCall10Coll (flinfo=0x36fd5f0,
     collation=100, arg1=36110032, arg2=5, arg3=37437664, arg4=13,
     arg5=36110288, arg6=36109673, arg7=36110144, arg8=36110000,
     arg9=36112304, arg10=36112448) at fmgr.c:1630
#3  0x0000000000472045 in callConsistentFn (ginstate=0x36fc158,
key=0x226fd10)
     at ginget.c:55
#4  0x0000000000473f8b in keyGetItem (ginstate=0x36fc158,
tempCtx=0x2b24a60,
     key=0x226fd10) at ginget.c:969
#5  0x000000000047416c in scanGetItem (scan=0x4157600,
     advancePast=0x7fff2a00c760, item=0x7fff2a00c760,
     recheck=0x7fff2a00c75f "\001\377\377\377\377\377\377") at ginget.c:1051
#6  0x0000000000475537 in gingetbitmap (fcinfo=0x7fff2a00c7b0)
     at ginget.c:1651

Attached is a dump of the test table I used above. Please note that the
snort-rules are GPLv2-licensed, so they cannot be included in PostgreSQL
e.g in regression tests.

- Heikki

Attachment

Re: Index on regexes

From
Heikki Linnakangas
Date:
On 13.06.2013 23:19, Alexander Korotkov wrote:
> Hackers,
>
> Attached patch contains opclass which demonstrates advantages of GIN
> additional information storing itself without other GIN improvements. It
> implements inversed task of regex indexing. It works so: you create index
> on regexes and search for regexes matched query string. It introduce two
> additional operators |~ and *~ for case-sensetive and case-insensetive
> regex to string matching, and gin_regexp_trgm_ops opclass.

I'm going to step back from this patch and trigrams for a while, and 
ponder in general, how one would do indexing of regexps.

The best approach surely depends a lot on the kinds of regexps and query 
strings involved. For example, how much common structure do the regexps 
share, how easily they can be decomposed into common and differing 
parts. Also, how long are the query strings being used, and how many 
regexps does it need to work efficiently with.

The first thought that comes to my mind is to build a GiST index, where 
the leafs items are the regexps, in a precompiled form. The upper nodes 
are also pre-compiled regexps, which match everything that the child 
regexps contain. For example, if you have the regexps ".*foobar.*" and 
".*foofoo.*" on a leaf page, the upper node might be ".*foo.*". In other 
words, the union function forms a regular expression that matches all 
the strings that any of the member regexps, and may match more. 
Implementing the union-function is probably the most difficult part of 
this approach.

In literature, there is the Aho-Corasick algorithm that can be used to 
efficiently search for several substrings in a string in one go. I 
believe it can be extended to handle regexps. That does not fit nicely 
into existing GIN or GiST indexes, however, so that would need to be a 
whole new indexam. Also, I'm not sure how it scales as the number of 
regexps searched increses, and incremental updates might be tricky.

- Heikki