Thread: Index on regexes
Hackers,
Attached patch contains opclass which demonstrates advantages of GIN additional information storing itself without other GIN improvements. It implements inversed task of regex indexing. It works so: you create index on regexes and search for regexes matched query string. It introduce two additional operators |~ and *~ for case-sensetive and case-insensetive regex to string matching, and gin_regexp_trgm_ops opclass.
Let's consider some example.
At first, generate some regexes.
CREATE OR REPLACE FUNCTION generate_string(int, int) RETURNS text AS $$
SELECT array_to_string(ARRAY(SELECT chr((97 + random() * 10) :: integer) FROM generate_series(1,($1 + random()*$2)::int)), '');
$$
LANGUAGE sql;
CREATE TABLE test AS select '(' || generate_string(1,4) || '|' || generate_string(1,4) || '|' || generate_string(1,4) || ')' || generate_string(1,2) || '(' || generate_string(1,4) || '|' || generate_string(1,4) || '|' || generate_string(1,4) || ')' AS s FROM generate_series(1,1000000);
I use only 10 characters on alphabet in order to have better chance of matching. It generate some regexes like so:
postgres=# SELECT * FROM test LIMIT 10;
s
------------------------------------
(g|cij|ah)jg(iei|hfc|eef)
(gbfdb|ehbg|akf)ge(bc|jgee|jidd)
(jedc|kgc|c)bc(ii|bji|iebc)
(aa|eie|bgdb)f(fc|he|f)
(b|ijc|ae)ae(eccb|ie|kjf)
(bib|igf|kdibf)fij(gcbh|efi|fidj)
(bkejf|jfdhg|gbfe)bhb(bedj|hh|ggg)
(kfb|egccd|iefce)jf(kj|jbef|kbc)
(bhh|c|cd)cb(h|ed|jg)
(id|j|geg)gc(djif|ai|cjjjc)
(10 rows)
Without index search takes about 10 seconds.
postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..19929.00 rows=5000 width=28) (actual time=172.990..97357.312 rows=438 loops=1)
Filter: (s |~ 'abcdefghijkl'::text)
Rows Removed by Filter: 999562
Total runtime: 97357.490 ms
(4 rows)
And with index it takes only 110 milliseconds.
postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=182.75..7245.94 rows=5000 width=28) (actual time=68.143..110.663 rows=438 loops=1)
Recheck Cond: (s |~ 'abcdefghijkl'::text)
-> Bitmap Index Scan on test_idx (cost=0.00..181.50 rows=5000 width=0) (actual time=67.929..67.929 rows=438 loops=1)
Index Cond: (s |~ 'abcdefghijkl'::text)
Total runtime: 110.870 ms
(5 rows)
------
With best regards,
Alexander Korotkov.
With best regards,
Alexander Korotkov.
Attachment
On Thu, June 13, 2013 22:19, Alexander Korotkov wrote: > [index_on_regexes.1.patch.gz ] Hi, Compile of core is OK, but contrib compilation fails: -- [2013.06.13 23:23:14 idxregex] make contrib trgm_gin.c: In function gin_regexp_trgm_config: trgm_gin.c:410:2: error: unknown type name GinConfig GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0); ^ trgm_gin.c:410:26: error: GinConfig undeclared (first use in this function) GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0); ^ trgm_gin.c:410:26: note: each undeclared identifier is reported only once for each function it appears in trgm_gin.c:410:37: error: expected expression before ) token GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0); ^ trgm_gin.c:412:11: error: request for member addInfoTypeOid in something not a structure or union ginConfig->addInfoTypeOid= BYTEAOID; ^ make[1]: *** [trgm_gin.o] Error 1 make: *** [all-pg_trgm-recurse] Error 2 trgm_gin.c: In function gin_regexp_trgm_config: trgm_gin.c:410:2: error: unknown type name GinConfig GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0); ^ trgm_gin.c:410:26: error: GinConfig undeclared (first use in this function) GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0); ^ trgm_gin.c:410:26: note: each undeclared identifier is reported only once for each function it appears in trgm_gin.c:410:37: error: expected expression before ) token GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0); ^ trgm_gin.c:412:11: error: request for member addInfoTypeOid in something not a structure or union ginConfig->addInfoTypeOid= BYTEAOID; ^ make[1]: *** [trgm_gin.o] Error 1 make: *** [install-pg_trgm-recurse] Error 2 It is also not entirely clear from your post how this index is created, I suppose the regex index is only in trgm? create index test_gin_idx on test (s gin_regexp_trgm_ops); --> right? thanks, Erik Erijkers
On Fri, Jun 14, 2013 at 1:30 AM, Erik Rijkers <er@xs4all.nl> wrote:
On Thu, June 13, 2013 22:19, Alexander Korotkov wrote:
> [index_on_regexes.1.patch.gz ]
Hi,
Compile of core is OK, but contrib compilation fails:
-- [2013.06.13 23:23:14 idxregex] make contrib
trgm_gin.c: In function ‘gin_regexp_trgm_config’:
trgm_gin.c:410:2: error: unknown type name ‘GinConfig’
GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
^
trgm_gin.c:410:26: error: ‘GinConfig’ undeclared (first use in this function)
GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
^
trgm_gin.c:410:26: note: each undeclared identifier is reported only once for each function it appears in
trgm_gin.c:410:37: error: expected expression before ‘)’ token
GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
^
trgm_gin.c:412:11: error: request for member ‘addInfoTypeOid’ in something not a structure or union
ginConfig->addInfoTypeOid = BYTEAOID;
^
make[1]: *** [trgm_gin.o] Error 1
make: *** [all-pg_trgm-recurse] Error 2
trgm_gin.c: In function ‘gin_regexp_trgm_config’:
trgm_gin.c:410:2: error: unknown type name ‘GinConfig’
GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
^
trgm_gin.c:410:26: error: ‘GinConfig’ undeclared (first use in this function)
GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
^
trgm_gin.c:410:26: note: each undeclared identifier is reported only once for each function it appears in
trgm_gin.c:410:37: error: expected expression before ‘)’ token
GinConfig *ginConfig = (GinConfig *)PG_GETARG_POINTER(0);
^
trgm_gin.c:412:11: error: request for member ‘addInfoTypeOid’ in something not a structure or union
ginConfig->addInfoTypeOid = BYTEAOID;
^
make[1]: *** [trgm_gin.o] Error 1
make: *** [install-pg_trgm-recurse] Error 2
Likely I wasn't explicit enough. You need to apply this patch first:
It is also not entirely clear from your post how this index is created, I suppose the regex index is only in trgm?
create index test_gin_idx on test (s gin_regexp_trgm_ops); --> right?
Oh, it was missed. Right.
------
With best regards,
Alexander Korotkov.
With best regards,
Alexander Korotkov.
On 13.06.2013 23:19, Alexander Korotkov wrote: > Hackers, > > Attached patch contains opclass which demonstrates advantages of GIN > additional information storing itself without other GIN improvements. It > implements inversed task of regex indexing. It works so: you create index > on regexes and search for regexes matched query string. It introduce two > additional operators |~ and *~ for case-sensetive and case-insensetive > regex to string matching, and gin_regexp_trgm_ops opclass. Ok, I've read the patch and I think I understand it now. Clever approach. The idea is to first build an NFA of trigrams from each indexed regular expression, just like we do in the reverse case where you index strings, and the query is a single regexp. The extracted trigrams are inserted into the gin index. That already helps significantly in certain kinds of applications. In fact, I wonder if we should do just that, and leave out the information stored in the additional info. There are pros and cons - you get a smaller index, but need to recheck more rows. But here's the clever part: in addition to just indexing all the trigrams in each regexp, fragments of the trigram graph are stored along with each item in the posting lists. In the graph, each arc is labeled with a trigram, and a string can match the regexp only if there is a path through the graph following only arcs that match trigrams that are present in the graph. In the posting list of trigram abc, you store all the arcs in the graph labelled with 'abc'. In the consistent function, you have access to all the arcs labelled with trigrams that are present in the string. The function assembles a graph from those arcs, and then checks if there is a path from initial to final state. > Let's consider some example. > > At first, generate some regexes. > > CREATE OR REPLACE FUNCTION generate_string(int, int) RETURNS text AS $$ > SELECT array_to_string(ARRAY(SELECT chr((97 + random() * 10) :: integer) > FROM generate_series(1,($1 + random()*$2)::int)), ''); > $$ > LANGUAGE sql; > > CREATE TABLE test AS select '(' || generate_string(1,4) || '|' || > generate_string(1,4) || '|' || generate_string(1,4) || ')' || > generate_string(1,2) || '(' || generate_string(1,4) || '|' || > generate_string(1,4) || '|' || generate_string(1,4) || ')' AS s FROM > generate_series(1,1000000); > > I use only 10 characters on alphabet in order to have better chance of > matching. It generate some regexes like so: > > postgres=# SELECT * FROM test LIMIT 10; > s > ------------------------------------ > (g|cij|ah)jg(iei|hfc|eef) > (gbfdb|ehbg|akf)ge(bc|jgee|jidd) > (jedc|kgc|c)bc(ii|bji|iebc) > (aa|eie|bgdb)f(fc|he|f) > (b|ijc|ae)ae(eccb|ie|kjf) > (bib|igf|kdibf)fij(gcbh|efi|fidj) > (bkejf|jfdhg|gbfe)bhb(bedj|hh|ggg) > (kfb|egccd|iefce)jf(kj|jbef|kbc) > (bhh|c|cd)cb(h|ed|jg) > (id|j|geg)gc(djif|ai|cjjjc) > (10 rows) > > Without index search takes about 10 seconds. > > postgres=# explain analyze select * from test where s |~ 'abcdefghijkl'; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------- > Seq Scan on test (cost=0.00..19929.00 rows=5000 width=28) (actual > time=172.990..97357.312 rows=438 loops=1) > Filter: (s |~ 'abcdefghijkl'::text) > Rows Removed by Filter: 999562 > Total runtime: 97357.490 ms > (4 rows) That's almost 100 seconds. > And with index it takes only 110 milliseconds. > > postgres=# explain analyze select * from test where s |~ 'abcdefghijkl'; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on test (cost=182.75..7245.94 rows=5000 width=28) > (actual time=68.143..110.663 rows=438 loops=1) > Recheck Cond: (s |~ 'abcdefghijkl'::text) > -> Bitmap Index Scan on test_idx (cost=0.00..181.50 rows=5000 width=0) > (actual time=67.929..67.929 rows=438 loops=1) > Index Cond: (s |~ 'abcdefghijkl'::text) > Total runtime: 110.870 ms > (5 rows) That's an impressive result. Now, the next question is, is this a realistic use of this feature? Some more numbers from the above test: * The index in above example is about 180MB in size, while the table is 60MB. Without the graph fragments stored in the "additional info", the index is about 120MB * Building the index takes about 370 seconds, or 0.37 ms per row. I wonder what kind of data sets this feature is likely to be used in practice? I googled around and found one data set: rules from the Snort intrusion detection system. I downloaded the community rules from http://www.snort.org/snort-rules/. They look very different from the kind of regexps you tested with; the rules contain a lot of strange characters, so the trigram extraction is not as effective. I had to modify some of the rules because PostgreSQL regexp engine only supports up to 255 repetitions with {XXX} syntax, and I also removed all the modifiers governing case-sensitivity and some other things. I ended up with 674 regular expressions. Here's a comparison of query speed: postgres=# explain analyze select * from t where s |~ 'abcdefghijk'; QUERY PLAN ------------------------------------------------------------------------------- --------------------- Seq Scan on t (cost=0.00..22.43 rows=3 width=123) (actual time=7534.185..7534 .185 rows=0 loops=1) Filter: (s |~ 'abcdefghijk'::text) Rows Removed by Filter: 674 Total runtime: 7534.210 ms (4 rows) Time: 7534,620 ms postgres=# explain analyze select * from t where s |~ 'abcdefghijk'; QUERY PLAN ------------------------------------------------------------------------------- ----------------------------------------- Bitmap Heap Scan on t (cost=108.03..115.90 rows=3 width=123) (actual time=705 9.735..7059.735 rows=0 loops=1) Recheck Cond: (s |~ 'abcdefghijk'::text) Rows Removed by Index Recheck: 343 -> Bitmap Index Scan on t_gin_idx (cost=0.00..108.03 rows=3 width=0) (actu al time=37.675..37.675 rows=343 loops=1) Index Cond: (s |~ 'abcdefghijk'::text) Total runtime: 7059.788 ms (6 rows) Time: 7060,306 ms So, the index doesn't help much in this case. It's eliminating about half of the rows, leaving 343 rows to be rechecked, but the total runtime is about the same. However, sometimes I get a segfault with that query: #0 0x00007fede3715988 in checkConnectivity (partGraphInfo=0x4157da0) at trgm_regexp.c:2262 #1 0x00007fede371286f in gin_regexp_trgm_consistent (fcinfo=0x7fff2a00c230) at trgm_gin.c:402 #2 0x000000000083386e in FunctionCall10Coll (flinfo=0x36fd5f0, collation=100, arg1=36110032, arg2=5, arg3=37437664, arg4=13, arg5=36110288, arg6=36109673, arg7=36110144, arg8=36110000, arg9=36112304, arg10=36112448) at fmgr.c:1630 #3 0x0000000000472045 in callConsistentFn (ginstate=0x36fc158, key=0x226fd10) at ginget.c:55 #4 0x0000000000473f8b in keyGetItem (ginstate=0x36fc158, tempCtx=0x2b24a60, key=0x226fd10) at ginget.c:969 #5 0x000000000047416c in scanGetItem (scan=0x4157600, advancePast=0x7fff2a00c760, item=0x7fff2a00c760, recheck=0x7fff2a00c75f "\001\377\377\377\377\377\377") at ginget.c:1051 #6 0x0000000000475537 in gingetbitmap (fcinfo=0x7fff2a00c7b0) at ginget.c:1651 Attached is a dump of the test table I used above. Please note that the snort-rules are GPLv2-licensed, so they cannot be included in PostgreSQL e.g in regression tests. - Heikki
Attachment
On 13.06.2013 23:19, Alexander Korotkov wrote: > Hackers, > > Attached patch contains opclass which demonstrates advantages of GIN > additional information storing itself without other GIN improvements. It > implements inversed task of regex indexing. It works so: you create index > on regexes and search for regexes matched query string. It introduce two > additional operators |~ and *~ for case-sensetive and case-insensetive > regex to string matching, and gin_regexp_trgm_ops opclass. I'm going to step back from this patch and trigrams for a while, and ponder in general, how one would do indexing of regexps. The best approach surely depends a lot on the kinds of regexps and query strings involved. For example, how much common structure do the regexps share, how easily they can be decomposed into common and differing parts. Also, how long are the query strings being used, and how many regexps does it need to work efficiently with. The first thought that comes to my mind is to build a GiST index, where the leafs items are the regexps, in a precompiled form. The upper nodes are also pre-compiled regexps, which match everything that the child regexps contain. For example, if you have the regexps ".*foobar.*" and ".*foofoo.*" on a leaf page, the upper node might be ".*foo.*". In other words, the union function forms a regular expression that matches all the strings that any of the member regexps, and may match more. Implementing the union-function is probably the most difficult part of this approach. In literature, there is the Aho-Corasick algorithm that can be used to efficiently search for several substrings in a string in one go. I believe it can be extended to handle regexps. That does not fit nicely into existing GIN or GiST indexes, however, so that would need to be a whole new indexam. Also, I'm not sure how it scales as the number of regexps searched increses, and incremental updates might be tricky. - Heikki