Re: Index on regexes - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Index on regexes |
Date | |
Msg-id | 51C98700.2020202@vmware.com Whole thread Raw |
In response to | Index on regexes (Alexander Korotkov <aekorotkov@gmail.com>) |
List | pgsql-hackers |
On 13.06.2013 23:19, Alexander Korotkov wrote: > Hackers, > > Attached patch contains opclass which demonstrates advantages of GIN > additional information storing itself without other GIN improvements. It > implements inversed task of regex indexing. It works so: you create index > on regexes and search for regexes matched query string. It introduce two > additional operators |~ and *~ for case-sensetive and case-insensetive > regex to string matching, and gin_regexp_trgm_ops opclass. Ok, I've read the patch and I think I understand it now. Clever approach. The idea is to first build an NFA of trigrams from each indexed regular expression, just like we do in the reverse case where you index strings, and the query is a single regexp. The extracted trigrams are inserted into the gin index. That already helps significantly in certain kinds of applications. In fact, I wonder if we should do just that, and leave out the information stored in the additional info. There are pros and cons - you get a smaller index, but need to recheck more rows. But here's the clever part: in addition to just indexing all the trigrams in each regexp, fragments of the trigram graph are stored along with each item in the posting lists. In the graph, each arc is labeled with a trigram, and a string can match the regexp only if there is a path through the graph following only arcs that match trigrams that are present in the graph. In the posting list of trigram abc, you store all the arcs in the graph labelled with 'abc'. In the consistent function, you have access to all the arcs labelled with trigrams that are present in the string. The function assembles a graph from those arcs, and then checks if there is a path from initial to final state. > Let's consider some example. > > At first, generate some regexes. > > CREATE OR REPLACE FUNCTION generate_string(int, int) RETURNS text AS $$ > SELECT array_to_string(ARRAY(SELECT chr((97 + random() * 10) :: integer) > FROM generate_series(1,($1 + random()*$2)::int)), ''); > $$ > LANGUAGE sql; > > CREATE TABLE test AS select '(' || generate_string(1,4) || '|' || > generate_string(1,4) || '|' || generate_string(1,4) || ')' || > generate_string(1,2) || '(' || generate_string(1,4) || '|' || > generate_string(1,4) || '|' || generate_string(1,4) || ')' AS s FROM > generate_series(1,1000000); > > I use only 10 characters on alphabet in order to have better chance of > matching. It generate some regexes like so: > > postgres=# SELECT * FROM test LIMIT 10; > s > ------------------------------------ > (g|cij|ah)jg(iei|hfc|eef) > (gbfdb|ehbg|akf)ge(bc|jgee|jidd) > (jedc|kgc|c)bc(ii|bji|iebc) > (aa|eie|bgdb)f(fc|he|f) > (b|ijc|ae)ae(eccb|ie|kjf) > (bib|igf|kdibf)fij(gcbh|efi|fidj) > (bkejf|jfdhg|gbfe)bhb(bedj|hh|ggg) > (kfb|egccd|iefce)jf(kj|jbef|kbc) > (bhh|c|cd)cb(h|ed|jg) > (id|j|geg)gc(djif|ai|cjjjc) > (10 rows) > > Without index search takes about 10 seconds. > > postgres=# explain analyze select * from test where s |~ 'abcdefghijkl'; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------- > Seq Scan on test (cost=0.00..19929.00 rows=5000 width=28) (actual > time=172.990..97357.312 rows=438 loops=1) > Filter: (s |~ 'abcdefghijkl'::text) > Rows Removed by Filter: 999562 > Total runtime: 97357.490 ms > (4 rows) That's almost 100 seconds. > And with index it takes only 110 milliseconds. > > postgres=# explain analyze select * from test where s |~ 'abcdefghijkl'; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on test (cost=182.75..7245.94 rows=5000 width=28) > (actual time=68.143..110.663 rows=438 loops=1) > Recheck Cond: (s |~ 'abcdefghijkl'::text) > -> Bitmap Index Scan on test_idx (cost=0.00..181.50 rows=5000 width=0) > (actual time=67.929..67.929 rows=438 loops=1) > Index Cond: (s |~ 'abcdefghijkl'::text) > Total runtime: 110.870 ms > (5 rows) That's an impressive result. Now, the next question is, is this a realistic use of this feature? Some more numbers from the above test: * The index in above example is about 180MB in size, while the table is 60MB. Without the graph fragments stored in the "additional info", the index is about 120MB * Building the index takes about 370 seconds, or 0.37 ms per row. I wonder what kind of data sets this feature is likely to be used in practice? I googled around and found one data set: rules from the Snort intrusion detection system. I downloaded the community rules from http://www.snort.org/snort-rules/. They look very different from the kind of regexps you tested with; the rules contain a lot of strange characters, so the trigram extraction is not as effective. I had to modify some of the rules because PostgreSQL regexp engine only supports up to 255 repetitions with {XXX} syntax, and I also removed all the modifiers governing case-sensitivity and some other things. I ended up with 674 regular expressions. Here's a comparison of query speed: postgres=# explain analyze select * from t where s |~ 'abcdefghijk'; QUERY PLAN ------------------------------------------------------------------------------- --------------------- Seq Scan on t (cost=0.00..22.43 rows=3 width=123) (actual time=7534.185..7534 .185 rows=0 loops=1) Filter: (s |~ 'abcdefghijk'::text) Rows Removed by Filter: 674 Total runtime: 7534.210 ms (4 rows) Time: 7534,620 ms postgres=# explain analyze select * from t where s |~ 'abcdefghijk'; QUERY PLAN ------------------------------------------------------------------------------- ----------------------------------------- Bitmap Heap Scan on t (cost=108.03..115.90 rows=3 width=123) (actual time=705 9.735..7059.735 rows=0 loops=1) Recheck Cond: (s |~ 'abcdefghijk'::text) Rows Removed by Index Recheck: 343 -> Bitmap Index Scan on t_gin_idx (cost=0.00..108.03 rows=3 width=0) (actu al time=37.675..37.675 rows=343 loops=1) Index Cond: (s |~ 'abcdefghijk'::text) Total runtime: 7059.788 ms (6 rows) Time: 7060,306 ms So, the index doesn't help much in this case. It's eliminating about half of the rows, leaving 343 rows to be rechecked, but the total runtime is about the same. However, sometimes I get a segfault with that query: #0 0x00007fede3715988 in checkConnectivity (partGraphInfo=0x4157da0) at trgm_regexp.c:2262 #1 0x00007fede371286f in gin_regexp_trgm_consistent (fcinfo=0x7fff2a00c230) at trgm_gin.c:402 #2 0x000000000083386e in FunctionCall10Coll (flinfo=0x36fd5f0, collation=100, arg1=36110032, arg2=5, arg3=37437664, arg4=13, arg5=36110288, arg6=36109673, arg7=36110144, arg8=36110000, arg9=36112304, arg10=36112448) at fmgr.c:1630 #3 0x0000000000472045 in callConsistentFn (ginstate=0x36fc158, key=0x226fd10) at ginget.c:55 #4 0x0000000000473f8b in keyGetItem (ginstate=0x36fc158, tempCtx=0x2b24a60, key=0x226fd10) at ginget.c:969 #5 0x000000000047416c in scanGetItem (scan=0x4157600, advancePast=0x7fff2a00c760, item=0x7fff2a00c760, recheck=0x7fff2a00c75f "\001\377\377\377\377\377\377") at ginget.c:1051 #6 0x0000000000475537 in gingetbitmap (fcinfo=0x7fff2a00c7b0) at ginget.c:1651 Attached is a dump of the test table I used above. Please note that the snort-rules are GPLv2-licensed, so they cannot be included in PostgreSQL e.g in regression tests. - Heikki
Attachment
pgsql-hackers by date: