Re: WIP: index support for regexp search - Mailing list pgsql-hackers
From | Erik Rijkers |
---|---|
Subject | Re: WIP: index support for regexp search |
Date | |
Msg-id | 1ce6dc4baa5e8293bd44ee1458b98d2c.squirrel@webmail.xs4all.nl Whole thread Raw |
In response to | Re: WIP: index support for regexp search (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>) |
Responses |
Re: WIP: index support for regexp search
Re: WIP: index support for regexp search |
List | pgsql-hackers |
On Thu, January 19, 2012 21:30, Heikki Linnakangas wrote: > On 22.11.2011 21:38, Alexander Korotkov wrote: >> WIP patch with index support for regexp search for pg_trgm contrib is >> attached. >> In spite of techniques which extracts continuous text parts from regexp, >> this patch presents technique of automatum transformation. That allows more >> comprehensive trigrams extraction. > > Nice! > Yes, wonderful stuff; I tested quite a bit with this patch; FWIW, here's what I found. The patch yields spectacular speedups with small, simple-enough regexen. But it does not do a good enough job when guessing where to use the index and where fall back to Seq Scan. This can lead to (also spectacular) slow-downs, compared to Seq Scan. I used the following to generate 3 test tables with lines of 80 random chars (just in case it's handy for others to use): $ cat create_data.sh #!/bin/sh for power in 4 5 6 do table=azjunk${power} index=${table}_trgmrgx_txt_01_idx echo "-- generating table $table with index $index"; time perl-E' sub ss{ join"",@_[ map{rand @_} 1 .. shift ] }; say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for1 .. 1e'"${power};" \ | psql -aqXc "drop table if exists $table; create table $table(txt text); copy $table from stdin;-- set session maintenance_work_mem='20GB'; create index $index on $table using gin(txt gin_trgm_ops); analyze $table;"; done \dt+ public.azjunk* List of relationsSchema | Name | Type | Owner | Size | Description --------+---------+-------+-----------+---------+-------------public | azjunk4 | table | breinbaas | 1152 kB |public | azjunk5| table | breinbaas | 11 MB |public | azjunk6 | table | breinbaas | 112 MB | (3 rows) I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that true? I can understand you want that value to be low to limit the above risk, but now it reduces the usability of the feature a bit: one has to split up larger char-classes into several smaller ones to make a statement use the index: i.e.: txt ~ 'f[aeio]n' OR txt ~ 'f[uy]n' instead of txt ~ 'f[aeiouy]n' I made compiled instances with larger values for MAX_COLOR_CHARS (6 and 9), and sure enough they used the index for larger classes such as the above, but of course also got into problems easier when quantifiers are added (*, +, {n,m}). A better calculation to decide index-use seems necessary, and ideally one that allows for a larger MAX_COLOR_CHARS than 4. Especially quantifiers could perhaps be inspected wrt that decision. IMHO, the functionality would still be very useful when only very simple regexen were considered. Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is normally necessary for perfomance reasons, but it would be useful te be able to compile a test version that allows it. I don't know how hard that would be. There is also a minor bug, I think, when running with 'set enable_seqscan=off' in combination with a too-large regex: $ cat fail.sql: set enable_bitmapscan=on; set enable_seqscan=on; explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n'; -- OK explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n'; -- OK set enable_bitmapscan=on; set enable_seqscan=off; explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n'; -- OK explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n'; -- crashes $ psql -f fail.sql QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on azjunk4 (cost=52.01..56.02 rows=1 width=81) (actual time=1.011..5.291 rows=131 loops=1) Recheck Cond: (txt ~ 'f[aeio]n'::text) -> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..52.01rows=1 width=0) (actual time=0.880..0.880 rows=131 loops=1) Index Cond: (txt ~ 'f[aeio]n'::text)Total runtime: 5.700 ms (5 rows) QUERY PLAN -------------------------------------------------------------------------------------------------------Seq Scan on azjunk4 (cost=0.00..268.00 rows=1 width=81) (actual time=1.491..36.049 rows=164 loops=1) Filter: (txt ~ 'f[aeiou]n'::text) Rows Removed by Filter: 9836Total runtime: 36.112 ms (4 rows) QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on azjunk4 (cost=52.01..56.02 rows=1 width=81) (actual time=0.346..0.927 rows=131 loops=1) Recheck Cond: (txt ~ 'f[aeio]n'::text) -> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..52.01rows=1 width=0) (actual time=0.316..0.316 rows=131 loops=1) Index Cond: (txt ~ 'f[aeio]n'::text)Total runtime: 0.996 ms (5 rows) psql:fail.sql:24: connection to server was lost Sorry: I found the code, esp. the regex engine hard to understand sofar, so all the above are somewhat inconclusive remarks; I hope nonetheless they are useful. Thanks, Erik Rijkers
pgsql-hackers by date: