Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Erik Rijkers
Subject Re: WIP: index support for regexp search
Date
Msg-id 1ce6dc4baa5e8293bd44ee1458b98d2c.squirrel@webmail.xs4all.nl
Whole thread Raw
In response to Re: WIP: index support for regexp search  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: WIP: index support for regexp search  (Marti Raudsepp <marti@juffo.org>)
Re: WIP: index support for regexp search  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
On Thu, January 19, 2012 21:30, Heikki Linnakangas wrote:
> On 22.11.2011 21:38, Alexander Korotkov wrote:
>> WIP patch with index support for regexp search for pg_trgm contrib is
>> attached.
>> In spite of techniques which extracts continuous text parts from regexp,
>> this patch presents technique of automatum transformation. That allows more
>> comprehensive trigrams extraction.
>
> Nice!
>

Yes, wonderful stuff; I tested quite a bit with this patch; FWIW, here's what I found.

The patch yields spectacular speedups with small, simple-enough regexen.  But it does not do a
good enough job when guessing where to use the index and where fall back to Seq Scan.  This can
lead to (also spectacular) slow-downs, compared to Seq Scan.

I used the following to generate 3 test tables with lines of 80 random chars (just in case it's
handy for others to use):

$ cat create_data.sh
#!/bin/sh

for power in 4 5 6
do table=azjunk${power} index=${table}_trgmrgx_txt_01_idx echo "-- generating table $table with index $index"; time
perl-E' sub ss{ join"",@_[ map{rand @_} 1 .. shift ] }; say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z"))
for1 .. 1e'"${power};" \ | psql -aqXc "drop table if exists $table; create table $table(txt text); copy $table from
stdin;-- set session maintenance_work_mem='20GB'; create index $index on $table using gin(txt gin_trgm_ops); analyze
$table;";
done


\dt+ public.azjunk*                     List of relationsSchema |  Name   | Type  |   Owner   |  Size   | Description
--------+---------+-------+-----------+---------+-------------public | azjunk4 | table | breinbaas | 1152 kB |public |
azjunk5| table | breinbaas | 11 MB   |public | azjunk6 | table | breinbaas | 112 MB  |
 
(3 rows)



I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that
true?   I can understand you want that value to be low to limit the above risk, but now it reduces
the usability of the feature a bit: one has to split up larger char-classes into several smaller
ones to make a statement use the index: i.e.:            txt ~ 'f[aeio]n' OR txt ~ 'f[uy]n'
instead of            txt ~ 'f[aeiouy]n'



I made compiled instances with larger values for MAX_COLOR_CHARS (6 and 9), and sure enough they
used the index for larger classes such as the above, but of course also got into problems easier
when quantifiers are added (*, +, {n,m}).

A better calculation to decide index-use seems necessary, and ideally one that allows for a larger
MAX_COLOR_CHARS than 4.  Especially quantifiers could perhaps be inspected wrt that decision.
IMHO, the functionality would still be very useful when only very simple regexen were considered.

Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
normally necessary for perfomance reasons, but it would be useful te be able to compile a test
version that allows it.  I don't know how hard that would be.




There is also a minor bug, I think, when running with  'set enable_seqscan=off'  in combination
with a too-large regex:

$ cat fail.sql:

set enable_bitmapscan=on;
set enable_seqscan=on;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n';   -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n';  -- OK

set enable_bitmapscan=on;
set enable_seqscan=off;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n';   -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n';  -- crashes

$ psql -f fail.sql                                                             QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on azjunk4  (cost=52.01..56.02 rows=1 width=81) (actual time=1.011..5.291
 
rows=131 loops=1)  Recheck Cond: (txt ~ 'f[aeio]n'::text)  ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
(cost=0.00..52.01rows=1 width=0) (actual
 
time=0.880..0.880 rows=131 loops=1)        Index Cond: (txt ~ 'f[aeio]n'::text)Total runtime: 5.700 ms
(5 rows)
                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------Seq Scan on
azjunk4 (cost=0.00..268.00 rows=1 width=81) (actual time=1.491..36.049 rows=164
 
loops=1)  Filter: (txt ~ 'f[aeiou]n'::text)  Rows Removed by Filter: 9836Total runtime: 36.112 ms
(4 rows)
                                                             QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on azjunk4  (cost=52.01..56.02 rows=1 width=81) (actual time=0.346..0.927
 
rows=131 loops=1)  Recheck Cond: (txt ~ 'f[aeio]n'::text)  ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
(cost=0.00..52.01rows=1 width=0) (actual
 
time=0.316..0.316 rows=131 loops=1)        Index Cond: (txt ~ 'f[aeio]n'::text)Total runtime: 0.996 ms
(5 rows)

psql:fail.sql:24: connection to server was lost


Sorry: I found the code, esp. the regex engine hard to understand sofar, so all the above are
somewhat inconclusive remarks; I hope nonetheless they are useful.


Thanks,

Erik Rijkers






pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: WIP -- renaming implicit sequences
Next
From: Robert Haas
Date:
Subject: Re: WIP -- renaming implicit sequences