Home > mailing lists

Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From	Erik Rijkers
Subject	Re: WIP: index support for regexp search
Date	January 19, 2012 19:34:19
Msg-id	1ce6dc4baa5e8293bd44ee1458b98d2c.squirrel@webmail.xs4all.nl Whole thread Raw
In response to	Re: WIP: index support for regexp search (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses	Re: WIP: index support for regexp search Re: WIP: index support for regexp search
List	pgsql-hackers

Tree view

On Thu, January 19, 2012 21:30, Heikki Linnakangas wrote:
> On 22.11.2011 21:38, Alexander Korotkov wrote:
>> WIP patch with index support for regexp search for pg_trgm contrib is
>> attached.
>> In spite of techniques which extracts continuous text parts from regexp,
>> this patch presents technique of automatum transformation. That allows more
>> comprehensive trigrams extraction.
>
> Nice!
>

Yes, wonderful stuff; I tested quite a bit with this patch; FWIW, here's what I found.

The patch yields spectacular speedups with small, simple-enough regexen.  But it does not do a
good enough job when guessing where to use the index and where fall back to Seq Scan.  This can
lead to (also spectacular) slow-downs, compared to Seq Scan.

I used the following to generate 3 test tables with lines of 80 random chars (just in case it's
handy for others to use):

$ cat create_data.sh
#!/bin/sh

for power in 4 5 6
do table=azjunk${power} index=${table}_trgmrgx_txt_01_idx echo "-- generating table $table with index $index"; time
perl-E' sub ss{ join"",@_[ map{rand @_} 1 .. shift ] }; say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z"))
for1 .. 1e'"${power};" \ | psql -aqXc "drop table if exists $table; create table $table(txt text); copy $table from
stdin;-- set session maintenance_work_mem='20GB'; create index $index on $table using gin(txt gin_trgm_ops); analyze
$table;";
done

\dt+ public.azjunk*                     List of relationsSchema |  Name   | Type  |   Owner   |  Size   | Description
--------+---------+-------+-----------+---------+-------------public | azjunk4 | table | breinbaas | 1152 kB |public |
azjunk5| table | breinbaas | 11 MB   |public | azjunk6 | table | breinbaas | 112 MB  |

(3 rows)

I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that
true?   I can understand you want that value to be low to limit the above risk, but now it reduces
the usability of the feature a bit: one has to split up larger char-classes into several smaller
ones to make a statement use the index: i.e.:            txt ~ 'f[aeio]n' OR txt ~ 'f[uy]n'
instead of            txt ~ 'f[aeiouy]n'

I made compiled instances with larger values for MAX_COLOR_CHARS (6 and 9), and sure enough they
used the index for larger classes such as the above, but of course also got into problems easier
when quantifiers are added (*, +, {n,m}).

A better calculation to decide index-use seems necessary, and ideally one that allows for a larger
MAX_COLOR_CHARS than 4.  Especially quantifiers could perhaps be inspected wrt that decision.
IMHO, the functionality would still be very useful when only very simple regexen were considered.

Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
normally necessary for perfomance reasons, but it would be useful te be able to compile a test
version that allows it.  I don't know how hard that would be.

There is also a minor bug, I think, when running with  'set enable_seqscan=off'  in combination
with a too-large regex:

$ cat fail.sql:

set enable_bitmapscan=on;
set enable_seqscan=on;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n';   -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n';  -- OK

set enable_bitmapscan=on;
set enable_seqscan=off;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n';   -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n';  -- crashes

$ psql -f fail.sql                                                             QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on azjunk4  (cost=52.01..56.02 rows=1 width=81) (actual time=1.011..5.291

rows=131 loops=1)  Recheck Cond: (txt ~ 'f[aeio]n'::text)  ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
(cost=0.00..52.01rows=1 width=0) (actual

time=0.880..0.880 rows=131 loops=1)        Index Cond: (txt ~ 'f[aeio]n'::text)Total runtime: 5.700 ms
(5 rows)
                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------Seq Scan on
azjunk4 (cost=0.00..268.00 rows=1 width=81) (actual time=1.491..36.049 rows=164

loops=1)  Filter: (txt ~ 'f[aeiou]n'::text)  Rows Removed by Filter: 9836Total runtime: 36.112 ms
(4 rows)
                                                             QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on azjunk4  (cost=52.01..56.02 rows=1 width=81) (actual time=0.346..0.927

rows=131 loops=1)  Recheck Cond: (txt ~ 'f[aeio]n'::text)  ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
(cost=0.00..52.01rows=1 width=0) (actual

time=0.316..0.316 rows=131 loops=1)        Index Cond: (txt ~ 'f[aeio]n'::text)Total runtime: 0.996 ms
(5 rows)

psql:fail.sql:24: connection to server was lost

Sorry: I found the code, esp. the regex engine hard to understand sofar, so all the above are
somewhat inconclusive remarks; I hope nonetheless they are useful.

Thanks,

Erik Rijkers

pgsql-hackers by date:

From: Alvaro Herrera
Date: 19 January 2012, 19:30:29
Subject: Re: WIP -- renaming implicit sequences

From: Robert Haas
Date: 19 January 2012, 19:53:59
Subject: Re: WIP -- renaming implicit sequences

Re: WIP: index support for regexp search - Mailing list pgsql-hackers

Previous

Next