Thread: WIP: index support for regexp search

WIP: index support for regexp search

From
Alexander Korotkov
Date:
Hackers,

WIP patch with index support for regexp search for pg_trgm contrib is attached.
In spite of techniques which extracts continuous text parts from regexp, this patch presents technique of automatum transformation. That allows more comprehensive trigrams extraction.

A little example of possible perfomance benefit.

test=# explain analyze select * from words where s ~ 'a[bc]+[de]';
                                              QUERY PLAN                                            
   
----------------------------------------------------------------------------------------------------
---
 Seq Scan on words  (cost=0.00..1703.11 rows=10 width=9) (actual time=3.092..242.303 rows=662 loops=1)
   Filter: (s ~ 'a[bc]+[de]'::text)
   Rows Removed by Filter: 97907
 Total runtime: 243.213 ms
(4 rows)

test=# explain analyze select * from words where s ~ 'a[bc]+[de]';
                                                         QUERY PLAN                                 
                        
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on words  (cost=260.08..295.83 rows=10 width=9) (actual time=4.166..7.506 rows=662 loops=1)
   Recheck Cond: (s ~ 'a[bc]+[de]'::text)
   Rows Removed by Index Recheck: 18
   ->  Bitmap Index Scan on words_trgm_idx  (cost=0.00..260.07 rows=10 width=0) (actual time=4.076..4.076 rows=680 loops=1)
         Index Cond: (s ~ 'a[bc]+[de]'::text)
 Total runtime: 8.424 ms
(6 rows)

Current version of patch have some limitations:
1) Algorithm of logical expression extraction on trigrams have high computational complexity. So, it can become really slow on regexp with many branches. Probably, improvements of this algorithm is possible.
2) Surely, no perfomance benefit if no trigrams can be extracted from regexp. It's inevitably.
3) Currently, only GIN index is supported. There are no serious problems, GiST code for it just not written yet.
4) It appear to be some kind of problem to extract multibyte encoded character from pg_wchar. I've posted question about it here:
While I've hardcoded some dirty solution. So PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not supported yet.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
Robert Haas
Date:
On Tue, Nov 22, 2011 at 2:38 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> WIP patch with index support for regexp search for pg_trgm contrib is
> attached.
> In spite of techniques which extracts continuous text parts from regexp,
> this patch presents technique of automatum transformation. That allows more
> comprehensive trigrams extraction.

Please add this patch here so it does not get lost in the shuffle:

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Thu, Dec 1, 2011 at 12:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Please add this patch here so it does not get lost in the shuffle:

https://commitfest.postgresql.org/action/commitfest_view/open

Done.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Heikki Linnakangas
Date:
On 22.11.2011 21:38, Alexander Korotkov wrote:
> WIP patch with index support for regexp search for pg_trgm contrib is
> attached.
> In spite of techniques which extracts continuous text parts from regexp,
> this patch presents technique of automatum transformation. That allows more
> comprehensive trigrams extraction.

Nice!

> Current version of patch have some limitations:
> 1) Algorithm of logical expression extraction on trigrams have high
> computational complexity. So, it can become really slow on regexp with many
> branches. Probably, improvements of this algorithm is possible.
> 2) Surely, no perfomance benefit if no trigrams can be extracted from
> regexp. It's inevitably.
> 3) Currently, only GIN index is supported. There are no serious problems,
> GiST code for it just not written yet.
> 4) It appear to be some kind of problem to extract multibyte encoded
> character from pg_wchar. I've posted question about it here:
> http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php
> While I've hardcoded some dirty solution. So
> PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not
> supported yet.

This is pretty far from being in committable state, so I'm going to mark 
this as "returned with feedback" in the commitfest app. The feedback:

The code badly needs comments. There is no explanation of how the 
trigram extraction code in trgm_regexp.c works. Guessing from the 
variable names, it seems to be some sort of a coloring algorithm that 
works on a graph, but that all needs to be explained. Can this algorithm 
be found somewhere in literature, perhaps? A link to a paper would be nice.

Apart from that, the multibyte issue seems like the big one. Any way 
around that?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
The code badly needs comments. There is no explanation of how the trigram extraction code in trgm_regexp.c works.
Sure. I hoped to find a time for comments before commitfest starts. Unfortunately I didn't, sorry.
 
Guessing from the variable names, it seems to be some sort of a coloring algorithm that works on a graph, but that all needs to be explained. Can this algorithm be found somewhere in literature, perhaps? A link to a paper would be nice.
I hope it's truly novel. At least application to regular expressions. I'm going to write a paper about it.
 
Apart from that, the multibyte issue seems like the big one. Any way around that?
Conversion of pg_wchar to multibyte character is the only way I found to avoid serious hacking of existing regexp code. Do you think additional function in pg_wchar_tbl which converts pg_wchar back to multibyte character is possible solution?

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
I also have a question about pg_wchar.

/*
 *-------------------------------------------------------------------
 * encoding info table
 * XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
 *-------------------------------------------------------------------
 */
pg_wchar_tbl pg_wchar_table[] = {
{pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen, pg_ascii_verifier, 1}, /* PG_SQL_ASCII */
{pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JP */
{pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_euccn_verifier, 2}, /* PG_EUC_CN */
{pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_euckr_verifier, 3}, /* PG_EUC_KR */
{pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_euctw_verifier, 4}, /* PG_EUC_TW */
{pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */
{pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /* PG_UTF8 */
{pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier, 4}, /* PG_MULE_INTERNAL */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN1 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN2 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN3 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN4 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN5 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN6 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN7 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN8 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN9 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_LATIN10 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1256 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1258 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN866 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN874 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8R */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1251 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1252 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-5 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-6 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-7 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* ISO-8859-8 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1250 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1253 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1254 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1255 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_WIN1257 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* PG_KOI8U */
{0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */
{0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */
{0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */
{0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */
{0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /* PG_GB18030 */
{0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */
{0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /* PG_SHIFT_JIS_2004 */
};

What does last 7 zeros in the first column means? No conversion to pg_wchar is possible from these encodings?

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Fri, Jan 20, 2012 at 1:07 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
What does last 7 zeros in the first column means? No conversion to pg_wchar is possible from these encodings?
Uh, I see. These encodings is not supported as server encodings. 

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
"Erik Rijkers"
Date:
On Thu, January 19, 2012 21:30, Heikki Linnakangas wrote:
> On 22.11.2011 21:38, Alexander Korotkov wrote:
>> WIP patch with index support for regexp search for pg_trgm contrib is
>> attached.
>> In spite of techniques which extracts continuous text parts from regexp,
>> this patch presents technique of automatum transformation. That allows more
>> comprehensive trigrams extraction.
>
> Nice!
>

Yes, wonderful stuff; I tested quite a bit with this patch; FWIW, here's what I found.

The patch yields spectacular speedups with small, simple-enough regexen.  But it does not do a
good enough job when guessing where to use the index and where fall back to Seq Scan.  This can
lead to (also spectacular) slow-downs, compared to Seq Scan.

I used the following to generate 3 test tables with lines of 80 random chars (just in case it's
handy for others to use):

$ cat create_data.sh
#!/bin/sh

for power in 4 5 6
do table=azjunk${power} index=${table}_trgmrgx_txt_01_idx echo "-- generating table $table with index $index"; time
perl-E' sub ss{ join"",@_[ map{rand @_} 1 .. shift ] }; say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z"))
for1 .. 1e'"${power};" \ | psql -aqXc "drop table if exists $table; create table $table(txt text); copy $table from
stdin;-- set session maintenance_work_mem='20GB'; create index $index on $table using gin(txt gin_trgm_ops); analyze
$table;";
done


\dt+ public.azjunk*                     List of relationsSchema |  Name   | Type  |   Owner   |  Size   | Description
--------+---------+-------+-----------+---------+-------------public | azjunk4 | table | breinbaas | 1152 kB |public |
azjunk5| table | breinbaas | 11 MB   |public | azjunk6 | table | breinbaas | 112 MB  |
 
(3 rows)



I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that
true?   I can understand you want that value to be low to limit the above risk, but now it reduces
the usability of the feature a bit: one has to split up larger char-classes into several smaller
ones to make a statement use the index: i.e.:            txt ~ 'f[aeio]n' OR txt ~ 'f[uy]n'
instead of            txt ~ 'f[aeiouy]n'



I made compiled instances with larger values for MAX_COLOR_CHARS (6 and 9), and sure enough they
used the index for larger classes such as the above, but of course also got into problems easier
when quantifiers are added (*, +, {n,m}).

A better calculation to decide index-use seems necessary, and ideally one that allows for a larger
MAX_COLOR_CHARS than 4.  Especially quantifiers could perhaps be inspected wrt that decision.
IMHO, the functionality would still be very useful when only very simple regexen were considered.

Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
normally necessary for perfomance reasons, but it would be useful te be able to compile a test
version that allows it.  I don't know how hard that would be.




There is also a minor bug, I think, when running with  'set enable_seqscan=off'  in combination
with a too-large regex:

$ cat fail.sql:

set enable_bitmapscan=on;
set enable_seqscan=on;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n';   -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n';  -- OK

set enable_bitmapscan=on;
set enable_seqscan=off;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n';   -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n';  -- crashes

$ psql -f fail.sql                                                             QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on azjunk4  (cost=52.01..56.02 rows=1 width=81) (actual time=1.011..5.291
 
rows=131 loops=1)  Recheck Cond: (txt ~ 'f[aeio]n'::text)  ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
(cost=0.00..52.01rows=1 width=0) (actual
 
time=0.880..0.880 rows=131 loops=1)        Index Cond: (txt ~ 'f[aeio]n'::text)Total runtime: 5.700 ms
(5 rows)
                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------Seq Scan on
azjunk4 (cost=0.00..268.00 rows=1 width=81) (actual time=1.491..36.049 rows=164
 
loops=1)  Filter: (txt ~ 'f[aeiou]n'::text)  Rows Removed by Filter: 9836Total runtime: 36.112 ms
(4 rows)
                                                             QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on azjunk4  (cost=52.01..56.02 rows=1 width=81) (actual time=0.346..0.927
 
rows=131 loops=1)  Recheck Cond: (txt ~ 'f[aeio]n'::text)  ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx
(cost=0.00..52.01rows=1 width=0) (actual
 
time=0.316..0.316 rows=131 loops=1)        Index Cond: (txt ~ 'f[aeio]n'::text)Total runtime: 0.996 ms
(5 rows)

psql:fail.sql:24: connection to server was lost


Sorry: I found the code, esp. the regex engine hard to understand sofar, so all the above are
somewhat inconclusive remarks; I hope nonetheless they are useful.


Thanks,

Erik Rijkers






Re: WIP: index support for regexp search

From
Marti Raudsepp
Date:
On Fri, Jan 20, 2012 at 01:33, Erik Rijkers <er@xs4all.nl> wrote:
> Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
> normally necessary for perfomance reasons, but it would be useful te be able to compile a test
> version that allows it.

I believe being interruptible is a requirement for the patch to be accepted.

CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops.

Regards,
Marti


Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
Hi!

Thank you for your feedback!

On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers <er@xs4all.nl> wrote:
The patch yields spectacular speedups with small, simple-enough regexen.  But it does not do a
good enough job when guessing where to use the index and where fall back to Seq Scan.  This can
lead to (also spectacular) slow-downs, compared to Seq Scan.
Could you give some examples of regexes where index scan becomes slower than seq scan?
 
I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that
true?   I can understand you want that value to be low to limit the above risk, but now it reduces
the usability of the feature a bit: one has to split up larger char-classes into several smaller
ones to make a statement use the index: i.e.:
Yes, MAX_COLOR_CHARS is number of maximum character in automata color when that color is divided to a separated characters. And it's likely there could be better solution than just have this hard limit.
 
Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
normally necessary for perfomance reasons, but it would be useful te be able to compile a test
version that allows it.  I don't know how hard that would be.
I seems that Ctrl-C was impossible because procedure of trigrams exctraction becomes so long while it is not breakable. It's not difficult to make this procedure breakable, but actually it just shouldn't take so long.
 
There is also a minor bug, I think, when running with  'set enable_seqscan=off'  in combination
with a too-large regex:
Thanks for pointing. Will be fixed.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Fri, Jan 20, 2012 at 8:45 PM, Marti Raudsepp <marti@juffo.org> wrote:
On Fri, Jan 20, 2012 at 01:33, Erik Rijkers <er@xs4all.nl> wrote:
> Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
> normally necessary for perfomance reasons, but it would be useful te be able to compile a test
> version that allows it.

I believe being interruptible is a requirement for the patch to be accepted.

CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops.
Sure. It's easy to fix. But it seems that in this case gin extract_query method becomes slow (because index scan itself is breakable). So, it just shouldn't work so long. 

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Fri, Jan 20, 2012 at 12:54 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
Apart from that, the multibyte issue seems like the big one. Any way around that?
Conversion of pg_wchar to multibyte character is the only way I found to avoid serious hacking of existing regexp code. Do you think additional function in pg_wchar_tbl which converts pg_wchar back to multibyte character is possible solution?
Do you have any notes on it? I could make the patch which adds such function into core.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
"Erik Rijkers"
Date:
On Sat, January 21, 2012 06:26, Alexander Korotkov wrote:
> Hi!
>
> Thank you for your feedback!
>
> On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers <er@xs4all.nl> wrote:
>
>> The patch yields spectacular speedups with small, simple-enough regexen.
>>  But it does not do a
>> good enough job when guessing where to use the index and where fall back
>> to Seq Scan.  This can
>> lead to (also spectacular) slow-downs, compared to Seq Scan.
>>
> Could you give some examples of regexes where index scan becomes slower
> than seq scan?
>


x[aeio]+q takes many minutes, uninterruptible.  It's now running for almost 30 minutes, I'm sure
it will come back eventually, but I think I'll kill it later on; I suppose you get the point  ;-)

Even with {n,m} quantifiers it's easy to hit slowdowns:
  (table azjunk6 is 112 MB, the index 693 MB.)  (MAX_COLOR_CHARS=4  <= your original patch)


instance           table    regex                  plan               time

HEAD               azjunk6  x[aeio]{1,3}q          Seq Scan           3566.088 ms
HEAD               azjunk6  x[aeio]{1,3}q          Seq Scan           3540.606 ms
HEAD               azjunk6  x[aeio]{1,3}q          Seq Scan           3495.034 ms
HEAD               azjunk6  x[aeio]{1,3}q          Seq Scan           3510.403 ms
trgm_regex         azjunk6  x[aeio]{1,3}q          Bitmap Heap Scan   3724.131 ms
trgm_regex         azjunk6  x[aeio]{1,3}q          Bitmap Heap Scan   3844.999 ms
trgm_regex         azjunk6  x[aeio]{1,3}q          Bitmap Heap Scan   3835.190 ms
trgm_regex         azjunk6  x[aeio]{1,3}q          Bitmap Heap Scan   3724.016 ms


HEAD               azjunk6  x[aeio]{1,4}q          Seq Scan           3617.997 ms
HEAD               azjunk6  x[aeio]{1,4}q          Seq Scan           3644.215 ms
HEAD               azjunk6  x[aeio]{1,4}q          Seq Scan           3636.976 ms
HEAD               azjunk6  x[aeio]{1,4}q          Seq Scan           3625.493 ms
trgm_regex         azjunk6  x[aeio]{1,4}q          Bitmap Heap Scan   7885.247 ms
trgm_regex         azjunk6  x[aeio]{1,4}q          Bitmap Heap Scan   8799.082 ms
trgm_regex         azjunk6  x[aeio]{1,4}q          Bitmap Heap Scan   7754.152 ms
trgm_regex         azjunk6  x[aeio]{1,4}q          Bitmap Heap Scan   7721.332 ms


This is with your patch as is; in instances compiled with higher MAX_COLOR_CHARS (I did 6 and 9),
it is of course even easier to dream up a slow regex...



Erik Rijkers







Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
Hi!

New version of patch is attached. Changes are following:
1) Right way to convert from pg_wchar to multibyte.
2) Optimization of producing CFNA-like graph on trigrams (produce smaller, but equivalent, graphs in less time).
3) Comments and refactoring.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
"Erik Rijkers"
Date:
On Mon, November 19, 2012 22:58, Alexander Korotkov wrote:

> New version of patch is attached.

Hi Alexander,

I get some compile-errors:

(Centos 6.3, Linux 2.6.32-279.14.1.el6.x86_64 GNU/Linux, gcc (GCC) 4.7.2)

make contrib
trgm_regexp.c:73:2: error: unknown type name ‘TrgmStateKey’
make[1]: *** [trgm_regexp.o] Error 1
make: *** [all-pg_trgm-recurse] Error 2
trgm_regexp.c:73:2: error: unknown type name ‘TrgmStateKey’
make[1]: *** [trgm_regexp.o] Error 1
make: *** [install-pg_trgm-recurse] Error 2


Did I forget something?

Thanks,

Erik Rijkers








Re: WIP: index support for regexp search

From
Tomas Vondra
Date:
On 19.11.2012 22:58, Alexander Korotkov wrote:
> Hi!
>
> New version of patch is attached. Changes are following:
> 1) Right way to convert from pg_wchar to multibyte.
> 2) Optimization of producing CFNA-like graph on trigrams (produce
> smaller, but equivalent, graphs in less time).
> 3) Comments and refactoring.

Hi,

thanks for the updated message-id. I've done the initial review:

1) Patch applies fine to the master.

2) It's common to use upper-case names for macros, but trgm.h defines  macro "iswordchr" - I see it's moved from
trgm_op.cbut maybe we  could make it a bit more correct? 

3) I see there are two '#ifdef KEEPONLYALNUM" blocks right next to each  other in trgm.h - maybe it'd be a good idea to
jointhem? 

4) The two new method prototypes at the end of trgm.h use different  indendation than the rest (spaces only instead of
tabs).

5) There are no regression tests / updated docs (yet).

6) It does not compile - I do get a bunch of errors like this

trgm_regexp.c:73:2: error: expected specifier-qualifier-list before
‘TrgmStateKey’
trgm_regexp.c: In function ‘addKeys’:
trgm_regexp.c:291:24: error: ‘TrgmState’ has no member named ‘keys’
trgm_regexp.c:304:10: error: ‘TrgmState’ has no member named ‘keys’
...

It seems this is cause by the order of typedefs in trgm_regexp.c, namely
TrgmState referencing TrgmStateKey before it's defined. Moving the
TrgmStateKey before TrgmState fixed the issue (I'm using gcc-4.5.4  but
I think it's not compiler-dependent.)

7) Once fixed, it seems to work

CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test      SELECT md5(i::text) FROM generate_series(1,1000000) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;

EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*';

                          QUERY PLAN
---------------------------------------------------------------------Bitmap Heap Scan on test  (cost=16.77..385.16
rows=100width=33)  Recheck Cond: (val ~ '.*qqq.*'::text)  ->  Bitmap Index Scan on trgm_idx  (cost=0.00..16.75 rows=100
                                    width=0)        Index Cond: (val ~ '.*qqq.*'::text) 
(4 rows)

but I do get a bunch of NOTICE messages with debugging info (no matter
if the GIN index is used or not, so it's somewhere in the common regexp
code). But I guess that's due to WIP status.

regards
Tomas



Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
Some quick comments.

On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
6) It does not compile - I do get a bunch of errors like this
Fixed. 

7) Once fixed, it seems to work

CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test
       SELECT md5(i::text) FROM generate_series(1,1000000) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;

EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*';


                           QUERY PLAN
---------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=16.77..385.16 rows=100 width=33)
   Recheck Cond: (val ~ '.*qqq.*'::text)
   ->  Bitmap Index Scan on trgm_idx  (cost=0.00..16.75 rows=100
                                       width=0)
         Index Cond: (val ~ '.*qqq.*'::text)
(4 rows)

but I do get a bunch of NOTICE messages with debugging info (no matter
if the GIN index is used or not, so it's somewhere in the common regexp
code). But I guess that's due to WIP status.
It's due to TRGM_REGEXP_DEBUG macro. I disabled it by default. But I think pieces of code hidden by that macro could be useful for debug even after WIP status.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
Heikki Linnakangas
Date:
Glad to see this patch hasn't been totally forgotten. Being able to use 
indexes for regular expressions would be really cool!

Back in January, I asked for some high-level description of how the 
algorithm works 
(http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com). 
That's still sorely needed. Googling around, I found the slides for your 
presentation on this from PGConf.EU - it would be great to have the 
information from that presentation included in the patch.

- Heikki



Re: WIP: index support for regexp search

From
Pavel Stehule
Date:
Hello

I tested it now and it working very well!

tested utf8, case sensitive, case insensitive

Regards

Pavel Stehule


2012/11/20 Heikki Linnakangas <hlinnakangas@vmware.com>:
> Glad to see this patch hasn't been totally forgotten. Being able to use
> indexes for regular expressions would be really cool!
>
> Back in January, I asked for some high-level description of how the
> algorithm works
> (http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com).
> That's still sorely needed. Googling around, I found the slides for your
> presentation on this from PGConf.EU - it would be great to have the
> information from that presentation included in the patch.
>
> - Heikki
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
2) It's common to use upper-case names for macros, but trgm.h defines
   macro "iswordchr" - I see it's moved from trgm_op.c but maybe we
   could make it a bit more correct?

3) I see there are two '#ifdef KEEPONLYALNUM" blocks right next to each
   other in trgm.h - maybe it'd be a good idea to join them?

4) The two new method prototypes at the end of trgm.h use different
   indendation than the rest (spaces only instead of tabs).
These issues are fixed in attached patch.
Additionally 3 new macros are introduced: MAX_RESULT_STATES, MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are limiting resources usage during regex processing.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
Pavel Stehule
Date:
hello

do you plan to support GiST?

Regards

Pavel

2012/11/20 Alexander Korotkov <aekorotkov@gmail.com>:
> On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>
>> 2) It's common to use upper-case names for macros, but trgm.h defines
>>    macro "iswordchr" - I see it's moved from trgm_op.c but maybe we
>>    could make it a bit more correct?
>>
>> 3) I see there are two '#ifdef KEEPONLYALNUM" blocks right next to each
>>    other in trgm.h - maybe it'd be a good idea to join them?
>>
>> 4) The two new method prototypes at the end of trgm.h use different
>>    indendation than the rest (spaces only instead of tabs).
>
> These issues are fixed in attached patch.
> Additionally 3 new macros are introduced: MAX_RESULT_STATES,
> MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are limiting resources usage during
> regex processing.
>
> ------
> With best regards,
> Alexander Korotkov.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:

On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
Glad to see this patch hasn't been totally forgotten. Being able to use indexes for regular expressions would be really cool!

Back in January, I asked for some high-level description of how the algorithm works (http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com). That's still sorely needed. Googling around, I found the slides for your presentation on this from PGConf.EU - it would be great to have the information from that presentation included in the patch.

New version of patch is attached. The changes are following:
1) A big comment with high-level description of what is going on.
2) Regression tests.
3) Documetation update.
4) Some more refactoring.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
Hi!

On Wed, Nov 21, 2012 at 12:51 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
do you plan to support GiST?

At first, I would note that pg_trgm GiST opclass is quite ridiculous for support regex search (and, actually for LIKE/ILIKE search which is already implemented too). Because in GiST opclass we store set of trigrams in leaf pages. In was designed for trigram similarity search and have sense for it because of elimination of trigram set computation. But for regex or LIKE/ILIKE search this representation is both lossy and bigger than just original string. Probably we could think about another opclass for GiST focusing on regex and LIKE/ILIKE search?

However, amyway I can create additional patch for current GiST opclass.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Heikki Linnakangas
Date:
On 25.11.2012 22:55, Alexander Korotkov wrote:
> On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangas<hlinnakangas@vmware.com
>> wrote:
>
>> Glad to see this patch hasn't been totally forgotten. Being able to use
>> indexes for regular expressions would be really cool!
>>
>> Back in January, I asked for some high-level description of how the
>> algorithm works (http://archives.postgresql.**
>>
org/message-id/4F187D5C.30701@**enterprisedb.com<http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com>).
>> That's still sorely needed. Googling around, I found the slides for your
>> presentation on this from PGConf.EU - it would be great to have the
>> information from that presentation included in the patch.
>
>
> New version of patch is attached. The changes are following:
> 1) A big comment with high-level description of what is going on.
> 2) Regression tests.
> 3) Documetation update.
> 4) Some more refactoring.

Great, that top-level comment helped tremendously! I feel enlightened.

I fixed some spelling, formatting etc. trivial stuff while reading
through the patch, see attached. Below is some feedback on the details:

* I don't like the PG_TRY/CATCH trick. It's not generally safe to catch
an error, without propagating it further or rolling back the whole
(sub)transation. It might work in this case, as you're only suppressing
errors with the special sqlcode that are used in the same file, but it
nevertheless feels naughty. I believe none of the limits that are being
checked are strict; it's OK to exceed the limits somewhat, as long as
you terminate the processing in a reasonable time, in case of
pathological input. I'd suggest putting an explicit check for the limits
somewhere, and not rely on ereport(). Something like this, in the code
that recurses:

if (trgmCNFA->arcsCount > MAX_RESULT_ARCS ||
     hash_get_num_entries(trgmCNFA->states) > MAX_RESULT_STATES)
{
    trgmCNFA->overflowed = true;
    return;
}

And then check for the overflowed flag at the top level.

* This part of the high-level comment was not clear to me:

>  * States of the graph produced in the first stage are marked with "keys". Key is a pair
>  * of a "prefix" and a state of the original automaton. "Prefix" is a last
>  * characters. So, knowing the prefix is enough to know what is a trigram when we read some new
>  * character. However, we can know single character of prefix or don't know any
>  * characters of it. Each state of resulting graph have an "enter key" (with that
>  * key we've entered this state) and a set of keys which are reachable without
>  * reading any predictable trigram. The algorithm of processing each state
>  * of resulting graph are so:
>  * 1) Add all keys which achievable without reading of any predictable trigram.
>  * 2) Add outgoing arcs labeled with trigrams.
>  * Step 2 leads to creation of new states and recursively processing them. So,
>  * we use depth-first algorithm.

I didn't understand that. Can you elaborate? It might help to work
through an example, with some ascii art depicting the graph.

* It would be nice to add some comments to TrgmCNFA struct, explaining
which fields are valid at which stages. For example, it seems that
'trgms' array is calculated only after building the CNFA, by
getTrgmVector() function, while arcsCount is updated on the fly, while
recursing in the getState() function.

* What is the representation used for the path matrix? Needs a comment.

* What do the getColorinfo() and scanColorMap() functions do? What
exactly does a color represent? What's the tradeoff in choosing
MAX_COLOR_CHARS?

- Heikki

Attachment

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Mon, Nov 26, 2012 at 4:55 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
Great, that top-level comment helped tremendously! I feel enlightened.

I fixed some spelling, formatting etc. trivial stuff while reading through the patch, see attached. Below is some feedback on the details:

* I don't like the PG_TRY/CATCH trick. It's not generally safe to catch an error, without propagating it further or rolling back the whole (sub)transation. It might work in this case, as you're only suppressing errors with the special sqlcode that are used in the same file, but it nevertheless feels naughty. I believe none of the limits that are being checked are strict; it's OK to exceed the limits somewhat, as long as you terminate the processing in a reasonable time, in case of pathological input. I'd suggest putting an explicit check for the limits somewhere, and not rely on ereport(). Something like this, in the code that recurses:

if (trgmCNFA->arcsCount > MAX_RESULT_ARCS ||
    hash_get_num_entries(trgmCNFA->states) > MAX_RESULT_STATES)
{
        trgmCNFA->overflowed = true;
        return;
}

PG_TRY/CATCH trick is replaced with some number of if/return. Code becomes a bit more complicated, but your notes does matter.

And then check for the overflowed flag at the top level.

* This part of the high-level comment was not clear to me:

 * States of the graph produced in the first stage are marked with "keys". Key is a pair
 * of a "prefix" and a state of the original automaton. "Prefix" is a last
 * characters. So, knowing the prefix is enough to know what is a trigram when we read some new
 * character. However, we can know single character of prefix or don't know any
 * characters of it. Each state of resulting graph have an "enter key" (with that
 * key we've entered this state) and a set of keys which are reachable without
 * reading any predictable trigram. The algorithm of processing each state
 * of resulting graph are so:
 * 1) Add all keys which achievable without reading of any predictable trigram.
 * 2) Add outgoing arcs labeled with trigrams.
 * Step 2 leads to creation of new states and recursively processing them. So,
 * we use depth-first algorithm.

I didn't understand that. Can you elaborate? It might help to work through an example, with some ascii art depicting the graph.

This comment is extended with example.
 
* It would be nice to add some comments to TrgmCNFA struct, explaining which fields are valid at which stages. For example, it seems that 'trgms' array is calculated only after building the CNFA, by getTrgmVector() function, while arcsCount is updated on the fly, while recursing in the getState() function.

TrgmCNFA comment is extended with this. 
 
* What is the representation used for the path matrix? Needs a comment.

Comments are added to PackedTrgmPaths and TrgmStatePath.
 
* What do the getColorinfo()
 
getColorInfo comment now references to ColorInfo struct which have comment. 

and scanColorMap() functions do?

 scanColorMap comment now have description of colormap structure.
 
What exactly does a color represent?

This is added to the top comment.
 
What's the tradeoff in choosing MAX_COLOR_CHARS?
 
Comment is added to the macro.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
"er"
Date:
On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:

> trgm-regexp-0.6.patch.gz

I ran the simple-minded tests against generated data (similar to the ones I did in January 2012). 
The problems of that older version seem pretty much all removed. (although I didn't do much work
on it -- just reran these tests).

I used two 2 instances, 'HEAD' and 'trgm_regex', which were both compiled with
    '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml'

Tables used:
           rowcount     size table        size index (trgm)
azjunk4   10^4 rows      1,171,456 |       9,781,248azjunk5   10^5 rows     11,706,368 |      65,093,632azjunk6   10^6
rows   117,030,912 |     726,310,912azjunk7   10^7 rows  1,170,292,736 |   4,976,189,440
 
  (See my previous emails for a generating script)

Tables contain random generated text:

table azjunk7 limit 5;                                      txt
----------------------------------------------------------------------------------i kzzhv ssaa zv x  xlepzxsgbdkxev v
wndmvqkuwb qxkyvgab gpidaosaqbewqimmai  jxjbvwn zbevtzyhibbn hoctxurutn pvlatjjyxf f runa owpltbcunrbq ux peoook
rxwoscbytzbbjlbbhhkivjivklgbhtvapzogh rj ky ahvgkvvlfudotvqapznludohdoyqrp kvothyclbckbxuhvic gomewbp
izsjifqggyqgzcghdatlb kud ltfqaxqxjjom qkw wqggikgvph   yi sftmbjtedbjfl vtcasudjpgfgjaf swooxygsse flnqd
pxzsdmesqhqbzgirswysotemuakq agk p w uq
 
(5 rows)

with index on column 'txt':
 create index az7_idx on azjunk7 using gin (txt gin_trgm_ops);

Queries were of the form:
  explain analyze select txt from azjunkXX where txt ~ '$REGEX';

The main problem with the January version was that it chose to use the trgm index even when it
could take a long time (hours).  This has been resolved as far as I can see, and the results are
now very attractive.

(There does seem to be a very slight regression on the seqscan, but it's so small that I'm not yet
sure it's not noise)


Hardware: AMD FX-8120 with Linux 2.6.32-279.14.1.el6.x86_64  x86_64  GNU/Linux

PostgreSQL 9.3devel-trgm_regex-20121127_2353-e78d288c895bd296e3cb1ca29c7fe2431eef3fcd on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.7.2, 64-bit

port instance    table    regex                 rows method            expl.analyze timing
6543 HEAD        azjunk4  x[ae]q                  46 Seq Scan             12.962 ms6554 trgm_regex  azjunk4  x[ae]q
            46 Bitmap Heap Scan      0.800 ms
 
6543 HEAD        azjunk4  x[ae]{1}q               46 Seq Scan             12.487 ms6554 trgm_regex  azjunk4  x[ae]{1}q
            46 Bitmap Heap Scan      0.209 ms
 
6543 HEAD        azjunk4  x[ae]{1,1}q             46 Seq Scan             12.266 ms6554 trgm_regex  azjunk4
x[ae]{1,1}q            46 Bitmap Heap Scan      0.210 ms
 
6543 HEAD        azjunk4  x[ae]{,2}q               0 Seq Scan             14.322 ms6554 trgm_regex  azjunk4  x[ae]{,2}q
             0 Bitmap Heap Scan      0.610 ms
 
6543 HEAD        azjunk4  x[ae]{,10}q              0 Seq Scan             20.503 ms6554 trgm_regex  azjunk4
x[ae]{,10}q             0 Bitmap Heap Scan      0.511 ms
 
6543 HEAD        azjunk4  x[ae]{1,2}q             49 Seq Scan             13.060 ms6554 trgm_regex  azjunk4
x[ae]{1,2}q            49 Bitmap Heap Scan      0.429 ms
 
6543 HEAD        azjunk4  x[aei]q                 81 Seq Scan             12.487 ms6554 trgm_regex  azjunk4  x[aei]q
            81 Bitmap Heap Scan      0.367 ms
 
6543 HEAD        azjunk4  x[aei]{1}q              81 Seq Scan             12.132 ms6554 trgm_regex  azjunk4  x[aei]{1}q
            81 Bitmap Heap Scan      0.336 ms
 
6543 HEAD        azjunk4  x[aei]{1,1}q            81 Seq Scan             12.168 ms6554 trgm_regex  azjunk4
x[aei]{1,1}q           81 Bitmap Heap Scan      0.319 ms
 
6543 HEAD        azjunk4  x[aei]{,2}q              0 Seq Scan             14.586 ms6554 trgm_regex  azjunk4
x[aei]{,2}q             0 Bitmap Heap Scan      0.621 ms
 
6543 HEAD        azjunk4  x[aei]{,10}q             0 Seq Scan             20.134 ms6554 trgm_regex  azjunk4
x[aei]{,10}q            0 Bitmap Heap Scan      0.552 ms
 
6543 HEAD        azjunk4  x[aei]{1,2}q            89 Seq Scan             12.553 ms6554 trgm_regex  azjunk4
x[aei]{1,2}q           89 Bitmap Heap Scan      0.916 ms
 
6543 HEAD        azjunk4  x[aei]{1,3}q            89 Seq Scan             13.055 ms6554 trgm_regex  azjunk4
x[aei]{1,3}q           89 Seq Scan             13.064 ms
 
6543 HEAD        azjunk4  x[aei]q                 81 Seq Scan             11.856 ms6554 trgm_regex  azjunk4  x[aei]q
            81 Bitmap Heap Scan      0.398 ms
 
6543 HEAD        azjunk4  x[aei]{1}q              81 Seq Scan             11.951 ms6554 trgm_regex  azjunk4  x[aei]{1}q
            81 Bitmap Heap Scan      0.369 ms
 
6543 HEAD        azjunk4  x[aei]{1,1}q            81 Seq Scan             12.750 ms6554 trgm_regex  azjunk4
x[aei]{1,1}q           81 Bitmap Heap Scan      0.355 ms
 
6543 HEAD        azjunk4  x[aei]{,2}q              0 Seq Scan             14.032 ms6554 trgm_regex  azjunk4
x[aei]{,2}q             0 Bitmap Heap Scan      0.540 ms
 
6543 HEAD        azjunk4  x[aei]{,10}q             0 Seq Scan             20.377 ms6554 trgm_regex  azjunk4
x[aei]{,10}q            0 Bitmap Heap Scan      0.550 ms
 
6543 HEAD        azjunk4  x[aei]{1,2}q            89 Seq Scan             12.706 ms6554 trgm_regex  azjunk4
x[aei]{1,2}q           89 Bitmap Heap Scan      0.969 ms
 
6543 HEAD        azjunk4  x[aei]{1,3}q            89 Seq Scan             13.127 ms6554 trgm_regex  azjunk4
x[aei]{1,3}q           89 Seq Scan             13.025 ms
 
6543 HEAD        azjunk4  x[aeio]q               105 Seq Scan             12.533 ms6554 trgm_regex  azjunk4  x[aeio]q
           105 Bitmap Heap Scan      0.391 ms
 
6543 HEAD        azjunk4  x[aeio]{1}q            105 Seq Scan             12.532 ms6554 trgm_regex  azjunk4
x[aeio]{1}q           105 Bitmap Heap Scan      0.362 ms
 
6543 HEAD        azjunk4  x[aeio]{1,1}q          105 Seq Scan             12.323 ms6554 trgm_regex  azjunk4
x[aeio]{1,1}q         105 Bitmap Heap Scan      0.449 ms
 
6543 HEAD        azjunk4  x[aeio]{,2}q             0 Seq Scan             14.417 ms6554 trgm_regex  azjunk4
x[aeio]{,2}q            0 Bitmap Heap Scan      0.844 ms
 
6543 HEAD        azjunk4  x[aeio]{,10}q            0 Seq Scan             23.056 ms6554 trgm_regex  azjunk4
x[aeio]{,10}q           0 Bitmap Heap Scan      0.668 ms
 
6543 HEAD        azjunk4  x[aeio]{1,2}q          121 Seq Scan             13.072 ms6554 trgm_regex  azjunk4
x[aeio]{1,2}q         121 Seq Scan             13.750 ms
 
6543 HEAD        azjunk4  x[aeio]{1,3}q          123 Seq Scan             12.916 ms6554 trgm_regex  azjunk4
x[aeio]{1,3}q         123 Seq Scan             13.078 ms
 
6543 HEAD        azjunk4  x[aeio]{1,4}q          124 Seq Scan             13.478 ms6554 trgm_regex  azjunk4
x[aeio]{1,4}q         124 Seq Scan             14.334 ms
 
6543 HEAD        azjunk4  x[aeio]{2,4}q           19 Seq Scan             13.922 ms6554 trgm_regex  azjunk4
x[aeio]{2,4}q          19 Seq Scan             13.503 ms
 
6543 HEAD        azjunk4  x[aeio]{3,4}q            3 Seq Scan             14.325 ms6554 trgm_regex  azjunk4
x[aeio]{3,4}q           3 Seq Scan             13.429 ms
 
6543 HEAD        azjunk4  x[aeiou]q              134 Seq Scan             12.356 ms6554 trgm_regex  azjunk4  x[aeiou]q
           134 Seq Scan             13.215 ms
 
6543 HEAD        azjunk4  x[aeiou]{1}q           134 Seq Scan             13.005 ms6554 trgm_regex  azjunk4
x[aeiou]{1}q          134 Seq Scan             12.893 ms
 
6543 HEAD        azjunk4  x[aeiou]{1,1}q         134 Seq Scan             12.430 ms6554 trgm_regex  azjunk4
x[aeiou]{1,1}q        134 Seq Scan             13.108 ms
 
6543 HEAD        azjunk4  x[aeiou]{,2}q            0 Seq Scan             14.486 ms6554 trgm_regex  azjunk4
x[aeiou]{,2}q           0 Bitmap Heap Scan      0.349 ms 
6543 HEAD        azjunk4  x[aeiou]{,10}q           0 Seq Scan             21.597 ms6554 trgm_regex  azjunk4
x[aeiou]{,10}q          0 Bitmap Heap Scan      0.363 ms
 
6543 HEAD        azjunk4  x[aeiou]{1,2}q         156 Seq Scan             13.069 ms6554 trgm_regex  azjunk4
x[aeiou]{1,2}q        156 Seq Scan             13.879 ms
 
6543 HEAD        azjunk4  x[aeiou]{1,3}q         160 Seq Scan             13.005 ms6554 trgm_regex  azjunk4
x[aeiou]{1,3}q        160 Seq Scan             14.016 ms
 
6543 HEAD        azjunk4  x[aeiou]{1,4}q         161 Seq Scan             13.603 ms6554 trgm_regex  azjunk4
x[aeiou]{1,4}q        161 Seq Scan             14.667 ms
 
6543 HEAD        azjunk4  x[aeiou]{2,4}q          27 Seq Scan             13.656 ms6554 trgm_regex  azjunk4
x[aeiou]{2,4}q         27 Seq Scan             14.113 ms
 
6543 HEAD        azjunk4  x[aeiou]{3,4}q           5 Seq Scan             13.541 ms6554 trgm_regex  azjunk4
x[aeiou]{3,4}q          5 Seq Scan             14.265 ms
 
6543 HEAD        azjunk4  x[aeiou]{1,5}q         162 Seq Scan             13.750 ms6554 trgm_regex  azjunk4
x[aeiou]{1,5}q        162 Seq Scan             13.858 ms
 
6543 HEAD        azjunk4  x[aeiou]{2,5}q          28 Seq Scan             13.745 ms6554 trgm_regex  azjunk4
x[aeiou]{2,5}q         28 Seq Scan             13.680 ms
 
6543 HEAD        azjunk4  x[aeiou]{4,5}q           2 Seq Scan             13.577 ms6554 trgm_regex  azjunk4
x[aeiou]{4,5}q          2 Seq Scan             13.755 ms
 
6543 HEAD        azjunk4  x[aeiouy]q             173 Seq Scan             12.106 ms6554 trgm_regex  azjunk4  x[aeiouy]q
           173 Seq Scan             12.655 ms
 
6543 HEAD        azjunk4  x[aeiouy]{1}q          173 Seq Scan             12.488 ms6554 trgm_regex  azjunk4
x[aeiouy]{1}q         173 Seq Scan             12.723 ms
 
6543 HEAD        azjunk4  x[aeiouy]{1,1}q        173 Seq Scan             12.525 ms6554 trgm_regex  azjunk4
x[aeiouy]{1,1}q       173 Seq Scan             13.957 ms
 
6543 HEAD        azjunk4  x[aeiouy]{,2}q           0 Seq Scan             14.574 ms6554 trgm_regex  azjunk4
x[aeiouy]{,2}q          0 Bitmap Heap Scan      0.295 ms
 
6543 HEAD        azjunk4  x[aeiouy]{,10}q          0 Seq Scan             20.916 ms6554 trgm_regex  azjunk4
x[aeiouy]{,10}q         0 Bitmap Heap Scan      0.311 ms
 
6543 HEAD        azjunk4  x[aeiouy]{1,2}q        204 Seq Scan             12.730 ms6554 trgm_regex  azjunk4
x[aeiouy]{1,2}q       204 Seq Scan             13.392 ms
 
6543 HEAD        azjunk4  x[aeiouy]{1,3}q        215 Seq Scan             12.824 ms6554 trgm_regex  azjunk4
x[aeiouy]{1,3}q       215 Seq Scan             13.083 ms
 
6543 HEAD        azjunk4  x[aeiouy]{1,4}q        218 Seq Scan             13.985 ms6554 trgm_regex  azjunk4
x[aeiouy]{1,4}q       218 Seq Scan             13.890 ms
 
6543 HEAD        azjunk4  x[aeiouy]{2,4}q         46 Seq Scan             13.735 ms6554 trgm_regex  azjunk4
x[aeiouy]{2,4}q        46 Seq Scan             13.724 ms
 
6543 HEAD        azjunk4  x[aeiouy]{3,4}q         14 Seq Scan             13.470 ms6554 trgm_regex  azjunk4
x[aeiouy]{3,4}q        14 Seq Scan             14.046 ms
 
6543 HEAD        azjunk4  x[aeiouy]{1,5}q        219 Seq Scan             14.245 ms6554 trgm_regex  azjunk4
x[aeiouy]{1,5}q       219 Seq Scan             14.370 ms
 
6543 HEAD        azjunk4  x[aeiouy]{2,5}q         47 Seq Scan             13.483 ms6554 trgm_regex  azjunk4
x[aeiouy]{2,5}q        47 Seq Scan             15.065 ms
 
6543 HEAD        azjunk4  x[aeiouy]{4,5}q          4 Seq Scan             13.394 ms6554 trgm_regex  azjunk4
x[aeiouy]{4,5}q         4 Seq Scan             14.158 ms
 
6543 HEAD        azjunk5  x[ae]q                 677 Seq Scan            114.862 ms6554 trgm_regex  azjunk5  x[ae]q
           677 Bitmap Heap Scan      2.213 ms
 
6543 HEAD        azjunk5  x[ae]{1}q              677 Seq Scan            119.024 ms6554 trgm_regex  azjunk5  x[ae]{1}q
           677 Bitmap Heap Scan      1.800 ms
 
6543 HEAD        azjunk5  x[ae]{1,1}q            677 Seq Scan            118.500 ms6554 trgm_regex  azjunk5
x[ae]{1,1}q           677 Bitmap Heap Scan      1.788 ms
 
6543 HEAD        azjunk5  x[ae]{,2}q               0 Seq Scan            138.023 ms6554 trgm_regex  azjunk5  x[ae]{,2}q
             0 Bitmap Heap Scan      9.822 ms
 
6543 HEAD        azjunk5  x[ae]{,10}q              0 Seq Scan            223.479 ms6554 trgm_regex  azjunk5
x[ae]{,10}q             0 Bitmap Heap Scan      9.141 ms
 
6543 HEAD        azjunk5  x[ae]{1,2}q            723 Seq Scan            122.973 ms6554 trgm_regex  azjunk5
x[ae]{1,2}q           723 Bitmap Heap Scan      3.865 ms
 
6543 HEAD        azjunk5  x[aei]q                982 Seq Scan            121.424 ms6554 trgm_regex  azjunk5  x[aei]q
           982 Bitmap Heap Scan      2.639 ms
 
6543 HEAD        azjunk5  x[aei]{1}q             982 Seq Scan            119.213 ms6554 trgm_regex  azjunk5  x[aei]{1}q
           982 Bitmap Heap Scan      2.769 ms
 
6543 HEAD        azjunk5  x[aei]{1,1}q           982 Seq Scan            121.673 ms6554 trgm_regex  azjunk5
x[aei]{1,1}q          982 Bitmap Heap Scan      2.657 ms
 
6543 HEAD        azjunk5  x[aei]{,2}q              0 Seq Scan            142.256 ms6554 trgm_regex  azjunk5
x[aei]{,2}q             0 Bitmap Heap Scan      5.588 ms
 
6543 HEAD        azjunk5  x[aei]{,10}q             0 Seq Scan            214.769 ms6554 trgm_regex  azjunk5
x[aei]{,10}q            0 Bitmap Heap Scan      9.007 ms
 
6543 HEAD        azjunk5  x[aei]{1,2}q          1075 Seq Scan            128.672 ms6554 trgm_regex  azjunk5
x[aei]{1,2}q         1075 Bitmap Heap Scan      8.079 ms
 
6543 HEAD        azjunk5  x[aei]{1,3}q          1086 Seq Scan            127.069 ms6554 trgm_regex  azjunk5
x[aei]{1,3}q         1086 Bitmap Heap Scan     27.654 ms
 
6543 HEAD        azjunk5  x[aei]q                982 Seq Scan            121.431 ms6554 trgm_regex  azjunk5  x[aei]q
           982 Bitmap Heap Scan      2.782 ms
 
6543 HEAD        azjunk5  x[aei]{1}q             982 Seq Scan            121.270 ms6554 trgm_regex  azjunk5  x[aei]{1}q
           982 Bitmap Heap Scan      2.603 ms
 
6543 HEAD        azjunk5  x[aei]{1,1}q           982 Seq Scan            120.032 ms6554 trgm_regex  azjunk5
x[aei]{1,1}q          982 Bitmap Heap Scan      2.627 ms
 
6543 HEAD        azjunk5  x[aei]{,2}q              0 Seq Scan            143.379 ms6554 trgm_regex  azjunk5
x[aei]{,2}q             0 Bitmap Heap Scan      4.906 ms
 
6543 HEAD        azjunk5  x[aei]{,10}q             0 Seq Scan            196.212 ms6554 trgm_regex  azjunk5
x[aei]{,10}q            0 Bitmap Heap Scan      4.707 ms
 
6543 HEAD        azjunk5  x[aei]{1,2}q          1075 Seq Scan            127.050 ms6554 trgm_regex  azjunk5
x[aei]{1,2}q         1075 Bitmap Heap Scan      8.474 ms
 
6543 HEAD        azjunk5  x[aei]{1,3}q          1086 Seq Scan            127.090 ms6554 trgm_regex  azjunk5
x[aei]{1,3}q         1086 Bitmap Heap Scan     27.646 ms
 
6543 HEAD        azjunk5  x[aeio]q              1292 Seq Scan            119.951 ms6554 trgm_regex  azjunk5  x[aeio]q
          1292 Bitmap Heap Scan      3.881 ms
 
6543 HEAD        azjunk5  x[aeio]{1}q           1292 Seq Scan            123.444 ms6554 trgm_regex  azjunk5
x[aeio]{1}q          1292 Bitmap Heap Scan      3.346 ms
 
6543 HEAD        azjunk5  x[aeio]{1,1}q         1292 Seq Scan            124.024 ms6554 trgm_regex  azjunk5
x[aeio]{1,1}q        1292 Bitmap Heap Scan      3.681 ms
 
6543 HEAD        azjunk5  x[aeio]{,2}q             0 Seq Scan            152.181 ms6554 trgm_regex  azjunk5
x[aeio]{,2}q            0 Bitmap Heap Scan      5.774 ms
 
6543 HEAD        azjunk5  x[aeio]{,10}q            0 Seq Scan            214.168 ms6554 trgm_regex  azjunk5
x[aeio]{,10}q           0 Bitmap Heap Scan     12.074 ms
 
6543 HEAD        azjunk5  x[aeio]{1,2}q         1441 Seq Scan            128.491 ms6554 trgm_regex  azjunk5
x[aeio]{1,2}q        1441 Bitmap Heap Scan     22.538 ms
 
6543 HEAD        azjunk5  x[aeio]{1,3}q         1461 Seq Scan            132.987 ms6554 trgm_regex  azjunk5
x[aeio]{1,3}q        1461 Seq Scan            125.682 ms
 
6543 HEAD        azjunk5  x[aeio]{1,4}q         1464 Seq Scan            132.729 ms6554 trgm_regex  azjunk5
x[aeio]{1,4}q        1464 Seq Scan            133.625 ms
 
6543 HEAD        azjunk5  x[aeio]{2,4}q          175 Seq Scan            135.328 ms6554 trgm_regex  azjunk5
x[aeio]{2,4}q         175 Seq Scan            134.194 ms
 
6543 HEAD        azjunk5  x[aeio]{3,4}q           23 Seq Scan            131.590 ms6554 trgm_regex  azjunk5
x[aeio]{3,4}q          23 Seq Scan            135.435 ms
 
6543 HEAD        azjunk5  x[aeiou]q             1598 Seq Scan            124.063 ms6554 trgm_regex  azjunk5  x[aeiou]q
          1598 Seq Scan            124.983 ms
 
6543 HEAD        azjunk5  x[aeiou]{1}q          1598 Seq Scan            134.563 ms6554 trgm_regex  azjunk5
x[aeiou]{1}q         1598 Seq Scan            128.089 ms
 
6543 HEAD        azjunk5  x[aeiou]{1,1}q        1598 Seq Scan            124.158 ms6554 trgm_regex  azjunk5
x[aeiou]{1,1}q       1598 Seq Scan            128.355 ms
 
6543 HEAD        azjunk5  x[aeiou]{,2}q            0 Seq Scan            144.541 ms6554 trgm_regex  azjunk5
x[aeiou]{,2}q           0 Bitmap Heap Scan      2.369 ms
 
6543 HEAD        azjunk5  x[aeiou]{,10}q           0 Seq Scan            208.091 ms6554 trgm_regex  azjunk5
x[aeiou]{,10}q          0 Bitmap Heap Scan      2.528 ms
 
6543 HEAD        azjunk5  x[aeiou]{1,2}q        1838 Seq Scan            130.474 ms6554 trgm_regex  azjunk5
x[aeiou]{1,2}q       1838 Seq Scan            130.433 ms
 
6543 HEAD        azjunk5  x[aeiou]{1,3}q        1886 Seq Scan            134.002 ms6554 trgm_regex  azjunk5
x[aeiou]{1,3}q       1886 Seq Scan            134.786 ms
 
6543 HEAD        azjunk5  x[aeiou]{1,4}q        1892 Seq Scan            137.588 ms6554 trgm_regex  azjunk5
x[aeiou]{1,4}q       1892 Seq Scan            145.194 ms
 
6543 HEAD        azjunk5  x[aeiou]{2,4}q         299 Seq Scan            136.125 ms6554 trgm_regex  azjunk5
x[aeiou]{2,4}q        299 Seq Scan            138.212 ms
 
6543 HEAD        azjunk5  x[aeiou]{3,4}q          54 Seq Scan            135.205 ms6554 trgm_regex  azjunk5
x[aeiou]{3,4}q         54 Seq Scan            134.146 ms
 
6543 HEAD        azjunk5  x[aeiou]{1,5}q        1895 Seq Scan            137.151 ms6554 trgm_regex  azjunk5
x[aeiou]{1,5}q       1895 Seq Scan            140.986 ms
 
6543 HEAD        azjunk5  x[aeiou]{2,5}q         302 Seq Scan            142.189 ms6554 trgm_regex  azjunk5
x[aeiou]{2,5}q        302 Seq Scan            137.368 ms
 
6543 HEAD        azjunk5  x[aeiou]{4,5}q           9 Seq Scan            138.165 ms6554 trgm_regex  azjunk5
x[aeiou]{4,5}q          9 Seq Scan            137.122 ms
 
6543 HEAD        azjunk5  x[aeiouy]q            1913 Seq Scan            126.283 ms6554 trgm_regex  azjunk5  x[aeiouy]q
          1913 Seq Scan            130.424 ms
 
6543 HEAD        azjunk5  x[aeiouy]{1}q         1913 Seq Scan            125.947 ms6554 trgm_regex  azjunk5
x[aeiouy]{1}q        1913 Seq Scan            131.957 ms
 
6543 HEAD        azjunk5  x[aeiouy]{1,1}q       1913 Seq Scan            126.529 ms6554 trgm_regex  azjunk5
x[aeiouy]{1,1}q      1913 Seq Scan            130.958 ms
 
6543 HEAD        azjunk5  x[aeiouy]{,2}q           0 Seq Scan            147.704 ms6554 trgm_regex  azjunk5
x[aeiouy]{,2}q          0 Bitmap Heap Scan      2.331 ms
 
6543 HEAD        azjunk5  x[aeiouy]{,10}q          0 Seq Scan            221.774 ms6554 trgm_regex  azjunk5
x[aeiouy]{,10}q         0 Bitmap Heap Scan      2.522 ms
 
6543 HEAD        azjunk5  x[aeiouy]{1,2}q       2275 Seq Scan            134.044 ms6554 trgm_regex  azjunk5
x[aeiouy]{1,2}q      2275 Seq Scan            136.827 ms
 
6543 HEAD        azjunk5  x[aeiouy]{1,3}q       2358 Seq Scan            135.599 ms6554 trgm_regex  azjunk5
x[aeiouy]{1,3}q      2358 Seq Scan            134.196 ms
 
6543 HEAD        azjunk5  x[aeiouy]{1,4}q       2376 Seq Scan            138.685 ms6554 trgm_regex  azjunk5
x[aeiouy]{1,4}q      2376 Seq Scan            141.408 ms
 
6543 HEAD        azjunk5  x[aeiouy]{2,4}q        474 Seq Scan            142.223 ms6554 trgm_regex  azjunk5
x[aeiouy]{2,4}q       474 Seq Scan            143.439 ms
 
6543 HEAD        azjunk5  x[aeiouy]{3,4}q        103 Seq Scan            138.690 ms6554 trgm_regex  azjunk5
x[aeiouy]{3,4}q       103 Seq Scan            136.192 ms
 
6543 HEAD        azjunk5  x[aeiouy]{1,5}q       2381 Seq Scan            140.836 ms6554 trgm_regex  azjunk5
x[aeiouy]{1,5}q      2381 Seq Scan            143.374 ms
 
6543 HEAD        azjunk5  x[aeiouy]{2,5}q        479 Seq Scan            140.223 ms6554 trgm_regex  azjunk5
x[aeiouy]{2,5}q       479 Seq Scan            139.995 ms
 
6543 HEAD        azjunk5  x[aeiouy]{4,5}q         23 Seq Scan            139.976 ms6554 trgm_regex  azjunk5
x[aeiouy]{4,5}q        23 Seq Scan            138.114 ms
 
6543 HEAD        azjunk6  x[ae]q                6448 Seq Scan           1219.490 ms6554 trgm_regex  azjunk6  x[ae]q
          6448 Bitmap Heap Scan     23.452 ms
 
6543 HEAD        azjunk6  x[ae]{1}q             6448 Seq Scan           1153.371 ms6554 trgm_regex  azjunk6  x[ae]{1}q
          6448 Bitmap Heap Scan     18.492 ms
 
6543 HEAD        azjunk6  x[ae]{1,1}q           6448 Seq Scan           1189.951 ms6554 trgm_regex  azjunk6
x[ae]{1,1}q          6448 Bitmap Heap Scan     24.596 ms
 
6543 HEAD        azjunk6  x[ae]{,2}q               0 Seq Scan           1423.474 ms6554 trgm_regex  azjunk6  x[ae]{,2}q
             0 Bitmap Heap Scan     41.593 ms
 
6543 HEAD        azjunk6  x[ae]{,10}q              0 Seq Scan           1957.142 ms6554 trgm_regex  azjunk6
x[ae]{,10}q             0 Bitmap Heap Scan     45.238 ms
 
6543 HEAD        azjunk6  x[ae]{1,2}q           6886 Seq Scan           1253.761 ms6554 trgm_regex  azjunk6
x[ae]{1,2}q          6886 Bitmap Heap Scan     31.247 ms
 
6543 HEAD        azjunk6  x[aei]q               9600 Seq Scan           1203.022 ms6554 trgm_regex  azjunk6  x[aei]q
          9600 Bitmap Heap Scan     31.467 ms
 
6543 HEAD        azjunk6  x[aei]{1}q            9600 Seq Scan           1213.834 ms6554 trgm_regex  azjunk6  x[aei]{1}q
          9600 Bitmap Heap Scan     26.008 ms
 
6543 HEAD        azjunk6  x[aei]{1,1}q          9600 Seq Scan           1244.158 ms6554 trgm_regex  azjunk6
x[aei]{1,1}q         9600 Bitmap Heap Scan     25.997 ms
 
6543 HEAD        azjunk6  x[aei]{,2}q              0 Seq Scan           1432.935 ms6554 trgm_regex  azjunk6
x[aei]{,2}q             0 Bitmap Heap Scan     44.843 ms
 
6543 HEAD        azjunk6  x[aei]{,10}q             0 Seq Scan           1940.611 ms6554 trgm_regex  azjunk6
x[aei]{,10}q            0 Bitmap Heap Scan     45.838 ms
 
6543 HEAD        azjunk6  x[aei]{1,2}q         10604 Seq Scan           1235.913 ms6554 trgm_regex  azjunk6
x[aei]{1,2}q        10604 Bitmap Heap Scan     78.764 ms
 
6543 HEAD        azjunk6  x[aei]{1,3}q         10704 Seq Scan           1244.960 ms6554 trgm_regex  azjunk6
x[aei]{1,3}q        10704 Bitmap Heap Scan    272.049 ms
 
6543 HEAD        azjunk6  x[aei]q               9600 Seq Scan           1211.965 ms6554 trgm_regex  azjunk6  x[aei]q
          9600 Bitmap Heap Scan     26.230 ms
 
6543 HEAD        azjunk6  x[aei]{1}q            9600 Seq Scan           1218.431 ms6554 trgm_regex  azjunk6  x[aei]{1}q
          9600 Bitmap Heap Scan     25.462 ms
 
6543 HEAD        azjunk6  x[aei]{1,1}q          9600 Seq Scan           1250.050 ms6554 trgm_regex  azjunk6
x[aei]{1,1}q         9600 Bitmap Heap Scan     25.711 ms
 
6543 HEAD        azjunk6  x[aei]{,2}q              0 Seq Scan           1457.725 ms6554 trgm_regex  azjunk6
x[aei]{,2}q             0 Bitmap Heap Scan     43.491 ms 
6543 HEAD        azjunk6  x[aei]{,10}q             0 Seq Scan           2034.895 ms6554 trgm_regex  azjunk6
x[aei]{,10}q            0 Bitmap Heap Scan     46.139 ms
 
6543 HEAD        azjunk6  x[aei]{1,2}q         10604 Seq Scan           1250.820 ms6554 trgm_regex  azjunk6
x[aei]{1,2}q        10604 Bitmap Heap Scan     78.067 ms
 
6543 HEAD        azjunk6  x[aei]{1,3}q         10704 Seq Scan           1265.146 ms6554 trgm_regex  azjunk6
x[aei]{1,3}q        10704 Bitmap Heap Scan    274.109 ms
 
6543 HEAD        azjunk6  x[aeio]q             12784 Seq Scan           1235.647 ms6554 trgm_regex  azjunk6  x[aeio]q
         12784 Bitmap Heap Scan     35.613 ms
 
6543 HEAD        azjunk6  x[aeio]{1}q          12784 Seq Scan           1206.185 ms6554 trgm_regex  azjunk6
x[aeio]{1}q         12784 Bitmap Heap Scan     39.618 ms
 
6543 HEAD        azjunk6  x[aeio]{1,1}q        12784 Seq Scan           1210.467 ms6554 trgm_regex  azjunk6
x[aeio]{1,1}q       12784 Bitmap Heap Scan     34.513 ms
 
6543 HEAD        azjunk6  x[aeio]{,2}q             0 Seq Scan           1457.918 ms6554 trgm_regex  azjunk6
x[aeio]{,2}q            0 Bitmap Heap Scan     55.732 ms
 
6543 HEAD        azjunk6  x[aeio]{,10}q            0 Seq Scan           2104.860 ms6554 trgm_regex  azjunk6
x[aeio]{,10}q           0 Bitmap Heap Scan     62.129 ms
 
6543 HEAD        azjunk6  x[aeio]{1,2}q        14538 Seq Scan           1286.881 ms6554 trgm_regex  azjunk6
x[aeio]{1,2}q       14538 Bitmap Heap Scan    182.161 ms
 
6543 HEAD        azjunk6  x[aeio]{1,3}q        14761 Seq Scan           1291.199 ms6554 trgm_regex  azjunk6
x[aeio]{1,3}q       14761 Bitmap Heap Scan   1445.593 ms
 
6543 HEAD        azjunk6  x[aeio]{1,4}q        14791 Seq Scan           1331.960 ms6554 trgm_regex  azjunk6
x[aeio]{1,4}q       14791 Seq Scan           1340.845 ms
 
6543 HEAD        azjunk6  x[aeio]{2,4}q         2024 Seq Scan           1337.631 ms6554 trgm_regex  azjunk6
x[aeio]{2,4}q        2024 Seq Scan           1354.844 ms
 
6543 HEAD        azjunk6  x[aeio]{3,4}q          257 Seq Scan           1321.271 ms6554 trgm_regex  azjunk6
x[aeio]{3,4}q         257 Seq Scan           1335.737 ms
 
6543 HEAD        azjunk6  x[aeiou]q            15976 Seq Scan           1237.313 ms6554 trgm_regex  azjunk6  x[aeiou]q
         15976 Seq Scan           1268.531 ms
 
6543 HEAD        azjunk6  x[aeiou]{1}q         15976 Seq Scan           1251.777 ms6554 trgm_regex  azjunk6
x[aeiou]{1}q        15976 Seq Scan           1268.431 ms
 
6543 HEAD        azjunk6  x[aeiou]{1,1}q       15976 Seq Scan           1243.416 ms6554 trgm_regex  azjunk6
x[aeiou]{1,1}q      15976 Seq Scan           1263.152 ms
 
6543 HEAD        azjunk6  x[aeiou]{,2}q            0 Seq Scan           1476.587 ms6554 trgm_regex  azjunk6
x[aeiou]{,2}q           0 Bitmap Heap Scan     19.583 ms
 
6543 HEAD        azjunk6  x[aeiou]{,10}q           0 Seq Scan           2084.845 ms6554 trgm_regex  azjunk6
x[aeiou]{,10}q          0 Bitmap Heap Scan     21.377 ms
 
6543 HEAD        azjunk6  x[aeiou]{1,2}q       18692 Seq Scan           1302.585 ms6554 trgm_regex  azjunk6
x[aeiou]{1,2}q      18692 Seq Scan           1330.683 ms
 
6543 HEAD        azjunk6  x[aeiou]{1,3}q       19128 Seq Scan           1290.309 ms6554 trgm_regex  azjunk6
x[aeiou]{1,3}q      19128 Seq Scan           1317.831 ms
 
6543 HEAD        azjunk6  x[aeiou]{1,4}q       19202 Seq Scan           1347.727 ms6554 trgm_regex  azjunk6
x[aeiou]{1,4}q      19202 Seq Scan           1361.307 ms
 
6543 HEAD        azjunk6  x[aeiou]{2,4}q        3268 Seq Scan           1362.704 ms6554 trgm_regex  azjunk6
x[aeiou]{2,4}q       3268 Seq Scan           1372.468 ms
 
6543 HEAD        azjunk6  x[aeiou]{3,4}q         523 Seq Scan           1321.774 ms6554 trgm_regex  azjunk6
x[aeiou]{3,4}q        523 Seq Scan           1346.200 ms
 
6543 HEAD        azjunk6  x[aeiou]{1,5}q       19214 Seq Scan           1367.949 ms6554 trgm_regex  azjunk6
x[aeiou]{1,5}q      19214 Seq Scan           1428.444 ms
 
6543 HEAD        azjunk6  x[aeiou]{2,5}q        3280 Seq Scan           1349.375 ms6554 trgm_regex  azjunk6
x[aeiou]{2,5}q       3280 Seq Scan           1375.887 ms
 
6543 HEAD        azjunk6  x[aeiou]{4,5}q          88 Seq Scan           1324.008 ms6554 trgm_regex  azjunk6
x[aeiou]{4,5}q         88 Seq Scan           1394.067 ms
 
6543 HEAD        azjunk6  x[aeiouy]q           19168 Seq Scan           1262.363 ms6554 trgm_regex  azjunk6  x[aeiouy]q
         19168 Seq Scan           1248.167 ms
 
6543 HEAD        azjunk6  x[aeiouy]{1}q        19168 Seq Scan           1257.760 ms6554 trgm_regex  azjunk6
x[aeiouy]{1}q       19168 Seq Scan           1276.502 ms
 
6543 HEAD        azjunk6  x[aeiouy]{1,1}q      19168 Seq Scan           1282.770 ms6554 trgm_regex  azjunk6
x[aeiouy]{1,1}q     19168 Seq Scan           1284.173 ms
 
6543 HEAD        azjunk6  x[aeiouy]{,2}q           0 Seq Scan           1483.940 ms6554 trgm_regex  azjunk6
x[aeiouy]{,2}q          0 Bitmap Heap Scan     20.634 ms
 
6543 HEAD        azjunk6  x[aeiouy]{,10}q          0 Seq Scan           2058.701 ms6554 trgm_regex  azjunk6
x[aeiouy]{,10}q         0 Bitmap Heap Scan     21.596 ms
 
6543 HEAD        azjunk6  x[aeiouy]{1,2}q      23069 Seq Scan           1340.593 ms6554 trgm_regex  azjunk6
x[aeiouy]{1,2}q     23069 Seq Scan           1322.919 ms
 
6543 HEAD        azjunk6  x[aeiouy]{1,3}q      23844 Seq Scan           1321.853 ms6554 trgm_regex  azjunk6
x[aeiouy]{1,3}q     23844 Seq Scan           1333.974 ms
 
6543 HEAD        azjunk6  x[aeiouy]{1,4}q      23993 Seq Scan           1377.787 ms6554 trgm_regex  azjunk6
x[aeiouy]{1,4}q     23993 Seq Scan           1389.073 ms
 
6543 HEAD        azjunk6  x[aeiouy]{2,4}q       4903 Seq Scan           1392.936 ms6554 trgm_regex  azjunk6
x[aeiouy]{2,4}q      4903 Seq Scan           1399.154 ms
 
6543 HEAD        azjunk6  x[aeiouy]{3,4}q        944 Seq Scan           1342.379 ms6554 trgm_regex  azjunk6
x[aeiouy]{3,4}q       944 Seq Scan           1375.420 ms
 
6543 HEAD        azjunk6  x[aeiouy]{1,5}q      24028 Seq Scan           1402.588 ms6554 trgm_regex  azjunk6
x[aeiouy]{1,5}q     24028 Seq Scan           1482.936 ms
 
6543 HEAD        azjunk6  x[aeiouy]{2,5}q       4938 Seq Scan           1378.311 ms6554 trgm_regex  azjunk6
x[aeiouy]{2,5}q      4938 Seq Scan           1402.020 ms
 
6543 HEAD        azjunk6  x[aeiouy]{4,5}q        189 Seq Scan           1348.171 ms6554 trgm_regex  azjunk6
x[aeiouy]{4,5}q       189 Seq Scan           1392.002 ms
 
6543 HEAD        azjunk7  x[ae]q               63781 Seq Scan          11722.978 ms6554 trgm_regex  azjunk7  x[ae]q
         63781 Bitmap Heap Scan    418.407 ms
 
6543 HEAD        azjunk7  x[ae]{1}q            63781 Seq Scan          11787.311 ms6554 trgm_regex  azjunk7  x[ae]{1}q
         63781 Bitmap Heap Scan    423.027 ms
 
6543 HEAD        azjunk7  x[ae]{1,1}q          63781 Seq Scan          11902.061 ms6554 trgm_regex  azjunk7
x[ae]{1,1}q         63781 Bitmap Heap Scan    420.819 ms
 
6543 HEAD        azjunk7  x[ae]{,2}q               0 Seq Scan          14144.148 ms6554 trgm_regex  azjunk7  x[ae]{,2}q
             0 Bitmap Heap Scan    343.806 ms
 
6543 HEAD        azjunk7  x[ae]{,10}q              0 Seq Scan          20390.872 ms6554 trgm_regex  azjunk7
x[ae]{,10}q             0 Bitmap Heap Scan    370.856 ms
 
6543 HEAD        azjunk7  x[ae]{1,2}q          68145 Seq Scan          12569.198 ms6554 trgm_regex  azjunk7
x[ae]{1,2}q         68145 Bitmap Heap Scan    571.570 ms
 
6543 HEAD        azjunk7  x[aei]q              95281 Seq Scan          12027.646 ms6554 trgm_regex  azjunk7  x[aei]q
         95281 Bitmap Heap Scan    579.807 ms
 
6543 HEAD        azjunk7  x[aei]{1}q           95281 Seq Scan          12213.674 ms6554 trgm_regex  azjunk7  x[aei]{1}q
         95281 Bitmap Heap Scan    581.085 ms
 
6543 HEAD        azjunk7  x[aei]{1,1}q         95281 Seq Scan          12121.898 ms6554 trgm_regex  azjunk7
x[aei]{1,1}q        95281 Bitmap Heap Scan    587.568 ms
 
6543 HEAD        azjunk7  x[aei]{,2}q              0 Seq Scan          14519.020 ms6554 trgm_regex  azjunk7
x[aei]{,2}q             0 Bitmap Heap Scan    440.596 ms
 
6543 HEAD        azjunk7  x[aei]{,10}q             0 Seq Scan          20829.970 ms6554 trgm_regex  azjunk7
x[aei]{,10}q            0 Bitmap Heap Scan    443.136 ms
 
6543 HEAD        azjunk7  x[aei]{1,2}q        105054 Seq Scan          12967.634 ms6554 trgm_regex  azjunk7
x[aei]{1,2}q       105054 Bitmap Heap Scan   1151.202 ms
 
6543 HEAD        azjunk7  x[aei]{1,3}q        106031 Seq Scan          12601.485 ms6554 trgm_regex  azjunk7
x[aei]{1,3}q       106031 Bitmap Heap Scan   3084.092 ms
 
6543 HEAD        azjunk7  x[aei]q              95281 Seq Scan          12251.805 ms6554 trgm_regex  azjunk7  x[aei]q
         95281 Bitmap Heap Scan    579.398 ms
 
6543 HEAD        azjunk7  x[aei]{1}q           95281 Seq Scan          12251.196 ms6554 trgm_regex  azjunk7  x[aei]{1}q
         95281 Bitmap Heap Scan    579.351 ms
 
6543 HEAD        azjunk7  x[aei]{1,1}q         95281 Seq Scan          12176.216 ms6554 trgm_regex  azjunk7
x[aei]{1,1}q        95281 Bitmap Heap Scan    577.931 ms
 
6543 HEAD        azjunk7  x[aei]{,2}q              0 Seq Scan          14632.855 ms6554 trgm_regex  azjunk7
x[aei]{,2}q             0 Bitmap Heap Scan    434.758 ms
 
6543 HEAD        azjunk7  x[aei]{,10}q             0 Seq Scan          20637.829 ms6554 trgm_regex  azjunk7
x[aei]{,10}q            0 Bitmap Heap Scan    440.237 ms
 
6543 HEAD        azjunk7  x[aei]{1,2}q        105054 Seq Scan          12967.108 ms6554 trgm_regex  azjunk7
x[aei]{1,2}q       105054 Bitmap Heap Scan   1166.260 ms
 
6543 HEAD        azjunk7  x[aei]{1,3}q        106031 Seq Scan          12820.629 ms6554 trgm_regex  azjunk7
x[aei]{1,3}q       106031 Bitmap Heap Scan   3079.662 ms
 
6543 HEAD        azjunk7  x[aeio]q            126868 Seq Scan          12535.441 ms6554 trgm_regex  azjunk7  x[aeio]q
        126868 Bitmap Heap Scan    737.634 ms
 
6543 HEAD        azjunk7  x[aeio]{1}q         126868 Seq Scan          12338.188 ms6554 trgm_regex  azjunk7
x[aeio]{1}q        126868 Bitmap Heap Scan    749.844 ms
 
6543 HEAD        azjunk7  x[aeio]{1,1}q       126868 Seq Scan          12579.271 ms6554 trgm_regex  azjunk7
x[aeio]{1,1}q      126868 Bitmap Heap Scan    731.667 ms
 
6543 HEAD        azjunk7  x[aeio]{,2}q             0 Seq Scan          14806.573 ms6554 trgm_regex  azjunk7
x[aeio]{,2}q            0 Bitmap Heap Scan    555.302 ms
 
6543 HEAD        azjunk7  x[aeio]{,10}q            0 Seq Scan          22000.135 ms6554 trgm_regex  azjunk7
x[aeio]{,10}q           0 Bitmap Heap Scan    560.526 ms
 
6543 HEAD        azjunk7  x[aeio]{1,2}q       144180 Seq Scan          12919.840 ms6554 trgm_regex  azjunk7
x[aeio]{1,2}q      144180 Bitmap Heap Scan   2245.885 ms
 
6543 HEAD        azjunk7  x[aeio]{1,3}q       146526 Seq Scan          12807.513 ms6554 trgm_regex  azjunk7
x[aeio]{1,3}q      146526 Bitmap Heap Scan  15261.582 ms
 
6543 HEAD        azjunk7  x[aeio]{1,4}q       146834 Seq Scan          13179.285 ms6554 trgm_regex  azjunk7
x[aeio]{1,4}q      146834 Seq Scan          13874.164 ms
 
6543 HEAD        azjunk7  x[aeio]{2,4}q        20220 Seq Scan          13365.779 ms6554 trgm_regex  azjunk7
x[aeio]{2,4}q       20220 Seq Scan          13544.404 ms
 
6543 HEAD        azjunk7  x[aeio]{3,4}q         2697 Seq Scan          13224.408 ms6554 trgm_regex  azjunk7
x[aeio]{3,4}q        2697 Seq Scan          13699.898 ms
 
6543 HEAD        azjunk7  x[aeiou]q           158778 Seq Scan          12753.739 ms6554 trgm_regex  azjunk7  x[aeiou]q
        158778 Seq Scan          12736.813 ms
 
6543 HEAD        azjunk7  x[aeiou]{1}q        158778 Seq Scan          12385.108 ms6554 trgm_regex  azjunk7
x[aeiou]{1}q       158778 Seq Scan          12852.739 ms
 
6543 HEAD        azjunk7  x[aeiou]{1,1}q      158778 Seq Scan          12665.614 ms6554 trgm_regex  azjunk7
x[aeiou]{1,1}q     158778 Seq Scan          12482.476 ms
 
6543 HEAD        azjunk7  x[aeiou]{,2}q            0 Seq Scan          14886.647 ms6554 trgm_regex  azjunk7
x[aeiou]{,2}q           0 Bitmap Heap Scan    197.807 ms
 
6543 HEAD        azjunk7  x[aeiou]{,10}q           0 Seq Scan          21428.416 ms6554 trgm_regex  azjunk7
x[aeiou]{,10}q          0 Bitmap Heap Scan    210.265 ms
 
6543 HEAD        azjunk7  x[aeiou]{1,2}q      185669 Seq Scan          12896.338 ms6554 trgm_regex  azjunk7
x[aeiou]{1,2}q     185669 Seq Scan          13354.702 ms
 
6543 HEAD        azjunk7  x[aeiou]{1,3}q      190274 Seq Scan          12730.517 ms6554 trgm_regex  azjunk7
x[aeiou]{1,3}q     190274 Seq Scan          13026.644 ms
 
6543 HEAD        azjunk7  x[aeiou]{1,4}q      191017 Seq Scan          13664.473 ms6554 trgm_regex  azjunk7
x[aeiou]{1,4}q     191017 Seq Scan          13743.875 ms
 
6543 HEAD        azjunk7  x[aeiou]{2,4}q       32732 Seq Scan          13360.429 ms6554 trgm_regex  azjunk7
x[aeiou]{2,4}q      32732 Seq Scan          13804.770 ms
 
6543 HEAD        azjunk7  x[aeiou]{3,4}q        5449 Seq Scan          13170.928 ms6554 trgm_regex  azjunk7
x[aeiou]{3,4}q       5449 Seq Scan          13394.707 ms
 
6543 HEAD        azjunk7  x[aeiou]{1,5}q      191160 Seq Scan          13591.866 ms6554 trgm_regex  azjunk7
x[aeiou]{1,5}q     191160 Seq Scan          14158.325 ms
 
6543 HEAD        azjunk7  x[aeiou]{2,5}q       32878 Seq Scan          13507.736 ms6554 trgm_regex  azjunk7
x[aeiou]{2,5}q      32878 Seq Scan          13687.159 ms
 
6543 HEAD        azjunk7  x[aeiou]{4,5}q         903 Seq Scan          13329.291 ms6554 trgm_regex  azjunk7
x[aeiou]{4,5}q        903 Seq Scan          13645.331 ms
 
6543 HEAD        azjunk7  x[aeiouy]q          190245 Seq Scan          12331.375 ms6554 trgm_regex  azjunk7  x[aeiouy]q
        190245 Seq Scan          12726.390 ms
 
6543 HEAD        azjunk7  x[aeiouy]{1}q       190245 Seq Scan          12752.629 ms6554 trgm_regex  azjunk7
x[aeiouy]{1}q      190245 Seq Scan          12712.805 ms
 
6543 HEAD        azjunk7  x[aeiouy]{1,1}q     190245 Seq Scan          12500.269 ms6554 trgm_regex  azjunk7
x[aeiouy]{1,1}q    190245 Seq Scan          12863.557 ms
 
6543 HEAD        azjunk7  x[aeiouy]{,2}q           0 Seq Scan          14746.988 ms6554 trgm_regex  azjunk7
x[aeiouy]{,2}q          0 Bitmap Heap Scan    194.024 ms
 
6543 HEAD        azjunk7  x[aeiouy]{,10}q          0 Seq Scan          21648.192 ms6554 trgm_regex  azjunk7
x[aeiouy]{,10}q         0 Bitmap Heap Scan    208.955 ms
 
6543 HEAD        azjunk7  x[aeiouy]{1,2}q     228677 Seq Scan          13359.817 ms6554 trgm_regex  azjunk7
x[aeiouy]{1,2}q    228677 Seq Scan          13358.769 ms
 
6543 HEAD        azjunk7  x[aeiouy]{1,3}q     236512 Seq Scan          13191.587 ms6554 trgm_regex  azjunk7
x[aeiouy]{1,3}q    236512 Seq Scan          13504.745 ms
 
6543 HEAD        azjunk7  x[aeiouy]{1,4}q     238061 Seq Scan          13756.733 ms6554 trgm_regex  azjunk7
x[aeiouy]{1,4}q    238061 Seq Scan          13929.557 ms
 
6543 HEAD        azjunk7  x[aeiouy]{2,4}q      48681 Seq Scan          13766.984 ms6554 trgm_regex  azjunk7
x[aeiouy]{2,4}q     48681 Seq Scan          14135.231 ms
 
6543 HEAD        azjunk7  x[aeiouy]{3,4}q       9602 Seq Scan          13429.259 ms6554 trgm_regex  azjunk7
x[aeiouy]{3,4}q      9602 Seq Scan          13539.163 ms
 
6543 HEAD        azjunk7  x[aeiouy]{1,5}q     238407 Seq Scan          13863.009 ms6554 trgm_regex  azjunk7
x[aeiouy]{1,5}q    238407 Seq Scan          14091.161 ms
 
6543 HEAD        azjunk7  x[aeiouy]{2,5}q      49031 Seq Scan          14006.522 ms6554 trgm_regex  azjunk7
x[aeiouy]{2,5}q     49031 Seq Scan          14105.039 ms
 
6543 HEAD        azjunk7  x[aeiouy]{4,5}q       1935 Seq Scan          13718.130 ms6554 trgm_regex  azjunk7
x[aeiouy]{4,5}q      1935 Seq Scan          14032.751 ms 


(You asked also for testing against real text, I'll probably some of that too (although I do not
expect all that many differences).


Thanks, great work!

Erik Rijkers





Re: WIP: index support for regexp search

From
Heikki Linnakangas
Date:
One thing that bothers me with this algoritm is that the overflow 
mechanism is all-or-nothing. In many cases, even when there is a huge 
number of states in the diagram, you could still extract at least a few 
trigrams that must be present in any matching string, with little 
effort. At least, it seems like that to a human :-).

For example, consider this:

explain analyze select count(*) from azjunk4 where txt ~ 

('^aabaacaadaaeaafaagaahaaiaajaakaalaamaanaaoaapaaqaaraasaataauaavaawaaxaayaazabaabbabcabdabeabfabgabhabiabjabkablabmabnaboabpabqabrabsabtabuabvabwabxabyabzacaacbaccacdaceacfacgachaciacjackaclacmacnacoacpacqacracsactacuacvacwacxacyaczadaadbadcaddadeadfadgadhadiadjadkadladmadnadoadpadqadradsadtaduadvadwadxadyadzaeaaebaecaedaeeaefaegaehaeiaejaekaelaemaenaeoaepaeqaeraesaetaeuaevaewaexaeyaezafaafbafcafdafeaffafgafhafiafjafkaflafmafnafoafpafqafrafsaftafuafvafwafxafyafzagaagbagcagdageagfaggaghagiagjagkaglagmagnagoagpagqagragsagtaguagvagwagxagyagzahaahbahcahdaheahfahgahhahiahjahkahlahmahnahoahpahqahrahs$');

you get a query plan like this (the long regexp string edited out):
 Aggregate  (cost=228148.02..228148.03 rows=1 width=0) (actual 
time=131.100..131
.101 rows=1 loops=1)   ->  Bitmap Heap Scan on azjunk4  (cost=228144.01..228148.02 rows=1 
width=0) (
actual time=131.096..131.096 rows=0 loops=1)         Recheck Cond: (txt ~ <ridiculously long regexp>)         Rows
Removedby Index Recheck: 10000         ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx 
 
(cost=0.00..228144
.01 rows=1 width=0) (actual time=82.914..82.914 rows=10000 loops=1)               Index Cond: (txt ~ <ridiculously long
regexp>)Total runtime: 131.230 ms
 
(7 rows)

That ridiculously long string exceeds the number of states (I think, 
could be number of paths or arcs too), and the algorithm gives up, 
resorting to scanning the whole index as can be seen by the "Rows 
Removed by Index Recheck" line. However, it's easy to see that any 
matching string must contain *any* of the possible trigrams the 
algorithm extracts. If it could safely return just a few of them, say 
"aab" and "abz", and discard the rest, that would already be much better 
than a full index scan.

Would it be safe to simply stop short the depth-first search on 
overflow, and proceed with the graph that was constructed up to that point?

- Heikki



Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
One thing that bothers me with this algoritm is that the overflow mechanism is all-or-nothing. In many cases, even when there is a huge number of states in the diagram, you could still extract at least a few trigrams that must be present in any matching string, with little effort. At least, it seems like that to a human :-).

For example, consider this:

explain analyze select count(*) from azjunk4 where txt ~ ('^aabaacaadaaeaafaagaahaaiaajaakaalaamaanaaoaapaaqaaraasaataauaavaawaaxaayaazabaabbabcabdabeabfabgabhabiabjabkablabmabnaboabpabqabrabsabtabuabvabwabxabyabzacaacbaccacdaceacfacgachaciacjackaclacmacnacoacpacqacracsactacuacvacwacxacyaczadaadbadcaddadeadfadgadhadiadjadkadladmadnadoadpadqadradsadtaduadvadwadxadyadzaeaaebaecaedaeeaefaegaehaeiaejaekaelaemaenaeoaepaeqaeraesaetaeuaevaewaexaeyaezafaafbafcafdafeaffafgafhafiafjafkaflafmafnafoafpafqafrafsaftafuafvafwafxafyafzagaagbagcagdageagfaggaghagiagjagkaglagmagnagoagpagqagragsagtaguagvagwagxagyagzahaahbahcahdaheahfahgahhahiahjahkahlahmahnahoahpahqahrahs$');

you get a query plan like this (the long regexp string edited out):

 Aggregate  (cost=228148.02..228148.03 rows=1 width=0) (actual time=131.100..131
.101 rows=1 loops=1)
   ->  Bitmap Heap Scan on azjunk4  (cost=228144.01..228148.02 rows=1 width=0) (
actual time=131.096..131.096 rows=0 loops=1)
         Recheck Cond: (txt ~ <ridiculously long regexp>)
         Rows Removed by Index Recheck: 10000
         ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..228144
.01 rows=1 width=0) (actual time=82.914..82.914 rows=10000 loops=1)
               Index Cond: (txt ~ <ridiculously long regexp>)
 Total runtime: 131.230 ms
(7 rows)

That ridiculously long string exceeds the number of states (I think, could be number of paths or arcs too), and the algorithm gives up, resorting to scanning the whole index as can be seen by the "Rows Removed by Index Recheck" line. However, it's easy to see that any matching string must contain *any* of the possible trigrams the algorithm extracts. If it could safely return just a few of them, say "aab" and "abz", and discard the rest, that would already be much better than a full index scan.

Would it be safe to simply stop short the depth-first search on overflow, and proceed with the graph that was constructed up to that point?

For depth-first it's not. But your proposal naturally makes sense. I've changed it to breadth-first search. And then it's safe to mark all unprocessed states as final when overflow. It means that we assume every unprocessed branch to immediately finish with matching (this can give us more false positives but no false negatives).
For overflow of matrix collection, it's safe to do just OR between all the trigrams.
New version of patch is attached.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
Hi!

On Thu, Nov 29, 2012 at 12:58 PM, er <er@xs4all.nl> wrote:
On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:

> trgm-regexp-0.6.patch.gz

I ran the simple-minded tests against generated data (similar to the ones I did in January 2012).
The problems of that older version seem pretty much all removed. (although I didn't do much work
on it -- just reran these tests).

Thanks a lot for testing! Could you repeat for 0.7 version of patch which has new overflow handling?

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Fri, Nov 30, 2012 at 3:20 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
For depth-first it's not. 

Oh, I didn't explained it. 
In order to stop graph processing we need to be sure that we put all outgoing arcs from state or assume that state to be final. In DFS we can be in the final part of graph producing but still didn't add some arc (with new trigram) from initial state directly to the final state. It obviously leads to false negatives.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Heikki Linnakangas
Date:
On 30.11.2012 13:20, Alexander Korotkov wrote:
> On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas<hlinnakangas@vmware.com
>> wrote:
>
>> Would it be safe to simply stop short the depth-first search on overflow,
>> and proceed with the graph that was constructed up to that point?
>
> For depth-first it's not. But your proposal naturally makes sense. I've
> changed it to breadth-first search. And then it's safe to mark all
> unprocessed states as final when overflow. It means that we assume every
> unprocessed branch to immediately finish with matching (this can give us
> more false positives but no false negatives).
> For overflow of matrix collection, it's safe to do just OR between all the
> trigrams.
> New version of patch is attached.

Thanks, sounds good.

I've spent quite a long time trying to understand the transformation the 
getState/addKeys/addAcrs functions do to the original CNFA graph. I 
think that still needs more comments to explain the steps involved in it.

One thing that occurs to me is that it might be better to delay 
expanding colors to characters. You could build and transform the graph, 
and even create the paths, all while operating on colors. You would end 
up with lists of "color trigrams", consisting of sequences of three 
colors that must appear in the source string. Only at the last step you 
would expand the color trigrams to real character trigrams. I think that 
would save a lot of processing while building the graph, if you have 
colors that contain many characters. It would allow us to do better than 
the fixed small MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS 
= 4 in the current patch, it's a shame that we can't do anything with a 
fairly simple regexp like "^a[b-g]h$"

- Heikki



Re: WIP: index support for regexp search

From
"Erik Rijkers"
Date:
On Fri, November 30, 2012 12:22, Alexander Korotkov wrote:
> Hi!
>
> On Thu, Nov 29, 2012 at 12:58 PM, er <er@xs4all.nl> wrote:
>
>> On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:
>>
>>
>> I ran the simple-minded tests against generated data (similar to the ones
>> I did in January 2012).
>> The problems of that older version seem pretty much all removed. (although
>> I didn't do much work
>> on it -- just reran these tests).
>>
>
> Thanks a lot for testing! Could you repeat for 0.7 version of patch which
> has new overflow handling?
>

I've attached a similar test re-run that compares HEAD with patch versions 0.6, and 0.7.


Erik Rijkers

Attachment

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Fri, Nov 30, 2012 at 6:23 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 30.11.2012 13:20, Alexander Korotkov wrote:
On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas<hlinnakangas@vmware.com
wrote:

Would it be safe to simply stop short the depth-first search on overflow,
and proceed with the graph that was constructed up to that point?

For depth-first it's not. But your proposal naturally makes sense. I've
changed it to breadth-first search. And then it's safe to mark all
unprocessed states as final when overflow. It means that we assume every
unprocessed branch to immediately finish with matching (this can give us
more false positives but no false negatives).
For overflow of matrix collection, it's safe to do just OR between all the
trigrams.
New version of patch is attached.

Thanks, sounds good.

I've spent quite a long time trying to understand the transformation the getState/addKeys/addAcrs functions do to the original CNFA graph. I think that still needs more comments to explain the steps involved in it.

One thing that occurs to me is that it might be better to delay expanding colors to characters. You could build and transform the graph, and even create the paths, all while operating on colors. You would end up with lists of "color trigrams", consisting of sequences of three colors that must appear in the source string. Only at the last step you would expand the color trigrams to real character trigrams. I think that would save a lot of processing while building the graph, if you have colors that contain many characters. It would allow us to do better than the fixed small MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS = 4 in the current patch, it's a shame that we can't do anything with a fairly simple regexp like "^a[b-g]h$"

Nice idea to delay expanding colors to characters! Obviously, we should delay expanding inly alphanumerical characters. Because non-alphanumberical characters influence graph structure. Trying to implement...

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Sat, Dec 1, 2012 at 3:22 PM, Erik Rijkers <er@xs4all.nl> wrote:
On Fri, November 30, 2012 12:22, Alexander Korotkov wrote:
> Hi!
>
> On Thu, Nov 29, 2012 at 12:58 PM, er <er@xs4all.nl> wrote:
>
>> On Mon, November 26, 2012 20:49, Alexander Korotkov wrote:
>>
>>
>> I ran the simple-minded tests against generated data (similar to the ones
>> I did in January 2012).
>> The problems of that older version seem pretty much all removed. (although
>> I didn't do much work
>> on it -- just reran these tests).
>>
>
> Thanks a lot for testing! Could you repeat for 0.7 version of patch which
> has new overflow handling?
>

I've attached a similar test re-run that compares HEAD with patch versions 0.6, and 0.7.

Thanks! Did you write scripts for automated testing? I would be nice if you share them.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> Nice idea to delay expanding colors to characters! Obviously, we should
> delay expanding inly alphanumerical characters. Because non-alphanumberical
> characters influence graph structure. Trying to implement...

Uh, why would that be?  Colors are colors.  The regexp machinery doesn't
care whether they represent alphanumerics or not.  (Or to be more
precise, if there is a situation where it makes a difference, separate
colors will have been created for each set of characters that need to be
distinguished.)
        regards, tom lane



Re: WIP: index support for regexp search

From
"Erik Rijkers"
Date:
On Sun, December 2, 2012 19:07, Alexander Korotkov wrote:
>>
>> I've attached a similar test re-run that compares HEAD with patch versions
>> 0.6, and 0.7.
>>
>
> Thanks! Did you write scripts for automated testing? I would be nice if you
> share them.
>

Sure, here they are.

The perl program does depend a bit on my particular setup (it reads the port from the
postgresql.conf of each instance, and the script knows the data_dir locations), but I suppose it's
easy enough to remove that.

(Just hardcode the %instances hash, instead of calling instances().)

If you need help to get them to run I can 'generalise' them, but for now I'll send them as they are.

Erik Rijkers



Attachment

Re: WIP: index support for regexp search

From
Heikki Linnakangas
Date:
On 02.12.2012 20:19, Tom Lane wrote:
> Alexander Korotkov<aekorotkov@gmail.com>  writes:
>> Nice idea to delay expanding colors to characters! Obviously, we should
>> delay expanding inly alphanumerical characters. Because non-alphanumberical
>> characters influence graph structure. Trying to implement...
>
> Uh, why would that be?  Colors are colors.  The regexp machinery doesn't
> care whether they represent alphanumerics or not.  (Or to be more
> precise, if there is a situation where it makes a difference, separate
> colors will have been created for each set of characters that need to be
> distinguished.)

The regexp machinery doesn't care, but the trigrams that pg_trgm 
extracts only contain alphanumeric characters. So if by looking at the 
CNFA graph produced by the regexp machinery you conclude that any 
matching strings must contain three-letter sequences "%oo" and "#oo", 
you can just club them together into " oo" trigram.

I think you can run a pre-processing step to the colors, and merge 
colors that are equivalent as far as trigrams are considered. For 
example, if you have a color that contains only character '%', and 
another that contains character '#', you can treat them as the same 
hcolor. You might then be able to simplify the CNFA. Actually, it would 
be even better if you could apply the pre-processing to the regexp 
before the regexp machinery turns it into a CNFA. Not sure how easy it 
would be to do such pre-processing.

BTW, why create the path matrix? You could check the "check" array of 
trigrams in the consistent function directly against the graph. 
Consistent should return true, iff there is a path through the graph 
following only arcs that contain trigrams present in the check array. 
Finding a path through a complex graph could be expensive, O(|E|), but 
if the path is complex, the path matrix would be large as well, and 
checking against a large matrix isn't exactly free either. It would 
allow you to avoid "overflows" caused by having too many paths through 
the graph.

- Heikki



Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Mon, Dec 3, 2012 at 2:05 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 02.12.2012 20:19, Tom Lane wrote:
Alexander Korotkov<aekorotkov@gmail.com>  writes:
Nice idea to delay expanding colors to characters! Obviously, we should
delay expanding inly alphanumerical characters. Because non-alphanumberical
characters influence graph structure. Trying to implement...

Uh, why would that be?  Colors are colors.  The regexp machinery doesn't
care whether they represent alphanumerics or not.  (Or to be more
precise, if there is a situation where it makes a difference, separate
colors will have been created for each set of characters that need to be
distinguished.)

The regexp machinery doesn't care, but the trigrams that pg_trgm extracts only contain alphanumeric characters. So if by looking at the CNFA graph produced by the regexp machinery you conclude that any matching strings must contain three-letter sequences "%oo" and "#oo", you can just club them together into " oo" trigram.

I think you can run a pre-processing step to the colors, and merge colors that are equivalent as far as trigrams are considered. For example, if you have a color that contains only character '%', and another that contains character '#', you can treat them as the same hcolor. You might then be able to simplify the CNFA. Actually, it would be even better if you could apply the pre-processing to the regexp before the regexp machinery turns it into a CNFA. Not sure how easy it would be to do such pre-processing.
 
Treating colors as same should be possible only for colors which has no alphanumeric characters, because colors are non-overlapping. However, this optimization could be significant in some cases.

BTW, why create the path matrix? You could check the "check" array of trigrams in the consistent function directly against the graph. Consistent should return true, iff there is a path through the graph following only arcs that contain trigrams present in the check array. Finding a path through a complex graph could be expensive, O(|E|), but if the path is complex, the path matrix would be large as well, and checking against a large matrix isn't exactly free either. It would allow you to avoid "overflows" caused by having too many paths through the graph.

Actually, I generally dislike path matrix for same reasons. But:
1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only "ijk" trigram while graph would contain much more.Path matrix is a method to get rid of all of them.
2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Actually, I generally dislike path matrix for same reasons. But:
1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only "ijk" trigram while graph would contain much more.Path matrix is a method to get rid of all of them.
2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix.

Now, I have an idea about doing some not comprehensive but simple and fast simplification of graph. I'm doing experiments now. In case of success we could get rid of path matrix.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Fri, Dec 14, 2012 at 1:34 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Dec 3, 2012 at 4:31 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Actually, I generally dislike path matrix for same reasons. But:
1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only "ijk" trigram while graph would contain much more.Path matrix is a method to get rid of all of them.
2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix.

Now, I have an idea about doing some not comprehensive but simple and fast simplification of graph. I'm doing experiments now. In case of success we could get rid of path matrix.

Attached patch have following changes:
1) Postphone expansion of colors. Graph are building on color trigrams.
2) Selective expansion of color trigrams into simple trigrams. All non-expanded color trigrams are removed. Such removal leads to union of all states pairs connected with corresponding arcs. Surely, this must no lead to union of initial and final states: that could do all previous work senseless.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
"Erik Rijkers"
Date:
On Sun, December 16, 2012 22:25, Alexander Korotkov wrote:

> trgm-regexp-0.8.patch.gz   22 k

Hi Alexander,

I gave this a quick try; the patch works when compiled for DEBUG, but crashes as a
'speed'-compiled binary:

Compile for speed:

$ pg_config --configure
'--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8' '--with-pgport=6556'
'--enable-depend' '--with-openssl' '--with-perl' '--with-libxml'

$ psql
psql (9.3devel-trgm_regex8-20121216_2336-c299477229559d4ee7db68720d86d3fb391db761)
Type "help" for help.

testdb=# explain analyze select txt from azjunk5 where txt ~ 'x[aeiouy]{2,5}q';
The connection to the server was lost. Attempting reset: Failed.
!> \q


log after such:
-----------------
2012-12-17 09:31:02.337 CET 15801 LOG:  server process (PID 16903) was terminated by signal 11:
Segmentation fault
2012-12-17 09:31:02.337 CET 15801 DETAIL:  Failed process was running: explain analyze select txt
from azjunk5 where txt ~ 'x[aeiouy]{2,5}q';
2012-12-17 09:31:02.347 CET 15801 LOG:  terminating any other active server processes
2012-12-17 09:31:02.348 CET 17049 FATAL:  the database system is in recovery mode
2012-12-17 09:31:02.722 CET 15801 LOG:  all server processes terminated; reinitializing
2012-12-17 09:31:03.506 CET 17052 LOG:  database system was interrupted; last known up at
2012-12-17 09:26:00 CET
2012-12-17 09:31:03.693 CET 17052 LOG:  database system was not properly shut down; automatic
recovery in progress
2012-12-17 09:31:04.493 CET 17052 LOG:  record with zero length at 2/7E3C7588
2012-12-17 09:31:04.494 CET 17052 LOG:  redo is not required
2012-12-17 09:31:06.940 CET 15801 LOG:  database system is ready to accept connections
----------------





A debug-compile with below options runs OK (so far):

Compile for debug:

$ pg_config --configure
'--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8b' '--with-pgport=6560'
'--enable-depend' '--enable-cassert' '--enable-debug' '--with-openssl' '--with-perl'
'--with-libxml'

which does show some speed gain, I think.  When I have time I'll post comparisons between HEAD,
versions 6, 7, 8.  (BTW, is v6 still interesting?)


Thanks,

Erik Rijkers









Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
Hi!

On Mon, Dec 17, 2012 at 12:54 PM, Erik Rijkers <er@xs4all.nl> wrote:
On Sun, December 16, 2012 22:25, Alexander Korotkov wrote:

> trgm-regexp-0.8.patch.gz   22 k

Hi Alexander,

I gave this a quick try; the patch works when compiled for DEBUG, but crashes as a
'speed'-compiled binary:

Compile for speed:

$ pg_config --configure
'--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.trgm_regex8' '--with-pgport=6556'
'--enable-depend' '--with-openssl' '--with-perl' '--with-libxml'

$ psql
psql (9.3devel-trgm_regex8-20121216_2336-c299477229559d4ee7db68720d86d3fb391db761)
Type "help" for help.

testdb=# explain analyze select txt from azjunk5 where txt ~ 'x[aeiouy]{2,5}q';
The connection to the server was lost. Attempting reset: Failed.
!> \q

Didn't reproduce it yet. Can you retry it with this line uncommented:
#define TRGM_REGEXP_DEBUG
Then we can see which stage it fails.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Mon, Dec 17, 2012 at 1:16 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Didn't reproduce it yet. Can you retry it with this line uncommented:
#define TRGM_REGEXP_DEBUG
Then we can see which stage it fails.

Bug is found and fixed in attached patch.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
"Erik Rijkers"
Date:
On Tue, December 18, 2012 08:04, Alexander Korotkov wrote:

> trgm-regexp-0.9.patch.gz   22 k

Hi.

I ran the same test again: HEAD versus trgm_regex v6, 7 and 9.  In v9 there is some gain but also
some regression.

It remains a difficult problem...

If I get some time in the holidays I'll try to diversify the test program; it is now too simple.


Thanks,

Erik Rijkers
Attachment

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Tue, Dec 18, 2012 at 11:45 AM, Erik Rijkers <er@xs4all.nl> wrote:
On Tue, December 18, 2012 08:04, Alexander Korotkov wrote:
I ran the same test again: HEAD versus trgm_regex v6, 7 and 9.  In v9 there is some gain but also
some regression.

It remains a difficult problem...

If I get some time in the holidays I'll try to diversify the test program; it is now too simple.

Note, that regexes which contains {,n} are likely not what do you expect.

test=# select 'xq' ~ 'x[aeiou]{,2}q';
 ?column? 
----------
 f
(1 row)

test=# select 'xa{,2}q' ~ 'x[aeiou]{,2}q';
 ?column? 
----------
 t
(1 row)

You should use {0,n} to express from 0 to n occurences.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
"Erik Rijkers"
Date:
On Tue, December 18, 2012 09:45, Alexander Korotkov wrote:
>
> You should use {0,n} to express from 0 to n occurences.
>


Thanks, but I know that of course.  It's a testing program; and in the end robustness with
unexpected or even wrong input is as important as performance.  (to put it bluntly, I am also
trying to get your patch to fall over ;-))

Thanks,

Erik Rijkers




Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Tue, Dec 18, 2012 at 12:51 PM, Erik Rijkers <er@xs4all.nl> wrote:
On Tue, December 18, 2012 09:45, Alexander Korotkov wrote:
>
> You should use {0,n} to express from 0 to n occurences.
>


Thanks, but I know that of course.  It's a testing program; and in the end robustness with
unexpected or even wrong input is as important as performance.  (to put it bluntly, I am also
trying to get your patch to fall over ;-))

I found most of regressions in 0.9 version to be in {,n} cases. New version of patch use more of trigrams than previous versions.
For example for regex 'x[aeiou]{,2}q'.
In 0.7 version we use trigrams '__2', '_2_' and '__q'.
In 0.9 version we use trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_', '__2', '_2_' and '__q'.

But, actually trigram '__2' or '_2_' never occurs. It enough to have one of them, all others are just causing a slowdown. Simultaneously, we can't decide reasonably which trigrams to use without knowing their frequencies. For example, if trigrams 'xa_', 'xe_', 'xi_', 'xo_', 'xu_' were altogether more rare than '__2', newer version of patch would be faster.


------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Heikki Linnakangas
Date:
On 18.12.2012 09:04, Alexander Korotkov wrote:
> Bug is found and fixed in attached patch.

I finally got around to look at this. I like this new version, without
the path matrix, much better.

I spend quite some time honing the code and comments, trying to organize
it so that it's easier to understand. In particular, I divided the
processing more clearly into four separate stages, and added comments
indicating which functions and which fields in the structs are needed in
which state.

I understand the other stages fairly well now, but the transformation
from the source CNFA form into the transformed graph is still a black
box to me. The addKeys/addArcs functions still need more explanation.
Can you come up with some extra comments or refactoring to clarify those?

I'd like to see a few more regression test cases, to cover the various
overflow cases. In particular, I built with --enable-coverage and ran
"make installcheck", and it looks like the state merging code isn't
exercised at all. Report attached.

To visualize the graphs, I rewrote the debugging print* functions to
write the graphs in graphviz .dot format. That helped a lot. See
attached graphs, generated from the regexp '^(abc|def)(ghi|jk[lmn])$'.

There's still a lot of cleanup to do, I'm going to continue working on
this tomorrow, but wanted to shared what I have this far.

- Heikki

Attachment

Re: WIP: index support for regexp search

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> I finally got around to look at this. I like this new version, without 
> the path matrix, much better.

I looked through this version too.  I have some notes/issues:

The biggest problem is that I really don't care for the idea of
contrib/pg_trgm being this cozy with the innards of regex_t.  Sooner
or later we are going to want to split the regex code off again as a
standalone library, and we *must* have a cleaner division of labor if
that is ever to happen.  Not sure what a suitable API would look like
though.

I think the assumption that all MB characters fit in 4 bytes is
unacceptable; someday we'll want to support wider Unicode characters
than we do now, and this code seems utterly unable to handle it.  It's
especially bad that the code isn't even bothering to defend itself
against the possibility of wider characters.

Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version
and an upgrade script.

Comments and documentation still need a lot of copy-editing, also I
think a lot of the comment blocks will not survive pg_indent.  It'd be
a good idea to run trgm_regexp.c through pg_indent as soon as you have
that fixed.

New file trgm_regexp.c lacks a copyright notice

Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
enough; need to pass through the actual collation for the regex
operator.

How deep can the recursion in activateState() go?  Is this a practical
production approach at all?  Do we need a stack depth check there?
addKeys is also recursive, same questions.  (But on the other hand, the
check_stack_depth in scanColorMap seems useless, since its recursion
depth is fixed.)

Not too happy with convertPgWchar: aside from hard-wired, unchecked
assumption about maximum length of pg_wchar2mb_with_len result, why is
it that this is doing a lowercase conversion?  Surely the regex stuff
dealt with that already?

I'm suspicious of the code in addKeys() that is special-casing bos[1]
and eos[1] but not the other cases (BOL/EOL).  My recollection from
working on the fixed-prefix stuff is that it's not always obvious which
of those states gets used.

strnlen() used in fillTrgm() is probably not portable, and is definitely
not used anywhere else in Postgres.

This isn't a complete review, just some stuff I happened to notice in
one quick pass over the code.
        regards, tom lane



Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
Hi!

Some quick answers to the part of notes/issues. I will provide rest of answers soon.

On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The biggest problem is that I really don't care for the idea of
contrib/pg_trgm being this cozy with the innards of regex_t.  Sooner
or later we are going to want to split the regex code off again as a
standalone library, and we *must* have a cleaner division of labor if
that is ever to happen.  Not sure what a suitable API would look like
though.

The only option I see now is to provide a method like "export_cnfa" which would export corresponding CNFA in fixed format.
 
I think the assumption that all MB characters fit in 4 bytes is
unacceptable; someday we'll want to support wider Unicode characters
than we do now, and this code seems utterly unable to handle it.  It's
especially bad that the code isn't even bothering to defend itself
against the possibility of wider characters.

In attached patch I introduce MAX_MULTIBYTE_CHARACTER_LENGTH macro and use it in type definition. Is this way ok? 
 
Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version
and an upgrade script.

Fixed.
 
Comments and documentation still need a lot of copy-editing, also I
think a lot of the comment blocks will not survive pg_indent.  It'd be
a good idea to run trgm_regexp.c through pg_indent as soon as you have
that fixed.

Fixed.
 
New file trgm_regexp.c lacks a copyright notice

Fixed. 

Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
enough; need to pass through the actual collation for the regex
operator.

We have collation passed to gin_extract_query in ginscan.c. I noticed that gincost_pattern don't pass collation to gin_extract_query. Is it a bug? Anyway this is not collation we need. We need collation used in operator clause. In attached patch I introduce additional argument to gin_extract_query which represent collation of operator clause. Do you think it is reasonable change in GIN interface? If so, I will provide it as separate patch.
 
Not too happy with convertPgWchar: aside from hard-wired, unchecked
assumption about maximum length of pg_wchar2mb_with_len result, why is
it that this is doing a lowercase conversion?  Surely the regex stuff
dealt with that already?

Trigrams are already lowercased. We can simplify our calculations by excluding uppercased characters from consideration. 

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
Heikki Linnakangas
Date:
On 23.01.2013 09:36, Alexander Korotkov wrote:
> Hi!
>
> Some quick answers to the part of notes/issues. I will provide rest of
> answers soon.
>
> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>
>> The biggest problem is that I really don't care for the idea of
>> contrib/pg_trgm being this cozy with the innards of regex_t.  Sooner
>> or later we are going to want to split the regex code off again as a
>> standalone library, and we *must* have a cleaner division of labor if
>> that is ever to happen.  Not sure what a suitable API would look like
>> though.
>
> The only option I see now is to provide a method like "export_cnfa" which
> would export corresponding CNFA in fixed format.

Yeah, I think that makes sense. The transformation code in trgm_regexp.c 
would probably be more readable too, if it didn't have to deal with the 
regex guts representation of the CNFA. Also, once you have intermediate 
representation of the original CNFA, you could do some of the 
transformation work on that representation, before building the 
"tranformed graph" containing trigrams. You could eliminate any 
non-alphanumeric characters, joining states connected by arcs with 
non-alphanumeric characters, for example.

- Heikki



Re: WIP: index support for regexp search

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 23.01.2013 09:36, Alexander Korotkov wrote:
>> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>> The biggest problem is that I really don't care for the idea of
>>> contrib/pg_trgm being this cozy with the innards of regex_t.

>> The only option I see now is to provide a method like "export_cnfa" which
>> would export corresponding CNFA in fixed format.

> Yeah, I think that makes sense. The transformation code in trgm_regexp.c 
> would probably be more readable too, if it didn't have to deal with the 
> regex guts representation of the CNFA. Also, once you have intermediate 
> representation of the original CNFA, you could do some of the 
> transformation work on that representation, before building the 
> "tranformed graph" containing trigrams. You could eliminate any 
> non-alphanumeric characters, joining states connected by arcs with 
> non-alphanumeric characters, for example.

It's not just the CNFA though; the other big API problem is with mapping
colors back to characters.  Right now, that not only knows way too much
about a part of the regex internals we have ambitions to change soon,
but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
which should be known to the regex library IMO.  So I'm not sure how we
divvy that up sanely.  To be clear: I'm not going to insist that we have
to have a clean API factorization before we commit this at all.  But it
worries me if we don't even know how we could get to that, because we
are going to need it eventually.

Anyway, I had another thought in the shower this morning, which is that
solving this problem in terms of color trigrams is really the Wrong
Thing.  ISTM it'd be better to think of the CNFA traversal logic as
looking for required sequences of colors of unspecified length, which
we'd then chop into trigrams after the fact.  This might produce
slightly better (more complete) trigram sets, but the real reason I'm
suggesting it is that I think the CNFA traversal code might be subject
to Polya's Inventor's Paradox: "the more general problem may be easier
to solve".  It seems like casting the goal of that code as being to
find variable-length sequences, rather than exactly trigrams, might
lead to simpler data structures and more readable algorithms.  The
still-to-be-designed regex API for this also seems like it would be
better off if decoupled from the notion of trigrams.

It's quite possible that this idea is all wet and no meaningful
improvement can be gotten this way.  But I offer it for your
consideration.
        regards, tom lane



Re: WIP: index support for regexp search

From
"Erik Rijkers"
Date:
On Wed, January 23, 2013 08:36, Alexander Korotkov wrote:
> Hi!
>
> Some quick answers to the part of notes/issues. I will provide rest of
> answers soon.
>
[...]
> trgm-regexp-0.10.patch.gz    27 k

Trying to build this I get, after 'make install' in contrib/ :

/usr/bin/install: cannot stat `./pg_trgm--1.1.sql': No such file or directory
/usr/bin/install: cannot stat `./pg_trgm--1.0--1.1.sql': No such file or directory
make[1]: *** [install] Error 1
make: *** [install-pg_trgm-recurse] Error 2













Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:

On Fri, Jan 25, 2013 at 11:47 AM, Erik Rijkers <er@xs4all.nl> wrote:
On Wed, January 23, 2013 08:36, Alexander Korotkov wrote:
> Hi!
>
> Some quick answers to the part of notes/issues. I will provide rest of
> answers soon.
>
[...]
> trgm-regexp-0.10.patch.gz    27 k

Trying to build this I get, after 'make install' in contrib/ :

/usr/bin/install: cannot stat `./pg_trgm--1.1.sql': No such file or directory
/usr/bin/install: cannot stat `./pg_trgm--1.0--1.1.sql': No such file or directory
make[1]: *** [install] Error 1
make: *** [install-pg_trgm-recurse] Error 2

Forgot to include these files into patch.

Another changes in new version of patch:
1) Get rid of recursion in trigramsMatchGraph and addKeys.
2) Both bos[0] and eos[0] are also included into checks.
3) Get rid of strnlen.
4) I found the way to fix work with collation in previous version of patch to be wrong. Collation of operator must match collation of indexed column for index scan. Only thing to fix is passing collation in gincost_pattern.
5) Some tests were added which improves the coverage.

Now I'm working on additional comments.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Sun, Jan 27, 2013 at 10:40 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Now I'm working on additional comments.

Some comments were added for addKey and addArc(s). I hope they clarify something.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 23.01.2013 09:36, Alexander Korotkov wrote:
>> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>> The biggest problem is that I really don't care for the idea of
>>> contrib/pg_trgm being this cozy with the innards of regex_t.

>> The only option I see now is to provide a method like "export_cnfa" which
>> would export corresponding CNFA in fixed format.

> Yeah, I think that makes sense. The transformation code in trgm_regexp.c
> would probably be more readable too, if it didn't have to deal with the
> regex guts representation of the CNFA. Also, once you have intermediate
> representation of the original CNFA, you could do some of the
> transformation work on that representation, before building the
> "tranformed graph" containing trigrams. You could eliminate any
> non-alphanumeric characters, joining states connected by arcs with
> non-alphanumeric characters, for example.

It's not just the CNFA though; the other big API problem is with mapping
colors back to characters.  Right now, that not only knows way too much
about a part of the regex internals we have ambitions to change soon,
but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
which should be known to the regex library IMO.  So I'm not sure how we
divvy that up sanely.  To be clear: I'm not going to insist that we have
to have a clean API factorization before we commit this at all.  But it
worries me if we don't even know how we could get to that, because we
are going to need it eventually.

Now, we probably don't have enough of time before 9.3 to solve an API problem :(. It's likely we have to choose either commit to 9.3 without clean API factorization or postpone it to 9.4.

------
With best regards,
Alexander Korotkov. 

Re: WIP: index support for regexp search

From
Stephen Frost
Date:
* Alexander Korotkov (aekorotkov@gmail.com) wrote:
> Now, we probably don't have enough of time before 9.3 to solve an API
> problem :(. It's likely we have to choose either commit to 9.3 without
> clean API factorization or postpone it to 9.4.

As much as I'd like this to get in, I don't think there's a question
here- it should be punted to 9.4.  These don't look like small issues to
me.
Thanks,
    Stephen

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 23.01.2013 09:36, Alexander Korotkov wrote:
>> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>> The biggest problem is that I really don't care for the idea of
>>> contrib/pg_trgm being this cozy with the innards of regex_t.

>> The only option I see now is to provide a method like "export_cnfa" which
>> would export corresponding CNFA in fixed format.

> Yeah, I think that makes sense. The transformation code in trgm_regexp.c
> would probably be more readable too, if it didn't have to deal with the
> regex guts representation of the CNFA. Also, once you have intermediate
> representation of the original CNFA, you could do some of the
> transformation work on that representation, before building the
> "tranformed graph" containing trigrams. You could eliminate any
> non-alphanumeric characters, joining states connected by arcs with
> non-alphanumeric characters, for example.

It's not just the CNFA though; the other big API problem is with mapping
colors back to characters.  Right now, that not only knows way too much
about a part of the regex internals we have ambitions to change soon,
but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
which should be known to the regex library IMO.  So I'm not sure how we
divvy that up sanely.  To be clear: I'm not going to insist that we have
to have a clean API factorization before we commit this at all.  But it
worries me if we don't even know how we could get to that, because we
are going to need it eventually.

Now I have following idea about API.
Put code of stage 2 (transform the original CNFA into an automaton-like graph) into regex engine. It would use API which describes what exactly are we going to extract from CNFA. This API could look like this.

typedef char *Prefix;
typedef char *ArcLabel;

typedef struct
{
Prefix newPrefix;
ArcLabel label;
} ArcInfo;

typedef struct
{
Prefix (*getInitialPrefix) ();
bool (*prefixContains) (Prefix prefix1, Prefix prefix2);
Prefix * (*getPrefixes) (Prefix prefix, color c, int *n);
ArcInfo * (*getArcs) (Prefix prefix, color c, int *n);
void (*freePrefix) (Prefix prefix);
void (*freeArcLabel) (ArcLabel arcLabel);
} CFNATransformAPI;

getInitialPrefix returns initial prefix value like now this code does:
> initkey.prefix.colors[0] = UNKNOWN_COLOR;
> initkey.prefix.colors[1] = UNKNOWN_COLOR;
prefixContains are exactly same as function with this name.
getPrefixes and getArcs cycle step work of addKeys an addArcs.
freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel strutures.

Additionally regex engine should provide correct way to examine colormap.
int getColorCharsCount(colormap *cm, color c);
pg_wchar *getColorChars(colormap *cm, color c);
getColorCharsCount would return -1 if this color should be considered as unexpandable.

Any thoughts?

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Thu, Mar 14, 2013 at 9:40 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 23.01.2013 09:36, Alexander Korotkov wrote:
>> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>> The biggest problem is that I really don't care for the idea of
>>> contrib/pg_trgm being this cozy with the innards of regex_t.

>> The only option I see now is to provide a method like "export_cnfa" which
>> would export corresponding CNFA in fixed format.

> Yeah, I think that makes sense. The transformation code in trgm_regexp.c
> would probably be more readable too, if it didn't have to deal with the
> regex guts representation of the CNFA. Also, once you have intermediate
> representation of the original CNFA, you could do some of the
> transformation work on that representation, before building the
> "tranformed graph" containing trigrams. You could eliminate any
> non-alphanumeric characters, joining states connected by arcs with
> non-alphanumeric characters, for example.

It's not just the CNFA though; the other big API problem is with mapping
colors back to characters.  Right now, that not only knows way too much
about a part of the regex internals we have ambitions to change soon,
but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
which should be known to the regex library IMO.  So I'm not sure how we
divvy that up sanely.  To be clear: I'm not going to insist that we have
to have a clean API factorization before we commit this at all.  But it
worries me if we don't even know how we could get to that, because we
are going to need it eventually.

Now I have following idea about API.
Put code of stage 2 (transform the original CNFA into an automaton-like graph) into regex engine. It would use API which describes what exactly are we going to extract from CNFA. This API could look like this.

typedef char *Prefix;
typedef char *ArcLabel;

typedef struct
{
Prefix newPrefix;
ArcLabel label;
} ArcInfo;

typedef struct
{
Prefix (*getInitialPrefix) ();
bool (*prefixContains) (Prefix prefix1, Prefix prefix2);
Prefix * (*getPrefixes) (Prefix prefix, color c, int *n);
ArcInfo * (*getArcs) (Prefix prefix, color c, int *n);
void (*freePrefix) (Prefix prefix);
void (*freeArcLabel) (ArcLabel arcLabel);
} CFNATransformAPI;

getInitialPrefix returns initial prefix value like now this code does:
> initkey.prefix.colors[0] = UNKNOWN_COLOR;
> initkey.prefix.colors[1] = UNKNOWN_COLOR;
prefixContains are exactly same as function with this name.
getPrefixes and getArcs cycle step work of addKeys an addArcs.
freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel strutures.

Additionally regex engine should provide correct way to examine colormap.
int getColorCharsCount(colormap *cm, color c);
pg_wchar *getColorChars(colormap *cm, color c);
getColorCharsCount would return -1 if this color should be considered as unexpandable.

Now I have working implemetation of this API. Comments still need rework. Could you give me any feedback?

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> Now I have working implemetation of this API. Comments still need rework.
> Could you give me any feedback?

I looked at this a little bit, but it's not very far along at all
towards resolving my API worries.  The basic point that I'm concerned
about is that we would like to split off the regex library (ie,
backend/regex/) as a standalone project someday.  There are already
some problems to be resolved to make that possible, and I don't want
this patch to introduce new ones.  From that standpoint, the proposed
regextract.c file is a disaster, because it depends on a boatload of
Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar;
to say nothing of palloc).  We can't consider that to be on the regex
side of the fence.

Similarly, pushing PG-specific declarations like RE_compile_and_cache()
into regex/regex.h is completely not the right thing for preserving a
clear library boundary (even positing that we want to expose that
function outside adt/regexp.c, which I'd rather we didn't).

Given that you've already got a notion of callbacks provided by
contrib/pg_trgm, perhaps this can be fixed by pushing more of the work
into those callbacks, so that the heavy-duty data structures like the
hash table live over there and the API exposed by backend/regex/ is at
a much simpler level than what you have here.  But right now I don't
see any usable library API here.

Perhaps we could avoid these issues by defining a library API that
provides accessors like these for the opaque regex_t struct:

* get the number of states in the CNFA

* get the numbers of the initial and final states

* get the number of out-arcs for the N'th state

* get the out-arcs for the N'th state into a caller-provided array
(sized using the previous function), where each out-arc is represented
by a color and an end-state

* get the number of character codes represented by color C

* get the wchar codes for color C into a caller-provided array

(The reason for letting the caller allocate the result arrays is so we
can use palloc for that; if we allocate it in backend/regex/ we must
use malloc, which will greatly increase the risk of leakages.  Also,
as far as the color API goes, the above lets the caller decide how
many characters is "too many" to bother with.)
        regards, tom lane



Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> Now I have working implemetation of this API. Comments still need rework.
> Could you give me any feedback?

I looked at this a little bit, but it's not very far along at all
towards resolving my API worries.  The basic point that I'm concerned
about is that we would like to split off the regex library (ie,
backend/regex/) as a standalone project someday.  There are already
some problems to be resolved to make that possible, and I don't want
this patch to introduce new ones.  From that standpoint, the proposed
regextract.c file is a disaster, because it depends on a boatload of
Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar;
to say nothing of palloc).  We can't consider that to be on the regex
side of the fence.
 
Now I can see your position about API much more clearly. Previously I thought only about algorithmic side of things.

Similarly, pushing PG-specific declarations like RE_compile_and_cache()
into regex/regex.h is completely not the right thing for preserving a
clear library boundary (even positing that we want to expose that
function outside adt/regexp.c, which I'd rather we didn't).

Given that you've already got a notion of callbacks provided by
contrib/pg_trgm, perhaps this can be fixed by pushing more of the work
into those callbacks, so that the heavy-duty data structures like the
hash table live over there and the API exposed by backend/regex/ is at
a much simpler level than what you have here.  But right now I don't
see any usable library API here.

Perhaps we could avoid these issues by defining a library API that
provides accessors like these for the opaque regex_t struct:

* get the number of states in the CNFA

* get the numbers of the initial and final states

* get the number of out-arcs for the N'th state

* get the out-arcs for the N'th state into a caller-provided array
(sized using the previous function), where each out-arc is represented
by a color and an end-state

* get the number of character codes represented by color C

* get the wchar codes for color C into a caller-provided array

(The reason for letting the caller allocate the result arrays is so we
can use palloc for that; if we allocate it in backend/regex/ we must
use malloc, which will greatly increase the risk of leakages.  Also,
as far as the color API goes, the above lets the caller decide how
many characters is "too many" to bother with.)

I like the this idea. Seems like clear and not over-engineered API. I can implement it. Could you propose something particular to do with RE_compile_and_cache in this patch? With this API we still need a way to get regex_t or equivalent from string.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Similarly, pushing PG-specific declarations like RE_compile_and_cache()
>> into regex/regex.h is completely not the right thing for preserving a
>> clear library boundary (even positing that we want to expose that
>> function outside adt/regexp.c, which I'd rather we didn't).
>> 
>> Perhaps we could avoid these issues by defining a library API that
>> provides accessors like these for the opaque regex_t struct:
>> 
>> * get the number of states in the CNFA
>> 
>> * get the numbers of the initial and final states
>> 
>> * get the number of out-arcs for the N'th state
>> 
>> * get the out-arcs for the N'th state into a caller-provided array
>> (sized using the previous function), where each out-arc is represented
>> by a color and an end-state
>> 
>> * get the number of character codes represented by color C
>> 
>> * get the wchar codes for color C into a caller-provided array
>> 
>> (The reason for letting the caller allocate the result arrays is so we
>> can use palloc for that; if we allocate it in backend/regex/ we must
>> use malloc, which will greatly increase the risk of leakages.  Also,
>> as far as the color API goes, the above lets the caller decide how
>> many characters is "too many" to bother with.)

> I like the this idea. Seems like clear and not over-engineered API. I can
> implement it. Could you propose something particular to do with
> RE_compile_and_cache in this patch? With this API we still need a way to
> get regex_t or equivalent from string.

Well, the brute force answer is for pg_trgm to not go through
RE_compile_and_cache at all, but just call the regex library for itself.
That would lose the ability to cache regex compilations, but I'm not
sure we care.  The motivation for RE_compile_and_cache is mainly to
prevent having to compile the regex again for *every row*, which is what
would otherwise happen in a simple SELECT using a regex function.
Unless I'm misunderstanding something, pg_trgm would only need to
compile the regex once per indexscan, which is probably tolerable.

Another approach we could take is for adt/regexp.c to expose a function
that collects the deconstructed CNFA data into a palloc'd structure
using the above-proposed functions.  This'd be a reasonable way to go if
we decide the caching behavior is actually needed for this purpose; but
otherwise it seems like it's just involving one more module than
necessary.

BTW, one reason I don't like exposing RE_compile_and_cache directly is
that then you have the question of how long is the regex_t it hands back
good for.  The caller doesn't own that struct, and at some point
regexp.c is likely to recycle it, so it's a bit risky to just assume
it'll stay around "long enough" without any further interaction with
regexp.c.
        regards, tom lane



Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Mon, Mar 25, 2013 at 2:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Similarly, pushing PG-specific declarations like RE_compile_and_cache()
>> into regex/regex.h is completely not the right thing for preserving a
>> clear library boundary (even positing that we want to expose that
>> function outside adt/regexp.c, which I'd rather we didn't).
>>
>> Perhaps we could avoid these issues by defining a library API that
>> provides accessors like these for the opaque regex_t struct:
>>
>> * get the number of states in the CNFA
>>
>> * get the numbers of the initial and final states
>>
>> * get the number of out-arcs for the N'th state
>>
>> * get the out-arcs for the N'th state into a caller-provided array
>> (sized using the previous function), where each out-arc is represented
>> by a color and an end-state
>>
>> * get the number of character codes represented by color C
>>
>> * get the wchar codes for color C into a caller-provided array
>>
>> (The reason for letting the caller allocate the result arrays is so we
>> can use palloc for that; if we allocate it in backend/regex/ we must
>> use malloc, which will greatly increase the risk of leakages.  Also,
>> as far as the color API goes, the above lets the caller decide how
>> many characters is "too many" to bother with.)

> I like the this idea. Seems like clear and not over-engineered API. I can
> implement it. Could you propose something particular to do with
> RE_compile_and_cache in this patch? With this API we still need a way to
> get regex_t or equivalent from string.

Well, the brute force answer is for pg_trgm to not go through
RE_compile_and_cache at all, but just call the regex library for itself.
That would lose the ability to cache regex compilations, but I'm not
sure we care.  The motivation for RE_compile_and_cache is mainly to
prevent having to compile the regex again for *every row*, which is what
would otherwise happen in a simple SELECT using a regex function.
Unless I'm misunderstanding something, pg_trgm would only need to
compile the regex once per indexscan, which is probably tolerable.

For me it makes sense.
Patch with proposed API implementation is attached.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
"Erikjan Rijkers"
Date:
On Mon, April 1, 2013 23:15, Alexander Korotkov wrote:

[trgm-regexp-0.14.patch.gz]

Hi Alexander,

Something went wrong in this version of the patch: many (most) queries that were earlier
spectacularly fast have become slow, often slower than a seqscan or only marginally faster. See
the attached numbers; it compares head(seqscan) with trgm-regex patch versions 13 and 14.

I did not even complete the test-run because version 14 is so clearly inferior to 13 (and earlier,
as far as I can remember).

(let me know if you want the whole result, I can run it overnight)


Thanks,


Erik Rijkers

Attachment

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Wed, Apr 3, 2013 at 12:36 AM, Erikjan Rijkers <er@xs4all.nl> wrote:
On Mon, April 1, 2013 23:15, Alexander Korotkov wrote:

[trgm-regexp-0.14.patch.gz]

Hi Alexander,

Hi Erik!
 
Something went wrong in this version of the patch: many (most) queries that were earlier
spectacularly fast have become slow, often slower than a seqscan or only marginally faster. See
the attached numbers; it compares head(seqscan) with trgm-regex patch versions 13 and 14.

I did not even complete the test-run because version 14 is so clearly inferior to 13 (and earlier,
as far as I can remember).

(let me know if you want the whole result, I can run it overnight)

Yes, there was bug in new color map scan implementation. It was fixed in attached version. Could you re-run tests please?
Attached version of patch also addresses problem that we now using API it's not correct to use special values of color number. It introduces additional structure ExtendedColor which contain ExtendedColorClass field for handling additional "virtual" colors introduced by analysis.

------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
"Erikjan Rijkers"
Date:
On Tue, April 2, 2013 23:54, Alexander Korotkov wrote:

> [trgm-regexp-0.15.patch.gz]

Yes, it does look good now; Attached a list of measurements. Most of the searches that I put in
that test-program are now speeded up very much.

There still are a few regressions, for example:

HEAD          azjunk6  x[aeiou]{4,5}q          83  Seq Scan          1393.465 ms
trgm_regex15  azjunk6  x[aeiou]{4,5}q          83  Bitmap Heap Scan  1728.319 ms

HEAD          azjunk7  x[aeiou]{1,3}q      190031  Seq Scan         16819.555 ms
trgm_regex15  azjunk7  x[aeiou]{1,3}q      190031  Bitmap Heap Scan 21286.804 ms

Not exactly negligible, and ideally those regressions would be removed but with the huge
advantages for other cases I'd say it's worth it.

hth,

Erik Rijkers




Attachment

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Wed, Apr 3, 2013 at 11:10 AM, Erikjan Rijkers <er@xs4all.nl> wrote:
On Tue, April 2, 2013 23:54, Alexander Korotkov wrote:

> [trgm-regexp-0.15.patch.gz]

Yes, it does look good now; Attached a list of measurements. Most of the searches that I put in
that test-program are now speeded up very much.

There still are a few regressions, for example:

HEAD          azjunk6  x[aeiou]{4,5}q          83  Seq Scan          1393.465 ms
trgm_regex15  azjunk6  x[aeiou]{4,5}q          83  Bitmap Heap Scan  1728.319 ms

HEAD          azjunk7  x[aeiou]{1,3}q      190031  Seq Scan         16819.555 ms
trgm_regex15  azjunk7  x[aeiou]{1,3}q      190031  Bitmap Heap Scan 21286.804 ms

Not exactly negligible, and ideally those regressions would be removed but with the huge
advantages for other cases I'd say it's worth it.

Thank you for testing!
Exploring results more detail I found version 13 to be buggy. This version is a dead end, we have quite different API now. Could you use v12 instead of v13 in comparison, please?
Sometimes we have regression in comparison with head in two reasons:
1) We select index scan in both cases but with patch we spent more time for analysis. It's inevitable disadvantage of any index. We can only take care of analysis doesn't take too long. Current testing results don't show this reason to be significant.
2) Sometimes we select index scan while sequential scan would be faster. It's also inevitable disadvantage until we have a relevant statistics. We now have similar situation, for example, with in-core geometrical search and LIKE/ILIKE search in pg_trgm. However,  probably, situation could be improved somehow even without such statistics. But I think we can do such conclusion based on synthetical testing, because improvements for synthetical cases could appear to be an worsening for real-life cases.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> [ trgm-regexp-0.15.patch.gz ]

I spent the weekend hacking on this, making a number of bug fixes and a
whole lot of cosmetic changes.  I think there are large parts of this
that are in committable shape now, but I still find the actual graph
transformation logic to be mostly unintelligible.  I think what's most
obscure is the distinction between the arcs list and the keys list of
each state in the expanded graph.  I get the impression that the
general idea is for the arcs to represent exactly-known transitions
while the keys represent imprecisely-known transitions ... but there
seems to be at least some leakage between those categories.  Could
you write down a specification for what's supposed to be happening
there?

            regards, tom lane


Attachment

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> [ trgm-regexp-0.15.patch.gz ]

I spent the weekend hacking on this, making a number of bug fixes and a
whole lot of cosmetic changes.  I think there are large parts of this
that are in committable shape now, but I still find the actual graph
transformation logic to be mostly unintelligible.  I think what's most
obscure is the distinction between the arcs list and the keys list of
each state in the expanded graph.  I get the impression that the
general idea is for the arcs to represent exactly-known transitions
while the keys represent imprecisely-known transitions ... but there
seems to be at least some leakage between those categories.  Could
you write down a specification for what's supposed to be happening
there?

Here is my try to specify it.
At first some notions. I know, they are already in the comments, but just in order to put it together.

Extended color - any color of source CNFA or one of two special values:
1) Unknown color - may represent any character either alphanumeric or non-alphanumeric.
2) Blank color - may represent any non-alphanumeric character
Prefix is extended colors of last two characters read by CNFA.
Key is pair of CNFA state and prefix. So, key is a extended state which is containing additional information which can influence further trigrams.

So, if you are in some key and traverse some CNFA arc then you moves info another key. But there are two possible cases (or, sometimes, both of them):
1) Your move from one key into another necessary means read of some trigram. Then you create new arc labeled with that trigram.
2) You move into another key, but you doesn't necessary read an useful trigram. For example, arc of source CNFA is labeled by "unextractable" color. Then you add new key into "keys" array. And outgoing arcs from this key will also be processed similarly to source key. Therefore "keys" array is a set of keys which are achievable from "stateKey" without reading of useful trigram.
We could get rid of "keys" array and produce some "empty arcs" in the second case. You can imagine that "keys" array is set of keys which are achivable by "empty arcs" from "stateKey". States connected with "empty arcs" could be merged on the next stage.
However, the reason of having separated addKeys stage is optimization. In addKey function more generals keys absorb less general ones. In many cases resulting graph becomes much simplier because of this. For example, if your regex is not prefix, then engine puts self-referencing arcs of all possible colors to initial state. Straight-forward processing of this could produce enormous output graph. I had similar situation in early version of patch where keys didn't absorb earch other. 

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I spent the weekend hacking on this, making a number of bug fixes and a
>> whole lot of cosmetic changes.  I think there are large parts of this
>> that are in committable shape now, but I still find the actual graph
>> transformation logic to be mostly unintelligible.  I think what's most
>> obscure is the distinction between the arcs list and the keys list of
>> each state in the expanded graph.  I get the impression that the
>> general idea is for the arcs to represent exactly-known transitions
>> while the keys represent imprecisely-known transitions ... but there
>> seems to be at least some leakage between those categories.  Could
>> you write down a specification for what's supposed to be happening
>> there?

> Here is my try to specify it.

Thanks.  I hacked on this some more and committed it.  I found a number
of bugs along the way with respect to handling of word boundaries
(partially-blank transition trigrams) and EOL-color ($) handling.
I think it's all fixed now but it could definitely use some more
study and testing.

One issue that bothered me is that the regression tests really don't
provide much visibility into what the code is doing.  Some of the bugs
had to do with failing to generate expected trigrams, for instance
col ~ 'foo bar' only generating trigram "foo" and not "bar".  This still
led to getting the right answer, so the error was invisible as far as the
tests were concerned.  Is it worth thinking of a way to expose what the
extract function did at SQL level, so we could test more carefully?
        regards, tom lane



Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
On Tue, Apr 9, 2013 at 9:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Mon, Apr 8, 2013 at 9:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I spent the weekend hacking on this, making a number of bug fixes and a
>> whole lot of cosmetic changes.  I think there are large parts of this
>> that are in committable shape now, but I still find the actual graph
>> transformation logic to be mostly unintelligible.  I think what's most
>> obscure is the distinction between the arcs list and the keys list of
>> each state in the expanded graph.  I get the impression that the
>> general idea is for the arcs to represent exactly-known transitions
>> while the keys represent imprecisely-known transitions ... but there
>> seems to be at least some leakage between those categories.  Could
>> you write down a specification for what's supposed to be happening
>> there?

> Here is my try to specify it.

Thanks.  I hacked on this some more and committed it.  I found a number
of bugs along the way with respect to handling of word boundaries
(partially-blank transition trigrams) and EOL-color ($) handling.
I think it's all fixed now but it could definitely use some more
study and testing.
 
Great, thanks! I also will do some more testing.

One issue that bothered me is that the regression tests really don't
provide much visibility into what the code is doing.  Some of the bugs
had to do with failing to generate expected trigrams, for instance
col ~ 'foo bar' only generating trigram "foo" and not "bar".  This still
led to getting the right answer, so the error was invisible as far as the
tests were concerned.  Is it worth thinking of a way to expose what the
extract function did at SQL level, so we could test more carefully?

Yes, I also had similar idea. But, I think we need some relatively stable representation of resulting graph in order to expose it. There could be a lot of equivalent graphs. Some changes in implementation could lead to change from one equivalent graph to another. It would be better to not rewrite tests in this case. Ideally, we should expose some representation which is the same for all equivalent graphs. However, it doesn't seem to be realistic. But, I think we could at least make it stable to order sequence of states and color trigrams. Another option I see is to expose just set of trigrams. It doesn't have completeness of information, but it is quite stable.

------
With best regards,
Alexander Korotkov.

Re: WIP: index support for regexp search

From
Alexander Korotkov
Date:
I found you committed GiST index implementation. That's cool.
I found an easy way to optimize it. We can also use trigramsMatchGraph for signatures. Attached patch contains implementation.
Simple example in order to demonstrate it:

Before the patch:

test=# explain (analyze, buffers) select * from words where s ~ '[abc]def';
                                                        QUERY PLAN                                  
                      
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on words  (cost=4.36..40.24 rows=10 width=9) (actual time=17.189..17.193 rows=3 loops=1)
   Recheck Cond: (s ~ '[abc]def'::text)
   Buffers: shared hit=858
   ->  Bitmap Index Scan on words_trgm_idx  (cost=0.00..4.36 rows=10 width=0) (actual time=17.172..17.172 rows=3 loops=1)
         Index Cond: (s ~ '[abc]def'::text)
         Buffers: shared hit=857
 Total runtime: 17.224 ms
(7 rows)

After the patch:

test=# explain (analyze, buffers) select * from words where s ~ '[abc]def';
                                                        QUERY PLAN                                  
                      
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on words  (cost=4.36..40.24 rows=10 width=9) (actual time=13.718..13.721 rows=3 loops=1)
   Recheck Cond: (s ~ '[abc]def'::text)
   Buffers: shared hit=498
   ->  Bitmap Index Scan on words_trgm_idx  (cost=0.00..4.36 rows=10 width=0) (actual time=13.701..13.701 rows=3 loops=1)
         Index Cond: (s ~ '[abc]def'::text)
         Buffers: shared hit=497
 Total runtime: 13.786 ms
(7 rows)


------
With best regards,
Alexander Korotkov.
Attachment

Re: WIP: index support for regexp search

From
David Fetter
Date:
On Mon, Apr 15, 2013 at 05:53:41PM +0400, Alexander Korotkov wrote:
> I found you committed GiST index implementation. That's cool.
> I found an easy way to optimize it. We can also use trigramsMatchGraph for
> signatures. Attached patch contains implementation.
> Simple example in order to demonstrate it:
> 
> Before the patch:
>          Buffers: shared hit=*857*
> After the patch:
>          Buffers: shared hit=*497*

Neato!

Inside the patch, s/monotonous/monotone/, I think.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: WIP: index support for regexp search

From
Tom Lane
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> I found you committed GiST index implementation. That's cool.
> I found an easy way to optimize it. We can also use trigramsMatchGraph for
> signatures. Attached patch contains implementation.

Good idea, committed.
        regards, tom lane