Thread: Wildcard search support for pg_trgm

Wildcard search support for pg_trgm

From
Alexander Korotkov
Date:
Hackers,

Here is first version of patch, which enable index support of wildcard search in pg_trgm contrib module. The idea of the patch is to extract from wildcard trigrams which should occurs in wildcard matching string. For example, for '%sector%' wildcard such trigrams would be: 'sec', 'ect', 'tor'.

create table words (word text);
copy words from '/usr/share/dict/american-english';

test=# explain analyze select * from words where word ilike '%independ%';
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on words  (cost=0.00..1703.11 rows=10 width=9) (actual time=18.818..174.146 rows=7 loops=1)
   Filter: (word ~~* '%independ%'::text)
 Total runtime: 174.200 ms
(3 rows)

CREATE INDEX trgm_idx ON words USING gist (word gist_trgm_ops);

test=# explain analyze select * from words where word ilike '%independ%';
                                                    QUERY PLAN                                                
    
------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on words  (cost=4.36..40.11 rows=10 width=9) (actual time=2.445..2.529 rows=7 loops=1)
   Recheck Cond: (word ~~* '%independ%'::text)
   ->  Bitmap Index Scan on trgm_idx  (cost=0.00..4.35 rows=10 width=0) (actual time=2.406..2.406 rows=7 loops=1)
         Index Cond: (word ~~* '%independ%'::text)
 Total runtime: 2.612 ms
(5 rows)

CREATE INDEX trgm_idx ON words USING gin (word gin_trgm_ops);

test=# explain analyze select * from words where word ilike '%independ%';
                                                    QUERY PLAN                                                
     
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on words  (cost=76.08..111.83 rows=10 width=9) (actual time=2.675..2.755 rows=7 loops=1)
   Recheck Cond: (word ~~* '%independ%'::text)
   ->  Bitmap Index Scan on trgm_idx  (cost=0.00..76.07 rows=10 width=0) (actual time=2.642..2.642 rows=7 loops=1)
         Index Cond: (word ~~* '%independ%'::text)
 Total runtime: 2.839 ms
(5 rows)

I've encountered with following problems:
1) Indexing support for ilike is possible only with case-insensetive wildcards, e.g. when IGNORECASE macro is enabled. But I can't use this macro in pg_trgm.sql.in, where list of operators is defined. Probably, is it enuogh to put comment near IGNORECASE, which tells that if one disable this macro he should also remove oparators from pg_trgm.sql.in?
2) I found gist index not very useful with default SIGLENINT = 3. I've changed this value to 15 and I found gist index performs very good on dictionary. But on longer strings greater values of SIGLENINT may be required (probably even SIGLENINT > 122 will give benefit in some cases in spite of TOAST).

----
With best regards,
Alexander Korotkov.
Attachment

Re: Wildcard search support for pg_trgm

From
Alexander Korotkov
Date:
I found another problem. GIN index suffers from "GIN indexes do not support whole-index scans" when no trigram can be extracted from pattern.

----
With best regards,
Alexander Korotkov.

Re: Wildcard search support for pg_trgm

From
Dimitri Fontaine
Date:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> Here is first version of patch, which enable index support of wildcard
> search in pg_trgm contrib module.

How different (and better) is it from wildspeed?
 http://www.sai.msu.su/~megera/wiki/wildspeed

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Wildcard search support for pg_trgm

From
Alexander Korotkov
Date:
On Mon, Dec 13, 2010 at 12:14 AM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
How different (and better) is it from wildspeed?

The general advantage is possibility of usage wildcard search and trigram similarity search using the same index.
I expect that GIN trigram index is slightly less space demanding, but slightly slower on search than wildspeed. Also, I expect GiST trigram index to be slower on search, but faster on updates. While I didn't check these assumptions in details.
I've lack of test datasets for sufficient testing, and I would like to ask community to help me with it. Because testing on dictionaries is good, but obviously not enough.

----
With best regards,
Alexander Korotkov.

Re: Wildcard search support for pg_trgm

From
Alexander Korotkov
Date:
I updated my patch to make it use full index scan in GIN index which is possible thanks to recent Tom Lane patch. Now wildcard, where no trigram can be extracted from, invokes full index scan, which is slow but correct.

test=# explain (analyze, buffers) select * from words where word ilike '%in%';
                                                 QUERY PLAN                                         
        
------------------------------------------------------------------------------------------------------------
 Seq Scan on words  (cost=0.00..1703.11 rows=15930 width=9) (actual time=0.333..225.817 rows=16558 loops=1)
   Filter: (word ~~* '%in%'::text)
   Buffers: shared read=471
 Total runtime: 248.207 ms
(4 rows)

test=# set enable_seqscan = off;
SET
test=# explain (analyze, buffers) select * from words where word ilike '%in%';
                                                           QUERY PLAN                               
                            
--------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on words  (cost=2287.46..2957.59 rows=15930 width=9) (actual time=122.239..331.993
 rows=16558 loops=1)
   Recheck Cond: (word ~~* '%in%'::text)
   Buffers: shared hit=472 read=1185
   ->  Bitmap Index Scan on trgm_idx  (cost=0.00..2283.48 rows=15930 width=0) (actual time=122.022..122.022 rows=98569 loops=1)
         Index Cond: (word ~~* '%in%'::text)
         Buffers: shared hit=1 read=1185
 Total runtime: 354.409 ms
(7 rows)

As an alternative solution I can propose to extract null item from every string and ivoke scan on that item instead of full index scan. It requires to store additional item per each string but it makes full scan fast. 

Also I found a segfault when execute the query above and switch enable_seqscan few times on line "*searchMode = GIN_SEARCH_MODE_ALL;". Is it a bug in GIN or I'm missing something?

Here goes backtrace from gdb:
#0  0xb4ead070 in gin_extract_query_trgm (fcinfo=0xbfcd8da8) at trgm_gin.c:112
#1  0x08323a84 in OidFunctionCall5 (functionId=32802, arg1=161269768, arg2=3217920208, arg3=4, 
    arg4=3217920204, arg5=3217920200) at fmgr.c:1687
#2  0x082c5654 in gincostestimate (fcinfo=0xbfcd9148) at selfuncs.c:6466
#3  0x083235d8 in OidFunctionCall9 (functionId=2741, arg1=161270176, arg2=161271296, 
    arg3=161824624, arg4=0, arg5=0, arg6=3217921064, arg7=3217921056, arg8=3217921048, 
    arg9=3217921040) at fmgr.c:1840
#4  0x081f3397 in cost_index (path=0x9a55050, root=0x99cc9a0, index=0x99cce00, 
    indexQuals=0x9a53f70, indexOrderBys=0x0, outer_rel=0x0) at costsize.c:268
#5  0x08216b66 in create_index_path (root=0x99cc9a0, index=0x99cce00, clause_groups=0x9a53f88, 
    indexorderbys=0x0, pathkeys=0x0, indexscandir=NoMovementScanDirection, outer_rel=0x0)
    at pathnode.c:511
#6  0x081f7ef5 in find_usable_indexes (root=<value optimized out>, rel=<value optimized out>, 
    clauses=<value optimized out>, outer_clauses=0x0, istoplevel=1 '\001', outer_rel=0x0, 
    saop_control=SAOP_FORBID, scantype=ST_ANYSCAN) at indxpath.c:422
#7  0x081f8e38 in create_index_paths (root=0x99cc9a0, rel=0x99ccc30) at indxpath.c:176
#8  0x081eec22 in set_plain_rel_pathlist (root=<value optimized out>, rel=<value optimized out>, 
    rti=<value optimized out>, rte=0x99cc650) at allpaths.c:262
#9  set_rel_pathlist (root=<value optimized out>, rel=<value optimized out>, 
    rti=<value optimized out>, rte=0x99cc650) at allpaths.c:202
#10 0x081efa55 in set_base_rel_pathlists (root=0x99cc9a0, joinlist=0x99ccde8) at allpaths.c:158
#11 make_one_rel (root=0x99cc9a0, joinlist=0x99ccde8) at allpaths.c:94
#12 0x08203ef7 in query_planner (root=0x99cc9a0, tlist=0x99ccb00, tuple_fraction=0, 
    limit_tuples=-1, cheapest_path=0xbfcd98cc, sorted_path=0xbfcd98c8, num_groups=0xbfcd98c0)
    at planmain.c:271
#13 0x08205b86 in grouping_planner (root=0x99cc9a0, tuple_fraction=0) at planner.c:1182
#14 0x08207609 in subquery_planner (glob=0x99cc910, parse=0x99cc5a0, parent_root=0x0, 
    hasRecursion=0 '\000', tuple_fraction=0, subroot=0xbfcd9a7c) at planner.c:536
#15 0x08207ca6 in standard_planner (parse=0x99cc5a0, cursorOptions=0, boundParams=0x0)
    at planner.c:201
#16 0x0825db11 in pg_plan_query (querytree=0x99cc5a0, cursorOptions=0, boundParams=0x0)
    at postgres.c:764
#17 0x0815a824 in ExplainOneQuery (stmt=0x9a258e0, 
    queryString=0x9a24c60 "explain (analyze, buffers) select * from words where word ilike '%in%';",---Type <return> to continue, or q <return> to quit---
 params=0x0, dest=0x9a32330) at explain.c:300
#18 ExplainQuery (stmt=0x9a258e0, 
    queryString=0x9a24c60 "explain (analyze, buffers) select * from words where word ilike '%in%';", params=0x0, dest=0x9a32330) at explain.c:209
#19 0x08261266 in PortalRunUtility (portal=0x9a4d6a8, utilityStmt=0x9a258e0, 
    isTopLevel=<value optimized out>, dest=0x9a32330, completionTag=0xbfcd9bcc "") at pquery.c:1191
#20 0x082622a4 in FillPortalStore (portal=0x9a4d6a8, isTopLevel=32 ' ') at pquery.c:1065
#21 0x0826281a in PortalRun (portal=0x9a4d6a8, count=2147483647, isTopLevel=-56 '\310', 
    dest=0x9a25ee8, altdest=0x9a25ee8, completionTag=0xbfcd9d9c "") at pquery.c:791
#22 0x0825ec78 in exec_simple_query (query_string=<value optimized out>) at postgres.c:1059
#23 0x0825fe01 in PostgresMain (argc=2, argv=0x99aeb68, username=0x99aead8 "smagen")
    at postgres.c:3939
#24 0x08229250 in BackendRun () at postmaster.c:3577
#25 BackendStartup () at postmaster.c:3262
#26 ServerLoop () at postmaster.c:1448
#27 0x0822be12 in PostmasterMain (argc=3, argv=0x99acc58) at postmaster.c:1109
#28 0x081ce3b7 in main (argc=3, argv=0x99acc58) at main.c:199

----
With best regards,
Alexander Korotkov.
Attachment

Re: Wildcard search support for pg_trgm

From
Jan Urbański
Date:
On 08/01/11 23:37, Alexander Korotkov wrote:
> I updated my patch to make it use full index scan in GIN index which is
> possible thanks to recent Tom Lane patch. Now wildcard, where no trigram can
> be extracted from, invokes full index scan, which is slow but correct.

Hi,

unfortunately, this change made the patch not apply:
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=be0c3ea2d30ba225f0249ae88d6b0bdf3b753162

I'm getting rejects in trgm_gin.c. Could you update the patch please?

Cheers,
Jan


Re: Wildcard search support for pg_trgm

From
Alexander Korotkov
Date:
Hi,

Here is updated version of this patch.

----
With best regards,
Alexander Korotkov.
Attachment