Re: Allow ILIKE forward matching to use btree index - Mailing list pgsql-hackers

From Yugo Nagata
Subject Re: Allow ILIKE forward matching to use btree index
Date
Msg-id 20241224160442.5852d16fcc64cdfc8238c325@sraoss.co.jp
Whole thread Raw
List pgsql-hackers
On Fri, 20 Dec 2024 03:22:26 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

> Hi,
> 
> Currently, btree indexes cannot used for ILIKE (~~*) operator if the pattern
> has case-varying characters although LIKE (~~) expression can be converted
> to indexable clauses by the planner support function (if the collation
> is "C" or operator class 'text_pattern_ops' is used).
> 
> For example, "t ~~ '123foo%'" is converted to "(t >= '123foo' AND t < '123fop')" 
> and index scan can be used for this condition. On the other hand, "t ~~* '123foo'"
> cannot be converted and sequential scan is used. 
> 
> Even in this case, we can use a bitmap index scan for the condition 
> "(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')" followed by
> recheck by the original condition "t ~~* '123foo'", and this could be faster
> than seqscan. 
> 
> However, even if the support function would return OR clauses, the current
> planner implementation cannot not build bitmap scan paths using them. 
> 
> 
> The attached patches allow to ILIKE (~~*) forward matching to use btree index. 
> 
> The patch 0001 enables get_index_clause_from_support to receive OR clauses
> from support functions and use them to build bitmap index scan later. OR clauses
> returned by support functions are collected in the new argument 'orclauses" of
> match_restriction_clauses_to_index(), and they are passed to
> generate_bitmap_or_paths() later to build bitmap scan paths.
> 
> The patch 0002 modifies the support function to return OR clauses as an example
> above when ILIKE's pattern has case-varying characters in forward matching. The
> OR clauses contains two index conditions for the upper and the lower letter of
> the first case-varying character in the pattern respectively. Although the
> subsequent characters are not considered in the index scans, it could enough be
> faster then sequent scan.
> 
> Here is an example. 
> 
> 1. Create a table with random text records
> 
> =# CREATE TABLE tbl (t text);
> =# INSERT INTO tbl SELECT CASE WHEN i%2=1 THEN upper(x) ELSE x END
>                   FROM (SELECT i, md5(i::text) x FROM generate_series(1,5000000) i);
> 
> 2. Create an index
> =# CREATE INDEX ON tbl (t text_pattern_ops);
> 
> 3. Before applying patch: Seq Scan was selected. It takes about 4 sec.
> 
> =# EXPLIN ANALYZE SELECT * FROM tbl WHERE t ILIKE '12ab%';

I am sorry, the example in my previous main was wrong. I showed the plan
with enable_index_scan = off. Indead, if the pattern starts with
case-insensitive characters like '12', an index scan can be used even with
ILIKE. 

postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
                                                       QUERY PLAN

 

-------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..69129.61 rows=500 width=33) (actual time=36.317..4034.770 rows=1188 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=99 read=41939
   ->  Parallel Seq Scan on tbl  (cost=0.00..68079.61 rows=208 width=33) (actual time=19.908..4023.668 rows=396
loops=3)
         Filter: (t ~~* 'abc%'::text)
         Rows Removed by Filter: 1666271
         Buffers: shared hit=99 read=41939
 Planning Time: 0.726 ms
 Execution Time: 4035.101 ms
(10 rows)

4. After applying patch: Bitmap Index Scan was selected. 

postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
                                                                          QUERY PLAN
                                     
 

--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=12563.66..58314.53 rows=500 width=33) (actual time=156.485..1266.789 rows=1188
loops=1)
   Recheck Cond: (((t ~>=~ 'A'::text) AND (t ~<~ 'B'::text) AND (t ~~* 'abc%'::text)) OR ((t ~>=~ 'a'::text) AND (t ~<~
'b'::text)AND (t ~~* 'abc%'::text)))
 
   Filter: (t ~~* 'abc%'::text)
   Rows Removed by Filter: 311473
   Heap Blocks: exact=42010
   Buffers: shared hit=1 read=44600
   ->  BitmapOr  (cost=12563.66..12563.66 rows=297029 width=0) (actual time=136.979..136.980 rows=0 loops=1)
         Buffers: shared hit=1 read=2590
         ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..6281.71 rows=148515 width=0) (actual time=80.548..80.549
rows=158375loops=1)
 
               Index Cond: ((t ~>=~ 'A'::text) AND (t ~<~ 'B'::text))
               Buffers: shared read=1272
         ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..6281.71 rows=148515 width=0) (actual time=56.427..56.427
rows=157042loops=1)
 
               Index Cond: ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text))
               Buffers: shared hit=1 read=1318
 Planning Time: 0.592 ms
 Execution Time: 1267.162 ms
(16 rows)

 
> What do you think about it?
> 
> I think another application of OR-clause returning support function might be
> allowing to use an index for NOT LIKE (!~~) expression  because, for example, 
> "t !~~ '123foo%'" could be converted to "(t < '123foo' OR t >= '123fooz')". 
> (The second condition is a bit loose but this would be safe and not problem
> since tuples are filtered by the original condition after the index scan.) 
> However, it would not very useful unless the distribution is much skew so that
> NOT LIKE expression's selectivity is enough small.


Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>



pgsql-hackers by date:

Previous
From: wenhui qiu
Date:
Subject: adjust_limit_rows_costs algorithm
Next
From: Richard Guo
Date:
Subject: ERROR: corrupt MVNDistinct entry