Thread: Trigram (pg_trgm) GIN index not used

Trigram (pg_trgm) GIN index not used

From

Ivan Voras

Date:

21 February 2013, 14:52:15

Hello,

I have a table with the following structure:

nn=3D> \d documents
                             Table "public.documents"
    Column     |   Type   |                       Modifiers

---------------+----------+----------------------------------------------=
----------
 id            | integer  | not null default
nextval('documents_id_seq'::regclass)
 ctime         | integer  | not null default unix_ts(now())
 dtime         | integer  | not null
 title         | citext   | not null
 html_filename | text     | not null
 raw_data      | citext   | not null
 fts_data      | tsvector | not null
 tags          | text[]   |
 flags         | integer  | not null default 0
 dtype         | integer  | not null default 0
Indexes:
    "documents_pkey" PRIMARY KEY, btree (id)
    "documents_html_filename" UNIQUE, btree (html_filename)
    "documents_raw_data_trgm" gin (raw_data gin_trgm_ops)
    "documents_title_trgm" gin (title gin_trgm_ops)

I'd like to use pg_trgm for matching substrings case-insensitively, but
it doesn't seem to use the index:

nn=3D> explain select id,title from documents where raw_data ilike '%zagr=
eb%';
                          QUERY PLAN
---------------------------------------------------------------
 Seq Scan on documents  (cost=3D0.00..6648.73 rows=3D180 width=3D98)
   Filter: (raw_data ~~* '%zagreb%'::citext)
(2 rows)

nn=3D> explain select id,title from documents where raw_data like '%zagre=
b%';
                          QUERY PLAN
---------------------------------------------------------------
 Seq Scan on documents  (cost=3D0.00..6692.71 rows=3D181 width=3D98)
   Filter: (raw_data ~~ '%zagreb%'::citext)
(2 rows)

When I try to create a GIST index as advised by the comment at:
http://www.postgresonline.com/journal/archives/212-PostgreSQL-9.1-Trigram=
s-teaching-LIKE-and-ILIKE-new-tricks.html

I get the following error:
ERROR:  index row requires 10488 bytes, maximum size is 8191

What am I doing wrong?

Re: Trigram (pg_trgm) GIN index not used

From

Ivan Voras

Date:

21 February 2013, 15:06:27

On 21/02/2013 12:52, Ivan Voras wrote:

> I'd like to use pg_trgm for matching substrings case-insensitively, but=

> it doesn't seem to use the index:


As a sort-of followup, the '%' operator kind of works but takes
incredibly long time, and the selectivity estimates are completely wrong:=


nn=3D> vacuum analyze documents;
VACUUM
nn=3D> explain select id,title from documents where raw_data % 'zagreb';
                                       QUERY PLAN

-------------------------------------------------------------------------=
----------------
 Bitmap Heap Scan on documents  (cost=3D128.42..330.87 rows=3D54 width=3D=
108)
   Recheck Cond: ((raw_data)::text % 'zagreb'::text)
   ->  Bitmap Index Scan on documents_raw_data_trgm  (cost=3D0.00..128.40=

rows=3D54 width=3D0)
         Index Cond: ((raw_data)::text % 'zagreb'::text)
(4 rows)

nn=3D> explain analyze select id,title from documents where raw_data %
'zagreb';
                                                               QUERY
PLAN
-------------------------------------------------------------------------=
----------------------------------------------------------------
 Bitmap Heap Scan on documents  (cost=3D128.42..330.87 rows=3D54 width=3D=
108)
(actual time=3D98750.283..98750.283 rows=3D0 loops=3D1)
   Recheck Cond: ((raw_data)::text % 'zagreb'::text)
   ->  Bitmap Index Scan on documents_raw_data_trgm  (cost=3D0.00..128.40=

rows=3D54 width=3D0) (actual time=3D26.748..26.748 rows=3D51874 loops=3D1=
)
         Index Cond: ((raw_data)::text % 'zagreb'::text)
 Total runtime: 98750.623 ms
(5 rows)


There is no IO load during this query.

Re: Trigram (pg_trgm) GIN index not used

From

Merlin Moncure

Date:

21 February 2013, 17:23:38

On Thu, Feb 21, 2013 at 6:06 AM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 21/02/2013 12:52, Ivan Voras wrote:
>
>> I'd like to use pg_trgm for matching substrings case-insensitively, but
>> it doesn't seem to use the index:
>
>
> As a sort-of followup, the '%' operator kind of works but takes
> incredibly long time, and the selectivity estimates are completely wrong:
>
> nn=> vacuum analyze documents;
> VACUUM
> nn=> explain select id,title from documents where raw_data % 'zagreb';
>                                        QUERY PLAN
>
> -----------------------------------------------------------------------------------------
>  Bitmap Heap Scan on documents  (cost=128.42..330.87 rows=54 width=108)
>    Recheck Cond: ((raw_data)::text % 'zagreb'::text)
>    ->  Bitmap Index Scan on documents_raw_data_trgm  (cost=0.00..128.40
> rows=54 width=0)
>          Index Cond: ((raw_data)::text % 'zagreb'::text)
> (4 rows)
>
> nn=> explain analyze select id,title from documents where raw_data %
> 'zagreb';
>                                                                QUERY
> PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on documents  (cost=128.42..330.87 rows=54 width=108)
> (actual time=98750.283..98750.283 rows=0 loops=1)
>    Recheck Cond: ((raw_data)::text % 'zagreb'::text)
>    ->  Bitmap Index Scan on documents_raw_data_trgm  (cost=0.00..128.40
> rows=54 width=0) (actual time=26.748..26.748 rows=51874 loops=1)
>          Index Cond: ((raw_data)::text % 'zagreb'::text)
>  Total runtime: 98750.623 ms
> (5 rows)
>
>
> There is no IO load during this query.

pg_trgm is not really designed for indexing large documents, but for
fuzzy simple string (company name, address, etc) matching.   probably
better off with full text search.

merlin

Re: Trigram (pg_trgm) GIN index not used

From

Tom Lane

Date:

21 February 2013, 18:57:54

Ivan Voras <ivoras@freebsd.org> writes:
> I have a table with the following structure:
> ...
>  raw_data      | citext   | not null
> ...
>     "documents_raw_data_trgm" gin (raw_data gin_trgm_ops)

> I'd like to use pg_trgm for matching substrings case-insensitively, but
> it doesn't seem to use the index:

You're outsmarting yourself by using citext as the column datatype.
That causes "ilike" to be interpreted as a citext-specific operator,
which is not a member of the gin_trgm_ops operator class, so it doesn't
match this index.

I wonder whether we really need that citext-specific operator at all
... but in the meantime, if you need the column to be citext for some
other reason, I'd suggest making a gin index on raw_data::text and
then writing the query as raw_data::text ilike '%zagreb%'.

            regards, tom lane

Re: Trigram (pg_trgm) GIN index not used

From

Merlin Moncure

Date:

21 February 2013, 19:07:28

On Thu, Feb 21, 2013 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ivan Voras <ivoras@freebsd.org> writes:
>> I have a table with the following structure:
>> ...
>>  raw_data      | citext   | not null
>> ...
>>     "documents_raw_data_trgm" gin (raw_data gin_trgm_ops)
>
>> I'd like to use pg_trgm for matching substrings case-insensitively, but
>> it doesn't seem to use the index:
>
> You're outsmarting yourself by using citext as the column datatype.
> That causes "ilike" to be interpreted as a citext-specific operator,
> which is not a member of the gin_trgm_ops operator class, so it doesn't
> match this index.
>
> I wonder whether we really need that citext-specific operator at all
> ... but in the meantime, if you need the column to be citext for some
> other reason, I'd suggest making a gin index on raw_data::text and
> then writing the query as raw_data::text ilike '%zagreb%'.

hm, one more data point that citext implementation didn't succeed in
terms of abstracting you from case sensitivity issues.

merlin

Re: Trigram (pg_trgm) GIN index not used

From

Tom Lane

Date:

21 February 2013, 19:27:59

Merlin Moncure <mmoncure@gmail.com> writes:
> On Thu, Feb 21, 2013 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder whether we really need that citext-specific operator at all
>> ... but in the meantime, if you need the column to be citext for some
>> other reason, I'd suggest making a gin index on raw_data::text and
>> then writing the query as raw_data::text ilike '%zagreb%'.

> hm, one more data point that citext implementation didn't succeed in
> terms of abstracting you from case sensitivity issues.

I think this is just a bug and not a fundamental design flaw: it looks
to me like simply removing the citext-specific declarations of the regex
operators would fix the problem (because then it'd fall back on the
standard operators with an implicit cast to text, and that would match
the index).  Might cause a problem though for anybody who's got those
operators embedded in views.

            regards, tom lane