Thread: Trigram (pg_trgm) GIN index not used
Hello, I have a table with the following structure: nn=3D> \d documents Table "public.documents" Column | Type | Modifiers ---------------+----------+----------------------------------------------= ---------- id | integer | not null default nextval('documents_id_seq'::regclass) ctime | integer | not null default unix_ts(now()) dtime | integer | not null title | citext | not null html_filename | text | not null raw_data | citext | not null fts_data | tsvector | not null tags | text[] | flags | integer | not null default 0 dtype | integer | not null default 0 Indexes: "documents_pkey" PRIMARY KEY, btree (id) "documents_html_filename" UNIQUE, btree (html_filename) "documents_raw_data_trgm" gin (raw_data gin_trgm_ops) "documents_title_trgm" gin (title gin_trgm_ops) I'd like to use pg_trgm for matching substrings case-insensitively, but it doesn't seem to use the index: nn=3D> explain select id,title from documents where raw_data ilike '%zagr= eb%'; QUERY PLAN --------------------------------------------------------------- Seq Scan on documents (cost=3D0.00..6648.73 rows=3D180 width=3D98) Filter: (raw_data ~~* '%zagreb%'::citext) (2 rows) nn=3D> explain select id,title from documents where raw_data like '%zagre= b%'; QUERY PLAN --------------------------------------------------------------- Seq Scan on documents (cost=3D0.00..6692.71 rows=3D181 width=3D98) Filter: (raw_data ~~ '%zagreb%'::citext) (2 rows) When I try to create a GIST index as advised by the comment at: http://www.postgresonline.com/journal/archives/212-PostgreSQL-9.1-Trigram= s-teaching-LIKE-and-ILIKE-new-tricks.html I get the following error: ERROR: index row requires 10488 bytes, maximum size is 8191 What am I doing wrong?
On 21/02/2013 12:52, Ivan Voras wrote: > I'd like to use pg_trgm for matching substrings case-insensitively, but= > it doesn't seem to use the index: As a sort-of followup, the '%' operator kind of works but takes incredibly long time, and the selectivity estimates are completely wrong:= nn=3D> vacuum analyze documents; VACUUM nn=3D> explain select id,title from documents where raw_data % 'zagreb'; QUERY PLAN -------------------------------------------------------------------------= ---------------- Bitmap Heap Scan on documents (cost=3D128.42..330.87 rows=3D54 width=3D= 108) Recheck Cond: ((raw_data)::text % 'zagreb'::text) -> Bitmap Index Scan on documents_raw_data_trgm (cost=3D0.00..128.40= rows=3D54 width=3D0) Index Cond: ((raw_data)::text % 'zagreb'::text) (4 rows) nn=3D> explain analyze select id,title from documents where raw_data % 'zagreb'; QUERY PLAN -------------------------------------------------------------------------= ---------------------------------------------------------------- Bitmap Heap Scan on documents (cost=3D128.42..330.87 rows=3D54 width=3D= 108) (actual time=3D98750.283..98750.283 rows=3D0 loops=3D1) Recheck Cond: ((raw_data)::text % 'zagreb'::text) -> Bitmap Index Scan on documents_raw_data_trgm (cost=3D0.00..128.40= rows=3D54 width=3D0) (actual time=3D26.748..26.748 rows=3D51874 loops=3D1= ) Index Cond: ((raw_data)::text % 'zagreb'::text) Total runtime: 98750.623 ms (5 rows) There is no IO load during this query.
On Thu, Feb 21, 2013 at 6:06 AM, Ivan Voras <ivoras@freebsd.org> wrote: > On 21/02/2013 12:52, Ivan Voras wrote: > >> I'd like to use pg_trgm for matching substrings case-insensitively, but >> it doesn't seem to use the index: > > > As a sort-of followup, the '%' operator kind of works but takes > incredibly long time, and the selectivity estimates are completely wrong: > > nn=> vacuum analyze documents; > VACUUM > nn=> explain select id,title from documents where raw_data % 'zagreb'; > QUERY PLAN > > ----------------------------------------------------------------------------------------- > Bitmap Heap Scan on documents (cost=128.42..330.87 rows=54 width=108) > Recheck Cond: ((raw_data)::text % 'zagreb'::text) > -> Bitmap Index Scan on documents_raw_data_trgm (cost=0.00..128.40 > rows=54 width=0) > Index Cond: ((raw_data)::text % 'zagreb'::text) > (4 rows) > > nn=> explain analyze select id,title from documents where raw_data % > 'zagreb'; > QUERY > PLAN > ----------------------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on documents (cost=128.42..330.87 rows=54 width=108) > (actual time=98750.283..98750.283 rows=0 loops=1) > Recheck Cond: ((raw_data)::text % 'zagreb'::text) > -> Bitmap Index Scan on documents_raw_data_trgm (cost=0.00..128.40 > rows=54 width=0) (actual time=26.748..26.748 rows=51874 loops=1) > Index Cond: ((raw_data)::text % 'zagreb'::text) > Total runtime: 98750.623 ms > (5 rows) > > > There is no IO load during this query. pg_trgm is not really designed for indexing large documents, but for fuzzy simple string (company name, address, etc) matching. probably better off with full text search. merlin
Ivan Voras <ivoras@freebsd.org> writes: > I have a table with the following structure: > ... > raw_data | citext | not null > ... > "documents_raw_data_trgm" gin (raw_data gin_trgm_ops) > I'd like to use pg_trgm for matching substrings case-insensitively, but > it doesn't seem to use the index: You're outsmarting yourself by using citext as the column datatype. That causes "ilike" to be interpreted as a citext-specific operator, which is not a member of the gin_trgm_ops operator class, so it doesn't match this index. I wonder whether we really need that citext-specific operator at all ... but in the meantime, if you need the column to be citext for some other reason, I'd suggest making a gin index on raw_data::text and then writing the query as raw_data::text ilike '%zagreb%'. regards, tom lane
On Thu, Feb 21, 2013 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ivan Voras <ivoras@freebsd.org> writes: >> I have a table with the following structure: >> ... >> raw_data | citext | not null >> ... >> "documents_raw_data_trgm" gin (raw_data gin_trgm_ops) > >> I'd like to use pg_trgm for matching substrings case-insensitively, but >> it doesn't seem to use the index: > > You're outsmarting yourself by using citext as the column datatype. > That causes "ilike" to be interpreted as a citext-specific operator, > which is not a member of the gin_trgm_ops operator class, so it doesn't > match this index. > > I wonder whether we really need that citext-specific operator at all > ... but in the meantime, if you need the column to be citext for some > other reason, I'd suggest making a gin index on raw_data::text and > then writing the query as raw_data::text ilike '%zagreb%'. hm, one more data point that citext implementation didn't succeed in terms of abstracting you from case sensitivity issues. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > On Thu, Feb 21, 2013 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wonder whether we really need that citext-specific operator at all >> ... but in the meantime, if you need the column to be citext for some >> other reason, I'd suggest making a gin index on raw_data::text and >> then writing the query as raw_data::text ilike '%zagreb%'. > hm, one more data point that citext implementation didn't succeed in > terms of abstracting you from case sensitivity issues. I think this is just a bug and not a fundamental design flaw: it looks to me like simply removing the citext-specific declarations of the regex operators would fix the problem (because then it'd fall back on the standard operators with an implicit cast to text, and that would match the index). Might cause a problem though for anybody who's got those operators embedded in views. regards, tom lane