Thread: Question on Trigram GIST indexes

Question on Trigram GIST indexes

From
ERR ORR
Date:

@Moderators: I am reposting this because the original from 22 December apparently didn't arrive on the list.

I was trying to make Postgresql use a trigram gist index on a varchar field, but to no avail.

Specifically, I was trying to replicate what is done in this blog post:
http://www.postgresonline.com/journal/archives/212-PostgreSQL-9.1-Trigrams-teaching-LIKE-and-ILIKE-new-tricks.html


I use Postgresql 9.1.7 on Linux FC17 64bit, my locale is UTF8.

My full table definition is

CREATE TABLE "TEST"
(
  "RECID" bigint NOT NULL DEFAULT next_id(),
  "TST_PAYLOAD" character varying(255),
  CONSTRAINT "PK_TEST" PRIMARY KEY ("RECID")
  USING INDEX TABLESPACE local
)
WITH (
  OIDS=FALSE
);

CREATE INDEX "TEST_PAYLOAD_PATTERN_1_IDX"
  ON "TEST"
  USING btree
  ("TST_PAYLOAD" COLLATE pg_catalog."default" varchar_pattern_ops)
TABLESPACE local;

CREATE INDEX "TEST_PAYLOAD_TRIGRAM_GIST_1_IDX"
  ON "TEST"
  USING gist
  ("TST_PAYLOAD" COLLATE pg_catalog."default" gist_trgm_ops)
TABLESPACE local;

CREATE INDEX "TEST_PAYLOAD_TRIGRAM_GIN_1_IDX"
  ON "TEST"
  USING gin
  ("TST_PAYLOAD" COLLATE pg_catalog."default" gin_trgm_ops)
TABLESPACE local;


The COLLATE pg_catalog."default" clause is inserted by the DB (default is "Unicode"). I also tried to define the Trigram index with COLLATE pg_catalog."C"  but the behavior did not change. I did vacuum and analyze after creating each index.

The field "TST_PAYLOAD" contains 26389 names of cities, all in uppercase.

I have pg_tgrm installed - actually all extensions are present.

Queries which use "WHERE "TST_PAYLOAD" LIKE 'SEAT%'" go to the btree index as it should.
Queries which use "WHERE "TST_PAYLOAD" LIKE '%EAT%'" *should* use the GIST index but do a full table scan instead.
(I am looking for names like 'SEATTLE' in this example)

I also tried dropping the btree index but that has no influence on the behavior.

I have texts/strings in different languages/charsets, so UTF8 looked like the best decision to me, instead of, say, ISO-8859-15, which is limited to just some European charsets. Specifically I am storing strings in European languages (corresponding to the ISO-8859 series) including diacrites line äöüñáéíóú ..., Russian, Arabic, Chinese etc. in one column instead of making different columns/tables and using them via a view because that's my use case and UTF8 should accommodate that IMHO (or is that an abuse of the DB?)

Would it help to `ALTER DATABASE set lc_collate = 'C'`,supposing that is possible? (Oracle doesn't allow that iirc).

Thanks for any insights, pointers ...

I'd be grateful if anybody could explain to me what I am doing wrong.

Thanks in advance.

Re: Question on Trigram GIST indexes

From
Johann Spies
Date:
On 5 January 2013 20:20, ERR ORR <rd0002@gmail.com> wrote:



Queries which use "WHERE "TST_PAYLOAD" LIKE 'SEAT%'" go to the btree index as it should.
Queries which use "WHERE "TST_PAYLOAD" LIKE '%EAT%'" *should* use the GIST index but do a full table scan instead.
(I am looking for names like 'SEATTLE' in this example)

I also tried dropping the btree index but that has no influence on the behavior.



I have the same problem:

Index:

CREATE INDEX akb_art_abstract_trgm
  ON akb_articles
  USING gin
  (abstract gin_trgm_ops);

and

 SELECT title, SIMILARITY(abstract, 'skeef') FROM akb_articles WHERE SIMILARITY (abstract, 'water') > 0

 results in a full sequential scan:

"Seq Scan on public.akb_articles  (cost=0.00..45751.67 rows=107025 width=666) (actual time=0.236..63153.268 rows=169265 loops=1)"
"  Output: title, similarity(abstract, 'skeef'::text)"
"  Filter: (similarity(akb_articles.abstract, 'water'::text) > 0::double precision)"
"  Buffers: shared hit=39000 read=46460"
"Total runtime: 63173.663 ms"

Regards
Johann

--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)

Re: Question on Trigram GIST indexes

From
Merlin Moncure
Date:
On Sat, Jan 5, 2013 at 12:20 PM, ERR ORR <rd0002@gmail.com> wrote:
>
> @Moderators: I am reposting this because the original from 22 December
> apparently didn't arrive on the list.
>
> I was trying to make Postgresql use a trigram gist index on a varchar field,
> but to no avail.
>
> Specifically, I was trying to replicate what is done in this blog post:
> http://www.postgresonline.com/journal/archives/212-PostgreSQL-9.1-Trigrams-teaching-LIKE-and-ILIKE-new-tricks.html
>
>
> I use Postgresql 9.1.7 on Linux FC17 64bit, my locale is UTF8.
>
> My full table definition is
>
> CREATE TABLE "TEST"
> (
>   "RECID" bigint NOT NULL DEFAULT next_id(),
>   "TST_PAYLOAD" character varying(255),
>   CONSTRAINT "PK_TEST" PRIMARY KEY ("RECID")
>   USING INDEX TABLESPACE local
> )
> WITH (
>   OIDS=FALSE
> );
>
> CREATE INDEX "TEST_PAYLOAD_PATTERN_1_IDX"
>   ON "TEST"
>   USING btree
>   ("TST_PAYLOAD" COLLATE pg_catalog."default" varchar_pattern_ops)
> TABLESPACE local;
>
> CREATE INDEX "TEST_PAYLOAD_TRIGRAM_GIST_1_IDX"
>   ON "TEST"
>   USING gist
>   ("TST_PAYLOAD" COLLATE pg_catalog."default" gist_trgm_ops)
> TABLESPACE local;
>
> CREATE INDEX "TEST_PAYLOAD_TRIGRAM_GIN_1_IDX"
>   ON "TEST"
>   USING gin
>   ("TST_PAYLOAD" COLLATE pg_catalog."default" gin_trgm_ops)
> TABLESPACE local;
>
>
> The COLLATE pg_catalog."default" clause is inserted by the DB (default is
> "Unicode"). I also tried to define the Trigram index with COLLATE
> pg_catalog."C"  but the behavior did not change. I did vacuum and analyze
> after creating each index.
>
> The field "TST_PAYLOAD" contains 26389 names of cities, all in uppercase.
>
> I have pg_tgrm installed - actually all extensions are present.
>
> Queries which use "WHERE "TST_PAYLOAD" LIKE 'SEAT%'" go to the btree index
> as it should.
> Queries which use "WHERE "TST_PAYLOAD" LIKE '%EAT%'" *should* use the GIST
> index but do a full table scan instead.
> (I am looking for names like 'SEATTLE' in this example)

where did you determine that pg_trgm should optimize like expressions?
pg_trgm provides new operators that are used to index on string
similarity...

merlin


Re: Question on Trigram GIST indexes

From
Merlin Moncure
Date:
On Tue, Jan 22, 2013 at 8:07 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sat, Jan 5, 2013 at 12:20 PM, ERR ORR <rd0002@gmail.com> wrote:
>>
>> @Moderators: I am reposting this because the original from 22 December
>> apparently didn't arrive on the list.
>>
>> I was trying to make Postgresql use a trigram gist index on a varchar field,
>> but to no avail.
>>
>> Specifically, I was trying to replicate what is done in this blog post:
>> http://www.postgresonline.com/journal/archives/212-PostgreSQL-9.1-Trigrams-teaching-LIKE-and-ILIKE-new-tricks.html
>>
>>
>> I use Postgresql 9.1.7 on Linux FC17 64bit, my locale is UTF8.
>>
>> My full table definition is
>>
>> CREATE TABLE "TEST"
>> (
>>   "RECID" bigint NOT NULL DEFAULT next_id(),
>>   "TST_PAYLOAD" character varying(255),
>>   CONSTRAINT "PK_TEST" PRIMARY KEY ("RECID")
>>   USING INDEX TABLESPACE local
>> )
>> WITH (
>>   OIDS=FALSE
>> );
>>
>> CREATE INDEX "TEST_PAYLOAD_PATTERN_1_IDX"
>>   ON "TEST"
>>   USING btree
>>   ("TST_PAYLOAD" COLLATE pg_catalog."default" varchar_pattern_ops)
>> TABLESPACE local;
>>
>> CREATE INDEX "TEST_PAYLOAD_TRIGRAM_GIST_1_IDX"
>>   ON "TEST"
>>   USING gist
>>   ("TST_PAYLOAD" COLLATE pg_catalog."default" gist_trgm_ops)
>> TABLESPACE local;
>>
>> CREATE INDEX "TEST_PAYLOAD_TRIGRAM_GIN_1_IDX"
>>   ON "TEST"
>>   USING gin
>>   ("TST_PAYLOAD" COLLATE pg_catalog."default" gin_trgm_ops)
>> TABLESPACE local;
>>
>>
>> The COLLATE pg_catalog."default" clause is inserted by the DB (default is
>> "Unicode"). I also tried to define the Trigram index with COLLATE
>> pg_catalog."C"  but the behavior did not change. I did vacuum and analyze
>> after creating each index.
>>
>> The field "TST_PAYLOAD" contains 26389 names of cities, all in uppercase.
>>
>> I have pg_tgrm installed - actually all extensions are present.
>>
>> Queries which use "WHERE "TST_PAYLOAD" LIKE 'SEAT%'" go to the btree index
>> as it should.
>> Queries which use "WHERE "TST_PAYLOAD" LIKE '%EAT%'" *should* use the GIST
>> index but do a full table scan instead.
>> (I am looking for names like 'SEATTLE' in this example)
>
> where did you determine that pg_trgm should optimize like expressions?
> pg_trgm provides new operators that are used to index on string
> similarity...

oops -- heh -- I guess you *can* do that (after further documentation
review).  hm...it works for me...

merlin