Re: pg_trgm partial-match - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: pg_trgm partial-match
Date
Msg-id CAPpHfdtTc2UqLXu98LMxPNdOhuxnSZrPPv2xv8i10SY+CGnaFg@mail.gmail.com
Whole thread Raw
In response to Re: pg_trgm partial-match  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: pg_trgm partial-match
List pgsql-hackers
On Mon, Nov 19, 2012 at 10:05 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
i.e., if query key contains multibyte characters. In this case, byte length of
the trigram string might be larger than three, and its CRC is used as a
trigram key instead of the trigram string itself. Because of using CRC, we
cannot do a partial-match. Attached patch extends pg_trgm so that it
compares a partial-match query key only when KEEPONLYALNUM is
enabled.

Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram character is singlebyte?

CREATE TABLE test (val TEXT);
INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;
test=# SELECT * FROM test WHERE val LIKE '%aa%';
 val  
------
 aa
 aaa
 шaaш
(3 rows)
test=# set enable_seqscan = off;
SET
test=# SELECT * FROM test WHERE val LIKE '%aa%';
 val 
-----
 aa
 aaa
(2 rows)

I think we can use partial match only for singlebyte encodings. Or, at most, in cases when all alpha-numeric characters are singlebyte (have no idea how to check this).

Actually, I also was fiddling around idea of partial match on trigrams when I was working on initial LIKE patch. But, I concluded that we would need a separate opclass which always keeps full trigram in entry.
 
------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: logical changeset generation v3
Next
From: Jeevan Chalke
Date:
Subject: Re: too much pgbench init output