Thread: pg_trgm partial-match
Hi, I'd like to propose to extend pg_trgm so that it can compare a partial-match query key to a GIN index. IOW, I'm thinking to implement the 'comparePartial' GIN method for pg_trgm. Currently, when the query key is less than three characters, we cannot use a GIN index (+ pg_trgm) efficiently, because pg_trgm doesn't support a partial-match method. In this case, seq scan or index full scan would be executed, and its response time would be very slow. I'd like to alleviate this problem. Note that we cannot do a partial-match if KEEPONLYALNUM is disabled, i.e., if query key contains multibyte characters. In this case, byte length of the trigram string might be larger than three, and its CRC is used as a trigram key instead of the trigram string itself. Because of using CRC, we cannot do a partial-match. Attached patch extends pg_trgm so that it compares a partial-match query key only when KEEPONLYALNUM is enabled. Attached patch is WIP yet. What I should do next is: * version up pg_trgm from 1.0 to 1.1, i.e., create pg_trgm--1.1.sql, etc. * write the regression test Comments? Review? Objection? Regards, -- Fujii Masao
Attachment
On 15.11.2012 20:39, Fujii Masao wrote: > Hi, > > I'd like to propose to extend pg_trgm so that it can compare a partial-match > query key to a GIN index. IOW, I'm thinking to implement the 'comparePartial' > GIN method for pg_trgm. > > Currently, when the query key is less than three characters, we cannot use > a GIN index (+ pg_trgm) efficiently, because pg_trgm doesn't support a > partial-match method. In this case, seq scan or index full scan would be > executed, and its response time would be very slow. I'd like to alleviate this > problem. > > Note that we cannot do a partial-match if KEEPONLYALNUM is disabled, > i.e., if query key contains multibyte characters. In this case, byte length of > the trigram string might be larger than three, and its CRC is used as a > trigram key instead of the trigram string itself. Because of using CRC, we > cannot do a partial-match. Attached patch extends pg_trgm so that it > compares a partial-match query key only when KEEPONLYALNUM is > enabled. > > Attached patch is WIP yet. What I should do next is: > > * version up pg_trgm from 1.0 to 1.1, i.e., create pg_trgm--1.1.sql, etc. > * write the regression test > > Comments? Review? Objection? Hi, I've done a quick review of the current patch: (1) It applies cleanly on the current master and builds fine. (2) In pg_trgm--1.0.sql the gin_trgm_compare_partial is indented differently (using tabs instead of spaces). (3) In trgm_gin.c, function gin_extract_value_trgm contains #ifdef KEEPONLYALNUM, although trgm_pmatch is not used at all. (4) The patch removes some commented-out variables, but there still remain various commented-out variables. Will this becleaned too? (5) I've done basic functionality of the patch, it really seems to work: CREATE EXTENSION pg_trgm ; CREATE TABLE TEST (val TEXT); INSERT INTO test SELECT md5(i::text) FROM generate_series(1,1000000) s(i); CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops); ANALYZE test; EXPLAIN SELECT * FROM test WHERE val LIKE '%aa%'; QUERY PLAN ------------------------------------------------------------------------Bitmap Heap Scan on test (cost=1655.96..11757.63rows=141414 width=33) Recheck Cond: (val ~~ '%aa%'::text) -> Bitmap Index Scan on trgm_idx (cost=0.00..1620.61rows=141414 width=0) Index Cond: (val ~~ '%aa%'::text) (4 rows) Without the patch, this gives a seq scan (as expected). Do you expect to update the docs too? IMHO it's worth mentioning that the pg_trgm can handle even patterns shorter than 2 chars ... regards Tomas Vondra
Hi!
On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
i.e., if query key contains multibyte characters. In this case, byte length of
the trigram string might be larger than three, and its CRC is used as a
trigram key instead of the trigram string itself. Because of using CRC, we
cannot do a partial-match. Attached patch extends pg_trgm so that it
compares a partial-match query key only when KEEPONLYALNUM is
enabled.
Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram character is singlebyte?
CREATE TABLE test (val TEXT);
INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;
test=# SELECT * FROM test WHERE val LIKE '%aa%';
val
------
aa
aaa
шaaш
(3 rows)
test=# set enable_seqscan = off;
SET
test=# SELECT * FROM test WHERE val LIKE '%aa%';
val
-----
aa
aaa
(2 rows)
I think we can use partial match only for singlebyte encodings. Or, at most, in cases when all alpha-numeric characters are singlebyte (have no idea how to check this).
With best regards,
Alexander Korotkov.
On Mon, Nov 19, 2012 at 10:05 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao.fujii@gmail.com> wrote:Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
i.e., if query key contains multibyte characters. In this case, byte length of
the trigram string might be larger than three, and its CRC is used as a
trigram key instead of the trigram string itself. Because of using CRC, we
cannot do a partial-match. Attached patch extends pg_trgm so that it
compares a partial-match query key only when KEEPONLYALNUM is
enabled.Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram character is singlebyte?CREATE TABLE test (val TEXT);INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);ANALYZE test;test=# SELECT * FROM test WHERE val LIKE '%aa%';val------aaaaaшaaш(3 rows)test=# set enable_seqscan = off;SETtest=# SELECT * FROM test WHERE val LIKE '%aa%';val-----aaaaa(2 rows)I think we can use partial match only for singlebyte encodings. Or, at most, in cases when all alpha-numeric characters are singlebyte (have no idea how to check this).
Actually, I also was fiddling around idea of partial match on trigrams when I was working on initial LIKE patch. But, I concluded that we would need a separate opclass which always keeps full trigram in entry.
With best regards,
Alexander Korotkov.
On Mon, Nov 19, 2012 at 7:55 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Mon, Nov 19, 2012 at 10:05 AM, Alexander Korotkov <aekorotkov@gmail.com> > wrote: >> >> On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao.fujii@gmail.com> >> wrote: >>> >>> Note that we cannot do a partial-match if KEEPONLYALNUM is disabled, >>> i.e., if query key contains multibyte characters. In this case, byte >>> length of >>> the trigram string might be larger than three, and its CRC is used as a >>> trigram key instead of the trigram string itself. Because of using CRC, >>> we >>> cannot do a partial-match. Attached patch extends pg_trgm so that it >>> compares a partial-match query key only when KEEPONLYALNUM is >>> enabled. >> >> >> Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram >> character is singlebyte? >> >> CREATE TABLE test (val TEXT); >> INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш'); >> CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops); >> ANALYZE test; >> test=# SELECT * FROM test WHERE val LIKE '%aa%'; >> val >> ------ >> aa >> aaa >> шaaш >> (3 rows) >> test=# set enable_seqscan = off; >> SET >> test=# SELECT * FROM test WHERE val LIKE '%aa%'; >> val >> ----- >> aa >> aaa >> (2 rows) >> >> I think we can use partial match only for singlebyte encodings. Or, at >> most, in cases when all alpha-numeric characters are singlebyte (have no >> idea how to check this). Good catch! You're right. > Actually, I also was fiddling around idea of partial match on trigrams when > I was working on initial LIKE patch. But, I concluded that we would need a > separate opclass which always keeps full trigram in entry. Agreed. My goal is to enable pg_trgm's full-text search using the keyword which consists of one or two Japanese characters to use an index efficiently. I first implemented the partial-match patch, and was planning to introduce new opclass next CommitFest to store the multibyte characters to the text data instead of int4. But obviously the order of development should have been the opposite. I will work on the latter development first, and add new opclass like gin_trgm_mb_ops (mb means multibyte) which ignores KEEPONLYALNUM and stores the GIN index key as text value. Regards, -- Fujii Masao
On Mon, Nov 19, 2012 at 10:56 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > I've done a quick review of the current patch: Thanks for the commit! As Alexander pointed out upthread, another infrastructure patch is required before applying this patch. So I will implement the infra patch first. Regards, -- Fujii Masao
On Fri, Nov 23, 2012 at 2:11 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Nov 19, 2012 at 10:56 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> I've done a quick review of the current patch: > > Thanks for the commit! > > As Alexander pointed out upthread, another infrastructure patch is required > before applying this patch. So I will implement the infra patch first. I marked this patch as "Returned with Feedback" because unfortunately I don't have enough time to revise the patch. I will retry this maybe in 9.4dev phase. Regards, -- Fujii Masao