Thread: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries
Hello, I have been trying to understand how pg_trgm works. As part of that, I was looking at gin_extract_query_trgm(), which I think, extracts trigrams from a search query string. So, I debugged for 3 cases: 1) column_name LIKE '%緊急%' in this case, inside gin_extract_query_trgm(), after a call to generate_wildcard_trgm(), returned trglen is 0, hence GIN_SEARCH_MODE_ALL search mode is used. 2) column_name LIKE '%os%' same as in case (1) 3) column_name LIKE '%ost%' returned trglen is > 0, things proceed differently. May be, trigrams have been generated and cane be used for index search. I later commented out #define KEEPONLYALNUM from contrib/pg_trgm/trgm.h (following from a related discussion on -hackers viz. http://www.postgresql.org/message-id/flat/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com#CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com), but things didn't change. So, it appears, for search strings consisting of 2 (or < 3) characters, trigrams can not be utilized. No? NOTE: Using the master branch. The indexed column is a text field and data consists of mix of Japanese, alphanumeric characters. -- Amit Langote
On Wed, May 29, 2013 at 10:51 PM, Amit Langote <amitlangote09@gmail.com> wrote: > So, it appears, for search strings consisting of 2 (or < 3) > characters, trigrams can not be utilized. No? I think that's right. "trigram" means a sequence of three characters, and what's stored in the indexes are three-character sequences from the original text. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 30, 2013 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 29, 2013 at 10:51 PM, Amit Langote <amitlangote09@gmail.com> wrote: >> So, it appears, for search strings consisting of 2 (or < 3) >> characters, trigrams can not be utilized. No? > > I think that's right. "trigram" means a sequence of three characters, > and what's stored in the indexes are three-character sequences from > the original text. > Was there any improvement to pg_trgm in recent past that could make it better for partial matching (the case in question I suppose) or is partial-matching a different thing altogether? -- Amit Langote
On Thu, May 30, 2013 at 11:00 AM, Amit Langote <amitlangote09@gmail.com> wrote: > Was there any improvement to pg_trgm in recent past that could make it > better for partial matching (the case in question I suppose) or is > partial-matching a different thing altogether? Sorry, I am not sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2013/5/31 Amit Langote <amitlangote09@gmail.com> > > On Thu, May 30, 2013 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, May 29, 2013 at 10:51 PM, Amit Langote <amitlangote09@gmail.com> wrote: > >> So, it appears, for search strings consisting of 2 (or < 3) > >> characters, trigrams can not be utilized. No? > > > > I think that's right. "trigram" means a sequence of three characters, > > and what's stored in the indexes are three-character sequences from > > the original text. > > > > Was there any improvement to pg_trgm in recent past that could make it > better for partial matching (the case in question I suppose) or is > partial-matching a different thing altogether? > Hi Amit, following emails are discussed about partial match of pg_trgm. I hope will this help. <http://www.postgresql.org/message-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com> as you may know, if search string contains multibyte characters trigram key is converted to CRC of 4 byte and it is used as key. (but only use upper 3 byte from CRC) so we can do partial matching if KEEPONLYALNUM is enabled. Regards, -------- Masahiko Sawada
On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
following emails are discussed about partial match of pg_trgm. I hopewill this help.
<http://www.postgresql.org/message-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com>
as you may know, if search string contains multibyte characters
trigram key is converted to CRC of 4 byte and it is used as key.
(but only use upper 3 byte from CRC)
so we can do partial matching if KEEPONLYALNUM is enabled.
Please, read the further discussion on that thread. We can't do partial matching because of CRC independently of KEEPONLYALNUM.
With best regards,
Alexander Korotkov.
On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com> > wrote: >> >> following emails are discussed about partial match of pg_trgm. I hope >> will this help. >> >> <http://www.postgresql.org/message-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com> >> as you may know, if search string contains multibyte characters >> trigram key is converted to CRC of 4 byte and it is used as key. >> (but only use upper 3 byte from CRC) >> so we can do partial matching if KEEPONLYALNUM is enabled. > > > Please, read the further discussion on that thread. We can't do partial > matching because of CRC independently of KEEPONLYALNUM. > Thank you Sawada-san and Alexander. I think the idea of using trigram "text" itself rather than its CRC (due to its problems in partial matching) as GIN key (?) has not been implemented into pg_trgm yet, right? And even though, such a facility would be added, we would still need to handle multibyte characters case differently (even for partial matching), is that right? -- Amit Langote
On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com> > wrote: >> >> following emails are discussed about partial match of pg_trgm. I hope >> will this help. >> >> <http://www.postgresql.org/message-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com> >> as you may know, if search string contains multibyte characters >> trigram key is converted to CRC of 4 byte and it is used as key. >> (but only use upper 3 byte from CRC) >> so we can do partial matching if KEEPONLYALNUM is enabled. > > > Please, read the further discussion on that thread. We can't do partial > matching because of CRC independently of KEEPONLYALNUM. > Also, a few more questions: 1) When building a trgm index, are there any differences for multi-byte character strings. For example, would a 2 character Japanese string (multi-byte offcourse) produce exactly 3 trigrams to be stored in the index which would later be used while look-up? 2) And if that is so, is there problem in gin_extract_query_trgm(), that is while generating trigrams from a query search term that causes trigrams (stored in the index if answer to (1) is yes) NOT to be used in such a partial matching case? -- Amit Langote
On Fri, May 31, 2013 at 11:16 AM, Amit Langote <amitlangote09@gmail.com> wrote: > On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov > <aekorotkov@gmail.com> wrote: >> On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com> >> wrote: >>> >>> following emails are discussed about partial match of pg_trgm. I hope >>> will this help. >>> >>> <http://www.postgresql.org/message-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com> >>> as you may know, if search string contains multibyte characters >>> trigram key is converted to CRC of 4 byte and it is used as key. >>> (but only use upper 3 byte from CRC) >>> so we can do partial matching if KEEPONLYALNUM is enabled. >> >> >> Please, read the further discussion on that thread. We can't do partial >> matching because of CRC independently of KEEPONLYALNUM. >> > > Also, a few more questions: > > 1) When building a trgm index, are there any differences for > multi-byte character strings. For example, would a 2 character > Japanese string (multi-byte offcourse) produce exactly 3 trigrams to > be stored in the index which would later be used while look-up? in above case a 2 character multibyte string produce 3 trigrams of CRC. (because these was larger than 3 byte) and these are used while look-up. > 2) And if that is so, is there problem in gin_extract_query_trgm(), > that is while generating trigrams from a query search term that causes > trigrams (stored in the index if answer to (1) is yes) NOT to be used > in such a partial matching case? it means that we can't use trigrams in case of partial matching because trigrams (stored in index) are converted to different value(CRC). right? Regards, -- ------- Sawada Masahiko
On Sat, Jun 1, 2013 at 1:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote: > On Fri, May 31, 2013 at 11:16 AM, Amit Langote <amitlangote09@gmail.com> wrote: >> 2) And if that is so, is there problem in gin_extract_query_trgm(), >> that is while generating trigrams from a query search term that causes >> trigrams (stored in the index if answer to (1) is yes) NOT to be used >> in such a partial matching case? > > it means that we can't use trigrams in case of partial matching > because trigrams (stored in index) are converted to different > value(CRC). > right? > When I debugged a partial match case such as " column like '%st%' ", it appears that get_wildcard_trigrams return no trigrams for wildcard part 'st' since charlen < 3. Hence, GIN_SEARCH_MODE_ALL mode is used and results in full index scan instead of trigrams being used. This happens for multibyte case too. The problem is that for wildcard part consisting of less than 3 characters, get_wildcard_trigrams return nothing. -- Amit Langote
On Fri, May 31, 2013 at 9:41 PM, Amit Langote <amitlangote09@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
On Sat, Jun 1, 2013 at 1:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, May 31, 2013 at 11:16 AM, Amit Langote <amitlangote09@gmail.com> wrote:>> 2) And if that is so, is there problem in gin_extract_query_trgm(),When I debugged a partial match case such as " column like '%st%'
>> that is while generating trigrams from a query search term that causes
>> trigrams (stored in the index if answer to (1) is yes) NOT to be used
>> in such a partial matching case?
>
> it means that we can't use trigrams in case of partial matching
> because trigrams (stored in index) are converted to different
> value(CRC).
> right?
>
", it appears that get_wildcard_trigrams return no trigrams for
wildcard part 'st' since charlen < 3. Hence, GIN_SEARCH_MODE_ALL mode
is used and results in full index scan instead of trigrams being
used. This happens for multibyte case too. The problem is that for
wildcard part consisting of less than 3 characters,
get_wildcard_trigrams return nothing.
Partial match for LIKE queries is not implemented at all. With current index structure it could be possible to implement it with guarantee that all indexed strings don't contain multibyte characters (i.e. single-byte encoding).
Also, at it was mentioned, it's possible to implement operator class with text storage type which would support partial match in all the cases. However, I doubt it's reasonable to implement it based on pg_trgm, because of many hard-wired assumptions that trigram is fixed length and compatibility.
With best regards,
Alexander Korotkov.