Thread: Can pg_trgm handle non-alphanumeric characters?
Hello, This question may be appropriate for pgsql-general, but let me ask here because the only relevant discussion seems to have been done on pgsql-hackers: http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php Can pg_trgm support non-alphanumeric characters by undefining KEEPONLYALNUM in contrib/pg_trgm/trgm.h? If no, what kind of problems would happen? Regards MauMau
On Wed, May 9, 2012 at 9:10 PM, MauMau <maumau307@gmail.com> wrote: > Hello, > > This question may be appropriate for pgsql-general, but let me ask here > because the only relevant discussion seems to have been done on > pgsql-hackers: > > http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php > > Can pg_trgm support non-alphanumeric characters by undefining KEEPONLYALNUM > in contrib/pg_trgm/trgm.h? Yes unless I'm missing something. Regards, -- Fujii Masao
From: "Fujii Masao" <masao.fujii@gmail.com> > On Wed, May 9, 2012 at 9:10 PM, MauMau <maumau307@gmail.com> wrote: >> This question may be appropriate for pgsql-general, but let me ask here >> because the only relevant discussion seems to have been done on >> pgsql-hackers: >> >> http://archives.postgresql.org/pgsql-hackers/2011-09/msg00169.php >> >> Can pg_trgm support non-alphanumeric characters by undefining >> KEEPONLYALNUM >> in contrib/pg_trgm/trgm.h? > > Yes unless I'm missing something. Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would it cause any problems? If no, I wish that, because it eliminates the need to do the removal every time the users applies minor releases. Regards MauMau
On 09-05-2012 19:17, MauMau wrote: > Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would it > cause any problems? If no, I wish that, because it eliminates the need to do > the removal every time the users applies minor releases. > If you do so, you'll break minor versions. IMHO the default is the desirable behavior for almost all use cases (you are the first one that complain about it). Maybe in the future, we should be able to flip this flag without rebuilding binaries. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
On Thu, May 10, 2012 at 8:18 AM, Euler Taveira <euler@timbira.com> wrote: > On 09-05-2012 19:17, MauMau wrote: >> Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? Would it >> cause any problems? If no, I wish that, because it eliminates the need to do >> the removal every time the users applies minor releases. >> > If you do so, you'll break minor versions. Right. And removing KEEPONLYALNUM is a feature change rather than bug fix, so that should be proposed during major version development cycle. > IMHO the default is the desirable > behavior for almost all use cases (you are the first one that complain about > it). Really? I was thinking non-English users (including me) basicaly would not be satisfied with the default because they cannot use pg_trgm for N-gram full text search of non-English text. Though I agree some users would prefer the default. > Maybe in the future, we should be able to flip this flag without > rebuilding binaries. Agreed. Regards, -- Fujii Masao
From: "Fujii Masao" <masao.fujii@gmail.com> > On Thu, May 10, 2012 at 8:18 AM, Euler Taveira <euler@timbira.com> wrote: >> On 09-05-2012 19:17, MauMau wrote: >>> Then, does it make sense to remove "#define KEEPONLYALNUM" in 9.1.4? >>> Would it >>> cause any problems? If no, I wish that, because it eliminates the need >>> to do >>> the removal every time the users applies minor releases. >>> >> If you do so, you'll break minor versions. > > Right. And removing KEEPONLYALNUM is a feature change rather than bug fix, > so that should be proposed during major version development cycle. For information, what kind of breakage would occur? Is it performance degradation, extra index storage consumption, or undesirable query results? I imagined removing KEEPONLYALNUM would just accept non-alphanumeric characters and cause no harm to those who use only alphanumeric characters. Regards MauMau
"MauMau" <maumau307@gmail.com> wrote: >>> On 09-05-2012 19:17, MauMau wrote: >>>> Then, does it make sense to remove "#define KEEPONLYALNUM" in >>>> 9.1.4? Would it cause any problems? Yes, it will cause problems. > For information, what kind of breakage would occur? > I imagined removing KEEPONLYALNUM would just accept > non-alphanumeric characters and cause no harm to those who use > only alphanumeric characters. This would break our current usages because of the handling of trigrams at the "edges" of groups of qualifying characters. It would make similarity (and distance) values less useful for our current name searches using it. To simulate the effect, I used an '8' in place of a comma instead of recompiling with the suggested change. test=# select show_trgm('smith,john'); show_trgm -----------------------------------------------------------{" j"," s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "} (1 row) test=# select show_trgm('smith8john'); show_trgm -----------------------------------------------------{" s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8} (1 row) test=# select similarity('smith,john', 'jon smith');similarity ------------ 0.615385 (1 row) test=# select similarity('smith8john', 'jon smith');similarity ------------ 0.3125 (1 row) So making the proposed change unconditionally could indeed hurt current users of the technique. On the other hand, if there was fine-grained control of this, it might make trigrams useful for searching statute cites (using all characters) as well as names (using the current character set); so I wouldn't want it to just be controlled by a global GUC. -Kevin
From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov> > "MauMau" <maumau307@gmail.com> wrote: >> For information, what kind of breakage would occur? > >> I imagined removing KEEPONLYALNUM would just accept >> non-alphanumeric characters and cause no harm to those who use >> only alphanumeric characters. > > This would break our current usages because of the handling of > trigrams at the "edges" of groups of qualifying characters. It > would make similarity (and distance) values less useful for our > current name searches using it. To simulate the effect, I used an > '8' in place of a comma instead of recompiling with the suggested > change. > > test=# select show_trgm('smith,john'); > show_trgm > ----------------------------------------------------------- > {" j"," s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "} > (1 row) > > test=# select show_trgm('smith8john'); > show_trgm > ----------------------------------------------------- > {" s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8} > (1 row) > > test=# select similarity('smith,john', 'jon smith'); > similarity > ------------ > 0.615385 > (1 row) > > test=# select similarity('smith8john', 'jon smith'); > similarity > ------------ > 0.3125 > (1 row) > > So making the proposed change unconditionally could indeed hurt > current users of the technique. On the other hand, if there was > fine-grained control of this, it might make trigrams useful for > searching statute cites (using all characters) as well as names > (using the current character set); so I wouldn't want it to just be > controlled by a global GUC. Thanks for your explanation. Although I haven't understood it well yet, I'll consider what you taught. And I'll consider if the tentative measure of removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm against Japanese text. Regards MauMau
On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307@gmail.com> wrote: > Thanks for your explanation. Although I haven't understood it well yet, I'll > consider what you taught. And I'll consider if the tentative measure of > removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm > against Japanese text. In Japanese, it's common to do a text search with two characters keyword. But since pg_trgm is 3-gram, you basically would not be able to use index for such text search. So you might need something like pg_bigm or pg_unigm for Japanese text search. Regards, -- Fujii Masao
Fujii Masao <masao.fujii@gmail.com> writes: > On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307@gmail.com> wrote: >> Thanks for your explanation. Although I haven't understood it well yet, I'll >> consider what you taught. And I'll consider if the tentative measure of >> removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm >> against Japanese text. > In Japanese, it's common to do a text search with two characters keyword. > But since pg_trgm is 3-gram, you basically would not be able to use index > for such text search. So you might need something like pg_bigm or pg_unigm > for Japanese text search. I believe the trigrams are three *bytes* not three characters. So a couple of kanji should work just fine for this. regards, tom lane
From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov> > "MauMau" <maumau307@gmail.com> wrote: >> For information, what kind of breakage would occur? > >> I imagined removing KEEPONLYALNUM would just accept >> non-alphanumeric characters and cause no harm to those who use >> only alphanumeric characters. > > This would break our current usages because of the handling of > trigrams at the "edges" of groups of qualifying characters. It > would make similarity (and distance) values less useful for our > current name searches using it. To simulate the effect, I used an > '8' in place of a comma instead of recompiling with the suggested > change. > > test=# select show_trgm('smith,john'); > show_trgm > ----------------------------------------------------------- > {" j"," s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "} > (1 row) > > test=# select show_trgm('smith8john'); > show_trgm > ----------------------------------------------------- > {" s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8} > (1 row) > > test=# select similarity('smith,john', 'jon smith'); > similarity > ------------ > 0.615385 > (1 row) > > test=# select similarity('smith8john', 'jon smith'); > similarity > ------------ > 0.3125 > (1 row) > > So making the proposed change unconditionally could indeed hurt > current users of the technique. On the other hand, if there was > fine-grained control of this, it might make trigrams useful for > searching statute cites (using all characters) as well as names > (using the current character set); so I wouldn't want it to just be > controlled by a global GUC. Thank you for your concise examples. I probably got it. From your examples, I thought KEEPONLYALNUM controls whether non-alphanumeric characters are included in trigrams, though I haven't read the code of pg_trgm. So, removing KEEPONLYALNUM definition produces trigrams unnecessary for users who handle only alphanumeric text. That would lead to undesirable query results. Then, I wonder what would be the ideal specification...to add alphanumeric/non-alphanumeric boolean switch to similarity() function, add non-alphanumeric version of operators (ex. %* and <->*) and non-alphanumeric version of operator classes (ex. gin_allchars_trgm_ops)? At least, I understood the fix is not appropriate for minor releases. Regards MauMau
On Fri, May 11, 2012 at 4:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> On Fri, May 11, 2012 at 12:07 AM, MauMau <maumau307@gmail.com> wrote: >>> Thanks for your explanation. Although I haven't understood it well yet, I'll >>> consider what you taught. And I'll consider if the tentative measure of >>> removing KEEPONLYALNUM is correct for someone who wants to use pg_trgm >>> against Japanese text. > >> In Japanese, it's common to do a text search with two characters keyword. >> But since pg_trgm is 3-gram, you basically would not be able to use index >> for such text search. So you might need something like pg_bigm or pg_unigm >> for Japanese text search. Even if an index can be used for two characters text search, bitmap index scan picks up all rows, so it's too slow. > I believe the trigrams are three *bytes* not three characters. So a > couple of kanji should work just fine for this. Really? As far as I read the code of pg_trgm, the trigram is three characters and its CRC32 is used as an index key if its size is more than three bytes. Regards, -- Fujii Masao