Configuring Text Search parser? - Mailing list pgsql-hackers

From jesper@krogh.cc
Subject Configuring Text Search parser?
Date
Msg-id 1a26550c0b55c0a0af0dcbd8e080bc82.squirrel@shrek.krogh.cc
Whole thread Raw
Responses Re: Configuring Text Search parser?  (Sushant Sinha <sushant354@gmail.com>)
List pgsql-hackers
Hi.

I'm trying to migrate an application off an existing Full Text Search engine
and onto PostgreSQL .. one of my main (remaining) headaches are the
fact that PostgreSQL treats _ as a seperation charachter whereas the existing
behaviour is to "not split". That means:

testdb=# select ts_debug('database_tag_number_999');                                  ts_debug
------------------------------------------------------------------------------(asciiword,"Word, all
ASCII",database,{english_stem},english_stem,{databas})(blank,"Spacesymbols",_,{},,)(asciiword,"Word, all
ASCII",tag,{english_stem},english_stem,{tag})(blank,"Spacesymbols",_,{},,)(asciiword,"Word, all
ASCII",number,{english_stem},english_stem,{number})(blank,"Spacesymbols",_,{},,)(uint,"Unsigned
integer",999,{simple},simple,{999})
(7 rows)

Where the incoming data, by design contains a set of tags which includes _
and are expected to be one "lexeme".

I've tried patching my way out of this using this patch.

$ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig
src/backend/tsearch/wparser_def.c
*** src/backend/tsearch/wparser_def.c.orig    2010-09-20 15:58:37.033336460
+0200
--- src/backend/tsearch/wparser_def.c    2010-09-20 15:58:41.193335577 +0200
***************
*** 967,986 ****
--- 967,988 ----
 static const TParserStateActionItem actionTPS_InNumWord[] = {     {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
{p_isalnum,0, A_NEXT, TPS_InNumWord, 0, NULL},     {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
 
+     {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL},     {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},     {p_iseqC,
'/',A_PUSH, TPS_InFileFirst, 0, NULL},     {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},     {p_iseqC, '-', A_PUSH,
TPS_InHyphenNumWordFirst,0, NULL},     {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL} };
 
 static const TParserStateActionItem actionTPS_InAsciiWord[] = {     {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
  {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
 
+     {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},     {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
{p_iseqC,'.', A_PUSH, TPS_InFileNext, 0, NULL},     {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},     {p_iseqC,
'-',A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL},     {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
 
***************
*** 995,1004 ****
--- 997,1007 ----
 static const TParserStateActionItem actionTPS_InWord[] = {     {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL},
{p_isalpha,0, A_NEXT, TPS_Null, 0, NULL},     {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL},
 
+     {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},     {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL},     {p_iseqC, '-',
A_PUSH,TPS_InHyphenWordFirst, 0, NULL},     {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL} };
 



This will obviously break other peoples applications, so my questions would
be: If this should be made configurable.. how should it be done?

As a sidenote... Xapian doesn't split on _ .. Lucene does.

Thanks.

-- 
Jesper



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Configuring synchronous replication
Next
From: "Kevin Grittner"
Date:
Subject: Re: Serializable Snapshot Isolation