Re: Configuring Text Search parser? - Mailing list pgsql-hackers
From | Sushant Sinha |
---|---|
Subject | Re: Configuring Text Search parser? |
Date | |
Msg-id | 1285090712.4454.70.camel@yoffice Whole thread Raw |
In response to | Configuring Text Search parser? (jesper@krogh.cc) |
List | pgsql-hackers |
Your changes are somewhat fine. It will get you tokens with "_" characters in it. However, it is not nice to mix your new token with existing token like NUMWORD. Give a new name to your new type of token .. probably UnderscoreWord. Then on seeing "_", move to a state that can identify the new token. If you finally recognize that token, then output it. In order to extract portions of the newly created token, you can write a special handler for the token that resets the parser position to the start of the token to get parts of it. And then modify the state machine to output the part-token before going into the state that can lead to the token that was identified earlier. Look at these changes to the text parser as well: http://archives.postgresql.org/pgsql-hackers/2010-09/msg00004.php -Sushant. On Mon, 2010-09-20 at 16:01 +0200, jesper@krogh.cc wrote: > Hi. > > I'm trying to migrate an application off an existing Full Text Search engine > and onto PostgreSQL .. one of my main (remaining) headaches are the > fact that PostgreSQL treats _ as a seperation charachter whereas the existing > behaviour is to "not split". That means: > > testdb=# select ts_debug('database_tag_number_999'); > ts_debug > ------------------------------------------------------------------------------ > (asciiword,"Word, all ASCII",database,{english_stem},english_stem,{databas}) > (blank,"Space symbols",_,{},,) > (asciiword,"Word, all ASCII",tag,{english_stem},english_stem,{tag}) > (blank,"Space symbols",_,{},,) > (asciiword,"Word, all ASCII",number,{english_stem},english_stem,{number}) > (blank,"Space symbols",_,{},,) > (uint,"Unsigned integer",999,{simple},simple,{999}) > (7 rows) > > Where the incoming data, by design contains a set of tags which includes _ > and are expected to be one "lexeme". > > I've tried patching my way out of this using this patch. > > $ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig > src/backend/tsearch/wparser_def.c > *** src/backend/tsearch/wparser_def.c.orig 2010-09-20 15:58:37.033336460 > +0200 > --- src/backend/tsearch/wparser_def.c 2010-09-20 15:58:41.193335577 +0200 > *************** > *** 967,986 **** > --- 967,988 ---- > > static const TParserStateActionItem actionTPS_InNumWord[] = { > {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL}, > {p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL}, > {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL}, > + {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL}, > {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, > {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL}, > {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL}, > {p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst, 0, NULL}, > {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL} > }; > > static const TParserStateActionItem actionTPS_InAsciiWord[] = { > {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL}, > {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL}, > + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL}, > {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL}, > {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL}, > {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL}, > {p_iseqC, '-', A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL}, > {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, > *************** > *** 995,1004 **** > --- 997,1007 ---- > > static const TParserStateActionItem actionTPS_InWord[] = { > {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL}, > {p_isalpha, 0, A_NEXT, TPS_Null, 0, NULL}, > {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL}, > + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL}, > {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL}, > {p_iseqC, '-', A_PUSH, TPS_InHyphenWordFirst, 0, NULL}, > {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL} > }; > > > > This will obviously break other peoples applications, so my questions would > be: If this should be made configurable.. how should it be done? > > As a sidenote... Xapian doesn't split on _ .. Lucene does. > > Thanks. > > -- > Jesper > >
pgsql-hackers by date: