Your changes are somewhat fine. It will get you tokens with "_"
characters in it. However, it is not nice to mix your new token with
existing token like NUMWORD. Give a new name to your new type of
token .. probably UnderscoreWord. Then on seeing "_", move to a state
that can identify the new token. If you finally recognize that token,
then output it.
In order to extract portions of the newly created token, you can write
a special handler for the token that resets the parser position to the
start of the token to get parts of it. And then modify the state machine
to output the part-token before going into the state that can lead to
the token that was identified earlier.
Look at these changes to the text parser as well:
http://archives.postgresql.org/pgsql-hackers/2010-09/msg00004.php
-Sushant.
On Mon, 2010-09-20 at 16:01 +0200, jesper@krogh.cc wrote:
> Hi.
>
> I'm trying to migrate an application off an existing Full Text Search engine
> and onto PostgreSQL .. one of my main (remaining) headaches are the
> fact that PostgreSQL treats _ as a seperation charachter whereas the existing
> behaviour is to "not split". That means:
>
> testdb=# select ts_debug('database_tag_number_999');
> ts_debug
> ------------------------------------------------------------------------------
> (asciiword,"Word, all ASCII",database,{english_stem},english_stem,{databas})
> (blank,"Space symbols",_,{},,)
> (asciiword,"Word, all ASCII",tag,{english_stem},english_stem,{tag})
> (blank,"Space symbols",_,{},,)
> (asciiword,"Word, all ASCII",number,{english_stem},english_stem,{number})
> (blank,"Space symbols",_,{},,)
> (uint,"Unsigned integer",999,{simple},simple,{999})
> (7 rows)
>
> Where the incoming data, by design contains a set of tags which includes _
> and are expected to be one "lexeme".
>
> I've tried patching my way out of this using this patch.
>
> $ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig
> src/backend/tsearch/wparser_def.c
> *** src/backend/tsearch/wparser_def.c.orig 2010-09-20 15:58:37.033336460
> +0200
> --- src/backend/tsearch/wparser_def.c 2010-09-20 15:58:41.193335577 +0200
> ***************
> *** 967,986 ****
> --- 967,988 ----
>
> static const TParserStateActionItem actionTPS_InNumWord[] = {
> {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
> {p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL},
> {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
> + {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL},
> {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
> {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
> {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
> {p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst, 0, NULL},
> {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL}
> };
>
> static const TParserStateActionItem actionTPS_InAsciiWord[] = {
> {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
> {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
> + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
> {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
> {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
> {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
> {p_iseqC, '-', A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL},
> {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
> ***************
> *** 995,1004 ****
> --- 997,1007 ----
>
> static const TParserStateActionItem actionTPS_InWord[] = {
> {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL},
> {p_isalpha, 0, A_NEXT, TPS_Null, 0, NULL},
> {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL},
> + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
> {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL},
> {p_iseqC, '-', A_PUSH, TPS_InHyphenWordFirst, 0, NULL},
> {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL}
> };
>
>
>
> This will obviously break other peoples applications, so my questions would
> be: If this should be made configurable.. how should it be done?
>
> As a sidenote... Xapian doesn't split on _ .. Lucene does.
>
> Thanks.
>
> --
> Jesper
>
>