Re: Configuring Text Search parser? - Mailing list pgsql-hackers

From Sushant Sinha
Subject Re: Configuring Text Search parser?
Date
Msg-id 1285090712.4454.70.camel@yoffice
Whole thread Raw
In response to Configuring Text Search parser?  (jesper@krogh.cc)
List pgsql-hackers
Your changes are somewhat fine. It will get you tokens with "_"
characters in it. However, it is not nice to mix your new token with
existing token like NUMWORD. Give a new name to your new type of
token .. probably UnderscoreWord. Then on seeing "_", move to a state
that can identify the new token. If you finally recognize that token,
then output it.

In order to extract portions of the newly created token,  you can write
a special handler for the token that resets the parser position to the
start of the token to get parts of it. And then modify the state machine
to output the part-token before going into the state that can lead to
the token that was identified earlier.


Look at these changes to the text parser as well:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg00004.php

-Sushant.


On Mon, 2010-09-20 at 16:01 +0200, jesper@krogh.cc wrote:
> Hi.
> 
> I'm trying to migrate an application off an existing Full Text Search engine
> and onto PostgreSQL .. one of my main (remaining) headaches are the
> fact that PostgreSQL treats _ as a seperation charachter whereas the existing
> behaviour is to "not split". That means:
> 
> testdb=# select ts_debug('database_tag_number_999');
>                                    ts_debug
> ------------------------------------------------------------------------------
>  (asciiword,"Word, all ASCII",database,{english_stem},english_stem,{databas})
>  (blank,"Space symbols",_,{},,)
>  (asciiword,"Word, all ASCII",tag,{english_stem},english_stem,{tag})
>  (blank,"Space symbols",_,{},,)
>  (asciiword,"Word, all ASCII",number,{english_stem},english_stem,{number})
>  (blank,"Space symbols",_,{},,)
>  (uint,"Unsigned integer",999,{simple},simple,{999})
> (7 rows)
> 
> Where the incoming data, by design contains a set of tags which includes _
> and are expected to be one "lexeme".
> 
> I've tried patching my way out of this using this patch.
> 
> $ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig
> src/backend/tsearch/wparser_def.c
> *** src/backend/tsearch/wparser_def.c.orig    2010-09-20 15:58:37.033336460
> +0200
> --- src/backend/tsearch/wparser_def.c    2010-09-20 15:58:41.193335577 +0200
> ***************
> *** 967,986 ****
> --- 967,988 ----
> 
>   static const TParserStateActionItem actionTPS_InNumWord[] = {
>       {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
>       {p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL},
>       {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
> +     {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL},
>       {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
>       {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
>       {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
>       {p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst, 0, NULL},
>       {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL}
>   };
> 
>   static const TParserStateActionItem actionTPS_InAsciiWord[] = {
>       {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
>       {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
> +     {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
>       {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
>       {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
>       {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
>       {p_iseqC, '-', A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL},
>       {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
> ***************
> *** 995,1004 ****
> --- 997,1007 ----
> 
>   static const TParserStateActionItem actionTPS_InWord[] = {
>       {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL},
>       {p_isalpha, 0, A_NEXT, TPS_Null, 0, NULL},
>       {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL},
> +     {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
>       {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL},
>       {p_iseqC, '-', A_PUSH, TPS_InHyphenWordFirst, 0, NULL},
>       {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL}
>   };
> 
> 
> 
> This will obviously break other peoples applications, so my questions would
> be: If this should be made configurable.. how should it be done?
> 
> As a sidenote... Xapian doesn't split on _ .. Lucene does.
> 
> Thanks.
> 
> -- 
> Jesper
> 
> 




pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Git conversion status
Next
From: Heikki Linnakangas
Date:
Subject: Re: Git conversion status