Home > mailing lists

Re: Configuring Text Search parser? - Mailing list pgsql-hackers

From	Sushant Sinha
Subject	Re: Configuring Text Search parser?
Date	September 21, 2010 14:38:53
Msg-id	1285090712.4454.70.camel@yoffice Whole thread
In response to	Configuring Text Search parser? (jesper@krogh.cc)
List	pgsql-hackers

Tree view

Your changes are somewhat fine. It will get you tokens with "_"
characters in it. However, it is not nice to mix your new token with
existing token like NUMWORD. Give a new name to your new type of
token .. probably UnderscoreWord. Then on seeing "_", move to a state
that can identify the new token. If you finally recognize that token,
then output it.

In order to extract portions of the newly created token,  you can write
a special handler for the token that resets the parser position to the
start of the token to get parts of it. And then modify the state machine
to output the part-token before going into the state that can lead to
the token that was identified earlier.


Look at these changes to the text parser as well:

http://archives.postgresql.org/pgsql-hackers/2010-09/msg00004.php

-Sushant.


On Mon, 2010-09-20 at 16:01 +0200, jesper@krogh.cc wrote:
> Hi.
> 
> I'm trying to migrate an application off an existing Full Text Search engine
> and onto PostgreSQL .. one of my main (remaining) headaches are the
> fact that PostgreSQL treats _ as a seperation charachter whereas the existing
> behaviour is to "not split". That means:
> 
> testdb=# select ts_debug('database_tag_number_999');
>                                    ts_debug
> ------------------------------------------------------------------------------
>  (asciiword,"Word, all ASCII",database,{english_stem},english_stem,{databas})
>  (blank,"Space symbols",_,{},,)
>  (asciiword,"Word, all ASCII",tag,{english_stem},english_stem,{tag})
>  (blank,"Space symbols",_,{},,)
>  (asciiword,"Word, all ASCII",number,{english_stem},english_stem,{number})
>  (blank,"Space symbols",_,{},,)
>  (uint,"Unsigned integer",999,{simple},simple,{999})
> (7 rows)
> 
> Where the incoming data, by design contains a set of tags which includes _
> and are expected to be one "lexeme".
> 
> I've tried patching my way out of this using this patch.
> 
> $ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig
> src/backend/tsearch/wparser_def.c
> *** src/backend/tsearch/wparser_def.c.orig    2010-09-20 15:58:37.033336460
> +0200
> --- src/backend/tsearch/wparser_def.c    2010-09-20 15:58:41.193335577 +0200
> ***************
> *** 967,986 ****
> --- 967,988 ----
> 
>   static const TParserStateActionItem actionTPS_InNumWord[] = {
>       {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL},
>       {p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL},
>       {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL},
> +     {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL},
>       {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
>       {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
>       {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
>       {p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst, 0, NULL},
>       {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL}
>   };
> 
>   static const TParserStateActionItem actionTPS_InAsciiWord[] = {
>       {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL},
>       {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL},
> +     {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
>       {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL},
>       {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL},
>       {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL},
>       {p_iseqC, '-', A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL},
>       {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL},
> ***************
> *** 995,1004 ****
> --- 997,1007 ----
> 
>   static const TParserStateActionItem actionTPS_InWord[] = {
>       {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL},
>       {p_isalpha, 0, A_NEXT, TPS_Null, 0, NULL},
>       {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL},
> +     {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
>       {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL},
>       {p_iseqC, '-', A_PUSH, TPS_InHyphenWordFirst, 0, NULL},
>       {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL}
>   };
> 
> 
> 
> This will obviously break other peoples applications, so my questions would
> be: If this should be made configurable.. how should it be done?
> 
> As a sidenote... Xapian doesn't split on _ .. Lucene does.
> 
> Thanks.
> 
> -- 
> Jesper
> 
>

pgsql-hackers by date:

From: Alvaro Herrera
Date: 21 September 2010, 14:32:34
Subject: Re: Git conversion status

From: Heikki Linnakangas
Date: 21 September 2010, 14:40:11
Subject: Re: Git conversion status

Re: Configuring Text Search parser? - Mailing list pgsql-hackers

Previous

Next