Fix for stop words in thesaurus file - Mailing list pgsql-patches

From Bruce Momjian
Subject Fix for stop words in thesaurus file
Date
Msg-id 200711090232.lA92W9614295@momjian.us
Whole thread Raw
Responses Re: Fix for stop words in thesaurus file  (Simon Riggs <simon@2ndquadrant.com>)
Re: Fix for stop words in thesaurus file  (Bruce Momjian <bruce@momjian.us>)
List pgsql-patches
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> One possible real solution would be to tweak the dictionary APIs so
> >> that the dictionaries can find out whether this is the first load during
> >> a session, or a reload, and emit notices only in the first case.
>
> > Yea, that would work too.  Or just throw an error for a stop word in the
> > file and then you never get a reload (use "*" instead).
>
> Hm, that's a thought --- it'd be a way to solve the problem without an
> API change for dictionaries, which is something to avoid at this late
> stage of the 8.3 cycle.  Come to think of it, does the ts_cache stuff
> work properly when an error is thrown in dictionary load (ie, is the
> cache entry left in a sane state)?

I have developed the attached patch which uses "?" to mark stop words in
the thesaurus file.  ("*" was already in use in the file.)  I updated
the docs to use "?", which makes the documentation clearer too.

The patch also reenables testing of stop words in the thesuarus file.

FYI, there is no longer a NOTICE for stop words in the thesaurus file;
it throws an error now, and says to use "?" instead.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/textsearch.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.30
diff -c -c -r1.30 textsearch.sgml
*** doc/src/sgml/textsearch.sgml    5 Nov 2007 15:55:53 -0000    1.30
--- doc/src/sgml/textsearch.sgml    9 Nov 2007 02:26:17 -0000
***************
*** 2258,2277 ****
     </para>

     <para>
!     Stop words recognized by the subdictionary are replaced by a <quote>stop
!     word placeholder</quote> to record their position. To illustrate this,
!     consider these phrases:

  <programlisting>
! a one the two : swsw
! the one a two : swsw2
  </programlisting>

!     Assuming that <literal>a</> and <literal>the</> are stop words according
!     to the subdictionary, these two phrases are identical to the thesaurus:
!     they both look like <replaceable>stopword</> <literal>one</>
!     <replaceable>stopword</> <literal>two</>.  Input matching this pattern
!     will be replaced by <literal>swsw2</>, according to the tie-breaking rule.
     </para>

     <para>
--- 2258,2274 ----
     </para>

     <para>
!     Specific stop words recognized by the subdictionary cannot be
!     specified;  instead use <literal>?</> to mark the location where any
!     stop word can appear.  For example, assuming that <literal>a</> and
!     <literal>the</> are stop words according to the subdictionary:

  <programlisting>
! ? one ? two : swsw
  </programlisting>

!     matches <literal>a one the two</> and <literal>the one a two</>;
!     both would be replaced by <literal>swsw</>.
     </para>

     <para>
Index: src/backend/tsearch/dict_thesaurus.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/dict_thesaurus.c,v
retrieving revision 1.5
diff -c -c -r1.5 dict_thesaurus.c
*** src/backend/tsearch/dict_thesaurus.c    9 Nov 2007 01:32:22 -0000    1.5
--- src/backend/tsearch/dict_thesaurus.c    9 Nov 2007 02:26:17 -0000
***************
*** 412,458 ****
      {
          TSLexeme   *ptr;

!         ptr = (TSLexeme *) DatumGetPointer(FunctionCall4(&(d->subdict->lexize),
!                                        PointerGetDatum(d->subdict->dictData),
!                                           PointerGetDatum(d->wrds[i].lexeme),
!                                     Int32GetDatum(strlen(d->wrds[i].lexeme)),
!                                                      PointerGetDatum(NULL)));
!
!         if (!ptr)
!             elog(ERROR, "thesaurus word-sample \"%s\" isn't recognized by subdictionary (rule %d)",
!                  d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
!         else if (!(ptr->lexeme))
!         {
!             elog(NOTICE, "thesaurus word-sample \"%s\" is recognized as stop-word, assign any stop-word (rule %d)",
!                  d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
!
              newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, NULL, d->wrds[i].entries, 0);
-         }
          else
          {
!             while (ptr->lexeme)
              {
!                 TSLexeme   *remptr = ptr + 1;
!                 int            tnvar = 1;
!                 int            curvar = ptr->nvariant;
!
!                 /* compute n words in one variant */
!                 while (remptr->lexeme)
                  {
!                     if (remptr->nvariant != (remptr - 1)->nvariant)
!                         break;
!                     tnvar++;
!                     remptr++;
!                 }
!
!                 remptr = ptr;
!                 while (remptr->lexeme && remptr->nvariant == curvar)
!                 {
!                     newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, remptr, d->wrds[i].entries, tnvar);
!                     remptr++;
                  }
-
-                 ptr = remptr;
              }
          }

--- 412,459 ----
      {
          TSLexeme   *ptr;

!         if (strcmp(d->wrds[i].lexeme, "?") == 0)    /* Is stop word marker? */
              newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, NULL, d->wrds[i].entries, 0);
          else
          {
!             ptr = (TSLexeme *) DatumGetPointer(FunctionCall4(&(d->subdict->lexize),
!                                            PointerGetDatum(d->subdict->dictData),
!                                               PointerGetDatum(d->wrds[i].lexeme),
!                                         Int32GetDatum(strlen(d->wrds[i].lexeme)),
!                                                          PointerGetDatum(NULL)));
!
!             if (!ptr)
!                 elog(ERROR, "thesaurus word-sample \"%s\" isn't recognized by subdictionary (rule %d)",
!                      d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
!             else if (!(ptr->lexeme))
!                 elog(ERROR, "thesaurus word-sample \"%s\" is recognized as stop-word, use \"?\" for stop words
instead(rule %d)", 
!                      d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
!             else
              {
!                 while (ptr->lexeme)
                  {
!                     TSLexeme   *remptr = ptr + 1;
!                     int            tnvar = 1;
!                     int            curvar = ptr->nvariant;
!
!                     /* compute n words in one variant */
!                     while (remptr->lexeme)
!                     {
!                         if (remptr->nvariant != (remptr - 1)->nvariant)
!                             break;
!                         tnvar++;
!                         remptr++;
!                     }
!
!                     remptr = ptr;
!                     while (remptr->lexeme && remptr->nvariant == curvar)
!                     {
!                         newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, remptr, d->wrds[i].entries, tnvar);
!                         remptr++;
!                     }
!
!                     ptr = remptr;
                  }
              }
          }

Index: src/backend/tsearch/thesaurus_sample.ths
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/thesaurus_sample.ths,v
retrieving revision 1.2
diff -c -c -r1.2 thesaurus_sample.ths
*** src/backend/tsearch/thesaurus_sample.ths    23 Sep 2007 15:58:58 -0000    1.2
--- src/backend/tsearch/thesaurus_sample.ths    9 Nov 2007 02:26:17 -0000
***************
*** 14,17 ****
  supernovae stars : *sn
  supernovae : *sn
  booking tickets : order invitation cards
! # booking the tickets : order invitation Cards
--- 14,18 ----
  supernovae stars : *sn
  supernovae : *sn
  booking tickets : order invitation cards
! booking ? tickets : order invitation Cards
!
Index: src/test/regress/expected/tsdicts.out
===================================================================
RCS file: /cvsroot/pgsql/src/test/regress/expected/tsdicts.out,v
retrieving revision 1.3
diff -c -c -r1.3 tsdicts.out
*** src/test/regress/expected/tsdicts.out    23 Oct 2007 20:46:12 -0000    1.3
--- src/test/regress/expected/tsdicts.out    9 Nov 2007 02:26:20 -0000
***************
*** 311,318 ****
  (1 row)

  SELECT to_tsvector('thesaurus_tst', 'Booking tickets is looking like a booking a tickets');
!                              to_tsvector
! ---------------------------------------------------------------------
!  'book':8 'card':3 'like':6 'look':5 'invit':2 'order':1 'ticket':10
  (1 row)

--- 311,318 ----
  (1 row)

  SELECT to_tsvector('thesaurus_tst', 'Booking tickets is looking like a booking a tickets');
!                       to_tsvector
! -------------------------------------------------------
!  'card':3,10 'like':6 'look':5 'invit':2,9 'order':1,8
  (1 row)


pgsql-patches by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Contrib docs v1
Next
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] Connection Pools and DISCARD ALL