Thread: Fix for stop words in thesaurus file

Fix for stop words in thesaurus file

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> One possible real solution would be to tweak the dictionary APIs so
> >> that the dictionaries can find out whether this is the first load during
> >> a session, or a reload, and emit notices only in the first case.
>
> > Yea, that would work too.  Or just throw an error for a stop word in the
> > file and then you never get a reload (use "*" instead).
>
> Hm, that's a thought --- it'd be a way to solve the problem without an
> API change for dictionaries, which is something to avoid at this late
> stage of the 8.3 cycle.  Come to think of it, does the ts_cache stuff
> work properly when an error is thrown in dictionary load (ie, is the
> cache entry left in a sane state)?

I have developed the attached patch which uses "?" to mark stop words in
the thesaurus file.  ("*" was already in use in the file.)  I updated
the docs to use "?", which makes the documentation clearer too.

The patch also reenables testing of stop words in the thesuarus file.

FYI, there is no longer a NOTICE for stop words in the thesaurus file;
it throws an error now, and says to use "?" instead.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/textsearch.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.30
diff -c -c -r1.30 textsearch.sgml
*** doc/src/sgml/textsearch.sgml    5 Nov 2007 15:55:53 -0000    1.30
--- doc/src/sgml/textsearch.sgml    9 Nov 2007 02:26:17 -0000
***************
*** 2258,2277 ****
     </para>

     <para>
!     Stop words recognized by the subdictionary are replaced by a <quote>stop
!     word placeholder</quote> to record their position. To illustrate this,
!     consider these phrases:

  <programlisting>
! a one the two : swsw
! the one a two : swsw2
  </programlisting>

!     Assuming that <literal>a</> and <literal>the</> are stop words according
!     to the subdictionary, these two phrases are identical to the thesaurus:
!     they both look like <replaceable>stopword</> <literal>one</>
!     <replaceable>stopword</> <literal>two</>.  Input matching this pattern
!     will be replaced by <literal>swsw2</>, according to the tie-breaking rule.
     </para>

     <para>
--- 2258,2274 ----
     </para>

     <para>
!     Specific stop words recognized by the subdictionary cannot be
!     specified;  instead use <literal>?</> to mark the location where any
!     stop word can appear.  For example, assuming that <literal>a</> and
!     <literal>the</> are stop words according to the subdictionary:

  <programlisting>
! ? one ? two : swsw
  </programlisting>

!     matches <literal>a one the two</> and <literal>the one a two</>;
!     both would be replaced by <literal>swsw</>.
     </para>

     <para>
Index: src/backend/tsearch/dict_thesaurus.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/dict_thesaurus.c,v
retrieving revision 1.5
diff -c -c -r1.5 dict_thesaurus.c
*** src/backend/tsearch/dict_thesaurus.c    9 Nov 2007 01:32:22 -0000    1.5
--- src/backend/tsearch/dict_thesaurus.c    9 Nov 2007 02:26:17 -0000
***************
*** 412,458 ****
      {
          TSLexeme   *ptr;

!         ptr = (TSLexeme *) DatumGetPointer(FunctionCall4(&(d->subdict->lexize),
!                                        PointerGetDatum(d->subdict->dictData),
!                                           PointerGetDatum(d->wrds[i].lexeme),
!                                     Int32GetDatum(strlen(d->wrds[i].lexeme)),
!                                                      PointerGetDatum(NULL)));
!
!         if (!ptr)
!             elog(ERROR, "thesaurus word-sample \"%s\" isn't recognized by subdictionary (rule %d)",
!                  d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
!         else if (!(ptr->lexeme))
!         {
!             elog(NOTICE, "thesaurus word-sample \"%s\" is recognized as stop-word, assign any stop-word (rule %d)",
!                  d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
!
              newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, NULL, d->wrds[i].entries, 0);
-         }
          else
          {
!             while (ptr->lexeme)
              {
!                 TSLexeme   *remptr = ptr + 1;
!                 int            tnvar = 1;
!                 int            curvar = ptr->nvariant;
!
!                 /* compute n words in one variant */
!                 while (remptr->lexeme)
                  {
!                     if (remptr->nvariant != (remptr - 1)->nvariant)
!                         break;
!                     tnvar++;
!                     remptr++;
!                 }
!
!                 remptr = ptr;
!                 while (remptr->lexeme && remptr->nvariant == curvar)
!                 {
!                     newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, remptr, d->wrds[i].entries, tnvar);
!                     remptr++;
                  }
-
-                 ptr = remptr;
              }
          }

--- 412,459 ----
      {
          TSLexeme   *ptr;

!         if (strcmp(d->wrds[i].lexeme, "?") == 0)    /* Is stop word marker? */
              newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, NULL, d->wrds[i].entries, 0);
          else
          {
!             ptr = (TSLexeme *) DatumGetPointer(FunctionCall4(&(d->subdict->lexize),
!                                            PointerGetDatum(d->subdict->dictData),
!                                               PointerGetDatum(d->wrds[i].lexeme),
!                                         Int32GetDatum(strlen(d->wrds[i].lexeme)),
!                                                          PointerGetDatum(NULL)));
!
!             if (!ptr)
!                 elog(ERROR, "thesaurus word-sample \"%s\" isn't recognized by subdictionary (rule %d)",
!                      d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
!             else if (!(ptr->lexeme))
!                 elog(ERROR, "thesaurus word-sample \"%s\" is recognized as stop-word, use \"?\" for stop words
instead(rule %d)", 
!                      d->wrds[i].lexeme, d->wrds[i].entries->idsubst + 1);
!             else
              {
!                 while (ptr->lexeme)
                  {
!                     TSLexeme   *remptr = ptr + 1;
!                     int            tnvar = 1;
!                     int            curvar = ptr->nvariant;
!
!                     /* compute n words in one variant */
!                     while (remptr->lexeme)
!                     {
!                         if (remptr->nvariant != (remptr - 1)->nvariant)
!                             break;
!                         tnvar++;
!                         remptr++;
!                     }
!
!                     remptr = ptr;
!                     while (remptr->lexeme && remptr->nvariant == curvar)
!                     {
!                         newwrds = addCompiledLexeme(newwrds, &nnw, &tnm, remptr, d->wrds[i].entries, tnvar);
!                         remptr++;
!                     }
!
!                     ptr = remptr;
                  }
              }
          }

Index: src/backend/tsearch/thesaurus_sample.ths
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/thesaurus_sample.ths,v
retrieving revision 1.2
diff -c -c -r1.2 thesaurus_sample.ths
*** src/backend/tsearch/thesaurus_sample.ths    23 Sep 2007 15:58:58 -0000    1.2
--- src/backend/tsearch/thesaurus_sample.ths    9 Nov 2007 02:26:17 -0000
***************
*** 14,17 ****
  supernovae stars : *sn
  supernovae : *sn
  booking tickets : order invitation cards
! # booking the tickets : order invitation Cards
--- 14,18 ----
  supernovae stars : *sn
  supernovae : *sn
  booking tickets : order invitation cards
! booking ? tickets : order invitation Cards
!
Index: src/test/regress/expected/tsdicts.out
===================================================================
RCS file: /cvsroot/pgsql/src/test/regress/expected/tsdicts.out,v
retrieving revision 1.3
diff -c -c -r1.3 tsdicts.out
*** src/test/regress/expected/tsdicts.out    23 Oct 2007 20:46:12 -0000    1.3
--- src/test/regress/expected/tsdicts.out    9 Nov 2007 02:26:20 -0000
***************
*** 311,318 ****
  (1 row)

  SELECT to_tsvector('thesaurus_tst', 'Booking tickets is looking like a booking a tickets');
!                              to_tsvector
! ---------------------------------------------------------------------
!  'book':8 'card':3 'like':6 'look':5 'invit':2 'order':1 'ticket':10
  (1 row)

--- 311,318 ----
  (1 row)

  SELECT to_tsvector('thesaurus_tst', 'Booking tickets is looking like a booking a tickets');
!                       to_tsvector
! -------------------------------------------------------
!  'card':3,10 'like':6 'look':5 'invit':2,9 'order':1,8
  (1 row)


Re: Fix for stop words in thesaurus file

From
Simon Riggs
Date:
On Thu, 2007-11-08 at 21:32 -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > Tom Lane wrote:
> > >> One possible real solution would be to tweak the dictionary APIs so
> > >> that the dictionaries can find out whether this is the first load during
> > >> a session, or a reload, and emit notices only in the first case.
> >
> > > Yea, that would work too.  Or just throw an error for a stop word in the
> > > file and then you never get a reload (use "*" instead).
> >
> > Hm, that's a thought --- it'd be a way to solve the problem without an
> > API change for dictionaries, which is something to avoid at this late
> > stage of the 8.3 cycle.  Come to think of it, does the ts_cache stuff
> > work properly when an error is thrown in dictionary load (ie, is the
> > cache entry left in a sane state)?
>
> I have developed the attached patch which uses "?" to mark stop words in
> the thesaurus file.  ("*" was already in use in the file.)  I updated
> the docs to use "?", which makes the documentation clearer too.
>
> The patch also reenables testing of stop words in the thesuarus file.
>
> FYI, there is no longer a NOTICE for stop words in the thesaurus file;
> it throws an error now, and says to use "?" instead.

So this fix requires people to have a different dictionary file from 8.2
to 8.3, and to manually edit the file? That makes upgrade even harder
and more error prone.

Seems easier to do it the way Tom suggested and only emit notices in the
first case.

I notice there's still a placeholder in the docs for how to upgrade.
Perhaps if we wrote that now it would make it clearer where the
difficulties lie?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Fix for stop words in thesaurus file

From
Bruce Momjian
Date:
Simon Riggs wrote:
> > I have developed the attached patch which uses "?" to mark stop words in
> > the thesaurus file.  ("*" was already in use in the file.)  I updated
> > the docs to use "?", which makes the documentation clearer too.
> >
> > The patch also reenables testing of stop words in the thesuarus file.
> >
> > FYI, there is no longer a NOTICE for stop words in the thesaurus file;
> > it throws an error now, and says to use "?" instead.
>
> So this fix requires people to have a different dictionary file from 8.2
> to 8.3, and to manually edit the file? That makes upgrade even harder
> and more error prone.
>
> Seems easier to do it the way Tom suggested and only emit notices in the
> first case.

There is no easy way to do that so I have not heard that proposed as an
actual solution.

> I notice there's still a placeholder in the docs for how to upgrade.
> Perhaps if we wrote that now it would make it clearer where the
> difficulties lie?

Yes, this would be one of the many changes in the API.  Of course, if
you did use a stop word you would now get an error and it would tell you
explicitly to use "?" for stop words, so I don't see a problem here.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Fix for stop words in thesaurus file

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > Tom Lane wrote:
> > >> One possible real solution would be to tweak the dictionary APIs so
> > >> that the dictionaries can find out whether this is the first load during
> > >> a session, or a reload, and emit notices only in the first case.
> >
> > > Yea, that would work too.  Or just throw an error for a stop word in the
> > > file and then you never get a reload (use "*" instead).
> >
> > Hm, that's a thought --- it'd be a way to solve the problem without an
> > API change for dictionaries, which is something to avoid at this late
> > stage of the 8.3 cycle.  Come to think of it, does the ts_cache stuff
> > work properly when an error is thrown in dictionary load (ie, is the
> > cache entry left in a sane state)?
>
> I have developed the attached patch which uses "?" to mark stop words in
> the thesaurus file.  ("*" was already in use in the file.)  I updated
> the docs to use "?", which makes the documentation clearer too.
>
> The patch also reenables testing of stop words in the thesuarus file.
>
> FYI, there is no longer a NOTICE for stop words in the thesaurus file;
> it throws an error now, and says to use "?" instead.

Patch applied.  I added documentation on this API change too.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +