Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords - Mailing list pgsql-patches

From Jan Urbański
Subject Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords
Date
Msg-id 473A61E2.4070908@students.mimuw.edu.pl
Whole thread Raw
In response to Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords  (Jan Urbański <j.urbanski@students.mimuw.edu.pl>)
List pgsql-patches
Jan Urbański wrote:
> Great, I didn't know the API was that convenient in 8.3. I'll try
> posting a working patch for 8.3 during the weekend.

Here's the patch for 8.3beta2. As was suggested I added a configuration
parameter to the 'simple' dictionary called AcceptAll so now it can work
in two modes: either accept everything (the default) or do not
reckognize anything (return NULL). Of couse stopwords are still being
weeded out.

The patch includes changes to the documentation (which was inconsistent
by the way: it stated that the 'simple' dictionary returns NULL for
stopwords, when in fact it returns an empty array).

Regards,
Jan Urbanski
--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin
diff -Naur postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml postgresql-8.3beta2/doc/src/sgml/textsearch.sgml
--- postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml    2007-10-27 02:19:45.000000000 +0200
+++ postgresql-8.3beta2/doc/src/sgml/textsearch.sgml    2007-11-14 03:35:48.000000000 +0100
@@ -2090,9 +2090,10 @@
    <para>
     The <literal>simple</> dictionary template operates by converting the
     input token to lower case and checking it against a file of stop words.
-    If it is found in the file then <literal>NULL</> is returned, causing
-    the token to be discarded.  If not, the lower-cased form of the word
-    is returned as the normalized lexeme.
+    If it is found in the file then an empty array is returned. If not, the
+    return value depends on the configuration. The default is to return the
+    lower-cased form of the word, but one might choose to
+    return <literal>NULL</> insead.
    </para>

    <para>
@@ -2135,6 +2136,34 @@
 </programlisting>
    </para>

+   <para>
+     We can also choose to return <literal>NULL</> insead of the lower-cased
+     lexeme if it is not found in the stop words file. This can be useful if
+     we just want to pass the unchanged lexeme to another dictionary instead
+     of reporting it as reckognized. We can control this behaviour through
+     the <literal>AcceptAll</> parameter. Correct values for this parameter
+     are <literal>true</> and <literal>false</>, the default
+     is <literal>true</>.
+   </para>
+
+   <para>
+     Using the same configuration as in the previous example:
+
+<programlisting>
+ALTER TEXT SEARCH DICTIONARY public.simple_dict ( AcceptAll = false );
+
+SELECT ts_lexize('public.simple_dict','YeS');
+ ts_lexize
+-----------
+
+
+SELECT ts_lexize('public.simple_dict','The');
+ ts_lexize
+-----------
+ {}
+</programlisting>
+   </para>
+
    <caution>
     <para>
      Most types of dictionaries rely on configuration files, such as files of
diff -Naur postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c
postgresql-8.3beta2/src/backend/tsearch/dict_simple.c
--- postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c    2007-08-25 02:03:59.000000000 +0200
+++ postgresql-8.3beta2/src/backend/tsearch/dict_simple.c    2007-11-14 03:39:45.000000000 +0100
@@ -23,6 +23,7 @@
 typedef struct
 {
     StopList    stoplist;
+    bool        acceptAll;
 } DictSimple;


@@ -31,8 +32,12 @@
 {
     List       *dictoptions = (List *) PG_GETARG_POINTER(0);
     DictSimple *d = (DictSimple *) palloc0(sizeof(DictSimple));
-    bool        stoploaded = false;
+    bool        stoploaded = false,
+            acceptloaded = false;
     ListCell   *l;
+    const char    *defstring;
+
+    d->acceptAll = true;

     foreach(l, dictoptions)
     {
@@ -47,6 +52,24 @@
             readstoplist(defGetString(defel), &d->stoplist, lowerstr);
             stoploaded = true;
         }
+        else if (pg_strcasecmp("AcceptAll", defel->defname) == 0)
+        {
+            if (acceptloaded)
+                ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("multiple AcceptAll parameters")));
+            defstring = defGetString(defel);
+            if (pg_strcasecmp(defstring, "True") == 0)
+                d->acceptAll = true;
+            else if (pg_strcasecmp(defstring, "False") == 0)
+                d->acceptAll = false;
+            else
+                ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("invalid value for AcceptAll parameter: \"%s\"",
+                             defstring)));
+            acceptloaded = true;
+        }
         else
         {
             ereport(ERROR,
@@ -71,9 +94,18 @@
     txt = lowerstr_with_len(in, len);

     if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
+    {
         pfree(txt);
+        PG_RETURN_POINTER(res);
+    }
     else
-        res[0].lexeme = txt;
-
-    PG_RETURN_POINTER(res);
+    {
+        if (d->acceptAll)
+        {
+            res[0].lexeme = txt;
+            PG_RETURN_POINTER(res);
+        }
+        else
+            PG_RETURN_POINTER(NULL);
+    }
 }

Attachment

pgsql-patches by date:

Previous
From: ITAGAKI Takahiro
Date:
Subject: Re: Miscalculation in IsCheckpointOnSchedule()
Next
From: Jan Urbański
Date:
Subject: Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords