Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords - Mailing list pgsql-patches

From Jan Urbański
Subject Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords
Date
Msg-id 473AD9B9.4020908@students.mimuw.edu.pl
Whole thread Raw
In response to Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords
Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords
List pgsql-patches
> This bit should be replaced with defGetBoolean.  Otherwise it looks
> reasonably sane.

Fixed that, thank you.

Regards,
Jan Urbanski
--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin
diff -Naur postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml postgresql-8.3beta2/doc/src/sgml/textsearch.sgml
--- postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml    2007-10-27 02:19:45.000000000 +0200
+++ postgresql-8.3beta2/doc/src/sgml/textsearch.sgml    2007-11-14 03:35:48.000000000 +0100
@@ -2090,9 +2090,10 @@
    <para>
     The <literal>simple</> dictionary template operates by converting the
     input token to lower case and checking it against a file of stop words.
-    If it is found in the file then <literal>NULL</> is returned, causing
-    the token to be discarded.  If not, the lower-cased form of the word
-    is returned as the normalized lexeme.
+    If it is found in the file then an empty array is returned. If not, the
+    return value depends on the configuration. The default is to return the
+    lower-cased form of the word, but one might choose to
+    return <literal>NULL</> insead.
    </para>

    <para>
@@ -2135,6 +2136,34 @@
 </programlisting>
    </para>

+   <para>
+     We can also choose to return <literal>NULL</> insead of the lower-cased
+     lexeme if it is not found in the stop words file. This can be useful if
+     we just want to pass the unchanged lexeme to another dictionary instead
+     of reporting it as reckognized. We can control this behaviour through
+     the <literal>AcceptAll</> parameter. Correct values for this parameter
+     are <literal>true</> and <literal>false</>, the default
+     is <literal>true</>.
+   </para>
+
+   <para>
+     Using the same configuration as in the previous example:
+
+<programlisting>
+ALTER TEXT SEARCH DICTIONARY public.simple_dict ( AcceptAll = false );
+
+SELECT ts_lexize('public.simple_dict','YeS');
+ ts_lexize
+-----------
+
+
+SELECT ts_lexize('public.simple_dict','The');
+ ts_lexize
+-----------
+ {}
+</programlisting>
+   </para>
+
    <caution>
     <para>
      Most types of dictionaries rely on configuration files, such as files of
diff -Naur postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c
postgresql-8.3beta2/src/backend/tsearch/dict_simple.c
--- postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c    2007-08-25 02:03:59.000000000 +0200
+++ postgresql-8.3beta2/src/backend/tsearch/dict_simple.c    2007-11-14 12:17:05.000000000 +0100
@@ -23,6 +23,7 @@
 typedef struct
 {
     StopList    stoplist;
+    bool        acceptAll;
 } DictSimple;


@@ -31,9 +32,12 @@
 {
     List       *dictoptions = (List *) PG_GETARG_POINTER(0);
     DictSimple *d = (DictSimple *) palloc0(sizeof(DictSimple));
-    bool        stoploaded = false;
+    bool        stoploaded = false,
+            acceptloaded = false;
     ListCell   *l;

+    d->acceptAll = true;
+
     foreach(l, dictoptions)
     {
         DefElem    *defel = (DefElem *) lfirst(l);
@@ -47,6 +51,18 @@
             readstoplist(defGetString(defel), &d->stoplist, lowerstr);
             stoploaded = true;
         }
+        else if (pg_strcasecmp("AcceptAll", defel->defname) == 0)
+        {
+            if (acceptloaded)
+                ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("multiple AcceptAll parameters")));
+            if (defGetBoolean(defel))
+                d->acceptAll = true;
+            else
+                d->acceptAll = false;
+            acceptloaded = true;
+        }
         else
         {
             ereport(ERROR,
@@ -71,9 +87,18 @@
     txt = lowerstr_with_len(in, len);

     if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
+    {
         pfree(txt);
+        PG_RETURN_POINTER(res);
+    }
     else
-        res[0].lexeme = txt;
-
-    PG_RETURN_POINTER(res);
+    {
+        if (d->acceptAll)
+        {
+            res[0].lexeme = txt;
+            PG_RETURN_POINTER(res);
+        }
+        else
+            PG_RETURN_POINTER(NULL);
+    }
 }

Attachment

pgsql-patches by date:

Previous
From: Jan Urbański
Date:
Subject: Re: a tsearch2 (8.2.4) dictionary that only filters out stopwords
Next
From: Heikki Linnakangas
Date:
Subject: Re: Miscalculation in IsCheckpointOnSchedule()