Re: Stack overflow issue - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Stack overflow issue
Date
Msg-id 3802215.1661900226@sss.pgh.pa.us
Whole thread Raw
In response to Re: Stack overflow issue  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Stack overflow issue
List pgsql-hackers
I wrote:
> The upstream recommendation, which seems pretty sane to me, is to
> simply reject any string exceeding some threshold length as not
> possibly being a word.  Apparently it's common to use thresholds
> as small as 64 bytes, but in the attached I used 1000 bytes.

On further thought: that coding treats anything longer than 1000
bytes as a stopword, but maybe we should just accept it unmodified.
The manual says "A Snowball dictionary recognizes everything, whether
or not it is able to simplify the word".  While "recognizes" formally
includes the case of "recognizes as a stopword", people might find
this behavior surprising.  We could alternatively do it as attached,
which accepts overlength words but does nothing to them except
case-fold.  This is closer to the pre-patch behavior, but gives up
the opportunity to avoid useless downstream processing of long words.

            regards, tom lane

diff --git a/src/backend/snowball/dict_snowball.c b/src/backend/snowball/dict_snowball.c
index 68c9213f69..1d5dfff5a0 100644
--- a/src/backend/snowball/dict_snowball.c
+++ b/src/backend/snowball/dict_snowball.c
@@ -275,8 +275,24 @@ dsnowball_lexize(PG_FUNCTION_ARGS)
     char       *txt = lowerstr_with_len(in, len);
     TSLexeme   *res = palloc0(sizeof(TSLexeme) * 2);

-    if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
+    /*
+     * Do not pass strings exceeding 1000 bytes to the stemmer, as they're
+     * surely not words in any human language.  This restriction avoids
+     * wasting cycles on stuff like base64-encoded data, and it protects us
+     * against possible inefficiency or misbehavior in the stemmer.  (For
+     * example, the Turkish stemmer has an indefinite recursion, so it can
+     * crash on long-enough strings.)  However, Snowball dictionaries are
+     * defined to recognize all strings, so we can't reject the string as an
+     * unknown word.
+     */
+    if (len > 1000)
+    {
+        /* return the lexeme lowercased, but otherwise unmodified */
+        res->lexeme = txt;
+    }
+    else if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
     {
+        /* empty or stopword, so report as stopword */
         pfree(txt);
     }
     else

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Reducing the chunk header sizes on all memory context types
Next
From: Peter Smith
Date:
Subject: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher