Home > mailing lists

Re: Stack overflow issue - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: Stack overflow issue
Date	August 31, 2022 01:57:06
Msg-id	3802215.1661900226@sss.pgh.pa.us Whole thread Raw
In response to	Re: Stack overflow issue (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Stack overflow issue
List	pgsql-hackers

Tree view

I wrote:
> The upstream recommendation, which seems pretty sane to me, is to
> simply reject any string exceeding some threshold length as not
> possibly being a word.  Apparently it's common to use thresholds
> as small as 64 bytes, but in the attached I used 1000 bytes.

On further thought: that coding treats anything longer than 1000
bytes as a stopword, but maybe we should just accept it unmodified.
The manual says "A Snowball dictionary recognizes everything, whether
or not it is able to simplify the word".  While "recognizes" formally
includes the case of "recognizes as a stopword", people might find
this behavior surprising.  We could alternatively do it as attached,
which accepts overlength words but does nothing to them except
case-fold.  This is closer to the pre-patch behavior, but gives up
the opportunity to avoid useless downstream processing of long words.

            regards, tom lane

diff --git a/src/backend/snowball/dict_snowball.c b/src/backend/snowball/dict_snowball.c
index 68c9213f69..1d5dfff5a0 100644
--- a/src/backend/snowball/dict_snowball.c
+++ b/src/backend/snowball/dict_snowball.c
@@ -275,8 +275,24 @@ dsnowball_lexize(PG_FUNCTION_ARGS)
     char       *txt = lowerstr_with_len(in, len);
     TSLexeme   *res = palloc0(sizeof(TSLexeme) * 2);

-    if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
+    /*
+     * Do not pass strings exceeding 1000 bytes to the stemmer, as they're
+     * surely not words in any human language.  This restriction avoids
+     * wasting cycles on stuff like base64-encoded data, and it protects us
+     * against possible inefficiency or misbehavior in the stemmer.  (For
+     * example, the Turkish stemmer has an indefinite recursion, so it can
+     * crash on long-enough strings.)  However, Snowball dictionaries are
+     * defined to recognize all strings, so we can't reject the string as an
+     * unknown word.
+     */
+    if (len > 1000)
+    {
+        /* return the lexeme lowercased, but otherwise unmodified */
+        res->lexeme = txt;
+    }
+    else if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
     {
+        /* empty or stopword, so report as stopword */
         pfree(txt);
     }
     else

pgsql-hackers by date:

From: David Rowley
Date: 31 August 2022, 01:40:43
Subject: Re: Reducing the chunk header sizes on all memory context types

From: Peter Smith
Date: 31 August 2022, 02:35:54
Subject: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Re: Stack overflow issue - Mailing list pgsql-hackers

Previous

Next