I wrote:
> The upstream recommendation, which seems pretty sane to me, is to
> simply reject any string exceeding some threshold length as not
> possibly being a word. Apparently it's common to use thresholds
> as small as 64 bytes, but in the attached I used 1000 bytes.
On further thought: that coding treats anything longer than 1000
bytes as a stopword, but maybe we should just accept it unmodified.
The manual says "A Snowball dictionary recognizes everything, whether
or not it is able to simplify the word". While "recognizes" formally
includes the case of "recognizes as a stopword", people might find
this behavior surprising. We could alternatively do it as attached,
which accepts overlength words but does nothing to them except
case-fold. This is closer to the pre-patch behavior, but gives up
the opportunity to avoid useless downstream processing of long words.
regards, tom lane
diff --git a/src/backend/snowball/dict_snowball.c b/src/backend/snowball/dict_snowball.c
index 68c9213f69..1d5dfff5a0 100644
--- a/src/backend/snowball/dict_snowball.c
+++ b/src/backend/snowball/dict_snowball.c
@@ -275,8 +275,24 @@ dsnowball_lexize(PG_FUNCTION_ARGS)
char *txt = lowerstr_with_len(in, len);
TSLexeme *res = palloc0(sizeof(TSLexeme) * 2);
- if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
+ /*
+ * Do not pass strings exceeding 1000 bytes to the stemmer, as they're
+ * surely not words in any human language. This restriction avoids
+ * wasting cycles on stuff like base64-encoded data, and it protects us
+ * against possible inefficiency or misbehavior in the stemmer. (For
+ * example, the Turkish stemmer has an indefinite recursion, so it can
+ * crash on long-enough strings.) However, Snowball dictionaries are
+ * defined to recognize all strings, so we can't reject the string as an
+ * unknown word.
+ */
+ if (len > 1000)
+ {
+ /* return the lexeme lowercased, but otherwise unmodified */
+ res->lexeme = txt;
+ }
+ else if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
{
+ /* empty or stopword, so report as stopword */
pfree(txt);
}
else