I wrote: > The upstream recommendation, which seems pretty sane to me, is to > simply reject any string exceeding some threshold length as not > possibly being a word. Apparently it's common to use thresholds > as small as 64 bytes, but in the attached I used 1000 bytes.
On further thought: that coding treats anything longer than 1000 bytes as a stopword, but maybe we should just accept it unmodified. The manual says "A Snowball dictionary recognizes everything, whether or not it is able to simplify the word". While "recognizes" formally includes the case of "recognizes as a stopword", people might find this behavior surprising. We could alternatively do it as attached, which accepts overlength words but does nothing to them except case-fold. This is closer to the pre-patch behavior, but gives up the opportunity to avoid useless downstream processing of long words.
This patch looks good to me. It avoids overly-long words (> 1000 bytes) going through the stemmer so the stack overflow issue in Turkish stemmer should not exist any more.