While playing with Alexander's pg_trgm regexp patch, I noticed that the
regexp library trips an assertion (if enabled) or crashes, when passed
an input string that contains more than 32k different characters:
select 'foo' ~ (select string_agg(chr(x),'') from generate_series(100,
35000) x) as nastyregex;
This is because it uses 'short' as the datatype to identify colors. When
it overflows, -32768 is used as index to the colordesc array, and you
get a crash. AFAICS this can't reliably be used for anything more
sinister than crashing the backend.
A regex with that many different colors is an extreme case, so I think
it's enough to turn the assertion in newcolor() into a run-time check,
and throw a "too many colors in regexp" error. Alternatively, we could
expand 'color' from short to int, but that would double the memory usage
of sane regexps with less different characters.
Thoughts?
- Heikki