Re: Some regular-expression performance hacking - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Some regular-expression performance hacking |
Date | |
Msg-id | 2845172.1613674385@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Some regular-expression performance hacking (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Some regular-expression performance hacking
|
List | pgsql-hackers |
I thought it was worth looking a little more closely at the error cases in this set of tests, as a form of competition analysis versus Javascript's regex engine. I ran through the cases that gave errors, and pinned down exactly what was causing the error for as many cases as I could. (These results are from your first test corpus, but I doubt the second one would give different conclusions.) We have these errors reported in the test corpus: error | count -----------------------------------+------- invalid escape \ sequence | 39141 invalid character range | 898 invalid backreference number | 816 brackets [] not balanced | 327 invalid repetition count(s) | 76 quantifier operand invalid | 17 parentheses () not balanced | 1 regular expression is too complex | 1 The existing patchset takes care of the one "regular expression is too complex" failure. Of the rest: It turns out that almost 39000 of the "invalid escape \ sequence" errors are due to use of \D, \S, or \W within a character class. We support the positive-class shorthands \d, \s, \w there, but not their negations. I think that this might be something that Henry Spencer just never got around to; I don't see any fundamental reason we can't allow it, although some refactoring might be needed in the regex lexer. Given the apparent popularity of this notation, maybe we should put some work into that. (Having said that, I can't help noticing that a very large fraction of those usages look like, eg, "[\w\W]". It seems to me that that's a very expensive and unwieldy way to spell ".". Am I missing something about what that does in Javascript?) About half of the remaining escape-sequence complaints seem to be due to just randomly backslashing alphanumeric characters that don't need it, as for example "i" in "\itunes\.apple\.com". Apparently Javascript is content to take "\i" as just meaning "i". Our engine rejects that, with a view to keeping such combinations reserved for future definition. That's fine by me so I don't want to change it. Of the rest, many are abbreviated numeric escapes, eg "\u45" where our engine wants to see "\u0045". I don't think being laxer about that would be a great idea either. Lastly, there are some occurrences like "[\1]", which in context look like the \1 might be intended as a back-reference? But I don't really understand what that's supposed to do inside a bracket expression. The "invalid character range" errors seem to be coming from constructs like "[A-Za-z0-9-/]", which our engine rejects because it looks like a messed-up character range. All but 123 of the "invalid backreference number" complaints stem from using backrefs inside lookahead constraints. Some of the rest look like they think you can put capturing parens inside a lookahead constraint and then backref that. I'm not really convinced that such constructs have a well-defined meaning. (I looked at the ECMAscript definition of regexes, and they do say it's allowed, but when trying to define it they resort to handwaving about backtracking; at best that is a particularly lame version of specification by implementation.) Spencer chose to forbid these cases in our engine, and I think there are very good implementation reasons why it won't work. Perhaps we could provide a clearer error message about it, though. 307 of the "brackets [] not balanced" errors, as well as the one "parentheses () not balanced" error, seem to trace to the fact that Javascript considers "[]" to be a legal empty character class, whereas POSIX doesn't allow empty character classes so our engine takes the "]" literally, and then looks for a right bracket it won't find. (That is, in POSIX "[]x]" is a character class matching ']' and 'x'.) Maybe I'm misinterpreting this too, because if I read the documentation correctly, "[]" in Javascript matches nothing, making it impossible for the regex to succeed. Why would such a construct appear this often? The remainder of the bracket errors happen because in POSIX, the sequences "[:", "[=", and "[." within a bracket expression introduce special syntax, whereas in Javascript '[' is just an ordinary data character within a bracket expression. Not much we can do here; the standards are just incompatible. All but 3 of the "invalid repetition count(s)" errors come from quantifiers larger than our implementation limit of 255. A lot of those are exactly 256, though I saw one as high as 3000. The remaining 3 errors are from syntax like "[0-9]{0-3}", which is a syntax error according to our engine ("[0-9]{0,3}" is correct). AFAICT it's not a valid quantifier according to Javascript either; perhaps that engine is just taking the "{0-3}" as literal text? Given this, it seems like there's a fairly strong case for increasing our repetition-count implementation limit, at least to 256, and maybe 1000 or so. I'm hesitant to make the limit *really* large, but if we can handle a regex containing thousands of "x"'s, it's not clear why you shouldn't be able to write that as "x{0,1000}". All of the "quantifier operand invalid" errors come from these three patterns: ((?!\\)?\{0(?!\\)?\}) ((?!\\)?\{1(?!\\)?\}) class="(?!(tco-hidden|tco-display|tco-ellipsis))+.*?"|data-query-source=".*?"|dir=".*?"|rel=".*?" which are evidently trying to apply a quantifier to a lookahead constraint, which is just silly. In short, a lot of this is from incompatible standards, or maybe from varying ideas about whether to throw an error for invalid constructs. But I see a couple things we could improve. regards, tom lane
pgsql-hackers by date: