Hello folks,
I am in the process of accelerating down the rabbit hole of regex
internals. Something that came up during my reading, is that a POSIX
compliant regex engine ought to always prefer the longest possible
match, when multiple matches are possible beginning from the same
location in the string. [1]
I wasn't sure that that was how our regex engine worked, and indeed,
on checking the manual [2] I found that our regex engine uses a
strange sort of "inductive greediness" to determine whether the
longest or the shortest possible match ought to be preferred. The
greediness of individual particles in the regex are taken into
account, and at the top level the entire expression is concluded to be
either greedy, or non-greedy.
I'll admit that this is a pretty obscure point, but we do appear to be
in direct violation of POSIX here. I am getting the impression that
our engine works this way to route around some of the performance
issues that can arise in trying out every possible match with an
NFA-style engine.
I find it a bit unfortunate that POSIX is so unambiguous about this,
while our engine's treatment is, well ... quite arcane by comparison.
At minimum, I think we should be more explicit in the manual that this
behaviour flips POSIX the proverbial bird. Several paragraphs south,
there is a sentence reading "Incompatibilities of note include ... the
longest/shortest-match (rather than first-match) matching semantics",
but in the context it seems as though the author is talking about an
incompatibility with Perl, not with POSIX.
Thoughts?
Cheers,
BJ
[1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
[2] http://www.postgresql.org/docs/9.0/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP