Brendan Jurd <direvus@gmail.com> writes:
> I am in the process of accelerating down the rabbit hole of regex
> internals. Something that came up during my reading, is that a POSIX
> compliant regex engine ought to always prefer the longest possible
> match, when multiple matches are possible beginning from the same
> location in the string. [1]
> I wasn't sure that that was how our regex engine worked, and indeed,
> on checking the manual [2] I found that our regex engine uses a
> strange sort of "inductive greediness" to determine whether the
> longest or the shortest possible match ought to be preferred. The
> greediness of individual particles in the regex are taken into
> account, and at the top level the entire expression is concluded to be
> either greedy, or non-greedy.
> I'll admit that this is a pretty obscure point, but we do appear to be
> in direct violation of POSIX here.
How so? POSIX doesn't contain any non-greedy constructs. If you use
only the POSIX-compatible greedy constructs, the behavior is compliant.
The issue that is obscure is, once you define some non-greedy
constructs, how to define how they should act in combination with greedy
ones. I'm not sure to what extent the engine's behavior is driven by
implementation restrictions and to what extent it's really the sanest
behavior Henry could think of. I found a comment from him about it:
http://groups.google.com/group/comp.lang.tcl/msg/c493317cc0d10d50
but it's short on details as to what alternatives he considered.
regards, tom lane