Re: Our regex vs. POSIX on "longest match" - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Our regex vs. POSIX on "longest match"
Date
Msg-id 6599.1330844013@sss.pgh.pa.us
Whole thread Raw
In response to Our regex vs. POSIX on "longest match"  (Brendan Jurd <direvus@gmail.com>)
Responses Re: Our regex vs. POSIX on "longest match"  (Brendan Jurd <direvus@gmail.com>)
List pgsql-hackers
Brendan Jurd <direvus@gmail.com> writes:
> I am in the process of accelerating down the rabbit hole of regex
> internals.  Something that came up during my reading, is that a POSIX
> compliant regex engine ought to always prefer the longest possible
> match, when multiple matches are possible beginning from the same
> location in the string. [1]

> I wasn't sure that that was how our regex engine worked, and indeed,
> on checking the manual [2] I found that our regex engine uses a
> strange sort of "inductive greediness" to determine whether the
> longest or the shortest possible match ought to be preferred.  The
> greediness of individual particles in the regex are taken into
> account, and at the top level the entire expression is concluded to be
> either greedy, or non-greedy.

> I'll admit that this is a pretty obscure point, but we do appear to be
> in direct violation of POSIX here.

How so?  POSIX doesn't contain any non-greedy constructs.  If you use
only the POSIX-compatible greedy constructs, the behavior is compliant.

The issue that is obscure is, once you define some non-greedy
constructs, how to define how they should act in combination with greedy
ones.  I'm not sure to what extent the engine's behavior is driven by
implementation restrictions and to what extent it's really the sanest
behavior Henry could think of.  I found a comment from him about it:
http://groups.google.com/group/comp.lang.tcl/msg/c493317cc0d10d50
but it's short on details as to what alternatives he considered.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Brendan Jurd
Date:
Subject: Our regex vs. POSIX on "longest match"
Next
From: Simon Riggs
Date:
Subject: Re: RFC: Making TRUNCATE more "MVCC-safe"