Home > mailing lists

Our regex vs. POSIX on "longest match" - Mailing list pgsql-hackers

From	Brendan Jurd
Subject	Our regex vs. POSIX on "longest match"
Date	March 4, 2012 05:37:14
Msg-id	CADxJZo2p29wHjvk3AL0w4LAhFCQ_-f-gPJcWLyTe+hK2gWevyA@mail.gmail.com Whole thread Raw
Responses	Re: Our regex vs. POSIX on "longest match"
List	pgsql-hackers

Tree view

Hello folks,

I am in the process of accelerating down the rabbit hole of regex
internals. Something that came up during my reading, is that a POSIX
compliant regex engine ought to always prefer the longest possible
match, when multiple matches are possible beginning from the same
location in the string. [1]

I wasn't sure that that was how our regex engine worked, and indeed,
on checking the manual [2] I found that our regex engine uses a
strange sort of "inductive greediness" to determine whether the
longest or the shortest possible match ought to be preferred. The
greediness of individual particles in the regex are taken into
account, and at the top level the entire expression is concluded to be
either greedy, or non-greedy.

I'll admit that this is a pretty obscure point, but we do appear to be
in direct violation of POSIX here. I am getting the impression that
our engine works this way to route around some of the performance
issues that can arise in trying out every possible match with an
NFA-style engine.

I find it a bit unfortunate that POSIX is so unambiguous about this,
while our engine's treatment is, well ... quite arcane by comparison.
At minimum, I think we should be more explicit in the manual that this
behaviour flips POSIX the proverbial bird. Several paragraphs south,
there is a sentence reading "Incompatibilities of note include ... the
longest/shortest-match (rather than first-match) matching semantics",
but in the context it seems as though the author is talking about an
incompatibility with Perl, not with POSIX.

Thoughts?

Cheers,
BJ

[1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
[2] http://www.postgresql.org/docs/9.0/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP

pgsql-hackers by date:

From: Tom Lane
Date: 04 March 2012, 04:21:09
Subject: Re: Parameterized-path cost comparisons need some work

From: Tom Lane
Date: 04 March 2012, 05:53:55
Subject: Re: Our regex vs. POSIX on "longest match"

Our regex vs. POSIX on "longest match" - Mailing list pgsql-hackers

Previous

Next