Re: Our regex vs. POSIX on "longest match" - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Our regex vs. POSIX on "longest match" |
Date | |
Msg-id | CA+TgmoYiYQyJ-2e-erykmn-z128V8zpGrPxe8XKU4+sCA41nVQ@mail.gmail.com Whole thread Raw |
In response to | Re: Our regex vs. POSIX on "longest match" (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Our regex vs. POSIX on "longest match"
Re: Our regex vs. POSIX on "longest match" |
List | pgsql-hackers |
On Sun, Mar 4, 2012 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > hubert depesz lubaczewski <depesz@depesz.com> writes: >> I stand on position that mixing greedy and non-greedy operators should >> be possible, and that it should work according to POLA - i.e. greedines >> of given atom shouldn't be influenced by other atoms. > > [ shrug... ] That sounds good, but it's pretty much vacuous as far as > defining a principled alternative behavior goes. It's easy to > demonstrate cases where atoms *must* be influenced by other ones. > A trivial example is > (.*)(.*) > It doesn't matter whether the second atom is greedy or not: it's not > going to get to eat anything because the first one does instead. > IOW this is just the same as > (.*)(.*?) > --- they are both overall-greedy. But I don't think this makes Brendan's example not a bug. In this case, we can't eat anything because there's nothing to eat. In his example, there's something to eat, and it's sitting in a place where it logically seems to line up with a greedy quantifier, and yet the quantifier doesn't eat it. Now in fairness, I've seen my fair share of apparently-buggy behavior from Perl's regex engine over the years as well. I think the right way to imagine this is as though the regular expression were being matched to the source text in left-to-right fashion. For example, suppose the RE is [ab]+?[bc]+ and the source text is aabbcc. Clearly there's a match at position 1, but the match could be anywhere between 3 and 6 characters in length, depending on how greedy we think the RE should be, and the exact source text that gets matched to each atom is up for grabs. Enumerating all the possibilities where each atom matches a string that is consistent with its definition, leaving greediness aside, we get this: [aa,b] [aa,bb] [aa,bbc] [aa,bbcc] [aab,b] [aab,bc] [aab,bcc] [aabb,c] [aabb,cc] To distinguish among these, we look first at [ab]+? and decide that, since it is non-greedy, the right overall answer must be one of the ones that assigns to [ab]+? a match of minimal length within the space of possible matches. That narrows the field to just the first four choices listed above. Then we look at [bc]+ and, since it is greedy, give it a match of maximal length within the space of possible matches, leading to [ab]+? = aa and [bc]+ = bbcc. Had the RE been [ab]+?[bc]+?, the same algorithm would assign [ab]+? = aa and [bc]+? = b; had it been [ab]+[bc]+? the same algorithm would assign [ab]+ = aabb and [bc]+? = c. Testing, I find that this matches what Perl does in all of those cases. Things get a bit more complicated when you throw alternation into the picture, but I think it's basically right to view it as a greedy operation. Anything else seems rather unprincipled. It may seem surprising that a+?|b+? matched against aaa will match the first branch to aaa rather than just a, but there's no real reason to suppose that the ? which applies to the plus should somehow bleed up and affect the interpretation of the alternation. The RE might be something like a+b+?|b+?a+ where the sub-REs have different greediness interpretations and there's no obvious principled way to decide which one should apply to the parent; or even something like cat|catfish where the sub-REs don't contain any greedy/non-greedy operators at all and yet we still have to assign some greediness to the alternation. I think it's right to view every RE construct as greedy unless it's got an explicit "not-greedy" flag attached to it; after all, that's the traditional behavior of REs from time immemorial. Someone could invent a non-greedy form of alternation if they were so inclined. This is different from what Perl does, but I think Perl's behavior here is batty: given a+|a+b+ and the string aaabbb, it picks the first branch and matches only aaa. My whole being recoils: that surely looks like a non-greedy interpretation of a regex with only greedy quantifiers. It turns out that if you write a+b+|a+ then it matches the whole string; apparently, it selects the syntactically first branch that can match, regardless of the length of the match, which strikes me as nearly pure evil. PostgreSQL - correctly, IMHO - matches the longest substring available regardless of the syntactic ordering; AIUI, the original charter for REs was to always match the longest possible substring; non-greedy quantifiers were added later, and should not be taken as a license to change the semantics of REs that don't use them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: