Thread: Regexp matching: bug or operator error?
Using Postgres V. 7.4.1, the following query: SELECT substring('X12345X' FROM '.*?([0-9]{1,5}).*?'); Returns '1'. I would expect it to return '12345'. Is this a bug, or am I missing something? Thanks.
Attachment
Ken Tanzer wrote: > Using Postgres V. 7.4.1, the following query: > > SELECT substring('X12345X' FROM '.*?([0-9]{1,5}).*?'); > > Returns '1'. I would expect it to return '12345'. Is this a bug, or am > I missing something? Thanks. > The regexp {1,5} is satisfied with the minimum of 1 digit. It looks ahead and finds your '.*'. That in turn consumes all but the last character. Perhaps what you want is '[^0-9]+([0-9]{1,5})[^0-9]+' Translates to "at least one non digit followed by 1-5 digits and then at least 1 non digit". Regards, Thomas Hallgren
Ken Tanzer <ktanzer@desc.org> writes: > Using Postgres V. 7.4.1, the following query: > SELECT substring('X12345X' FROM '.*?([0-9]{1,5}).*?'); > Returns '1'. I would expect it to return '12345'. Is this a bug, or am > I missing something? Thanks. Hmm ... playing with it, it seems that there may indeed be a bug here ... it's acting like the "non greedy" flag from the .*? is being applied to the {m,n} as well. In other words the behavior would be correct for SELECT substring('X12345X' FROM '.*?([0-9]{1,5}?).*?'); However, aren't you doing this the hard way? Why not just SELECT substring('X12345X' FROM '([0-9]{1,5})'); regards, tom lane
Checking in the Tcl bug tracker reveals that this is an open issue for them as well: http://sourceforge.net/tracker/index.php?func=detail&aid=219219&group_id=10894&atid=110894 http://sourceforge.net/tracker/index.php?func=detail&aid=219358&group_id=10894&atid=110894 The first entry has Henry Spencer claiming that it is operating as designed, but the second one seems to cast doubt on that claim. In any case I tend to agree that the notation implies that greediness should be an independent property of each quantifier. However, if Henry can't or doesn't want to fix it, I'm not sure that I care to wade in ;-) regards, tom lane
From: "Ken Tanzer" <ktanzer@desc.org> > Using Postgres V. 7.4.1, the following query: > > SELECT substring('X12345X' FROM '.*?([0-9]{1,5}).*?'); > > Returns '1'. I would expect it to return '12345'. Is this a bug, or am > I missing something? Thanks. the results of mixing greedy and non-greedy repetitions can be difficult to predict. try SELECT substring('X12345X' FROM '\\D*(\\d{1,5}).*?'); gnari
Thanks for the quick responses yesterday. At a minimum, it seems like this behavior does not match what is described in the Postgres documentation (more detail below). But I still have a hard time understanding the results of these two queries: select SUBSTRING( 'X444X','.*?([0-9]{1,3}).*?'); This is the original query I submitted, with the puzzling non-greedy match. It returns '4'. Adding start and end characters to the query, like so: select SUBSTRING( 'X444X','^.*?([0-9]{1,3}).*?$'); returns '444'. If the "whole RE" was being set non-greedy by the first ".*?", then shouldn't the subsequent "([0-9]{1,3})" also match non-greedily, returning a '4', with the last ".*?" then capturing the balance of "44X"? Either way, I'm not sure why the start and end characters would effect the rest of the match. In terms of the Postgres documentation, it definitely seems at odds with the observed behavior. Here's my attempts to explain why: a) select SUBSTRING( 'X444X','[0-9]{1,3}'); returns '444'. This suggests that a "default" for the {m,n} syntax is greedy. b) Table 9-13 of the docs describes {m,n} syntax, then lists {m,n}? as a "non-greedy" version of the same. That, and the fact that there doesn't seem to be a specific "greedy" modifier, would both also imply that {m,n} should be greedy. Section 9.6.3.5: "A quantified atom with other normal quantifiers (including {m,n} with m equal to n) prefers longest match" I can't find anything else in this section that would say otherwise. I specifically can't find anything that says the whole expression becomes greedy or not. If the regex code isn't going to change, it seems that changing the documentation would be very helpful to avoid confusion. Of course, that's easy for me to say, since I wouldn't have to do the work! For that matter, I'd be willing to try editing the documentation, but I'd have to understand what the actual behavior is before I could try to describe it! :) Either way, thanks for the great DB program! Ken Tanzer p.s., The suggested regex rewrites some people responded with were appreciated, but the regex I used was just a simplified example for this posting. Here's the actual regex we're working on--any help reformulating this would be great! select substring('Searching for log 5376, referenced in this text' FROM '(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />| )(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)'); We were able to get this to work by adding start and end characters, like so, but it doesn't seem like it should be necessary: select substring('Searching for log 5376, referenced in this text' FROM '(?i)^(?:.*?)logs?(?:\\s|\\n|<br>|<br />| )(?:entry|no|number|#)?(?:\\s|\\n\<br>|<br /> )?([0-9]{1,7})(.*?)$'); Tom Lane wrote: >Checking in the Tcl bug tracker reveals that this is an open issue >for them as well: > >http://sourceforge.net/tracker/index.php?func=detail&aid=219219&group_id=10894&atid=110894 >http://sourceforge.net/tracker/index.php?func=detail&aid=219358&group_id=10894&atid=110894 > >The first entry has Henry Spencer claiming that it is operating as >designed, but the second one seems to cast doubt on that claim. >In any case I tend to agree that the notation implies that greediness >should be an independent property of each quantifier. > >However, if Henry can't or doesn't want to fix it, I'm not sure that >I care to wade in ;-) > > regards, tom lane > >
Attachment
Ken Tanzer <ktanzer@desc.org> writes: > Thanks for the quick responses yesterday. At a minimum, it seems like > this behavior does not match what is described in the Postgres > documentation (more detail below). After looking at this more, I think that it is actually behaving as Spencer designed it to. The key point is this bit from the fine print in section 9.6.3.5: A branch has the same preference as the first quantified atom in it which has a preference. ("branch" being any regexp with no outer-level | operator) What this apparently means is that if the RE begins with a non-greedy quantifier, then the matching will be done in such a way that the whole RE matches the shortest possible string --- that is, the whole RE is non-greedy. It's still possible for individual items within the RE to be greedy or non-greedy, but that only affects how much of the shortest possible total match they are allowed to eat relative to each other. All the examples I've looked at seem to work "properly" when seen in this light. I can see that this behavior could have some usefulness, and if need be you can always override it by writing (...){1,1} around the whole RE. So at this point I'm disinclined to vary from the Tcl semantics. This does leave us with a documentation problem though, because this behavior is surely not obvious from what it says in 9.6.3.5. If you've got any thoughts about a better explanation, I'm all ears. > Here's the actual regex we're working on--any help > reformulating this would be great! > select substring('Searching for log 5376, referenced in this text' > FROM > '(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />| > )(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)'); I don't see that you need either the leading (?:.*?) or the trailing (.*?) here, and if you dropped them then the first quantifier would be the "s?" which is greedy so the curious case goes away. I suppose the idea of adding (?:.*?) was to ensure that "log" will be matched to the first possible place where it could match --- but that is true anyway, per the first sentence of 9.6.3.5. regards, tom lane
OK. I've been trying to get my mind around this, and think about ways to improve the documentation (more about that below). I'm pretty sure that I can see the general concept now, and am almost convinced that it really does work as described. I guess I really don't like the whole RE inheriting it's preference from the first preference it encounters. It seems like it would be better to use a switch preceeding the whole expression, as is done with (?i) for case-insensitivity. Designating (?g) for greedy and (?G) for non-greedy seems like it would be less inviting of human error. But that's mostly a quibble, as long as it works consistently as described! I've still got a few problems, though. Per 9.6.3.5, "An RE consisting of two or more branches connected by the | operator prefers longest match." Apply that to this example: SELECT substring('abc' FROM '.*?|.*?'); which returns a greedy 'abc'. In this case, the whole RE is then supposed to be greedy (because of the "|"). So I suppose that _might_ be said to work as described, even though the | in this case overrides the expressed preference of both of its components (in which case I'm just griping about not liking the way this was implemented. :) ) But what about these two queries: SELECT substring('a' FROM 'a?|a?'); This returns a greedy 'a', similar to the example above. But then why does SELECT substring('ba' FROM 'a?|a?'); return a non-greedy empty string? Using the logic from the previous examples, this should do a greedy match and return 'a' as well. If this isn't a bug, please explain why! With regard to the documentation, after re-reading it many times I'd have to say the information is all there, but it's hard to absorb. I think the main problem is that the term "preference" is used to discuss greedy/non-greediness, as well as the words greedy & non-greedy. This makes it easier for humans to fail to connect the dots. People are more likely to be familiar with greedy & non-greedy (especially those who've used other regex's before), whereas preference isn't as clear. Perhaps the term "greediness preference" could be used in place of "preference", or "preference" could be dropped altogether. In general, I think "prefers shortest match" could be replaced with "is non-greedy", and "prefers longest match" could be replaced with "is greedy". A little bit more context might be helpful as well, as in example c) below. As an example, here's a couple of different possibilities for the second sentence of the section: a) If the RE could match more than one substring starting at that point, its choice is determined by its greediness /preference/: either the longest substring (greedy), or the shortest (non-greedy). b) If the RE could match more than one substring starting at that point, the match can be either greedy (matching the longest substring) or non-greedy (matching the shortest substring). Whether an RE is greedy or not is determined by the following rules... c) Like individual components of an RE, the entire RE can be either greedy (matching the longest substring) or non-greedy (matching the shortest substring). Do you think an edit along these lines would be helpful? If so, I'd be willing to take a shot at re-writing that section. Let me know. Thanks. Ken Tanzer Tom Lane wrote: >Ken Tanzer <ktanzer@desc.org> writes: > > >>Thanks for the quick responses yesterday. At a minimum, it seems like >>this behavior does not match what is described in the Postgres >>documentation (more detail below). >> >> > >After looking at this more, I think that it is actually behaving as >Spencer designed it to. The key point is this bit from the fine print >in section 9.6.3.5: > > A branch has the same preference as the first quantified atom in it > which has a preference. > >("branch" being any regexp with no outer-level | operator) > >What this apparently means is that if the RE begins with a non-greedy >quantifier, then the matching will be done in such a way that the whole >RE matches the shortest possible string --- that is, the whole RE is >non-greedy. It's still possible for individual items within the RE to >be greedy or non-greedy, but that only affects how much of the shortest >possible total match they are allowed to eat relative to each other. >All the examples I've looked at seem to work "properly" when seen in >this light. > >I can see that this behavior could have some usefulness, and if need be >you can always override it by writing (...){1,1} around the whole RE. >So at this point I'm disinclined to vary from the Tcl semantics. > >This does leave us with a documentation problem though, because this >behavior is surely not obvious from what it says in 9.6.3.5. If you've >got any thoughts about a better explanation, I'm all ears. > > > >>Here's the actual regex we're working on--any help >>reformulating this would be great! >> >> > > > >>select substring('Searching for log 5376, referenced in this text' >> FROM >> '(?i)(?:.*?)logs?(?:\\s|\\n|<br>|<br />| >>)(?:entry|no|number|#)?(?:\\s|\\n|<br>|<br /> )?([0-9]{1,7})(.*?)'); >> >> > >I don't see that you need either the leading (?:.*?) or the trailing >(.*?) here, and if you dropped them then the first quantifier would be >the "s?" which is greedy so the curious case goes away. I suppose the >idea of adding (?:.*?) was to ensure that "log" will be matched to the >first possible place where it could match --- but that is true anyway, >per the first sentence of 9.6.3.5. > > regards, tom lane > >
Kenneth Tanzer <ktanzer@desc.org> writes: > But what about these two queries: > SELECT substring('a' FROM 'a?|a?'); > This returns a greedy 'a', similar to the example above. But then why does > SELECT substring('ba' FROM 'a?|a?'); > return a non-greedy empty string? You're ignoring the first rule of matching: when there is more than one possible match, the match starting earliest in the string is chosen. The longer-or-shorter business only applies when there are multiple legal ways to form a match starting at the same place. In this case 'a?' can form a legal match at the beginning of the string (ie, match to no characters) and so the fact that a longer match is available later in the string doesn't enter into it. > With regard to the documentation, after re-reading it many times I'd > have to say the information is all there, but it's hard to absorb. I'd agree. This section was taken nearly verbatim from Henry Spencer's man page for the regexp package, and with all due respect to Henry, it's definitely written in geek reference-page-speak. Maybe a few examples would help. On the other hand, I don't want to try to turn the section into a regexp tutorial --- there are entire books written about regexps (I quite like the O'Reilly one, btw). So there's a bulk-vs-friendliness tradeoff to be made. > I think the main problem is that the term "preference" is used to > discuss greedy/non-greediness, as well as the words greedy & > non-greedy. Good point. It would help to use only one term. > As an example, here's a couple of different possibilities for the second > sentence of the section: I like this one: > b) If the RE could match more than one substring starting at that point, > the match can be either greedy (matching the longest substring) or > non-greedy (matching the shortest substring). Whether an RE is greedy > or not is determined by the following rules... Given that intro, there's no need to use the word "preference" at all. Or almost --- what term will you use for "RE with no preference"? Perhaps you can avoid the question by pointing out that greediness only matters for quantifiers, since unquantified REs can only match fixed-length strings. The point you make here: > c) Like individual components of an RE, the entire RE can be either > greedy (matching the longest substring) or non-greedy (matching the > shortest substring). is also important, but probably needs to be a completely separate paragraph containing its own example. > Do you think an edit along these lines would be helpful? If so, I'd be > willing to take a shot at re-writing that section. Let me know. Thanks. Fire away. Please send whatever you come up with to the pgsql-docs list. regards, tom lane
Kenneth Tanzer <ktanzer@desc.org> writes: > OK. I've been trying to get my mind around this, and think about ways > to improve the documentation (more about that below). I finally got around to revising the regex docs on the basis of this discussion: see patch at http://developer.postgresql.org/cvsweb.cgi/pgsql/doc/src/sgml/func.sgml or the finished text at http://candle.pha.pa.us/main/writings/pgsql/sgml/functions-matching.html#POSIX-MATCHING-RULES Let me know if you have suggestions for further improvements. regards, tom lane