Thread: Some qualms with the current description of RegExp s,n,w modes.
The current documentation for "n" and "w" are as follows: [s] If partial newline-sensitive matching is specified, this affects . and bracket expressions as with newline-sensitive matching, but not ^ and $. [w] If inverse partial newline-sensitive matching is specified, this affects ^ and $ as with newline-sensitive matching, but not . and bracket expressions. This isn't very useful but is provided for symmetry. I have a specific qualm with the claim that [w] "isn't very useful". I would argue that if a person is appropriately exact in their usage of \A and \Z that there is nothing [s] can do that cannot be done in [w] but that parsing multi-record text documents becomes much cleaner if done in [w] mode. The terms themselves also do little to help the user understand and remember the nuances of each mode. I simplified ". and bracket expressions" to "wildcard" and "^ and $" to "anchors" though did make use of ^ and $individual quite a bit. I did not formally define these terms in the body either. I'm posting mostly to see if anyone else agrees with my opinions on the matter and to gather thoughts both for and against. Note that true symmetry would require a 4th mode - one where wildcards stop at newlines but where anchors only match at the document level - though this pair is of little value for much the same reason as [n]. In my mind there are two primary modes (s, w) and one "helpful" mode (n) - no symmetry claimed. Instead of calling these "partial" and "inverse partial" better terms would be "newline-sensitive wildcard matching" and "newline-sensitive anchor matching". The default mode could be called "newline-sensitive full matching". With those defined correctly elsewhere in the documentation section 9.7.3.5 (9.3 version) could provide the following definitions: full matching - the default - causes wildcards to stop matching at a newline (typically denoting end-of-line) and so is often referred to as single-line mode. The beginning and end of each line can be referred to by using ^ and $ respectively. During a global match the document boundaries can be matched using \A and \Z. anchor-only matching is generally useful and almost necessary for times when newlines are not part of the content but the document being parsed has multiple records separated by newlines (in particular if the number of rows-per-record is variable). The wildcard allows for selecting multiple rows of content from each record while still being able to use the anchors to find the beginning and end of each record. Like in full matching mode the document boundaries can be matched using \A and \Z. wildcard-only matching is useful when you wish to treat newlines only as content within a single logical document. ^ and $ are left as synonyms for \A and \Z respectively and so do not (typically inadvertently) match near an embedded newline - you have to use a literal \n to do that and then deal with the newline itself being part of the capture. This is best thought of as a compatibility mode since you can get the same behavior, without losing the unique behavior of ^ and $, in anchor-only mode with proper use of \A and \Z to match boundaries and avoid using ^ and $. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Some-qualms-with-the-current-description-of-RegExp-s-n-w-modes-tp5806271.html Sent from the PostgreSQL - docs mailing list archive at Nabble.com.
David G Johnston <david.g.johnston@gmail.com> writes: > I simplified ". and bracket expressions" to "wildcard" and "^ and $" to > "anchors" though did make use of ^ and $individual quite a bit. I did not > formally define these terms in the body either. Did you mean to attach a proposed doc patch here, or are you just armwaving about what a patch might look like? FWIW, I don't agree with using "wildcard" to mean those particular things (the term is too generic, and there are other regex constructs that might be thought to be included); although you could probably get away with using "anchor" this way as long as you define the term at first use. The text involved here is more or less verbatim from Henry Spencer's original man page for the regex library, so you're essentially claiming you know more than the author did about what his code is good for. Maybe so, but some examples in support of your thesis would be a good thing. > Instead of calling these "partial" and "inverse partial" better terms would > be "newline-sensitive wildcard matching" and "newline-sensitive anchor > matching". Agreed that "partial" is not a very good name, but I remain resistant to "wildcard" here. > The default mode could be called "newline-sensitive full > matching". Or just "newline-sensitive matching" ... does "full" add anything? regards, tom lane
David G Johnston <david.g.johnston@gmail.com> writes:
> I simplified ". and bracket expressions" to "wildcard" and "^ and $" to
> "anchors" though did make use of ^ and $individual quite a bit. I did not
> formally define these terms in the body either.
Did you mean to attach a proposed doc patch here, or are you just
armwaving about what a patch might look like?
Armwaving for lack of any current setup to generate doc-patches.
FWIW, I don't agree with using "wildcard" to mean those particular things
(the term is too generic, and there are other regex constructs that
might be thought to be included); although you could probably get away
with using "anchor" this way as long as you define the term at first use.
I had the same nagging suspicion but figured for a first pass, and defined only within this context, it would suffice. ". and ^ brackets" just rubbed me the wrong way but it does have the merit of being precise.
The text involved here is more or less verbatim from Henry Spencer's
original man page for the regex library, so you're essentially claiming
you know more than the author did about what his code is good for. Maybe
so, but some examples in support of your thesis would be a good thing.
I can readily support why I found [w] to be most useful; the conclusion that [w] > [s] came from the logic that making "^ and $" useless means that using [w] mode and simply avoiding using them would have the same effect. I'll admit that people using ^ and $ where they really meant \A and \Z may be an issue worth accounting for...but I personally call providing that mode to be a compatibility/help-oriented decision and just decided to state so in my revision.
Example that prompted this whole journey:
WITH src (filecontent) AS ( VALUES(
$$CDF CORR: DRAIN COOLANT AND REFILL
ADDITIONAL DLR-OP: BGFLDEX
PAY TYPE: C OTH HRS: 0000 FORECAST SERVICE: CHG TO: EPA CHG: HAZ CHG:
9999 5
SPG CONVERSION SETTINGS - SPG MFG: -- GEN MOD: -- VIN/MODEL#: ENGINE:
CDR CORR: CUSTOMER ELECTED NOT TO HAVE REPAIRS DONE AT THIS TIME NOS
PAY TYPE: C OTH HRS: 0000 FORECAST SERVICE: CHG TO: EPA CHG: HAZ CHG:
9999 03 0030
SPG CONVERSION SETTINGS - SPG MFG: -- GEN MOD: -- VIN/MODEL#: ENGINE:
$$::varchar
))
, do_match AS (
SELECT regexp_matches(filecontent,'^(\S.*?)(?=^\S|\Z)','gw') AS match FROM src
)
, explode_match AS (
SELECT unnest(match) FROM do_match
)
SELECT unnest, length(unnest) FROM explode_match;
[s] 1 result because the "^\S" construct attempts to match beginning-of-document instead of beginning-of-line. This is when I started digging deeper since I expected it to behave like [w].
[n] 0 results because the (.*?) never gets beyond the first line and thus cannot match "^\S|\Z" - no problem here, the behavior of "." is as expected.
[w] 2 results as desired/expected. It is possible to replace ^\S with \n\S (and thus allow [s] to work) but the semantic meaning of ^ makes using this form more convenient
Note that CDF has 5 rows of content while CDR only has 4; thus strongly suggesting the use of newline-insensitive "wildcard" matching. The choice of anchor mode is of a cosmetic/semantic nature but I argue that in this situation the semantic of [w] are preferred over [n].
In either case I'd rather simply drop the existing commentary that [w] is not that useful and either in words or example explain when it would have use; even if you do not want to go as far as to claim that [w] is superior to [n] as I would.
While it is likely possible to write a working expression in all three modes my experience - which is largely based in executing these expressions in Java, not PostgreSQL thought that is becoming more common nowadays - led me directly to the regexp provided.
> Instead of calling these "partial" and "inverse partial" better terms would
> be "newline-sensitive wildcard matching" and "newline-sensitive anchor
> matching".
Agreed that "partial" is not a very good name, but I remain resistant to
"wildcard" here.
> The default mode could be called "newline-sensitive full
> matching".
Or just "newline-sensitive matching" ... does "full" add anything?
Not much - though after adding "anchor" and "wildcard" to the others the question became if this option is not only one of those then is it both, or neither? Full makes it clear that it means both.
Maybe something like: [s] - single-line mode; [w] - multi-line mode; [n|m] - document-only mode; though I dislike re-associating multi-line with [w] given its current association with [n|m]. "Record Mode [w]" has some merit since that is at least the use case that I have identified where it is particularly useful...
David J.
Or just "newline-sensitive matching" ... does "full" add anything?
And since I'm nit-picking anyway - the word "sensitive" does nothing for me. Simply "newline-matching" would be sufficient, ideally. i.e., Do ". [^]" and "^$" match the newline character, or not.
[w] anchor newline-matching
[n] dot/inverse-bracket newline-matching
[s] newline-matching
These are precise, what-oriented, names compared to:
[w] record mode
[n] multi-line mode
[s] single-line mode
which are more descriptive, use-oriented, names.
Use of these label sets is not mutually exclusive...
David J.