Some qualms with the current description of RegExp s,n,w modes. - Mailing list pgsql-docs
From | David G Johnston |
---|---|
Subject | Some qualms with the current description of RegExp s,n,w modes. |
Date | |
Msg-id | 1402011281941-5806271.post@n5.nabble.com Whole thread Raw |
Responses |
Re: Some qualms with the current description of RegExp s,n,w modes.
|
List | pgsql-docs |
The current documentation for "n" and "w" are as follows: [s] If partial newline-sensitive matching is specified, this affects . and bracket expressions as with newline-sensitive matching, but not ^ and $. [w] If inverse partial newline-sensitive matching is specified, this affects ^ and $ as with newline-sensitive matching, but not . and bracket expressions. This isn't very useful but is provided for symmetry. I have a specific qualm with the claim that [w] "isn't very useful". I would argue that if a person is appropriately exact in their usage of \A and \Z that there is nothing [s] can do that cannot be done in [w] but that parsing multi-record text documents becomes much cleaner if done in [w] mode. The terms themselves also do little to help the user understand and remember the nuances of each mode. I simplified ". and bracket expressions" to "wildcard" and "^ and $" to "anchors" though did make use of ^ and $individual quite a bit. I did not formally define these terms in the body either. I'm posting mostly to see if anyone else agrees with my opinions on the matter and to gather thoughts both for and against. Note that true symmetry would require a 4th mode - one where wildcards stop at newlines but where anchors only match at the document level - though this pair is of little value for much the same reason as [n]. In my mind there are two primary modes (s, w) and one "helpful" mode (n) - no symmetry claimed. Instead of calling these "partial" and "inverse partial" better terms would be "newline-sensitive wildcard matching" and "newline-sensitive anchor matching". The default mode could be called "newline-sensitive full matching". With those defined correctly elsewhere in the documentation section 9.7.3.5 (9.3 version) could provide the following definitions: full matching - the default - causes wildcards to stop matching at a newline (typically denoting end-of-line) and so is often referred to as single-line mode. The beginning and end of each line can be referred to by using ^ and $ respectively. During a global match the document boundaries can be matched using \A and \Z. anchor-only matching is generally useful and almost necessary for times when newlines are not part of the content but the document being parsed has multiple records separated by newlines (in particular if the number of rows-per-record is variable). The wildcard allows for selecting multiple rows of content from each record while still being able to use the anchors to find the beginning and end of each record. Like in full matching mode the document boundaries can be matched using \A and \Z. wildcard-only matching is useful when you wish to treat newlines only as content within a single logical document. ^ and $ are left as synonyms for \A and \Z respectively and so do not (typically inadvertently) match near an embedded newline - you have to use a literal \n to do that and then deal with the newline itself being part of the capture. This is best thought of as a compatibility mode since you can get the same behavior, without losing the unique behavior of ^ and $, in anchor-only mode with proper use of \A and \Z to match boundaries and avoid using ^ and $. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Some-qualms-with-the-current-description-of-RegExp-s-n-w-modes-tp5806271.html Sent from the PostgreSQL - docs mailing list archive at Nabble.com.
pgsql-docs by date: