Some qualms with the current description of RegExp s,n,w modes. - Mailing list pgsql-docs

From David G Johnston
Subject Some qualms with the current description of RegExp s,n,w modes.
Date
Msg-id 1402011281941-5806271.post@n5.nabble.com
Whole thread Raw
Responses Re: Some qualms with the current description of RegExp s,n,w modes.
List pgsql-docs
The current documentation for "n" and "w" are as follows:

[s] If partial newline-sensitive matching is specified, this affects . and
bracket expressions as with newline-sensitive matching, but not ^ and $.

[w] If inverse partial newline-sensitive matching is specified, this affects
^ and $ as with newline-sensitive matching, but not . and bracket
expressions. This isn't very useful but is provided for symmetry.

I have a specific qualm with the claim that [w] "isn't very useful".  I
would argue that if a person is appropriately exact in their usage of \A and
\Z that there is nothing [s] can do that cannot be done in [w] but that
parsing multi-record text documents becomes much cleaner if done in [w]
mode.  The terms themselves also do little to help the user understand and
remember the nuances of each mode.

I simplified ". and bracket expressions" to "wildcard" and "^ and $" to
"anchors" though did make use of ^ and $individual quite a bit.  I did not
formally define these terms in the body either.

I'm posting mostly to see if anyone else agrees with my opinions on the
matter and to gather thoughts both for and against.

Note that true symmetry would require a 4th mode - one where wildcards stop
at newlines but where anchors only match at the document level - though this
pair is of little value for much the same reason as [n].  In my mind there
are two primary modes (s, w) and one "helpful" mode (n) - no symmetry
claimed.


Instead of calling these "partial" and "inverse partial" better terms would
be "newline-sensitive wildcard matching" and "newline-sensitive anchor
matching".  The default mode could be called "newline-sensitive full
matching".  With those defined correctly elsewhere in the documentation
section 9.7.3.5 (9.3 version) could provide the following definitions:

full matching - the default - causes wildcards to stop matching at a newline
(typically denoting end-of-line) and so is often referred to as single-line
mode.  The beginning and end of each line can be referred to by using ^ and
$ respectively.  During a global match the document boundaries can be
matched  using \A and \Z.

anchor-only matching is generally useful and almost necessary for times when
newlines are not part of the content but the document being parsed has
multiple records separated by newlines (in particular if the number of
rows-per-record is variable).  The wildcard allows for selecting multiple
rows of content from each record while still being able to use the anchors
to find the beginning and end of each record.  Like in full matching mode
the document boundaries can be matched using \A and \Z.

wildcard-only matching is useful when you wish to treat newlines only as
content within a single logical document.  ^ and $ are left as synonyms for
\A and \Z respectively and so do not (typically inadvertently) match near an
embedded newline - you have to use a literal \n to do that and then deal
with the newline itself being part of the capture.  This is best thought of
as a compatibility mode since you can get the same behavior, without losing
the unique behavior of ^ and $, in anchor-only mode with proper use of \A
and \Z to match boundaries and avoid using ^ and $.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Some-qualms-with-the-current-description-of-RegExp-s-n-w-modes-tp5806271.html
Sent from the PostgreSQL - docs mailing list archive at Nabble.com.


pgsql-docs by date:

Previous
From: Tom Lane
Date:
Subject: Re: [9.3] Should we mention "set_config(...)" in 18.1.3 in Server Configuration?
Next
From: Tom Lane
Date:
Subject: Re: Some qualms with the current description of RegExp s,n,w modes.