Thread: Some qualms with the current description of RegExp s,n,w modes.

Some qualms with the current description of RegExp s,n,w modes.

From

David G Johnston

Date:

05 June 2014, 23:34:52

The current documentation for "n" and "w" are as follows:

[s] If partial newline-sensitive matching is specified, this affects . and
bracket expressions as with newline-sensitive matching, but not ^ and $.

[w] If inverse partial newline-sensitive matching is specified, this affects
^ and $ as with newline-sensitive matching, but not . and bracket
expressions. This isn't very useful but is provided for symmetry.

I have a specific qualm with the claim that [w] "isn't very useful". I
would argue that if a person is appropriately exact in their usage of \A and
\Z that there is nothing [s] can do that cannot be done in [w] but that
parsing multi-record text documents becomes much cleaner if done in [w]
mode. The terms themselves also do little to help the user understand and
remember the nuances of each mode.

I simplified ". and bracket expressions" to "wildcard" and "^ and $" to
"anchors" though did make use of ^ and $individual quite a bit. I did not
formally define these terms in the body either.

I'm posting mostly to see if anyone else agrees with my opinions on the
matter and to gather thoughts both for and against.

Note that true symmetry would require a 4th mode - one where wildcards stop
at newlines but where anchors only match at the document level - though this
pair is of little value for much the same reason as [n]. In my mind there
are two primary modes (s, w) and one "helpful" mode (n) - no symmetry
claimed.

Instead of calling these "partial" and "inverse partial" better terms would
be "newline-sensitive wildcard matching" and "newline-sensitive anchor
matching". The default mode could be called "newline-sensitive full
matching". With those defined correctly elsewhere in the documentation
section 9.7.3.5 (9.3 version) could provide the following definitions:

full matching - the default - causes wildcards to stop matching at a newline
(typically denoting end-of-line) and so is often referred to as single-line
mode. The beginning and end of each line can be referred to by using ^ and
$ respectively. During a global match the document boundaries can be
matched using \A and \Z.

anchor-only matching is generally useful and almost necessary for times when
newlines are not part of the content but the document being parsed has
multiple records separated by newlines (in particular if the number of
rows-per-record is variable). The wildcard allows for selecting multiple
rows of content from each record while still being able to use the anchors
to find the beginning and end of each record. Like in full matching mode
the document boundaries can be matched using \A and \Z.

wildcard-only matching is useful when you wish to treat newlines only as
content within a single logical document. ^ and $ are left as synonyms for
\A and \Z respectively and so do not (typically inadvertently) match near an
embedded newline - you have to use a literal \n to do that and then deal
with the newline itself being part of the capture. This is best thought of
as a compatibility mode since you can get the same behavior, without losing
the unique behavior of ^ and $, in anchor-only mode with proper use of \A
and \Z to match boundaries and avoid using ^ and $.

David J.

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Some-qualms-with-the-current-description-of-RegExp-s-n-w-modes-tp5806271.html
Sent from the PostgreSQL - docs mailing list archive at Nabble.com.

Re: Some qualms with the current description of RegExp s,n,w modes.

From

Tom Lane

Date:

06 June 2014, 00:00:51

David G Johnston <david.g.johnston@gmail.com> writes:
> I simplified ". and bracket expressions" to "wildcard" and "^ and $" to
> "anchors" though did make use of ^ and $individual quite a bit.  I did not
> formally define these terms in the body either.

Did you mean to attach a proposed doc patch here, or are you just
armwaving about what a patch might look like?

FWIW, I don't agree with using "wildcard" to mean those particular things
(the term is too generic, and there are other regex constructs that
might be thought to be included); although you could probably get away
with using "anchor" this way as long as you define the term at first use.

The text involved here is more or less verbatim from Henry Spencer's
original man page for the regex library, so you're essentially claiming
you know more than the author did about what his code is good for.  Maybe
so, but some examples in support of your thesis would be a good thing.

> Instead of calling these "partial" and "inverse partial" better terms would
> be "newline-sensitive wildcard matching" and "newline-sensitive anchor
> matching".

Agreed that "partial" is not a very good name, but I remain resistant to
"wildcard" here.

> The default mode could be called "newline-sensitive full
> matching".

Or just "newline-sensitive matching" ... does "full" add anything?

            regards, tom lane

Re: Some qualms with the current description of RegExp s,n,w modes.

From

David Johnston

Date:

06 June 2014, 00:32:44

On Thu, Jun 5, 2014 at 8:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David G Johnston <david.g.johnston@gmail.com> writes:
> I simplified ". and bracket expressions" to "wildcard" and "^ and $" to
> "anchors" though did make use of ^ and $individual quite a bit. I did not
> formally define these terms in the body either.

Did you mean to attach a proposed doc patch here, or are you just
armwaving about what a patch might look like?

Armwaving for lack of any current setup to generate doc-patches.

FWIW, I don't agree with using "wildcard" to mean those particular things
(the term is too generic, and there are other regex constructs that
might be thought to be included); although you could probably get away
with using "anchor" this way as long as you define the term at first use.

I had the same nagging suspicion but figured for a first pass, and defined only within this context, it would suffice. ". and ^ brackets" just rubbed me the wrong way but it does have the merit of being precise.

The text involved here is more or less verbatim from Henry Spencer's
original man page for the regex library, so you're essentially claiming
you know more than the author did about what his code is good for. Maybe
so, but some examples in support of your thesis would be a good thing.

I can readily support why I found [w] to be most useful; the conclusion that [w] > [s] came from the logic that making "^ and $" useless means that using [w] mode and simply avoiding using them would have the same effect. I'll admit that people using ^ and $ where they really meant \A and \Z may be an issue worth accounting for...but I personally call providing that mode to be a compatibility/help-oriented decision and just decided to state so in my revision.

Example that prompted this whole journey:

WITH src (filecontent) AS ( VALUES(

$$CDF CORR: DRAIN COOLANT AND REFILL

ADDITIONAL DLR-OP: BGFLDEX

PAY TYPE: C OTH HRS: 0000 FORECAST SERVICE: CHG TO: EPA CHG: HAZ CHG:

9999 5

SPG CONVERSION SETTINGS - SPG MFG: -- GEN MOD: -- VIN/MODEL#: ENGINE:

CDR CORR: CUSTOMER ELECTED NOT TO HAVE REPAIRS DONE AT THIS TIME NOS

PAY TYPE: C OTH HRS: 0000 FORECAST SERVICE: CHG TO: EPA CHG: HAZ CHG:

9999 03 0030

SPG CONVERSION SETTINGS - SPG MFG: -- GEN MOD: -- VIN/MODEL#: ENGINE:

$$::varchar

))

, do_match AS (

SELECT regexp_matches(filecontent,'^(\S.*?)(?=^\S|\Z)','gw') AS match FROM src

)

, explode_match AS (

SELECT unnest(match) FROM do_match

)

SELECT unnest, length(unnest) FROM explode_match;

[s] 1 result because the "^\S" construct attempts to match beginning-of-document instead of beginning-of-line. This is when I started digging deeper since I expected it to behave like [w].

[n] 0 results because the (.*?) never gets beyond the first line and thus cannot match "^\S|\Z" - no problem here, the behavior of "." is as expected.

[w] 2 results as desired/expected. It is possible to replace ^\S with \n\S (and thus allow [s] to work) but the semantic meaning of ^ makes using this form more convenient

Note that CDF has 5 rows of content while CDR only has 4; thus strongly suggesting the use of newline-insensitive "wildcard" matching. The choice of anchor mode is of a cosmetic/semantic nature but I argue that in this situation the semantic of [w] are preferred over [n].

In either case I'd rather simply drop the existing commentary that [w] is not that useful and either in words or example explain when it would have use; even if you do not want to go as far as to claim that [w] is superior to [n] as I would.

While it is likely possible to write a working expression in all three modes my experience - which is largely based in executing these expressions in Java, not PostgreSQL thought that is becoming more common nowadays - led me directly to the regexp provided.

> Instead of calling these "partial" and "inverse partial" better terms would
> be "newline-sensitive wildcard matching" and "newline-sensitive anchor
> matching".

Agreed that "partial" is not a very good name, but I remain resistant to
"wildcard" here.

> The default mode could be called "newline-sensitive full
> matching".

Or just "newline-sensitive matching" ... does "full" add anything?

Not much - though after adding "anchor" and "wildcard" to the others the question became if this option is not only one of those then is it both, or neither? Full makes it clear that it means both.

Maybe something like: [s] - single-line mode; [w] - multi-line mode; [n|m] - document-only mode; though I dislike re-associating multi-line with [w] given its current association with [n|m]. "Record Mode [w]" has some merit since that is at least the use case that I have identified where it is particularly useful...

David J.

Re: Some qualms with the current description of RegExp s,n,w modes.

From

David Johnston

Date:

06 June 2014, 00:56:42

Or just "newline-sensitive matching" ... does "full" add anything?

And since I'm nit-picking anyway - the word "sensitive" does nothing for me. Simply "newline-matching" would be sufficient, ideally. i.e., Do ". [^]" and "^$" match the newline character, or not.

[w] anchor newline-matching

[n] dot/inverse-bracket newline-matching

[s] newline-matching

These are precise, what-oriented, names compared to:

[w] record mode

[n] multi-line mode

[s] single-line mode

which are more descriptive, use-oriented, names.

Use of these label sets is not mutually exclusive...

David J.