Home > mailing lists

Re: Some qualms with the current description of RegExp s,n,w modes. - Mailing list pgsql-docs

From	David Johnston
Subject	Re: Some qualms with the current description of RegExp s,n,w modes.
Date	June 6, 2014 00:32:44
Msg-id	CAKFQuwY=D+4wK1LpZpxXiP3p_SdEb1pMy8k5Y+sh6m9mhUFCPA@mail.gmail.com Whole thread
In response to	Re: Some qualms with the current description of RegExp s,n,w modes. (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Some qualms with the current description of RegExp s,n,w modes.
List	pgsql-docs

Tree view

On Thu, Jun 5, 2014 at 8:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David G Johnston <david.g.johnston@gmail.com> writes:
> I simplified ". and bracket expressions" to "wildcard" and "^ and $" to
> "anchors" though did make use of ^ and $individual quite a bit. I did not
> formally define these terms in the body either.

Did you mean to attach a proposed doc patch here, or are you just
armwaving about what a patch might look like?

Armwaving for lack of any current setup to generate doc-patches.

FWIW, I don't agree with using "wildcard" to mean those particular things
(the term is too generic, and there are other regex constructs that
might be thought to be included); although you could probably get away
with using "anchor" this way as long as you define the term at first use.

I had the same nagging suspicion but figured for a first pass, and defined only within this context, it would suffice. ". and ^ brackets" just rubbed me the wrong way but it does have the merit of being precise.

The text involved here is more or less verbatim from Henry Spencer's
original man page for the regex library, so you're essentially claiming
you know more than the author did about what his code is good for. Maybe
so, but some examples in support of your thesis would be a good thing.

I can readily support why I found [w] to be most useful; the conclusion that [w] > [s] came from the logic that making "^ and $" useless means that using [w] mode and simply avoiding using them would have the same effect. I'll admit that people using ^ and $ where they really meant \A and \Z may be an issue worth accounting for...but I personally call providing that mode to be a compatibility/help-oriented decision and just decided to state so in my revision.

Example that prompted this whole journey:

WITH src (filecontent) AS ( VALUES(

$$CDF CORR: DRAIN COOLANT AND REFILL

ADDITIONAL DLR-OP: BGFLDEX

PAY TYPE: C OTH HRS: 0000 FORECAST SERVICE: CHG TO: EPA CHG: HAZ CHG:

9999 5

SPG CONVERSION SETTINGS - SPG MFG: -- GEN MOD: -- VIN/MODEL#: ENGINE:

CDR CORR: CUSTOMER ELECTED NOT TO HAVE REPAIRS DONE AT THIS TIME NOS

PAY TYPE: C OTH HRS: 0000 FORECAST SERVICE: CHG TO: EPA CHG: HAZ CHG:

9999 03 0030

SPG CONVERSION SETTINGS - SPG MFG: -- GEN MOD: -- VIN/MODEL#: ENGINE:

$$::varchar

))

, do_match AS (

SELECT regexp_matches(filecontent,'^(\S.*?)(?=^\S|\Z)','gw') AS match FROM src

)

, explode_match AS (

SELECT unnest(match) FROM do_match

)

SELECT unnest, length(unnest) FROM explode_match;

[s] 1 result because the "^\S" construct attempts to match beginning-of-document instead of beginning-of-line. This is when I started digging deeper since I expected it to behave like [w].

[n] 0 results because the (.*?) never gets beyond the first line and thus cannot match "^\S|\Z" - no problem here, the behavior of "." is as expected.

[w] 2 results as desired/expected. It is possible to replace ^\S with \n\S (and thus allow [s] to work) but the semantic meaning of ^ makes using this form more convenient

Note that CDF has 5 rows of content while CDR only has 4; thus strongly suggesting the use of newline-insensitive "wildcard" matching. The choice of anchor mode is of a cosmetic/semantic nature but I argue that in this situation the semantic of [w] are preferred over [n].

In either case I'd rather simply drop the existing commentary that [w] is not that useful and either in words or example explain when it would have use; even if you do not want to go as far as to claim that [w] is superior to [n] as I would.

While it is likely possible to write a working expression in all three modes my experience - which is largely based in executing these expressions in Java, not PostgreSQL thought that is becoming more common nowadays - led me directly to the regexp provided.

> Instead of calling these "partial" and "inverse partial" better terms would
> be "newline-sensitive wildcard matching" and "newline-sensitive anchor
> matching".

Agreed that "partial" is not a very good name, but I remain resistant to
"wildcard" here.

> The default mode could be called "newline-sensitive full
> matching".

Or just "newline-sensitive matching" ... does "full" add anything?

Not much - though after adding "anchor" and "wildcard" to the others the question became if this option is not only one of those then is it both, or neither? Full makes it clear that it means both.

Maybe something like: [s] - single-line mode; [w] - multi-line mode; [n|m] - document-only mode; though I dislike re-associating multi-line with [w] given its current association with [n|m]. "Record Mode [w]" has some merit since that is at least the use case that I have identified where it is particularly useful...

David J.

pgsql-docs by date:

From: Tom Lane
Date: 06 June 2014, 00:00:51
Subject: Re: Some qualms with the current description of RegExp s,n,w modes.

From: David Johnston
Date: 06 June 2014, 00:56:42
Subject: Re: Some qualms with the current description of RegExp s,n,w modes.

Re: Some qualms with the current description of RegExp s,n,w modes. - Mailing list pgsql-docs

Previous

Next