Thread: Can we make regexp processing more friendly by recognizing "\r\n" as a "newline" for "^$" purposes?

Other implementation of regular expressions handle "newline" mechanics related to "^" and "$" semantically instead of literally.  By that I mean that both "\r\n" and "\n" are considered "newlines" instead of just "\n".

If changing behavior is not desirable I would be content with another flag that would toggle such behavior.

In code - both of these subqueries should match whereas presently only the first one does.

SELECT regexp_matches(E'123\n',   E'123$', 'w');
SELECT regexp_matches(E'123\r\n', E'123$', 'w');

I don't know if this is server O/S dependent...but I would not expect it to be so.

Having to say something like:  , 'wr'  (r = combine-\r-with-adjacent-newline) would be OK but not ideal.  I'm not seeing much risk in changing this particular behavior.

Thanks!

David J.

P.S. Forgive me for re-iterating the dislike of calling and describing "w" as "weird" and "rarely useful".  I find it to be quite useful and my source material doesn't seem particularly unusual.

Hi David:

On Sun, Oct 18, 2015 at 7:49 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Other implementation of regular expressions handle "newline" mechanics
> related to "^" and "$" semantically instead of literally.  By that I mean
> that both "\r\n" and "\n" are considered "newlines" instead of just "\n".

Which ones ? AFAIK this kind of thing is usually done by C ( and
related ) runtimes when reading text files.

At least in my machine perl does not do it:

censored:~$ perl -e 'print( ("A\r\n" =~ /A$/) ? "matched\n" : "NO MATCH\n");'
NO MATCH
censored:~$ perl -e 'print( ("A\r\n" =~ /A.$/) ? "matched\n" : "NO MATCH\n");'
matched
censored:~$ perl -e 'print( ("A\r\n" =~ /A\s$/) ? "matched\n" : "NO MATCH\n");'
matched

Normally when reading lines in CP/M and related ( MSDOS, Windows ) the
CRT does collapse them ( and sometimes just zaps \r, or collapse any
run, or consider [\r*]\n[\r*] or.... ). But I normally do not see that
behaviour in regexes.

> If changing behavior is not desirable I would be content with another flag
> that would toggle such behavior.
> In code - both of these subqueries should match whereas presently only the
> first one does.
> SELECT regexp_matches(E'123\n',   E'123$', 'w');
> SELECT regexp_matches(E'123\r\n', E'123$', 'w');
> I don't know if this is server O/S dependent...but I would not expect it to
> be so.

Neither do I ( expect it to be os dep. ) , but I find the current
behaviour correct. I mean, newline stuff is OS dependent, and you
should convert when ingesting data, when matching them it should
already have been converted to whatever the language uses for newlines
( in C and perl that means \n, which needs not be \012, BTW . In unix
\n=\012 on disk, on CP/M it's \015\012 and when I worked with Mac (
before the unixy osX they use now ) it was \015, and I cannot think on
what they can use on EBCDIC machines ).

Francisco Olarte.




On Mon, Oct 19, 2015 at 1:26 AM, Francisco Olarte <folarte@peoplecall.com> wrote:
Hi David:

On Sun, Oct 18, 2015 at 7:49 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Other implementation of regular expressions handle "newline" mechanics
> related to "^" and "$" semantically instead of literally.  By that I mean
> that both "\r\n" and "\n" are considered "newlines" instead of just "\n".

Which ones ? AFAIK this kind of thing is usually done by C ( and
related ) runtimes when reading text files.


​In particular, Java.

There is a third-party program I use, RegEx Buddy, that also behaves in the way described.

At least in my machine perl does not do it:

censored:~$ perl -e 'print( ("A\r\n" =~ /A$/) ? "matched\n" : "NO MATCH\n");'
NO MATCH
censored:~$ perl -e 'print( ("A\r\n" =~ /A.$/) ? "matched\n" : "NO MATCH\n");'
matched
censored:~$ perl -e 'print( ("A\r\n" =~ /A\s$/) ? "matched\n" : "NO MATCH\n");'
matched

​Yes; and I find this to be an annoyance as well...
 

Normally when reading lines in CP/M and related ( MSDOS, Windows ) the
CRT does collapse them ( and sometimes just zaps \r, or collapse any
run, or consider [\r*]\n[\r*] or.... ). But I normally do not see that
behaviour in regexes.

> If changing behavior is not desirable I would be content with another flag
> that would toggle such behavior.
> In code - both of these subqueries should match whereas presently only the
> first one does.
> SELECT regexp_matches(E'123\n',   E'123$', 'w');
> SELECT regexp_matches(E'123\r\n', E'123$', 'w');
> I don't know if this is server O/S dependent...but I would not expect it to
> be so.

Neither do I ( expect it to be os dep. ) , but I find the current
behaviour correct. I mean, newline stuff is OS dependent, and you
should convert when ingesting data, when matching them it should
already have been converted to whatever the language uses for newlines
( in C and perl that means \n, which needs not be \012, BTW . In unix
\n=\012 on disk, on CP/M it's \015\012 and when I worked with Mac (
before the unixy osX they use now ) it was \015, and I cannot think on
what they can use on EBCDIC machines ).


The current behavior is correct.  The behavior I describe, however, would be more user-friendly​ without being "incorrect".

​Having started with, and still reliant upon external sources that use, Windows I've been (un)fortunate to develop habits where 99% of the time I do not have to care about line endings during the processing of data.  I'll pick up new habits eventually but not having to deal with a pre-process line-ending conversion step would make ad-hoc use of the PostgreSQL regex engine (TCL's) less cumbersome.

I'm hoping that Tom Lane at least chimes with his opinion given his recent work that area of the codebase is at least fresh in his mind.  Its not a huge deal but recent pain motivates me to at least put it out there.

David J.