Thread: REGEXP_MATCHES() strange behavior with '^' and '$' pattern

REGEXP_MATCHES() strange behavior with '^' and '$' pattern

From
Jeevan Chalke
Date:
<div dir="ltr">Hi,<br /><br />While playing with regular expression I found some strange behavior of<br
/>regexp_matches()function.<br /><br />Consider following sql query and its output:<br /><br /><font size="1"><span
style="font-family:couriernew,monospace">postgres=# select regexp_matches('1' || chr(10) || '2' || chr(10) || '3' ||
chr(10)|| '4', '^', 'mg');<br />  regexp_matches <br />----------------<br /> {""}<br /> {""}<br /> {""}<br /> {""}<br
/> {""}<br/> {""}<br /> {""}<br />(7 rows)</span></font><br /><br />It suppose to return me 4 rows and not 7. Similar
behaviorfound with<br /> pattern '$'.<br /><br />It seems that these start and end anchor characters are not
matching<br/>correctly. Or rather they are matching twice.<br /><br />To get a root cause of it, I put elog(INFO,..)
intothe<br />setup_regexp_matches() function where we copy matches into the struct and<br /> found following values.<br
/><br/><br /><font size="1"><span style="font-family:courier new,monospace">postgres=# select regexp_matches('1' ||
chr(10)|| '2' || chr(10) || '3' || chr(10) || '4', '^', 'mg');<br /> INFO:  start_search: 0  rm_so: 0  rm_eo: 0<br
/>INFO: updated start_search: 1<br />INFO:  start_search: 1  rm_so: 2  rm_eo: 2<br />INFO:  updated start_search: 2<br
/>INFO: start_search: 2  rm_so: 2  rm_eo: 2<br />INFO:  updated start_search: 3<br /> INFO:  start_search: 3  rm_so: 4 
rm_eo:4<br />INFO:  updated start_search: 4<br />INFO:  start_search: 4  rm_so: 4  rm_eo: 4<br />INFO:  updated
start_search:5<br />INFO:  start_search: 5  rm_so: 6  rm_eo: 6<br />INFO:  updated start_search: 6<br /> INFO: 
start_search:6  rm_so: 6  rm_eo: 6<br />INFO:  updated start_search: 7</span></font><br /><br />Certainly, after second
pass,updated start_search should be 3 as last<br />matched pattern was at 2 and of zero length since so = eo.<br /><br
/>Ihave modified that logic to look similar as that of replace_text_regexp()<br />function. As regexp_replace works
well.<br/><br />Attached patch with test-case. Please have a look and let me know if I<br />assumed something wrong.<br
/><br/>Thanks<br /><br />-- <br />Jeevan B Chalke<br /><br /></div> 

Re: REGEXP_MATCHES() strange behavior with '^' and '$' pattern

From
Jeevan Chalke
Date:
Oops forgot patch.

Attached now.


On Wed, Jul 31, 2013 at 6:03 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi,

While playing with regular expression I found some strange behavior of
regexp_matches() function.

Consider following sql query and its output:

postgres=# select regexp_matches('1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4', '^', 'mg');
 regexp_matches
----------------
 {""}
 {""}
 {""}
 {""}
 {""}
 {""}
 {""}
(7 rows)


It suppose to return me 4 rows and not 7. Similar behavior found with
pattern '$'.

It seems that these start and end anchor characters are not matching
correctly. Or rather they are matching twice.

To get a root cause of it, I put elog(INFO,..) into the
setup_regexp_matches() function where we copy matches into the struct and
found following values.


postgres=# select regexp_matches('1' || chr(10) || '2' || chr(10) || '3' || chr(10) || '4', '^', 'mg');
INFO:  start_search: 0  rm_so: 0  rm_eo: 0
INFO:  updated start_search: 1
INFO:  start_search: 1  rm_so: 2  rm_eo: 2
INFO:  updated start_search: 2
INFO:  start_search: 2  rm_so: 2  rm_eo: 2
INFO:  updated start_search: 3
INFO:  start_search: 3  rm_so: 4  rm_eo: 4
INFO:  updated start_search: 4
INFO:  start_search: 4  rm_so: 4  rm_eo: 4
INFO:  updated start_search: 5
INFO:  start_search: 5  rm_so: 6  rm_eo: 6
INFO:  updated start_search: 6
INFO:  start_search: 6  rm_so: 6  rm_eo: 6
INFO:  updated start_search: 7


Certainly, after second pass, updated start_search should be 3 as last
matched pattern was at 2 and of zero length since so = eo.

I have modified that logic to look similar as that of replace_text_regexp()
function. As regexp_replace works well.

Attached patch with test-case. Please have a look and let me know if I
assumed something wrong.

Thanks

--
Jeevan B Chalke




--
Jeevan B Chalke

Attachment

Re: REGEXP_MATCHES() strange behavior with '^' and '$' pattern

From
Tom Lane
Date:
Jeevan Chalke <jeevan.chalke@enterprisedb.com> writes:
> Oops forgot patch.
> Attached now.

Hmm ... I think the logic change is good, but two demerits for not fixing
the adjacent comment.
        regards, tom lane



Re: REGEXP_MATCHES() strange behavior with '^' and '$' pattern

From
Jeevan Chalke
Date:



On Wed, Jul 31, 2013 at 7:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeevan Chalke <jeevan.chalke@enterprisedb.com> writes:
> Oops forgot patch.
> Attached now.

Hmm ... I think the logic change is good, but two demerits for not fixing
the adjacent comment.

I had a look over comments and somehow I found that OK.

Anyway, updated comments in this version of patch.

Thanks
 

                        regards, tom lane



--
Jeevan B Chalke

Attachment

Re: REGEXP_MATCHES() strange behavior with '^' and '$' pattern

From
Jeevan Chalke
Date:



On Thu, Aug 1, 2013 at 12:25 PM, Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:



On Wed, Jul 31, 2013 at 7:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeevan Chalke <jeevan.chalke@enterprisedb.com> writes:
> Oops forgot patch.
> Attached now.

Hmm ... I think the logic change is good, but two demerits for not fixing
the adjacent comment.

I had a look over comments and somehow I found that OK.

Anyway, updated comments in this version of patch.

It looks like you have committed the changes with updated comments and more test-cases.

Thanks
 

Thanks
 

                        regards, tom lane



--
Jeevan B Chalke




--
Jeevan B Chalke
Senior Software Engineer, R&D
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 30589500

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.