Thread: Can Postgres beat Oracle for regexp_count?

Can Postgres beat Oracle for regexp_count?

From
Shaozhong SHI
Date:
It has been found that regexp_count works brilliantly in Oracle.

However, it is not easy to replicate that in Postgres.  The following codes have been experimented but without any luck.

select regexp_matches('My High Street', '([A-Z][a-z]+[\s])', 'g')

select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)', 'g')

County occurrences of 'My High Street' in one of the following strings:

'My High Street'    1
''                             0
'My High Street My High Street'   2

Can anyone enlighten all of us?

Regards,

David

Re: Can Postgres beat Oracle for regexp_count?

From
"David G. Johnston"
Date:
On Wed, Feb 2, 2022 at 1:20 PM Shaozhong SHI <shishaozhong@gmail.com> wrote:
It has been found that regexp_count works brilliantly in Oracle.

What query exactly did you execute in Oracle that you wish to see if an equivalent can be formulated in PostgreSQL?

However, it is not easy to replicate that in Postgres.

Given we don't have a regexp_count function this isn't surprising...

 
  The following codes have been experimented but without any luck.

select regexp_matches('My High Street', '([A-Z][a-z]+[\s])', 'g')

select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)', 'g')

See my first point.

David J.

Re: Can Postgres beat Oracle for regexp_count?

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> Given we don't have a regexp_count function this isn't surprising...

FYI, it's there in HEAD.

In the meantime, you could possibly do something like

=# select count(*) from regexp_matches('My High Street', '([A-Z][a-z]+[\s])', 'g');
 count
-------
     2
(1 row)

(Note that 2 is the correct answer given that there's no space
after the third word; I trust Oracle agrees.)

            regards, tom lane



Re: Can Postgres beat Oracle for regexp_count?

From
Shaozhong SHI
Date:
Hi, Tom, Lane,

On Wed, 2 Feb 2022 at 22:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> Given we don't have a regexp_count function this isn't surprising...

FYI, it's there in HEAD.

In the meantime, you could possibly do something like

=# select count(*) from regexp_matches('My High Street', '([A-Z][a-z]+[\s])', 'g');
 count
-------
     2
(1 row)

(Note that 2 is the correct answer given that there's no space
after the third word; I trust Oracle agrees.)

Can the whole 3 or 4 or 5 to be matched as 1?

The following has been attempted but no luck.

select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)+', 'g')
It is intended to match 'My High Street, but it turned out only 'Street' was matched. 

Regards,  David 

Re: Can Postgres beat Oracle for regexp_count?

From
"David G. Johnston"
Date:
On Wed, Feb 2, 2022 at 10:26 PM Shaozhong SHI <shishaozhong@gmail.com> wrote:

select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)+', 'g')
It is intended to match 'My High Street, but it turned out only 'Street' was matched. 


I'm too tired to find the documentation for why you saw your result but basically you only have a single capturing parentheses pair and since you've quantified that you end up with just the last capture that was found - Street.  If you want to capture the entire found expression you need to capture the quantifier.  So put parentheses around the entire regexp.

select regexp_matches('My High Street', '(([A-Z][a-z]+[\s]*)+)', 'g')

You now have a two element array, slots filled left-to-right based upon the opening parenthesis.  So {"My High Street",Street}

To get rid of the undesired Street and only return a single element array you need to make the inner parentheses non-capturing.

select regexp_matches('My High Street', '((?:[A-Z][a-z]+[\s]*)+)', 'g')

David J.

Re: Can Postgres beat Oracle for regexp_count?

From
Tom Lane
Date:
Shaozhong SHI <shishaozhong@gmail.com> writes:
> The following has been attempted but no luck.

> select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)+', 'g')
> It is intended to match 'My High Street, but it turned out only 'Street'
> was matched.

You've got the parentheses in the wrong place, ie inside not outside the
"+" quantifier.  Per the fine manual [1], the result is determined by the
last match of quantified capturing parens.

You could avoid using any capturing parens, so that the result is
the whole match:

regression=# select regexp_matches('My High Street', '(?:[A-Z][a-z]+[\s]*)+', 'g');
   regexp_matches
--------------------
 {"My High Street"}
(1 row)

or you could do

regression=# select regexp_matches('My High Street', '(([A-Z][a-z]+[\s]*)+)', 'g');
      regexp_matches
---------------------------
 {"My High Street",Street}
(1 row)

but then you have two sets of capturing parens and you get results for
both, so you might prefer

regression=# select regexp_matches('My High Street', '((?:[A-Z][a-z]+[\s]*)+)', 'g');
   regexp_matches
--------------------
 {"My High Street"}
(1 row)

In any case, there's no substitute for reading the manual.

            regards, tom lane

[1] https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP



Re: Can Postgres beat Oracle for regexp_count?

From
Shaozhong SHI
Date:
Many thanks, Tom,

select regexp_matches('My High Street', '(?:[A-Z][a-z]+[\s]*)+', 'g');  looks very interesting.

I did read the documentation, but found it is difficult to read.  Particularly, the documentation on the use ?: does not state clear sense.  There is only limited explanation on ?:.

Is it correct to say that this ?: construction of a regex can be applied for checking whether cell values meet specifications?

Regards,

David

On Thu, 3 Feb 2022 at 05:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Shaozhong SHI <shishaozhong@gmail.com> writes:
> The following has been attempted but no luck.

> select regexp_matches('My High Street', '([A-Z][a-z]+[\s]*)+', 'g')
> It is intended to match 'My High Street, but it turned out only 'Street'
> was matched.

You've got the parentheses in the wrong place, ie inside not outside the
"+" quantifier.  Per the fine manual [1], the result is determined by the
last match of quantified capturing parens.

You could avoid using any capturing parens, so that the result is
the whole match:

regression=# select regexp_matches('My High Street', '(?:[A-Z][a-z]+[\s]*)+', 'g');
   regexp_matches   
--------------------
 {"My High Street"}
(1 row)

or you could do

regression=# select regexp_matches('My High Street', '(([A-Z][a-z]+[\s]*)+)', 'g');
      regexp_matches       
---------------------------
 {"My High Street",Street}
(1 row)

but then you have two sets of capturing parens and you get results for
both, so you might prefer

regression=# select regexp_matches('My High Street', '((?:[A-Z][a-z]+[\s]*)+)', 'g');
   regexp_matches   
--------------------
 {"My High Street"}
(1 row)

In any case, there's no substitute for reading the manual.

                        regards, tom lane

[1] https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP

Re: Can Postgres beat Oracle for regexp_count?

From
"David G. Johnston"
Date:


On Thursday, February 3, 2022, Shaozhong SHI <shishaozhong@gmail.com> wrote:

Is it correct to say that this ?: construction of a regex can be applied for checking whether cell values meet specifications?


It does exactly what our examples shows it does.  I don’t understand what you mean above but if that helps you remember its purpose, great.

David J.

Re: Can Postgres beat Oracle for regexp_count?

From
Shaozhong SHI
Date:
Hi, David,

Many thanks.

I am investigating into transformation of data quality validation through automation with application of Postgres/PostGIS.

Regards,

David

On Thu, 3 Feb 2022 at 13:00, David G. Johnston <david.g.johnston@gmail.com> wrote:


On Thursday, February 3, 2022, Shaozhong SHI <shishaozhong@gmail.com> wrote:

Is it correct to say that this ?: construction of a regex can be applied for checking whether cell values meet specifications?


It does exactly what our examples shows it does.  I don’t understand what you mean above but if that helps you remember its purpose, great.

David J.

Re: Can Postgres beat Oracle for regexp_count?

From
Merlin Moncure
Date:
On Wed, Feb 2, 2022 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> "David G. Johnston" <david.g.johnston@gmail.com> writes:
> > Given we don't have a regexp_count function this isn't surprising...
>
> FYI, it's there in HEAD.
>
> In the meantime, you could possibly do something like
>
> =# select count(*) from regexp_matches('My High Street', '([A-Z][a-z]+[\s])', 'g');
>  count
> -------
>      2
> (1 row)

alternate version:
postgres=# select array_upper(regexp_split_to_array('My High Street My
High Street', 'My High Street'), 1) - 1;
 ?column?
──────────
        2

can help to slide this into complex queries a little bit easier by
avoiding the aggregation :-).

merlin