Thread: Match 2 words and more

Match 2 words and more

From
Shaozhong SHI
Date:
this is supposed to find those to have 2 words and more.

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$';

But, it finds only one word as well.

It appears that regex is not robust.

Can anyone shed light on this?

Regards,

David

Re: Match 2 words and more

From
Rob Sargent
Date:

> On Nov 27, 2021, at 5:27 PM, Shaozhong SHI <shishaozhong@gmail.com> wrote:
>
> 
> this is supposed to find those to have 2 words and more.
>
> select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$';
>
> But, it finds only one word as well.
>
> It appears that regex is not robust.
>
> Can anyone shed light on this?
>
> Regards,
>
Show the results of the query. Describe how those results differ from what you desire or from what you expect.


> David



Re: Match 2 words and more

From
Alvaro Herrera
Date:
On 2021-Nov-28, Shaozhong SHI wrote:

> this is supposed to find those to have 2 words and more.
> 
> select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$';
> 
> But, it finds only one word as well.

How about something like this?

'^([[:<:]][[:alpha:]]+[[:>:]]( |$)){2}$'

You have:
- the ^ is a constraint that matches start of string
- you have a ( ... ){2}$ construct which means "match exactly twice" and
  then match end-of-string
- Inside the parens of that construct, you match:
  - [[:<:]] which means start-of-word
  - [[:alpha:]]+ which means "a non-empty set of alphabetical chars"
  - [[:>:]] which means end-of-word
  - ( |$) for "either a space or end-of-string"

You can perhaps simplify by removing the [[:<:]] and [[:>:]]
constraints, so '^([[:alpha:]]+( |$)){2}$'

To mean "between two and four", change the {2} to {2,4}.  If you want
"two or more", try {2,}.

You could change the ( |$) to ([[:white:]]+|$) in order to accept more
than one space between words, or combinations of space and tabs and
newlines and so on.

With a decent set of data, you could probably notice some other problems
in this regexp, but at least it should be a decent start.

> It appears that regex is not robust.

Nah.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/



Re: Match 2 words and more

From
Guyren Howe
Date:
On Nov 27, 2021, at 16:27 , Shaozhong SHI <shishaozhong@gmail.com> wrote:

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$’;

The simplest thing that does what you says is I think:

select name FROM a_table where "STREET_NAME" ~ ‘^([[:alpha:]]+\s)+[[:alpha:]]+$’;

Re: Match 2 words and more

From
"David G. Johnston"
Date:
On Sat, Nov 27, 2021 at 5:27 PM Shaozhong SHI <shishaozhong@gmail.com> wrote:
this is supposed to find those to have 2 words and more.

select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$';

But, it finds only one word as well.

It appears that regex is not robust.

Can anyone shed light on this?


You put the space inside the first character class (the plus sign within that class is also a literal plus, not a "one or more" indicator) thus you've made the space itself optional.  Did you maybe mean to use ( ) parens instead of [ ]?

When asking regex questions it is usually more helpful to formulate them like so:

SELECT 'this should match' ~ '<regex here>';

David J.

Re: Match 2 words and more

From
Ron
Date:
On 11/27/21 6:49 PM, Alvaro Herrera wrote:
On 2021-Nov-28, Shaozhong SHI wrote:
[snip]
It appears that regex is not robust.

This was my attitude when I first started to learn computer programming.  My code is perfect, it can't be my fault!!  There must be a compiler bug!!!

Nah.

Just incomprehensible line noise!!

--
Angular momentum makes the world go 'round.

Re: Match 2 words and more

From
"Peter J. Holzer"
Date:
On 2021-11-28 00:27:34 +0000, Shaozhong SHI wrote:
> this is supposed to find those to have 2 words and more.
>
> select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[:alpha:]+$';

I think you meant

  select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ ]+[[:alpha:]]+$';

Note the extra two brackets.

The character classes (like [:alpha:] or [:digit:] can only be used
within bracket expressions. So you have to put brackets around the
second [[:alpha:], too (like you did for the first one).

But if you look more closely at the first one, you will notice that it
doesn't do what you want, either: It matches any non-empty sequence of
alpabetic characters, plus signs and spaces. But you almost certainly
don't want to match a plus sign, and you don't want space and alphabetic
characters to be interchangable. You want some alphabetic characters
followed by a space. So this becomes

  select name FROM a_table where "STREET_NAME" ~ '^([[:alpha:]]+ )+[[:alpha:]]+$';

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Match 2 words and more

From
"Peter J. Holzer"
Date:
On 2021-11-28 00:27:34 +0000, Shaozhong SHI wrote:
> It appears that regex is not robust.

It does reliably what you tell it to do.

I would agree that after almost 50 years of extending it, the syntax has
become a bit of a mess.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Match 2 words and more

From
Thomas Markus
Date:

Am 28.11.21 um 01:27 schrieb Shaozhong SHI:
> this is supposed to find those to have 2 words and more.
>
> select name FROM a_table where "STREET_NAME" ~ '^[[:alpha:]+ 
> ]+[:alpha:]+$';
>
> But, it finds only one word as well.
>
> It appears that regex is not robust.
>
> Can anyone shed light on this?
>
> Regards,
>
> David
Hi,

It's robust, but syntax is sometimes weired

for words I would use something like (contains numbers too)
"STREET_NAME" ~ '(\w+\s+)+\w+';

or alpha only
"STREET_NAME" ~ '([[:alpha:]]+\s+)+[[:alpha:]]+'

regards
Thomas