Thread: Detecting repeated phrase in a string

Detecting repeated phrase in a string

From
Shaozhong SHI
Date:
Does anyone know how to detect repeated phrase in a string?

Is there any such function?

Regards,

David

Re: Detecting repeated phrase in a string

From
"Peter J. Holzer"
Date:
On 2021-12-09 12:38:15 +0000, Shaozhong SHI wrote:
> Does anyone know how to detect repeated phrase in a string?

Use regular expressions with backreferences:

bayes=> select regexp_match('foo wikiwiki bar', '(.+)\1');
╔══════════════╗
║ regexp_match ║
╟──────────────╢
║ {o}          ║
╚══════════════╝
(1 row)

"o" is repeated in "foo".

bayes=> select regexp_match('fo wikiwiki bar', '(.+)\1');
╔══════════════╗
║ regexp_match ║
╟──────────────╢
║ {wiki}       ║
╚══════════════╝
(1 row)

"wiki" is repeated in "wikiwiki".

bayes=> select regexp_match('fo wikiwi bar', '(.+)\1');
╔══════════════╗
║ regexp_match ║
╟──────────────╢
║ (∅)          ║
╚══════════════╝
(1 row)

nothing is repeated.

Adjust the expression within parentheses if you want to match somethig
more specific than any sequence of one or more characters.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Detecting repeated phrase in a string

From
Shaozhong SHI
Date:
Hi, Peter,

How to define word boundary as either by using
^  , space, or $

So that the following can be done

fox fox is a repeat

foxfox is not a repeat but just one word.

Regards,

David

On Thu, 9 Dec 2021 at 13:35, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2021-12-09 12:38:15 +0000, Shaozhong SHI wrote:
> Does anyone know how to detect repeated phrase in a string?

Use regular expressions with backreferences:

bayes=> select regexp_match('foo wikiwiki bar', '(.+)\1');
╔══════════════╗
║ regexp_match ║
╟──────────────╢
║ {o}          ║
╚══════════════╝
(1 row)

"o" is repeated in "foo".

bayes=> select regexp_match('fo wikiwiki bar', '(.+)\1');
╔══════════════╗
║ regexp_match ║
╟──────────────╢
║ {wiki}       ║
╚══════════════╝
(1 row)

"wiki" is repeated in "wikiwiki".

bayes=> select regexp_match('fo wikiwi bar', '(.+)\1');
╔══════════════╗
║ regexp_match ║
╟──────────────╢
║ (∅)          ║
╚══════════════╝
(1 row)

nothing is repeated.

Adjust the expression within parentheses if you want to match somethig
more specific than any sequence of one or more characters.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Re: Detecting repeated phrase in a string

From
Andreas Joseph Krogh
Date:
På torsdag 09. desember 2021 kl. 15:46:05, skrev Shaozhong SHI <shishaozhong@gmail.com>:
Hi, Peter,
 
How to define word boundary as either by using
^  , space, or $
 
So that the following can be done
 
fox fox is a repeat
 
foxfox is not a repeat but just one word.
 
Do you want repeated phrase (list of words) ore repeated words?
For repeated words (including unicode-chars) you can do:
 
(\b\p{L}+\b)(?:\s+\1)+
 
I'm not quite sure how to translate this to PG, but in JAVA it works.
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Detecting repeated phrase in a string

From
"Peter J. Holzer"
Date:
On 2021-12-09 16:11:31 +0100, Andreas Joseph Krogh wrote:
> For repeated words (including unicode-chars) you can do:
>  
> (\b\p{L}+\b)(?:\s+\1)+
>  
> I'm not quite sure how to translate this to PG, but in JAVA it works.

See https://www.postgresql.org/docs/11/functions-matching.html#POSIX-CONSTRAINT-ESCAPES-TABLE

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment