Thread: Counting the number of repeated phrases in a column

Counting the number of repeated phrases in a column

From
Shaozhong SHI
Date:
There is a short of a function in the standard Postgres to do the following:

It is easy to count the number of occurrence of words, but it is rather difficult to count the number of occurrence of phrases.

For instance:

A cell of value:  'Hello World' means 1 occurrence a phrase.

A cell of value: 'Hello World World Hello' means no occurrence of any repeated phrase.

But, A cell of value: 'Hello World World Hello Hello World' means 2 occurrences of 'Hello World'.

'The City of London, London' also has no occurrences of any repeated phrase.

Anyone has got such a function to check out the number of occurrence of any repeated phrases?

Regards,

David

Aw: Counting the number of repeated phrases in a column

From
Karsten Hilbert
Date:
> There is a short of a function in the standard Postgres to do the following:
>  
> it is easy to count the number of occurrence of words, but it is rather difficult to count the number of occurrence
ofphrases. 
>  
> For instance:
>  
> A cell of value:  'Hello World' means 1 occurrence a phrase.
>  
> A cell of value: 'Hello World World Hello' means no occurrence of any repeated phrase.
>  
> But, A cell of value: 'Hello World World Hello Hello World' means 2 occurrences of 'Hello World'.
>  
> 'The City of London, London' also has no occurrences of any repeated phrase.
>  
> Anyone has got such a function to check out the number of occurrence of any repeated phrases?

For that to become answerable you may want to define what to
do when facing ambiguity.

Best,
Karsten



Re: Counting the number of repeated phrases in a column

From
Shaozhong SHI
Date:
How about split up the value into individual words and keep their orders?
add words up to form individual phrase and ensure that each phrase only consists unique/distinct words
count repeated phrases afterward

How about this?

Regards,

David

On Tue, 25 Jan 2022 at 17:22, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
> There is a short of a function in the standard Postgres to do the following:
>  
> it is easy to count the number of occurrence of words, but it is rather difficult to count the number of occurrence of phrases.
>  
> For instance:
>  
> A cell of value:  'Hello World' means 1 occurrence a phrase.
>  
> A cell of value: 'Hello World World Hello' means no occurrence of any repeated phrase.
>  
> But, A cell of value: 'Hello World World Hello Hello World' means 2 occurrences of 'Hello World'.
>  
> 'The City of London, London' also has no occurrences of any repeated phrase.
>  
> Anyone has got such a function to check out the number of occurrence of any repeated phrases?

For that to become answerable you may want to define what to
do when facing ambiguity.

Best,
Karsten


Aw: Re: Counting the number of repeated phrases in a column

From
Karsten Hilbert
Date:
> How about split up the value into individual words and keep their orders?
> add words up to form individual phrase and ensure that each phrase only consists unique/distinct words
> count repeated phrases afterward
>  
> How about this?

Sure, if that serves your purpose ?

So far, we (I?) can't tell because you have yet to (computably) define "phrase".

Which may or may not solve the previous dilemma.

(Top-posting is not liked on this list, to my knowledge.)

Best,
Karsten




Re: Counting the number of repeated phrases in a column

From
"David G. Johnston"
Date:
On Tue, Jan 25, 2022 at 10:10 AM Shaozhong SHI <shishaozhong@gmail.com> wrote:
Anyone has got such a function to check out the number of occurrence of any repeated phrases?

Not I.  But I wouldn't be surprised that such an algorithm exists and that it has been implemented - in a language other than SQL or pl/pgsql.

David J.

Re: Counting the number of repeated phrases in a column

From
Laura Smith
Date:
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Tuesday, January 25th, 2022 at 17:10, Shaozhong SHI <shishaozhong@gmail.com> wrote:

> 'The City of London, London' also has no occurrences of any repeated phrase.


Not sure the City would be particularly happy with that attribution. ;-)

Its it sits on its own. Its own local authority, its own county. It is an enclave enclosed by Greater London.

A bit like the Vatican really. Except the City isn't its own country - much to the chagrin of some, no doubt !



Re: Counting the number of repeated phrases in a column

From
benj.dev@laposte.net
Date:
Le 25/01/2022 à 18:10, Shaozhong SHI a écrit :
> There is a short of a function in the standard Postgres to do the following:
>
> It is easy to count the number of occurrence of words, but it is rather
> difficult to count the number of occurrence of phrases.
>
> For instance:
>
> A cell of value:  'Hello World' means 1 occurrence a phrase.
>
> A cell of value: 'Hello World World Hello' means no occurrence of any
> repeated phrase.
>
> But, A cell of value: 'Hello World World Hello Hello World' means 2
> occurrences of 'Hello World'.
>
> 'The City of London, London' also has no occurrences of any repeated phrase.
>
> Anyone has got such a function to check out the number of occurrence of
> any repeated phrases?
>
> Regards,
>
> David

Don't know if it's exactly what you want, but you can replace all
occurence of the phrase in the text by empty string and compute the diff
between the initial and the result and next divide by the length of your
phrase.

Example :
WITH x AS (SELECT 'toto like tata and toto like titi and toto like tutu'
, 'toto like' phrase)
SELECT (char_length(texte) - char_length(replace(texte, phrase, ''))) /
char_length(phrase) AS nb_occurence
FROM x


Attachment

Re: Counting the number of repeated phrases in a column

From
Ivan Panchenko
Date:
On 26.01.2022 00:21, benj.dev@laposte.net wrote:
> Le 25/01/2022 à 18:10, Shaozhong SHI a écrit :
>> There is a short of a function in the standard Postgres to do the 
>> following:
>>
>> It is easy to count the number of occurrence of words, but it is 
>> rather difficult to count the number of occurrence of phrases.
>>
>> For instance:
>>
>> A cell of value:  'Hello World' means 1 occurrence a phrase.
>>
>> A cell of value: 'Hello World World Hello' means no occurrence of any 
>> repeated phrase.
>>
>> But, A cell of value: 'Hello World World Hello Hello World' means 2 
>> occurrences of 'Hello World'.
>>
>> 'The City of London, London' also has no occurrences of any repeated 
>> phrase.
>>
>> Anyone has got such a function to check out the number of occurrence 
>> of any repeated phrases?
>>
>> Regards,
>>
>> David
>
> Don't know if it's exactly what you want, but you can replace all 
> occurence of the phrase in the text by empty string and compute the 
> diff between the initial and the result and next divide by the length 
> of your phrase.
>
> Example :
> WITH x AS (SELECT 'toto like tata and toto like titi and toto like 
> tutu' , 'toto like' phrase)
> SELECT (char_length(texte) - char_length(replace(texte, phrase, ''))) 
> / char_length(phrase) AS nb_occurence
> FROM x
>
This works if the user knows the phrase. As far as I understood, the 
phrase is not known, and user wants to count number of repeats of any 
phrases.
Of course this can be done with recursive CTE. Split into words, 
generate all phrases (AFAIK requires recursion), then group and count.

But probably in PL/Perl this could be done  more effectively.





Re: Counting the number of repeated phrases in a column

From
Jian He
Date:

On Wed, Jan 26, 2022 at 3:03 AM Ivan Panchenko <i.panchenko@postgrespro.ru> wrote:

On 26.01.2022 00:21, benj.dev@laposte.net wrote:
> Le 25/01/2022 à 18:10, Shaozhong SHI a écrit :
>> There is a short of a function in the standard Postgres to do the
>> following:
>>
>> It is easy to count the number of occurrence of words, but it is
>> rather difficult to count the number of occurrence of phrases.
>>
>> For instance:
>>
>> A cell of value:  'Hello World' means 1 occurrence a phrase.
>>
>> A cell of value: 'Hello World World Hello' means no occurrence of any
>> repeated phrase.
>>
>> But, A cell of value: 'Hello World World Hello Hello World' means 2
>> occurrences of 'Hello World'.
>>
>> 'The City of London, London' also has no occurrences of any repeated
>> phrase.
>>
>> Anyone has got such a function to check out the number of occurrence
>> of any repeated phrases?
>>
>> Regards,
>>
>> David
>
> Don't know if it's exactly what you want, but you can replace all
> occurence of the phrase in the text by empty string and compute the
> diff between the initial and the result and next divide by the length
> of your phrase.
>
> Example :
> WITH x AS (SELECT 'toto like tata and toto like titi and toto like
> tutu' , 'toto like' phrase)
> SELECT (char_length(texte) - char_length(replace(texte, phrase, '')))
> / char_length(phrase) AS nb_occurence
> FROM x
>
This works if the user knows the phrase. As far as I understood, the
phrase is not known, and user wants to count number of repeats of any
phrases.
Of course this can be done with recursive CTE. Split into words,
generate all phrases (AFAIK requires recursion), then group and count.

But probably in PL/Perl this could be done  more effectively.




Re: Counting the number of repeated phrases in a column

From
Shaozhong SHI
Date:


On Tue, 25 Jan 2022 at 21:33, Ivan Panchenko <i.panchenko@postgrespro.ru> wrote:

On 26.01.2022 00:21, benj.dev@laposte.net wrote:
> Le 25/01/2022 à 18:10, Shaozhong SHI a écrit :
>> There is a short of a function in the standard Postgres to do the
>> following:
>>
>> It is easy to count the number of occurrence of words, but it is
>> rather difficult to count the number of occurrence of phrases.
>>
>> For instance:
>>
>> A cell of value:  'Hello World' means 1 occurrence a phrase.
>>
>> A cell of value: 'Hello World World Hello' means no occurrence of any
>> repeated phrase.
>>
>> But, A cell of value: 'Hello World World Hello Hello World' means 2
>> occurrences of 'Hello World'.
>>
>> 'The City of London, London' also has no occurrences of any repeated
>> phrase.
>>
>> Anyone has got such a function to check out the number of occurrence
>> of any repeated phrases?
>>
>> Regards,
>>
>> David
>
> Don't know if it's exactly what you want, but you can replace all
> occurence of the phrase in the text by empty string and compute the
> diff between the initial and the result and next divide by the length
> of your phrase.
>
> Example :
> WITH x AS (SELECT 'toto like tata and toto like titi and toto like
> tutu' , 'toto like' phrase)
> SELECT (char_length(texte) - char_length(replace(texte, phrase, '')))
> / char_length(phrase) AS nb_occurence
> FROM x
>
This works if the user knows the phrase. As far as I understood, the
phrase is not known, and user wants to count number of repeats of any
phrases.
Of course this can be done with recursive CTE. Split into words,
generate all phrases (AFAIK requires recursion), then group and count.

But probably in PL/Perl this could be done  more effectively.

Is there an example of using recursive CTE to split a text string into words?

Regards,

David 

Re: Counting the number of repeated phrases in a column

From
benj.dev@laposte.net
Date:
>On Tue, 25 Jan 2022 at 21:33, Ivan Panchenko <i(dot)panchenko(at)postgrespro(dot)ru>
>wrote:
>
>
>>
>> On 26.01.2022 00:21, benj(dot)dev(at)laposte(dot)net wrote:
>> > Le 25/01/2022 à 18:10, Shaozhong SHI a écrit :
>> >> There is a short of a function in the standard Postgres to do the
>> >> following:
>> >>
>> >> It is easy to count the number of occurrence of words, but it is
>> >> rather difficult to count the number of occurrence of phrases.
>> >>
>> >> For instance:
>> >>
>> >> A cell of value:  'Hello World' means 1 occurrence a phrase.
>> >>
>> >> A cell of value: 'Hello World World Hello' means no occurrence of any
>> >> repeated phrase.
>> >>
>> >> But, A cell of value: 'Hello World World Hello Hello World' means 2
>> >> occurrences of 'Hello World'.
>> >>
>> >> 'The City of London, London' also has no occurrences of any repeated
>> >> phrase.
>> >>
>> >> Anyone has got such a function to check out the number of occurrence
>> >> of any repeated phrases?
>> >>
>> >> Regards,
>> >>
>> >> David
>> >
>> > Don't know if it's exactly what you want, but you can replace all
>> > occurence of the phrase in the text by empty string and compute the
>> > diff between the initial and the result and next divide by the length
>> > of your phrase.
>> >
>> > Example :
>> > WITH x AS (SELECT 'toto like tata and toto like titi and toto like
>> > tutu' , 'toto like' phrase)
>> > SELECT (char_length(texte) - char_length(replace(texte, phrase, '')))
>> > / char_length(phrase) AS nb_occurence
>> > FROM x
>> >
>> This works if the user knows the phrase. As far as I understood, the
>> phrase is not known, and user wants to count number of repeats of any
>> phrases.
>> Of course this can be done with recursive CTE. Split into words,
>> generate all phrases (AFAIK requires recursion), then group and count.
>>
>> But probably in PL/Perl this could be done  more effectively.
>>
>
>
>Is there an example of using recursive CTE to split a text string into
>words?
>
>
>Regards,
>
>
>David

Without recursive, a "brutal" solution may be something like

WITH original_text AS (SELECT 'Hello World World Hello my friend Hello World' sentence)
, range_to_search AS (SELECT *, generate_series(1,5) gs FROM original_text) -- 1 is the minimal group of word searched, 5 is the maximal grouped word searched
, x AS (
SELECT r.sentence, gs
, array_to_string((array_agg(rstt.word) OVER (PARTITION BY gs ORDER BY rstt.pos ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING))[1:gs],' ') AS search_words
FROM range_to_search r
LEFT JOIN LATERAL regexp_split_to_table(r.sentence, ' ') WITH ORDINALITY rstt(word,pos) ON true
)
SELECT DISTINCT search_words, (char_length(sentence) - char_length(replace(sentence, search_words, '')))
/ NULLIF(char_length(search_words),0) AS nb_occurence
FROM x

It's also possible to define a minimal number of word accepted

 

Re: Counting the number of repeated phrases in a column

From
"Ivan E. Panchenko"
Date:


On 26.01.2022 11:11, Shaozhong SHI wrote:


On Tue, 25 Jan 2022 at 21:33, Ivan Panchenko <i.panchenko@postgrespro.ru> wrote:

On 26.01.2022 00:21, benj.dev@laposte.net wrote:
> Le 25/01/2022 à 18:10, Shaozhong SHI a écrit :
>> There is a short of a function in the standard Postgres to do the
>> following:
>>
>> It is easy to count the number of occurrence of words, but it is
>> rather difficult to count the number of occurrence of phrases.
>>
>> For instance:
>>
>> A cell of value:  'Hello World' means 1 occurrence a phrase.
>>
>> A cell of value: 'Hello World World Hello' means no occurrence of any
>> repeated phrase.
>>
>> But, A cell of value: 'Hello World World Hello Hello World' means 2
>> occurrences of 'Hello World'.
>>
>> 'The City of London, London' also has no occurrences of any repeated
>> phrase.
>>
>> Anyone has got such a function to check out the number of occurrence
>> of any repeated phrases?
>>
>> Regards,
>>
>> David
>
> Don't know if it's exactly what you want, but you can replace all
> occurence of the phrase in the text by empty string and compute the
> diff between the initial and the result and next divide by the length
> of your phrase.
>
> Example :
> WITH x AS (SELECT 'toto like tata and toto like titi and toto like
> tutu' , 'toto like' phrase)
> SELECT (char_length(texte) - char_length(replace(texte, phrase, '')))
> / char_length(phrase) AS nb_occurence
> FROM x
>
This works if the user knows the phrase. As far as I understood, the
phrase is not known, and user wants to count number of repeats of any
phrases.
Of course this can be done with recursive CTE. Split into words,
generate all phrases (AFAIK requires recursion), then group and count.

But probably in PL/Perl this could be done  more effectively.

Is there an example of using recursive CTE to split a text string into words?

Recursion is not needed for splitting into words. This can be done by regexp_split_to_table function.

But generation of all possible phrases from the given list of words probably requires recursion. On the first step the list of words becomes a list of a single-worded phrases. On each iteration then, you add the next word to each existing phrase, if it is possible (i.e. until the last word is reached).


Regards,

David 
Regards,
Ivan

Re: Counting the number of repeated phrases in a column

From
Shaozhong SHI
Date:


On Tue, 25 Jan 2022 at 17:10, Shaozhong SHI <shishaozhong@gmail.com> wrote:
There is a short of a function in the standard Postgres to do the following:

It is easy to count the number of occurrence of words, but it is rather difficult to count the number of occurrence of phrases.

For instance:

A cell of value:  'Hello World' means 1 occurrence a phrase.

A cell of value: 'Hello World World Hello' means no occurrence of any repeated phrase.

But, A cell of value: 'Hello World World Hello Hello World' means 2 occurrences of 'Hello World'.

'The City of London, London' also has no occurrences of any repeated phrase.

Anyone has got such a function to check out the number of occurrence of any repeated phrases?

Regards,

David

Hi, All Friends,

Whatever.   Can we try to build a regex for   'The City of London London Great London UK ' ?

It could be something like '[\w\s]+[\s-]+[a-z]+[\s-][\s\w]+'.   [\s-]+[a-z]+[\s-] is catered for some people think that 'City of London' is 'City-of-London' or 'City-of-London'.

Regards,

David

Counting the number of repeated phrases in a column

From
"David G. Johnston"
Date:
On Wednesday, January 26, 2022, Shaozhong SHI <shishaozhong@gmail.com> wrote:


Whatever.   Can we try to build a regex for   'The City of London London Great London UK ' ?

Not even if you paid me.  I’d probably die before I succeeded if you tortured me.

David J.
 

Re: Counting the number of repeated phrases in a column

From
Rob Sargent
Date:
On 1/26/22 13:35, Shaozhong SHI wrote:


On Tue, 25 Jan 2022 at 17:10, Shaozhong SHI <shishaozhong@gmail.com> wrote:
There is a short of a function in the standard Postgres to do the following:

It is easy to count the number of occurrence of words, but it is rather difficult to count the number of occurrence of phrases.

For instance:

A cell of value:  'Hello World' means 1 occurrence a phrase.

A cell of value: 'Hello World World Hello' means no occurrence of any repeated phrase.

But, A cell of value: 'Hello World World Hello Hello World' means 2 occurrences of 'Hello World'.

'The City of London, London' also has no occurrences of any repeated phrase.

Anyone has got such a function to check out the number of occurrence of any repeated phrases?

Regards,

David

Hi, All Friends,

Whatever.   Can we try to build a regex for   'The City of London London Great London UK ' ?

It could be something like '[\w\s]+[\s-]+[a-z]+[\s-][\s\w]+'.   [\s-]+[a-z]+[\s-] is catered for some people think that 'City of London' is 'City-of-London' or 'City-of-London'.

Regards,

David
Do you really want "The City of", by itself, to be one of the detected phrases?  eg 'The City of London London Great London UK The City of Liverpool'.

Re: Counting the number of repeated phrases in a column

From
Karsten Hilbert
Date:
Am Wed, Jan 26, 2022 at 08:35:06PM +0000 schrieb Shaozhong SHI:

> Whatever.   Can we try to build a regex for   'The City of London London
> Great London UK ' ?

Would you be so kind as do be more specific about that "we" ?

Best,
Karsten
--
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B



Re: Counting the number of repeated phrases in a column

From
Merlin Moncure
Date:
On Tue, Jan 25, 2022 at 11:10 AM Shaozhong SHI <shishaozhong@gmail.com> wrote:
>
> There is a short of a function in the standard Postgres to do the following:
>
> It is easy to count the number of occurrence of words, but it is rather difficult to count the number of occurrence
ofphrases.
 
>
> For instance:
>
> A cell of value:  'Hello World' means 1 occurrence a phrase.
>
> A cell of value: 'Hello World World Hello' means no occurrence of any repeated phrase.
>
> But, A cell of value: 'Hello World World Hello Hello World' means 2 occurrences of 'Hello World'.
>
> 'The City of London, London' also has no occurrences of any repeated phrase.
>
> Anyone has got such a function to check out the number of occurrence of any repeated phrases?

Let's define phase as a sequence of two or more words, delimited by
space.  you could find it with something like:

with s as (select 'Hello World Hello World' as sentence)
select
  phrase,
  array_upper(string_to_array((select sentence from s), phrase), 1) -
1 as occurrances
from
(
  select array_to_string(x, ' ') as phrase
  from
  (
    select distinct v[a:b]  x
    from regexp_split_to_array((select sentence from s), ' ') v
    cross join lateral generate_series(1, array_upper(v, 1)) a
    cross join lateral generate_series(a + 1, array_upper(v, 1)) b
  ) q
) q;

this would be slow for large sentences obviously, and you'd probably
want to prepare the string stripping some characters and such.

merlin



Re: Counting the number of repeated phrases in a column

From
Merlin Moncure
Date:
On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure <mmoncure@gmail.com> wrote:
>
> with s as (select 'Hello World Hello World' as sentence)
> select
>   phrase,
>   array_upper(string_to_array((select sentence from s), phrase), 1) -
> 1 as occurrances
> from
> (
>   select array_to_string(x, ' ') as phrase
>   from
>   (
>     select distinct v[a:b]  x
>     from regexp_split_to_array((select sentence from s), ' ') v
>     cross join lateral generate_series(1, array_upper(v, 1)) a
>     cross join lateral generate_series(a + 1, array_upper(v, 1)) b
>   ) q
> ) q;

Simplified to:
select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
from regexp_split_to_array('Hello World Hello World', ' ') v
cross join lateral generate_series(1, array_upper(v, 1)) a
cross join lateral generate_series(a + 1, array_upper(v, 1)) b
group by 1;

         phrase          │ occurances
─────────────────────────┼────────────
 World Hello             │          1
 Hello World Hello       │          1
 Hello World             │          2
 Hello World Hello World │          1
 World Hello World       │          1

merlin



Re: Counting the number of repeated phrases in a column

From
Rob Sargent
Date:
On 1/27/22 10:03, Merlin Moncure wrote:
On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure <mmoncure@gmail.com> wrote:
with s as (select 'Hello World Hello World' as sentence)
select  phrase,  array_upper(string_to_array((select sentence from s), phrase), 1) -
1 as occurrances
from
(  select array_to_string(x, ' ') as phrase  from  (    select distinct v[a:b]  x    from regexp_split_to_array((select sentence from s), ' ') v    cross join lateral generate_series(1, array_upper(v, 1)) a    cross join lateral generate_series(a + 1, array_upper(v, 1)) b  ) q
) q;
Simplified to:
select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
from regexp_split_to_array('Hello World Hello World', ' ') v
cross join lateral generate_series(1, array_upper(v, 1)) a
cross join lateral generate_series(a + 1, array_upper(v, 1)) b
group by 1;
         phrase          │ occurances
─────────────────────────┼──────────── World Hello             │          1 Hello World Hello       │          1 Hello World             │          2 Hello World Hello World │          1 World Hello World       │          1

merlin


And since we're looking for repeated phrases maybe add
having count(*) > 1

Re: Counting the number of repeated phrases in a column

From
Merlin Moncure
Date:
On Thu, Jan 27, 2022 at 11:09 AM Rob Sargent <robjsargent@gmail.com> wrote:
>
> On 1/27/22 10:03, Merlin Moncure wrote:
>
> On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure <mmoncure@gmail.com> wrote:
>
> with s as (select 'Hello World Hello World' as sentence)
> select
>   phrase,
>   array_upper(string_to_array((select sentence from s), phrase), 1) -
> 1 as occurrances
> from
> (
>   select array_to_string(x, ' ') as phrase
>   from
>   (
>     select distinct v[a:b]  x
>     from regexp_split_to_array((select sentence from s), ' ') v
>     cross join lateral generate_series(1, array_upper(v, 1)) a
>     cross join lateral generate_series(a + 1, array_upper(v, 1)) b
>   ) q
> ) q;
>
> Simplified to:
> select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
> from regexp_split_to_array('Hello World Hello World', ' ') v
> cross join lateral generate_series(1, array_upper(v, 1)) a
> cross join lateral generate_series(a + 1, array_upper(v, 1)) b
> group by 1;
>
>          phrase          │ occurances
> ─────────────────────────┼────────────
>  World Hello             │          1
>  Hello World Hello       │          1
>  Hello World             │          2
>  Hello World Hello World │          1
>  World Hello World       │          1
>
> merlin
>
>
> And since we're looking for repeated phrases maybe add
>
> having count(*) > 1

thanks.  also, testing on actual data, I noticed that a couple other
things are mandatory, mainly doing a bit of massaging before
tokenizing:

select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
from
(
  select array_agg(t) v
  from
  (
    select trim(replace(unnest(v), E'\n', '')) t
    from regexp_split_to_array(<sentence>, ' ') v
  ) q
  where length(t) > 1
) q
cross join lateral generate_series(1, array_upper(v, 1)) a
cross join lateral generate_series(a + 1, array_upper(v, 1)) b
group by 1
having count(*) > 1;

We are definitely in N^2 space here, so look for things to start
breaking down for sentences > 1000 words.

merlin



Re: Counting the number of repeated phrases in a column

From
benj.dev@laposte.net
Date:

Le 27/01/2022 à 18:35, Merlin Moncure a écrit :
> On Thu, Jan 27, 2022 at 11:09 AM Rob Sargent <robjsargent@gmail.com> wrote:
>>
>> On 1/27/22 10:03, Merlin Moncure wrote:
>>
>> On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure <mmoncure@gmail.com> wrote:
>>
>> with s as (select 'Hello World Hello World' as sentence)
>> select
>>    phrase,
>>    array_upper(string_to_array((select sentence from s), phrase), 1) -
>> 1 as occurrances
>> from
>> (
>>    select array_to_string(x, ' ') as phrase
>>    from
>>    (
>>      select distinct v[a:b]  x
>>      from regexp_split_to_array((select sentence from s), ' ') v
>>      cross join lateral generate_series(1, array_upper(v, 1)) a
>>      cross join lateral generate_series(a + 1, array_upper(v, 1)) b
>>    ) q
>> ) q;
>>
>> Simplified to:
>> select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
>> from regexp_split_to_array('Hello World Hello World', ' ') v
>> cross join lateral generate_series(1, array_upper(v, 1)) a
>> cross join lateral generate_series(a + 1, array_upper(v, 1)) b
>> group by 1;
>>
>>           phrase          │ occurances
>> ─────────────────────────┼────────────
>>   World Hello             │          1
>>   Hello World Hello       │          1
>>   Hello World             │          2
>>   Hello World Hello World │          1
>>   World Hello World       │          1
>>
>> merlin
>>
>>
>> And since we're looking for repeated phrases maybe add
>>
>> having count(*) > 1
>
> thanks.  also, testing on actual data, I noticed that a couple other
> things are mandatory, mainly doing a bit of massaging before
> tokenizing:
>
> select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
> from
> (
>    select array_agg(t) v
>    from
>    (
>      select trim(replace(unnest(v), E'\n', '')) t
>      from regexp_split_to_array(<sentence>, ' ') v
>    ) q
>    where length(t) > 1
> ) q
> cross join lateral generate_series(1, array_upper(v, 1)) a
> cross join lateral generate_series(a + 1, array_upper(v, 1)) b
> group by 1
> having count(*) > 1;
>
> We are definitely in N^2 space here, so look for things to start
> breaking down for sentences > 1000 words.
>
> merlin
>

(for better complexity) you may search about "Ukkonen suffix tree"
Similar problem as yours :
https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/?ref=lbp



Attachment

Re: Counting the number of repeated phrases in a column

From
Merlin Moncure
Date:
On Thu, Jan 27, 2022 at 11:56 AM <benj.dev@laposte.net> wrote:
> Le 27/01/2022 à 18:35, Merlin Moncure a écrit :
> > select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
> > from
> > (
> >    select array_agg(t) v
> >    from
> >    (
> >      select trim(replace(unnest(v), E'\n', '')) t
> >      from regexp_split_to_array(<sentence>, ' ') v
> >    ) q
> >    where length(t) > 1
> > ) q
> > cross join lateral generate_series(1, array_upper(v, 1)) a
> > cross join lateral generate_series(a + 1, array_upper(v, 1)) b
> > group by 1
> > having count(*) > 1;
> >
> > We are definitely in N^2 space here, so look for things to start
> > breaking down for sentences > 1000 words.
> >
> > merlin
> >
>
> (for better complexity) you may search about "Ukkonen suffix tree"
> Similar problem as yours :
> https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/?ref=lbp

Yep.  Many problems like this are well solved in imperative languages
and will fit poorly into SQL quase-functional space.  That
implementation could probably be converted to pl/pgsql pretty easily,
or a 'sql + tables' variant as a fun challenge.  It also slightly
exploits the fact that only the most repeated needle is returned,
rather than all of them.

Having the need to have single statement stateless SQL solutions to
interesting problems comes up all the time in common development
practice though for simplicity's sake even if there are better
approaches out there.  It's also fun.

merlin



Re: Counting the number of repeated phrases in a column

From
Shaozhong SHI
Date:


On Thursday, 27 January 2022, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure <mmoncure@gmail.com> wrote:
>
> with s as (select 'Hello World Hello World' as sentence)
> select
>   phrase,
>   array_upper(string_to_array((select sentence from s), phrase), 1) -
> 1 as occurrances
> from
> (
>   select array_to_string(x, ' ') as phrase
>   from
>   (
>     select distinct v[a:b]  x
>     from regexp_split_to_array((select sentence from s), ' ') v
>     cross join lateral generate_series(1, array_upper(v, 1)) a
>     cross join lateral generate_series(a + 1, array_upper(v, 1)) b
>   ) q
> ) q;

Simplified to:
select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
from regexp_split_to_array('Hello World Hello World', ' ') v
cross join lateral generate_series(1, array_upper(v, 1)) a
cross join lateral generate_series(a + 1, array_upper(v, 1)) b
group by 1;

         phrase          │ occurances
─────────────────────────┼────────────
 World Hello             │          1
 Hello World Hello       │          1
 Hello World             │          2
 Hello World Hello World │          1
 World Hello World       │          1

merlin



How about knock unique words into discrete joint up strings?  Then check whether there is any repeated words?
Regards,
David

Re: Counting the number of repeated phrases in a column

From
Shaozhong SHI
Date:


On Tue, 25 Jan 2022 at 17:10, Shaozhong SHI <shishaozhong@gmail.com> wrote:
There is a short of a function in the standard Postgres to do the following:

It is easy to count the number of occurrence of words, but it is rather difficult to count the number of occurrence of phrases.

For instance:

A cell of value:  'Hello World' means 1 occurrence a phrase.

A cell of value: 'Hello World World Hello' means no occurrence of any repeated phrase.

But, A cell of value: 'Hello World World Hello Hello World' means 2 occurrences of 'Hello World'.

'The City of London, London' also has no occurrences of any repeated phrase.

Anyone has got such a function to check out the number of occurrence of any repeated phrases?



How about knock unique words into discrete joint up strings?  Then check whether there is any repeated words?
Regards,
David 

Re: Counting the number of repeated phrases in a column

From
Karsten Hilbert
Date:
Am Tue, Feb 01, 2022 at 11:29:50PM +0000 schrieb Shaozhong SHI:

> How about knock unique words into discrete joint up strings?  Then check
> whether there is any repeated words?

Does it work when you try ?

Karsten
--
GPG  40BE 5B0E C98E 1713 AFA6  5BC0 3BEA AC80 7D4F C89B