Re: Counting the number of repeated phrases in a column - Mailing list pgsql-general

From benj.dev@laposte.net
Subject Re: Counting the number of repeated phrases in a column
Date
Msg-id 1966163298.951317.1643190779781@wlpnf0218
Whole thread Raw
In response to Counting the number of repeated phrases in a column  (Shaozhong SHI <shishaozhong@gmail.com>)
List pgsql-general
>On Tue, 25 Jan 2022 at 21:33, Ivan Panchenko <i(dot)panchenko(at)postgrespro(dot)ru>
>wrote:
>
>
>>
>> On 26.01.2022 00:21, benj(dot)dev(at)laposte(dot)net wrote:
>> > Le 25/01/2022 à 18:10, Shaozhong SHI a écrit :
>> >> There is a short of a function in the standard Postgres to do the
>> >> following:
>> >>
>> >> It is easy to count the number of occurrence of words, but it is
>> >> rather difficult to count the number of occurrence of phrases.
>> >>
>> >> For instance:
>> >>
>> >> A cell of value:  'Hello World' means 1 occurrence a phrase.
>> >>
>> >> A cell of value: 'Hello World World Hello' means no occurrence of any
>> >> repeated phrase.
>> >>
>> >> But, A cell of value: 'Hello World World Hello Hello World' means 2
>> >> occurrences of 'Hello World'.
>> >>
>> >> 'The City of London, London' also has no occurrences of any repeated
>> >> phrase.
>> >>
>> >> Anyone has got such a function to check out the number of occurrence
>> >> of any repeated phrases?
>> >>
>> >> Regards,
>> >>
>> >> David
>> >
>> > Don't know if it's exactly what you want, but you can replace all
>> > occurence of the phrase in the text by empty string and compute the
>> > diff between the initial and the result and next divide by the length
>> > of your phrase.
>> >
>> > Example :
>> > WITH x AS (SELECT 'toto like tata and toto like titi and toto like
>> > tutu' , 'toto like' phrase)
>> > SELECT (char_length(texte) - char_length(replace(texte, phrase, '')))
>> > / char_length(phrase) AS nb_occurence
>> > FROM x
>> >
>> This works if the user knows the phrase. As far as I understood, the
>> phrase is not known, and user wants to count number of repeats of any
>> phrases.
>> Of course this can be done with recursive CTE. Split into words,
>> generate all phrases (AFAIK requires recursion), then group and count.
>>
>> But probably in PL/Perl this could be done  more effectively.
>>
>
>
>Is there an example of using recursive CTE to split a text string into
>words?
>
>
>Regards,
>
>
>David

Without recursive, a "brutal" solution may be something like

WITH original_text AS (SELECT 'Hello World World Hello my friend Hello World' sentence)
, range_to_search AS (SELECT *, generate_series(1,5) gs FROM original_text) -- 1 is the minimal group of word searched, 5 is the maximal grouped word searched
, x AS (
SELECT r.sentence, gs
, array_to_string((array_agg(rstt.word) OVER (PARTITION BY gs ORDER BY rstt.pos ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING))[1:gs],' ') AS search_words
FROM range_to_search r
LEFT JOIN LATERAL regexp_split_to_table(r.sentence, ' ') WITH ORDINALITY rstt(word,pos) ON true
)
SELECT DISTINCT search_words, (char_length(sentence) - char_length(replace(sentence, search_words, '')))
/ NULLIF(char_length(search_words),0) AS nb_occurence
FROM x

It's also possible to define a minimal number of word accepted

 

pgsql-general by date:

Previous
From: Shaozhong SHI
Date:
Subject: Re: Counting the number of repeated phrases in a column
Next
From: "Ivan E. Panchenko"
Date:
Subject: Re: Counting the number of repeated phrases in a column