Re: Counting the number of repeated phrases in a column - Mailing list pgsql-general

From Merlin Moncure
Subject Re: Counting the number of repeated phrases in a column
Date
Msg-id CAHyXU0w2-jbeQ5HPmJH87gJHwkYCcG=Q0xTOUEuLO1e25wVh-Q@mail.gmail.com
Whole thread Raw
In response to Re: Counting the number of repeated phrases in a column  (benj.dev@laposte.net)
List pgsql-general
On Thu, Jan 27, 2022 at 11:56 AM <benj.dev@laposte.net> wrote:
> Le 27/01/2022 à 18:35, Merlin Moncure a écrit :
> > select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
> > from
> > (
> >    select array_agg(t) v
> >    from
> >    (
> >      select trim(replace(unnest(v), E'\n', '')) t
> >      from regexp_split_to_array(<sentence>, ' ') v
> >    ) q
> >    where length(t) > 1
> > ) q
> > cross join lateral generate_series(1, array_upper(v, 1)) a
> > cross join lateral generate_series(a + 1, array_upper(v, 1)) b
> > group by 1
> > having count(*) > 1;
> >
> > We are definitely in N^2 space here, so look for things to start
> > breaking down for sentences > 1000 words.
> >
> > merlin
> >
>
> (for better complexity) you may search about "Ukkonen suffix tree"
> Similar problem as yours :
> https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/?ref=lbp

Yep.  Many problems like this are well solved in imperative languages
and will fit poorly into SQL quase-functional space.  That
implementation could probably be converted to pl/pgsql pretty easily,
or a 'sql + tables' variant as a fun challenge.  It also slightly
exploits the fact that only the most repeated needle is returned,
rather than all of them.

Having the need to have single statement stateless SQL solutions to
interesting problems comes up all the time in common development
practice though for simplicity's sake even if there are better
approaches out there.  It's also fun.

merlin



pgsql-general by date:

Previous
From: Ben Chobot
Date:
Subject: Re: could not open relation with OID
Next
From: Michael Harris
Date:
Subject: Re: Undetected Deadlock