Thread: Improving docs for strict_word_similarity()

Improving docs for strict_word_similarity()

From
Bruce Momjian
Date:
While creating the release notes, I was confused by the description for
strict_word_similarity(), particularly "extent boundaries".  The
attached patch clarifies, at least for me, how word_similarity() and
strict_word_similarity() differ.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Attachment

Re: Improving docs for strict_word_similarity()

From
Alexander Korotkov
Date:
Hi, Bruce!

On Sat, May 26, 2018 at 7:56 PM Bruce Momjian <bruce@momjian.us> wrote:
While creating the release notes, I was confused by the description for
strict_word_similarity(), particularly "extent boundaries".  The
attached patch clarifies, at least for me, how word_similarity() and
strict_word_similarity() differ.

Thank you for your efforts on improving documentation of pg_trgm.
However, I don't find all of them correct.  I've following notes regarding
the edits you propose.

--- 112,119 ----
        </entry>
        <entry><type>real</type></entry>
        <entry>
!        Same as <function>word_similarity(text, text)</function>, but
!        considers the set of trigrams to be of the same length.
        </entry>
       </row>
       <row>

This doesn't look a correct description.  In short, strict_word_similarity() is searching
for extent of words in the second string, which is best match for the first string.
So, this function takes care about using whole words from the second strings,
not parts of words.  However, this is not matter of length of trigrams sets.

--- 164,182 ----
     This function returns a value that can be approximately understood as the
     greatest similarity between the first string and any substring of the second
     string.  However, this function does not add padding to the boundaries of
!    the extent.  Thus, the number of additional characters present in the
!    second string is not considered, except for the mismatched word boundry.
    </para>

This looks correct for me.

!    The function <function>strict_word_similarity(text, text)</function>
!    does consider additional characters in the second string.  In the
!    example above, <function>strict_word_similarity(text, text)</function>
!    would use the full trigram for the second string when computing
!    similarity, not just the part of the trigram that matches the
!    first string. For example, it would use the <literal>{" w","
!    wo","wor","ord","rds","ds "}</literal>, which corresponds to the whole
!    word <literal>'words'</literal>.

After your edits, it looks like strict_word_similarity() matches full
set of first string trigrams to full set of second string trigrams.  However,
this is description of just similarity() function.  Actually,
strict_word_similarity() matches set of trigrams of first string to
set of trigrams of conjuncted subset of second string words.

--- 189,197 ----
  
    <para>
     Thus, the <function>strict_word_similarity(text, text)</function> function
!    is useful for finding the similarity to whole words, while
     <function>word_similarity(text, text)</function> is more suitable for
!    finding the similarity for parts of words.
    </para>

This also looks correct to me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Improving docs for strict_word_similarity()

From
Alexander Korotkov
Date:
On Fri, Jun 1, 2018 at 6:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Sat, May 26, 2018 at 7:56 PM Bruce Momjian <bruce@momjian.us> wrote:
>>
>> While creating the release notes, I was confused by the description for
>> strict_word_similarity(), particularly "extent boundaries".  The
>> attached patch clarifies, at least for me, how word_similarity() and
>> strict_word_similarity() differ.
>
>
> Thank you for your efforts on improving documentation of pg_trgm.
> However, I don't find all of them correct.  I've following notes regarding
> the edits you propose.
>
> --- 112,119 ----
>         </entry>
>         <entry><type>real</type></entry>
>         <entry>
> !        Same as <function>word_similarity(text, text)</function>, but
> !        considers the set of trigrams to be of the same length.
>         </entry>
>        </row>
>        <row>
>
> This doesn't look a correct description.  In short, strict_word_similarity() is searching
> for extent of words in the second string, which is best match for the first string.
> So, this function takes care about using whole words from the second strings,
> not parts of words.  However, this is not matter of length of trigrams sets.
>
> --- 164,182 ----
>      This function returns a value that can be approximately understood as the
>      greatest similarity between the first string and any substring of the second
>      string.  However, this function does not add padding to the boundaries of
> !    the extent.  Thus, the number of additional characters present in the
> !    second string is not considered, except for the mismatched word boundry.
>     </para>
>
> This looks correct for me.
>
> !    The function <function>strict_word_similarity(text, text)</function>
> !    does consider additional characters in the second string.  In the
> !    example above, <function>strict_word_similarity(text, text)</function>
> !    would use the full trigram for the second string when computing
> !    similarity, not just the part of the trigram that matches the
> !    first string. For example, it would use the <literal>{" w","
> !    wo","wor","ord","rds","ds "}</literal>, which corresponds to the whole
> !    word <literal>'words'</literal>.
>
> After your edits, it looks like strict_word_similarity() matches full
> set of first string trigrams to full set of second string trigrams.  However,
> this is description of just similarity() function.  Actually,
> strict_word_similarity() matches set of trigrams of first string to
> set of trigrams of conjuncted subset of second string words.
>
> --- 189,197 ----
>
>     <para>
>      Thus, the <function>strict_word_similarity(text, text)</function> function
> !    is useful for finding the similarity to whole words, while
>      <function>word_similarity(text, text)</function> is more suitable for
> !    finding the similarity for parts of words.
>     </para>
>
> This also looks correct to me.

I've edited places, which looked incorrect for me.  I tried to do my
best in making them as clear as possible.  Bruce, could you please
take a look on them?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: Improving docs for strict_word_similarity()

From
Bruce Momjian
Date:
On Fri, Jun  1, 2018 at 06:39:11PM +0300, Alexander Korotkov wrote:
> Hi, Bruce!
> 
> On Sat, May 26, 2018 at 7:56 PM Bruce Momjian <bruce@momjian.us> wrote:
> 
>     While creating the release notes, I was confused by the description for
>     strict_word_similarity(), particularly "extent boundaries".  The
>     attached patch clarifies, at least for me, how word_similarity() and
>     strict_word_similarity() differ.
> 
> 
> Thank you for your efforts on improving documentation of pg_trgm.
> However, I don't find all of them correct.  I've following notes regarding
> the edits you propose.

Yes, I realize my version was wrong.  Yours looks much better and adds
what is needed.  Thanks.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: Improving docs for strict_word_similarity()

From
Alexander Korotkov
Date:
On Wed, Jun 13, 2018 at 5:57 PM Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Jun  1, 2018 at 06:39:11PM +0300, Alexander Korotkov wrote:
> > On Sat, May 26, 2018 at 7:56 PM Bruce Momjian <bruce@momjian.us> wrote:
> >
> >     While creating the release notes, I was confused by the description for
> >     strict_word_similarity(), particularly "extent boundaries".  The
> >     attached patch clarifies, at least for me, how word_similarity() and
> >     strict_word_similarity() differ.
> >
> >
> > Thank you for your efforts on improving documentation of pg_trgm.
> > However, I don't find all of them correct.  I've following notes regarding
> > the edits you propose.
>
> Yes, I realize my version was wrong.  Yours looks much better and adds
> what is needed.  Thanks.

Pushed, thanks!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company