Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug - Mailing list pgsql-bugs

From Alexander Korotkov
Subject Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
Date
Msg-id CAPpHfdsDtQJ+pYfapH7GMt9n4RLByKMzHEBWcPL99C_g+rkUKQ@mail.gmail.com
Whole thread Raw
In response to [BUGS] pg_trgm word_similarity inconsistencies or bug  (Cristiano Coelho <cristianocca@hotmail.com>)
Responses Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug  (Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com>)
List pgsql-bugs
Hi!

I'd like to forward a feedback from our customer who uses word_similarity() function.
François finds current behavior of word_similarity() to be useful.  Thus, I think we should preserve it.  But documentation correction is needed and option for alternative behavior would be useful too.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

---------- Forwarded message ----------
From: François CHAHUNEAU <Francois.CHAHUNEAU@numen.fr>
Date: Wed, Nov 1, 2017 at 1:04 AM
Subject: RE: [BUGS] pg_trgm word_similarity inconsistencies or bug
To: Alexander Korotkov <a.korotkov@postgrespro.ru>
Cc: Thierry BOUDIERE <Thierry.BOUDIERE@numen.fr>, "foli@numen.mg" <foli@numen.mg>


Hello Alexander,

 

We agree that the current pg_trgm documentation does not correctly reflect the de facto behavior of word_similarity(), and that something has to be changed. But to us, it is more a documentation problem than anything else.

 

What is computed  is still « substring_similarity » as was initially specified between us,  but it is influenced by a strong word boundary bias  caused by the way trigrams are padded at word boundaries. This bias was noticed by early reviewers and you explained that this motivated the name switch to « word_similarity ». As you will remember, at the time we discovered this, we were suprised because we considerd this as a slight misnomer.  Indeed, what is  currently described in the 9.6 pg_trgm documentation is inaccurate (although seemingly consistent with this new name)  and has to be amended.

 

Now, word_similarity() has been out for more than a year and, of course, it is preferable to avoid any breaking changes… In our case, we consider the name « unfortunate » and the explanation buggy, not the function itself.

 

As you may remember from the initial discussion, some other users stressed the importance to be able to matchsub strings. We tend to agree with what Jeff Janes wrote in this discussion :

 

The reason I like the option of not treating word boundaries as

special in this case is that often in scientific vocabulary, and in

catalog part numbers, people are pretty inconsistent about whether

they included spaces.  "HEK 293", "HEK293", and "HEK-293" could be all

the same thing.  So I like to strip out spaces and punctuation on both

sides of operator.  Of course I can't do that if there are invisible

un-removable spaces on the substring side.

 

But, It doesn't sound like I am going to win that debate.  Given that,

I don't think we need a different name for the function. I'm fine with

explaining the word-boundary subtlety in the documentation, and

keeping the function name itself simple.

 

 

Now, considering your proposal :

 

As far as we are concerned, we use <% and %>  everyday  for efficient fuzzy matching on large databases. Our typical usage scenario is matching noisy OCRized text strings against reference databases.

 

> 1) Define GUC variable which specifies whether word_similarity() should force extent boundaries to be at word boundaries,

 

Ok for us, iff default behavior remains the same as now, for backward compatibility reasons. We could take advantage, in some cases, of the new « word rounded » behavior controlled by the GUC variable, but this would not cover all scenarios currently in use.


> 2) Document both cases of word_similarity() behavior.


This is clearly needed anyway.

 

Best regards,

 

François CHAHUNEAU
Directeur des technologies

 
NUMEN DIGITAL24, rue Marc Seguin 75018 Paris Francewww.numen.fr
Tel +33 1 40 37 95 03 | Mob +33 6 07 85 21 79 | Fax +33 1 40 37 94 94
Pensez vert, n’imprimez que nécessaire. Les informations contenues dans le présent e-mail sont exclusivement adressées au(x) destinataire(s) de ce message et peuvent contenir des informations confidentielles, protégées par un secret professionnel. L’utilisation de ces informations par d’autres personnes que le(s) destinataire(s) est strictement interdite. Si vous n’êtes pas destinataire de ce message, la publication, la reproduction, la diffusion et /ou la distribution de ces informations auprès de tiers n’est pas autorisée. Si vous avez reçu cet e-mail par erreur, veuillez nous en informer immédiatement, détruire l'email, ses copies et documents joints et le supprimer.
 

De : Alexander Korotkov [mailto:a.korotkov@postgrespro.ru]
Envoyé : mardi 31 octobre 2017 16:18
À : Thierry BOUDIERE <Thierry.BOUDIERE@numen.fr>; François CHAHUNEAU <Francois.CHAHUNEAU@numen.fr>
Objet : Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

 

Dear, Thierry and François!

 

PostgreSQL users found inconsistency between documentation and implementation of word_similarity().

Possible solution proposed by the reporter is to alter the implementation.

But it's important for me that your interests are not affected but potential further change of implementation of word_similarity().

Could you please share your opinion on changes proposed by Jan in the pgsql-bugs mailing list?

 

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

pgsql-bugs by date:

Previous
From: David Rowley
Date:
Subject: Re: [BUGS] BUG #14889: explain analyze is taking much more time thanactual execution
Next
From: Tom Lane
Date:
Subject: Re: [BUGS] BUG #14889: explain analyze is taking much more time than actual execution