Re: daitch_mokotoff module - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: daitch_mokotoff module
Date
Msg-id 20221225130136.oyfh5jcw3xxkvq34@alvherre.pgsql
Whole thread Raw
In response to Re: daitch_mokotoff module  (Dag Lem <dag@nimrod.no>)
Responses Re: daitch_mokotoff module  (Dag Lem <dag@nimrod.no>)
Re: daitch_mokotoff module  (Dag Lem <dag@nimrod.no>)
List pgsql-hackers
Hello

On 2022-Dec-23, Dag Lem wrote:

> It seems to me like you're trying to use soundex coding for something it
> was never designed for.

I'm not trying to use it for anything, actually.  I'm just reading the
pages your patch links to, to try and understand how this algorithm can
be best implemented in Postgres.

So I got to this page
https://www.avotaynu.com/soundex.htm
which explains that Daitch figured that it would be best if a letter
that can have two possible encodings would be encoded in both ways:

> 5. If a combination of letters could have two possible sounds, then it
> is coded in both manners. For example, the letters ch can have a soft
> sound such as in Chicago or a hard sound as in Christmas.

which I understand as meaning that a single name returns two possible
encodings, which is why these three names
 Barca Barco Parco
have two possible encodings
 795000 and 794000
which is what your algorithm returns.

In fact, using the word Christmas we do get alternative codes for the first
letter (either 4 or 5), precisely as in Daitch's example:

=# select daitch_mokotoff('christmas');
 daitch_mokotoff 
─────────────────
 594364 494364
(1 fila)

and if we take out the ambiguous 'ch', we get a single one:

=# select daitch_mokotoff('ristmas');
 daitch_mokotoff 
─────────────────
 943640
(1 fila)

and if we add another 'ch', we get the codes for each possibility at each
position of the ambiguous 'ch':

=# select daitch_mokotoff('christmach');
       daitch_mokotoff       
─────────────────────────────
 594365 594364 494365 494364
(1 fila)


So, yes, I'm proposing that we returns those as array elements and that
@> is used to match them.

> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
> however if I understand correctly, you want to index names by single
> sounds, linking all alike sounding names to the same soundex code. I
> fail to see how that is useful - if you want to find matches for a name,
> you simply match against all indexed names. If you only consider one
> sound, you won't find all names that match.

Hmm, I think we're saying the same thing, but from opposite points of
view.  No, I want each name to return multiple codes, but that those
multiple codes can be treated as a multiple-value array of codes, rather
than as a single string of space-separated codes.

> In any case, as explained in the documentation, the implementation is
> intended to be a companion to Full Text Search, thus text is the natural
> representation for the soundex codes.

Hmm, I don't agree with this point.  The numbers are representations of
the strings, but they don't necessarily have to be strings themselves.


> BTW Vera 790000 does not match Veras 794000, because they don't sound
> the same (up to the maximum soundex code length).

No, and maybe that's okay because they have different codes.  But they
are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
790000 and 794000.  (Any Spanish speaker will readily tell you that
neither Vera nor Veras are similar in any way to Borja, but D-M has
chosen to say that each of them matches one of Borjas' codes.  So they
*are* related, even though indirectly, and as a genealogist you *may* be
interested in getting a match for a person called Vera when looking for
relatives to a person called Veras.  And, as a Spanish speaker, that
would make a lot of sense to me.)


Now, it's true that I've chosen to use Spanish names for my silly little
experiment.  Maybe this isn't terribly useful as a practical example,
because this algorithm seems to have been designed for Jew surnames and
perhaps not many (or not any) Jews had Spanish surnames.  I don't know;
I'm not a Jew myself (though Noah Gordon tells the tale of a Spanish Jew
called Josep Álvarez in his book "The Winemaker", so I guess it's not
impossible).  Anyway, I suspect if you repeat the experiment with names
of other origins, you'll find pretty much the same results apply there,
and that is the whole reason D-M returns multiple codes and not just
one.


Merry Christmas :-)

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.
Next
From: Ankit Kumar Pandey
Date:
Subject: Todo: Teach planner to evaluate multiple windows in the optimal order