Re: daitch_mokotoff module - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: daitch_mokotoff module |
Date | |
Msg-id | 20221225130136.oyfh5jcw3xxkvq34@alvherre.pgsql Whole thread Raw |
In response to | Re: daitch_mokotoff module (Dag Lem <dag@nimrod.no>) |
Responses |
Re: daitch_mokotoff module
Re: daitch_mokotoff module |
List | pgsql-hackers |
Hello On 2022-Dec-23, Dag Lem wrote: > It seems to me like you're trying to use soundex coding for something it > was never designed for. I'm not trying to use it for anything, actually. I'm just reading the pages your patch links to, to try and understand how this algorithm can be best implemented in Postgres. So I got to this page https://www.avotaynu.com/soundex.htm which explains that Daitch figured that it would be best if a letter that can have two possible encodings would be encoded in both ways: > 5. If a combination of letters could have two possible sounds, then it > is coded in both manners. For example, the letters ch can have a soft > sound such as in Chicago or a hard sound as in Christmas. which I understand as meaning that a single name returns two possible encodings, which is why these three names Barca Barco Parco have two possible encodings 795000 and 794000 which is what your algorithm returns. In fact, using the word Christmas we do get alternative codes for the first letter (either 4 or 5), precisely as in Daitch's example: =# select daitch_mokotoff('christmas'); daitch_mokotoff ───────────────── 594364 494364 (1 fila) and if we take out the ambiguous 'ch', we get a single one: =# select daitch_mokotoff('ristmas'); daitch_mokotoff ───────────────── 943640 (1 fila) and if we add another 'ch', we get the codes for each possibility at each position of the ambiguous 'ch': =# select daitch_mokotoff('christmach'); daitch_mokotoff ───────────────────────────── 594365 594364 494365 494364 (1 fila) So, yes, I'm proposing that we returns those as array elements and that @> is used to match them. > Daitch-Mokotoff Soundex indexes alternative sounds for the same name, > however if I understand correctly, you want to index names by single > sounds, linking all alike sounding names to the same soundex code. I > fail to see how that is useful - if you want to find matches for a name, > you simply match against all indexed names. If you only consider one > sound, you won't find all names that match. Hmm, I think we're saying the same thing, but from opposite points of view. No, I want each name to return multiple codes, but that those multiple codes can be treated as a multiple-value array of codes, rather than as a single string of space-separated codes. > In any case, as explained in the documentation, the implementation is > intended to be a companion to Full Text Search, thus text is the natural > representation for the soundex codes. Hmm, I don't agree with this point. The numbers are representations of the strings, but they don't necessarily have to be strings themselves. > BTW Vera 790000 does not match Veras 794000, because they don't sound > the same (up to the maximum soundex code length). No, and maybe that's okay because they have different codes. But they are both similar, in Daitch-Mokotoff, to Borja, which has two codes, 790000 and 794000. (Any Spanish speaker will readily tell you that neither Vera nor Veras are similar in any way to Borja, but D-M has chosen to say that each of them matches one of Borjas' codes. So they *are* related, even though indirectly, and as a genealogist you *may* be interested in getting a match for a person called Vera when looking for relatives to a person called Veras. And, as a Spanish speaker, that would make a lot of sense to me.) Now, it's true that I've chosen to use Spanish names for my silly little experiment. Maybe this isn't terribly useful as a practical example, because this algorithm seems to have been designed for Jew surnames and perhaps not many (or not any) Jews had Spanish surnames. I don't know; I'm not a Jew myself (though Noah Gordon tells the tale of a Spanish Jew called Josep Álvarez in his book "The Winemaker", so I guess it's not impossible). Anyway, I suspect if you repeat the experiment with names of other origins, you'll find pretty much the same results apply there, and that is the whole reason D-M returns multiple codes and not just one. Merry Christmas :-) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
pgsql-hackers by date: