Re: daitch_mokotoff module - Mailing list pgsql-hackers
From | Dag Lem |
---|---|
Subject | Re: daitch_mokotoff module |
Date | |
Msg-id | ygea630vfai.fsf@sid.nimrod.no Whole thread Raw |
In response to | Re: daitch_mokotoff module (Alvaro Herrera <alvherre@alvh.no-ip.org>) |
List | pgsql-hackers |
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Hello > > On 2022-Dec-23, Dag Lem wrote: > [...] > So, yes, I'm proposing that we returns those as array elements and that > @> is used to match them. Looking into the array operators I guess that to match such arrays directly one would actually use && (overlaps) rather than @> (contains), but I digress. The function is changed to return an array of soundex codes - I hope it is now to your liking :-) I also improved on the documentation example (using Full Text Search). AFAIK you can't make general queries like that using arrays, however in any case I must admit that text arrays seem like more natural building blocks than space delimited text here. Search to perform is the best match for Daitch-Mokotoff, however , but in any case I've changed it into return arrays now. I hope it is to your liking. > >> Daitch-Mokotoff Soundex indexes alternative sounds for the same name, >> however if I understand correctly, you want to index names by single >> sounds, linking all alike sounding names to the same soundex code. I >> fail to see how that is useful - if you want to find matches for a name, >> you simply match against all indexed names. If you only consider one >> sound, you won't find all names that match. > > Hmm, I think we're saying the same thing, but from opposite points of > view. No, I want each name to return multiple codes, but that those > multiple codes can be treated as a multiple-value array of codes, rather > than as a single string of space-separated codes. > >> In any case, as explained in the documentation, the implementation is >> intended to be a companion to Full Text Search, thus text is the natural >> representation for the soundex codes. > > Hmm, I don't agree with this point. The numbers are representations of > the strings, but they don't necessarily have to be strings themselves. > > >> BTW Vera 790000 does not match Veras 794000, because they don't sound >> the same (up to the maximum soundex code length). > > No, and maybe that's okay because they have different codes. But they > are both similar, in Daitch-Mokotoff, to Borja, which has two codes, > 790000 and 794000. (Any Spanish speaker will readily tell you that > neither Vera nor Veras are similar in any way to Borja, but D-M has > chosen to say that each of them matches one of Borjas' codes. So they > *are* related, even though indirectly, and as a genealogist you *may* be > interested in getting a match for a person called Vera when looking for > relatives to a person called Veras. And, as a Spanish speaker, that > would make a lot of sense to me.) > > > Now, it's true that I've chosen to use Spanish names for my silly little > experiment. Maybe this isn't terribly useful as a practical example, > because this algorithm seems to have been designed for Jew surnames and > perhaps not many (or not any) Jews had Spanish surnames. I don't know; > I'm not a Jew myself (though Noah Gordon tells the tale of a Spanish Jew > called Josep Álvarez in his book "The Winemaker", so I guess it's not > impossible). Anyway, I suspect if you repeat the experiment with names > of other origins, you'll find pretty much the same results apply there, > and that is the whole reason D-M returns multiple codes and not just > one. > > > Merry Christmas :-) -- Dag
pgsql-hackers by date: