Re: daitch_mokotoff module - Mailing list pgsql-hackers
From | Dag Lem |
---|---|
Subject | Re: daitch_mokotoff module |
Date | |
Msg-id | yger0wpuvek.fsf@sid.nimrod.no Whole thread Raw |
In response to | Re: daitch_mokotoff module (Dag Lem <dag@nimrod.no>) |
List | pgsql-hackers |
Dag Lem <dag@nimrod.no> writes: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > >> On 2022-Dec-23, Alvaro Herrera wrote: >> > > [...] > >> I tried downloading a list of surnames from here >> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/ >> pasted that in a text file and \copy'ed it into a table. Then I ran >> this query >> >> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*) >> from apellidos >> group by daitch_mokotoff(a) >> order by count(*) desc; >> >> so I have a first entry like this >> >> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz >> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas >> daitch_mokotoff │ 784000 >> count │ 18 >> >> but then I have a bunch of other entries with the same code 784000 as >> alternative codes, >> >> string_agg │ Velazco >> daitch_mokotoff │ 784500 784000 >> count │ 1 >> >> string_agg │ Palacio >> daitch_mokotoff │ 785000 784000 >> count │ 1 >> >> I suppose I need to group these together somehow, and it would make more >> sense to do that if the values were arrays. >> >> >> If I scroll a bit further down and choose, say, 794000 (a relatively >> popular one), then I have this >> >> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras >> Peraza Peres Perez Porras Varas Veras >> daitch_mokotoff │ 794000 >> count │ 14 >> >> and looking for that code in the result I also get these three >> >> string_agg │ Barca Barco Parco >> daitch_mokotoff │ 795000 794000 >> count │ 3 >> >> string_agg │ Borja >> daitch_mokotoff │ 790000 794000 >> count │ 1 >> >> string_agg │ Borjas >> daitch_mokotoff │ 794000 794400 >> count │ 1 >> >> and then I see that I should also search for possible matches in codes >> 795000, 790000 and 794400, so that gives me >> >> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra >> Perea Vera >> daitch_mokotoff │ 790000 >> count │ 11 >> >> string_agg │ Barriga Borge Borrego Burgo Fraga >> daitch_mokotoff │ 795000 >> count │ 5 >> >> string_agg │ Borjas >> daitch_mokotoff │ 794000 794400 >> count │ 1 >> >> which look closely related (compare "Veras" in the first to "Vera" in >> the later set. If you ignore that pseudo-match, you're likely to miss >> possible family relationships.) >> >> >> I suppose if I were a genealogy researcher, I would be helped by having >> each of these codes behave as a separate unit, rather than me having to >> split the string into the several possible contained values. > > It seems to me like you're trying to use soundex coding for something it > was never designed for. > > As stated in my previous mail, soundex algorithms are designed to index > names on some representation of sound, so that alike sounding names with > different spellings will match, and as shown in the documentation > example, that is exactly what the implementation facilitates. > > Daitch-Mokotoff Soundex indexes alternative sounds for the same name, > however if I understand correctly, you want to index names by single > sounds, linking all alike sounding names to the same soundex code. I > fail to see how that is useful - if you want to find matches for a name, > you simply match against all indexed names. If you only consider one > sound, you won't find all names that match. > > In any case, as explained in the documentation, the implementation is > intended to be a companion to Full Text Search, thus text is the natural > representation for the soundex codes. > > BTW Vera 790000 does not match Veras 794000, because they don't sound > the same (up to the maximum soundex code length). > I've been sleeping on this, and perhaps the normal use case can just as well (or better) be covered by the "@>" array operator? I originally implemented similar functionality using another soundex algorithm more than a decade ago, and either arrays couldn't be GIN indexed back then, or I simply missed it. I'll have to get back to this - now it's Christmas! Merry Christmas! Best regards, Dag Lem
pgsql-hackers by date: