Re: daitch_mokotoff module - Mailing list pgsql-hackers

From Dag Lem
Subject Re: daitch_mokotoff module
Date
Msg-id yger0wpuvek.fsf@sid.nimrod.no
Whole thread Raw
In response to Re: daitch_mokotoff module  (Dag Lem <dag@nimrod.no>)
List pgsql-hackers
Dag Lem <dag@nimrod.no> writes:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>
>> On 2022-Dec-23, Alvaro Herrera wrote:
>>
>
> [...]
>
>> I tried downloading a list of surnames from here
>> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
>> pasted that in a text file and \copy'ed it into a table.  Then I ran
>> this query
>>
>> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
>> from apellidos
>> group by daitch_mokotoff(a)
>> order by count(*) desc;
>>
>> so I have a first entry like this
>>
>> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz
>> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
>> daitch_mokotoff │ 784000
>> count           │ 18
>>
>> but then I have a bunch of other entries with the same code 784000 as
>> alternative codes,
>>
>> string_agg      │ Velazco
>> daitch_mokotoff │ 784500 784000
>> count           │ 1
>>
>> string_agg      │ Palacio
>> daitch_mokotoff │ 785000 784000
>> count           │ 1
>>
>> I suppose I need to group these together somehow, and it would make more
>> sense to do that if the values were arrays.
>>
>>
>> If I scroll a bit further down and choose, say, 794000 (a relatively
>> popular one), then I have this
>>
>> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras
>> Peraza Peres Perez Porras Varas Veras
>> daitch_mokotoff │ 794000
>> count           │ 14
>>
>> and looking for that code in the result I also get these three
>>
>> string_agg      │ Barca Barco Parco
>> daitch_mokotoff │ 795000 794000
>> count           │ 3
>>
>> string_agg      │ Borja
>> daitch_mokotoff │ 790000 794000
>> count           │ 1
>>
>> string_agg      │ Borjas
>> daitch_mokotoff │ 794000 794400
>> count           │ 1
>>
>> and then I see that I should also search for possible matches in codes
>> 795000, 790000 and 794400, so that gives me
>>
>> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra
>> Perea Vera
>> daitch_mokotoff │ 790000
>> count           │ 11
>>
>> string_agg      │ Barriga Borge Borrego Burgo Fraga
>> daitch_mokotoff │ 795000
>> count           │ 5
>>
>> string_agg      │ Borjas
>> daitch_mokotoff │ 794000 794400
>> count           │ 1
>>
>> which look closely related (compare "Veras" in the first to "Vera" in
>> the later set.  If you ignore that pseudo-match, you're likely to miss
>> possible family relationships.)
>>
>>
>> I suppose if I were a genealogy researcher, I would be helped by having
>> each of these codes behave as a separate unit, rather than me having to
>> split the string into the several possible contained values.
>
> It seems to me like you're trying to use soundex coding for something it
> was never designed for.
>
> As stated in my previous mail, soundex algorithms are designed to index
> names on some representation of sound, so that alike sounding names with
> different spellings will match, and as shown in the documentation
> example, that is exactly what the implementation facilitates.
>
> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
> however if I understand correctly, you want to index names by single
> sounds, linking all alike sounding names to the same soundex code. I
> fail to see how that is useful - if you want to find matches for a name,
> you simply match against all indexed names. If you only consider one
> sound, you won't find all names that match.
>
> In any case, as explained in the documentation, the implementation is
> intended to be a companion to Full Text Search, thus text is the natural
> representation for the soundex codes.
>
> BTW Vera 790000 does not match Veras 794000, because they don't sound
> the same (up to the maximum soundex code length).
>

I've been sleeping on this, and perhaps the normal use case can just as
well (or better) be covered by the "@>" array operator? I originally
implemented similar functionality using another soundex algorithm more
than a decade ago, and either arrays couldn't be GIN indexed back then,
or I simply missed it. I'll have to get back to this - now it's
Christmas!

Merry Christmas!

Best regards,

Dag Lem



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Force streaming every change in logical decoding
Next
From: Ted Yu
Date:
Subject: Re: Error-safe user functions