Re: daitch_mokotoff module - Mailing list pgsql-hackers

From Dag Lem
Subject Re: daitch_mokotoff module
Date
Msg-id ygev8m1visd.fsf@sid.nimrod.no
Whole thread Raw
In response to Re: daitch_mokotoff module  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: daitch_mokotoff module  (Dag Lem <dag@nimrod.no>)
Re: daitch_mokotoff module  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> On 2022-Dec-23, Alvaro Herrera wrote:
>

[...]

> I tried downloading a list of surnames from here
> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
> pasted that in a text file and \copy'ed it into a table.  Then I ran
> this query
>
> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
> from apellidos
> group by daitch_mokotoff(a)
> order by count(*) desc;
>
> so I have a first entry like this
>
> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz
> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
> daitch_mokotoff │ 784000
> count           │ 18
>
> but then I have a bunch of other entries with the same code 784000 as
> alternative codes,
>
> string_agg      │ Velazco
> daitch_mokotoff │ 784500 784000
> count           │ 1
>
> string_agg      │ Palacio
> daitch_mokotoff │ 785000 784000
> count           │ 1
>
> I suppose I need to group these together somehow, and it would make more
> sense to do that if the values were arrays.
>
>
> If I scroll a bit further down and choose, say, 794000 (a relatively
> popular one), then I have this
>
> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras
> Peraza Peres Perez Porras Varas Veras
> daitch_mokotoff │ 794000
> count           │ 14
>
> and looking for that code in the result I also get these three
>
> string_agg      │ Barca Barco Parco
> daitch_mokotoff │ 795000 794000
> count           │ 3
>
> string_agg      │ Borja
> daitch_mokotoff │ 790000 794000
> count           │ 1
>
> string_agg      │ Borjas
> daitch_mokotoff │ 794000 794400
> count           │ 1
>
> and then I see that I should also search for possible matches in codes
> 795000, 790000 and 794400, so that gives me
>
> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra
> Perea Vera
> daitch_mokotoff │ 790000
> count           │ 11
>
> string_agg      │ Barriga Borge Borrego Burgo Fraga
> daitch_mokotoff │ 795000
> count           │ 5
>
> string_agg      │ Borjas
> daitch_mokotoff │ 794000 794400
> count           │ 1
>
> which look closely related (compare "Veras" in the first to "Vera" in
> the later set.  If you ignore that pseudo-match, you're likely to miss
> possible family relationships.)
>
>
> I suppose if I were a genealogy researcher, I would be helped by having
> each of these codes behave as a separate unit, rather than me having to
> split the string into the several possible contained values.

It seems to me like you're trying to use soundex coding for something it
was never designed for.

As stated in my previous mail, soundex algorithms are designed to index
names on some representation of sound, so that alike sounding names with
different spellings will match, and as shown in the documentation
example, that is exactly what the implementation facilitates.

Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
however if I understand correctly, you want to index names by single
sounds, linking all alike sounding names to the same soundex code. I
fail to see how that is useful - if you want to find matches for a name,
you simply match against all indexed names. If you only consider one
sound, you won't find all names that match.

In any case, as explained in the documentation, the implementation is
intended to be a companion to Full Text Search, thus text is the natural
representation for the soundex codes.

BTW Vera 790000 does not match Veras 794000, because they don't sound
the same (up to the maximum soundex code length).

Best regards

Dag Lem



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: fixing CREATEROLE
Next
From: Joseph Koshakow
Date:
Subject: Re: Infinite Interval