Re: daitch_mokotoff module - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: daitch_mokotoff module
Date
Msg-id 20221223132559.mauqerlf75d7jnuq@alvherre.pgsql
Whole thread Raw
In response to Re: daitch_mokotoff module  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: daitch_mokotoff module  (Dag Lem <dag@nimrod.no>)
List pgsql-hackers
On 2022-Dec-23, Alvaro Herrera wrote:

> I wonder why do you have it return the multiple alternative codes as a
> space-separated string.  Maybe an array would be more appropriate.  Even
> on your documented example use, the first thing you do is split it on
> spaces.

I tried downloading a list of surnames from here
https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
pasted that in a text file and \copy'ed it into a table.  Then I ran
this query

select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
from apellidos
group by daitch_mokotoff(a)
order by count(*) desc;

so I have a first entry like this

string_agg      │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz Palos Pelaez Plaza Valles Vallez Velasco Velez
VelizVeloz Villas
 
daitch_mokotoff │ 784000
count           │ 18

but then I have a bunch of other entries with the same code 784000 as
alternative codes,

string_agg      │ Velazco
daitch_mokotoff │ 784500 784000
count           │ 1

string_agg      │ Palacio
daitch_mokotoff │ 785000 784000
count           │ 1

I suppose I need to group these together somehow, and it would make more
sense to do that if the values were arrays.


If I scroll a bit further down and choose, say, 794000 (a relatively
popular one), then I have this

string_agg      │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras Peraza Peres Perez Porras Varas Veras
daitch_mokotoff │ 794000
count           │ 14

and looking for that code in the result I also get these three

string_agg      │ Barca Barco Parco
daitch_mokotoff │ 795000 794000
count           │ 3

string_agg      │ Borja
daitch_mokotoff │ 790000 794000
count           │ 1

string_agg      │ Borjas
daitch_mokotoff │ 794000 794400
count           │ 1

and then I see that I should also search for possible matches in codes
795000, 790000 and 794400, so that gives me

string_agg      │ Baria Baro Barrio Barro Berra Borra Feria Para Parra Perea Vera
daitch_mokotoff │ 790000
count           │ 11

string_agg      │ Barriga Borge Borrego Burgo Fraga
daitch_mokotoff │ 795000
count           │ 5

string_agg      │ Borjas
daitch_mokotoff │ 794000 794400
count           │ 1

which look closely related (compare "Veras" in the first to "Vera" in
the later set.  If you ignore that pseudo-match, you're likely to miss
possible family relationships.)


I suppose if I were a genealogy researcher, I would be helped by having
each of these codes behave as a separate unit, rather than me having to
split the string into the several possible contained values.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."                                      (E. Dijkstra)



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: daitch_mokotoff module
Next
From: Andrew Dunstan
Date:
Subject: Re: Error-safe user functions