Re: daitch_mokotoff module - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: daitch_mokotoff module |
Date | |
Msg-id | 20221223132559.mauqerlf75d7jnuq@alvherre.pgsql Whole thread Raw |
In response to | Re: daitch_mokotoff module (Alvaro Herrera <alvherre@alvh.no-ip.org>) |
Responses |
Re: daitch_mokotoff module
|
List | pgsql-hackers |
On 2022-Dec-23, Alvaro Herrera wrote: > I wonder why do you have it return the multiple alternative codes as a > space-separated string. Maybe an array would be more appropriate. Even > on your documented example use, the first thing you do is split it on > spaces. I tried downloading a list of surnames from here https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/ pasted that in a text file and \copy'ed it into a table. Then I ran this query select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*) from apellidos group by daitch_mokotoff(a) order by count(*) desc; so I have a first entry like this string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz Palos Pelaez Plaza Valles Vallez Velasco Velez VelizVeloz Villas daitch_mokotoff │ 784000 count │ 18 but then I have a bunch of other entries with the same code 784000 as alternative codes, string_agg │ Velazco daitch_mokotoff │ 784500 784000 count │ 1 string_agg │ Palacio daitch_mokotoff │ 785000 784000 count │ 1 I suppose I need to group these together somehow, and it would make more sense to do that if the values were arrays. If I scroll a bit further down and choose, say, 794000 (a relatively popular one), then I have this string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras Peraza Peres Perez Porras Varas Veras daitch_mokotoff │ 794000 count │ 14 and looking for that code in the result I also get these three string_agg │ Barca Barco Parco daitch_mokotoff │ 795000 794000 count │ 3 string_agg │ Borja daitch_mokotoff │ 790000 794000 count │ 1 string_agg │ Borjas daitch_mokotoff │ 794000 794400 count │ 1 and then I see that I should also search for possible matches in codes 795000, 790000 and 794400, so that gives me string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra Perea Vera daitch_mokotoff │ 790000 count │ 11 string_agg │ Barriga Borge Borrego Burgo Fraga daitch_mokotoff │ 795000 count │ 5 string_agg │ Borjas daitch_mokotoff │ 794000 794400 count │ 1 which look closely related (compare "Veras" in the first to "Vera" in the later set. If you ignore that pseudo-match, you're likely to miss possible family relationships.) I suppose if I were a genealogy researcher, I would be helped by having each of these codes behave as a separate unit, rather than me having to split the string into the several possible contained values. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Industry suffers from the managerial dogma that for the sake of stability and continuity, the company should be independent of the competence of individual employees." (E. Dijkstra)
pgsql-hackers by date: