Re: daitch_mokotoff module - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: daitch_mokotoff module |
Date | |
Msg-id | 20221207185656.nlaj2o5kcbgbadcp@awork3.anarazel.de Whole thread Raw |
In response to | Re: daitch_mokotoff module (Dag Lem <dag@nimrod.no>) |
Responses |
Re: daitch_mokotoff module
|
List | pgsql-hackers |
Hi, On 2022-02-03 15:27:32 +0100, Dag Lem wrote: > Just some minor adjustments to the patch: > > * Removed call to locale-dependent toupper() > * Cleaned up input normalization This patch currently fails in cfbot, likely because meson.build needs to be adjusted (this didn't exist at the time you submitted this version of the patch): [23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File fuzzystrmatch--1.1.sql does not exist. > -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql > +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql > PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing that. In recent years our approach has been to just keep the "base version" of the upgrade script, with extension creation running through the upgrade scripts. > > + > +#include "daitch_mokotoff.h" > + > +#include "postgres.h" Postgres policy is that the include of "postgres.h" has to be the first include in every .c file. > +#include "utils/builtins.h" > +#include "mb/pg_wchar.h" > + > +#include <string.h> > + > +/* Internal C implementation */ > +static char *_daitch_mokotoff(char *word, char *soundex, size_t n); > + > + > +PG_FUNCTION_INFO_V1(daitch_mokotoff); > +Datum > +daitch_mokotoff(PG_FUNCTION_ARGS) > +{ > + text *arg = PG_GETARG_TEXT_PP(0); > + char *string, > + *tmp_soundex; > + text *soundex; > + > + /* > + * The maximum theoretical soundex size is several KB, however in practice > + * anything but contrived synthetic inputs will yield a soundex size of > + * less than 100 bytes. We thus allocate and free a temporary work buffer, > + * and return only the actual soundex result. > + */ > + string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8); > + tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS); Seems that just using StringInfo to hold the soundex output would work better than a static allocation? > + if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS)) We imo shouldn't introduce new functions starting with _. > +/* Mark soundex code tree node as leaf. */ > +static void > +set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node) > +{ > + if (!node->is_leaf) > + { > + node->is_leaf = 1; > + leaves_next[(*num_leaves_next)++] = node; > + } > +} > + > + > +/* Find next node corresponding to code digit, or create a new node. */ > +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes, > + dm_node * node, char code_digit) PG code style is to have a line break between a function defintion's return type and the function name - like you actually do above. > +/* Mapping from ISO8859-1 to upper-case ASCII */ > +static const char tr_iso8859_1_to_ascii_upper[] = > +/* > +"`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ" > +*/ > +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~ ! ?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY"; > + > +static char > +iso8859_1_to_ascii_upper(unsigned char c) > +{ > + return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c; > +} > + > + > +/* Convert an UTF-8 character to ISO-8859-1. > + * Unconvertable characters are returned as '?'. > + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț. > + */ > +static char > +utf8_to_iso8859_1(char *str, int *ix) It seems decidedly not great to have custom encoding conversion routines in a contrib module. Is there any way we can avoid this? > +/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */ > +static char * > +_daitch_mokotoff(char *word, char *soundex, size_t n) > +{ > + int i = 0, > + j; > + int letter_no = 0; > + int ix_leaves = 0; > + int num_nodes = 0, > + num_leaves = 0; > + dm_codes *codes, > + *next_codes; > + dm_node *nodes; > + dm_leaves *leaves; > + > + /* First letter. */ > + if (!(codes = read_letter(word, &i))) > + { > + /* No encodable character in input. */ > + return NULL; > + } > + > + /* Allocate memory for node tree. */ > + nodes = palloc(sizeof(dm_nodes)); > + leaves = palloc(2 * sizeof(dm_leaves)); So this allocates the worst case memory usage, is that right? That's quite a bit of memory. Shouldn't nodes be allocated dynamically? Instead of carefully freeing individual memory allocations, I think it be better to create a temporary memory context, allocate the necessary nodes etc on demand, and destroy the temporary memory context at the end. > +/* Codes for letter sequence at start of name, before a vowel, and any other. */ > +static dm_codes codes_0_1_X[2] = Any reason these aren't all const? It's not clear to me where the intended line between the .h and .c file is. > +print <<EOF; > +/* > + * Types and lookup tables for Daitch-Mokotoff Soundex > + * If we generate the code, why is the generated header included in the commit? > +/* Letter in input sequence */ > +struct dm_letter > +{ > + char letter; /* Present letter in sequence */ > + struct dm_letter *letters; /* List of possible successive letters */ > + dm_codes *codes; /* Code sequence(s) for complete sequence */ > +}; > + > +/* Node in soundex code tree */ > +struct dm_node > +{ > + int soundex_length; /* Length of generated soundex code */ > + char soundex[DM_MAX_CODE_DIGITS + 1]; /* Soundex code */ > + int is_leaf; /* Candidate for complete soundex code */ > + int last_update; /* Letter number for last update of node */ > + char code_digit; /* Last code digit, 0 - 9 */ > + > + /* > + * One or two alternate code digits leading to this node. If there are two > + * digits, one of them is always an 'X'. Repeated code digits and 'X' lead > + * back to the same node. > + */ > + char prev_code_digits[2]; > + /* One or two alternate code digits moving forward. */ > + char next_code_digits[2]; > + /* ORed together code index(es) used to reach current node. */ > + int prev_code_index; > + int next_code_index; > + /* Nodes branching out from this node. */ > + struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1]; > +}; > + > +typedef struct dm_letter dm_letter; > +typedef struct dm_node dm_node; Why is all this in the generated header? It needs DM_MAX_ALTERNATE_CODES etc, but it seems that the structs could just be defined in the .c file. > +# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html What does "adapted" mean here? And what's the path to updating the data? Greetings, Andres Freund
pgsql-hackers by date: