Re: daitch_mokotoff module - Mailing list pgsql-hackers

From Andres Freund
Subject Re: daitch_mokotoff module
Date
Msg-id 20221207185656.nlaj2o5kcbgbadcp@awork3.anarazel.de
Whole thread Raw
In response to Re: daitch_mokotoff module  (Dag Lem <dag@nimrod.no>)
Responses Re: daitch_mokotoff module  (Dag Lem <dag@nimrod.no>)
List pgsql-hackers
Hi,

On 2022-02-03 15:27:32 +0100, Dag Lem wrote:
> Just some minor adjustments to the patch:
> 
> * Removed call to locale-dependent toupper()
> * Cleaned up input normalization

This patch currently fails in cfbot, likely because meson.build needs to be
adjusted (this didn't exist at the time you submitted this version of the
patch):

[23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File fuzzystrmatch--1.1.sql does not exist.


> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
>  PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"


The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing
that. In recent years our approach has been to just keep the "base version" of
the upgrade script, with extension creation running through the upgrade
scripts.

>  
> +
> +#include "daitch_mokotoff.h"
> +
> +#include "postgres.h"

Postgres policy is that the include of "postgres.h" has to be the first
include in every .c file.


> +#include "utils/builtins.h"
> +#include "mb/pg_wchar.h"
> +
> +#include <string.h>
> +
> +/* Internal C implementation */
> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
> +
> +
> +PG_FUNCTION_INFO_V1(daitch_mokotoff);
> +Datum
> +daitch_mokotoff(PG_FUNCTION_ARGS)
> +{
> +    text       *arg = PG_GETARG_TEXT_PP(0);
> +    char       *string,
> +               *tmp_soundex;
> +    text       *soundex;
> +
> +    /*
> +     * The maximum theoretical soundex size is several KB, however in practice
> +     * anything but contrived synthetic inputs will yield a soundex size of
> +     * less than 100 bytes. We thus allocate and free a temporary work buffer,
> +     * and return only the actual soundex result.
> +     */
> +    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
> +    tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);

Seems that just using StringInfo to hold the soundex output would work better
than a static allocation?


> +    if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))

We imo shouldn't introduce new functions starting with _.


> +/* Mark soundex code tree node as leaf. */
> +static void
> +set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
> +{
> +    if (!node->is_leaf)
> +    {
> +        node->is_leaf = 1;
> +        leaves_next[(*num_leaves_next)++] = node;
> +    }
> +}
> +
> +
> +/* Find next node corresponding to code digit, or create a new node. */
> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
> +                                     dm_node * node, char code_digit)

PG code style is to have a line break between a function defintion's return
type and the function name - like you actually do above.




> +/* Mapping from ISO8859-1 to upper-case ASCII */
> +static const char tr_iso8859_1_to_ascii_upper[] =
> +/*
> +"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
> +*/
> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
> +
> +static char
> +iso8859_1_to_ascii_upper(unsigned char c)
> +{
> +    return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
> +}
> +
> +
> +/* Convert an UTF-8 character to ISO-8859-1.
> + * Unconvertable characters are returned as '?'.
> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
> + */
> +static char
> +utf8_to_iso8859_1(char *str, int *ix)

It seems decidedly not great to have custom encoding conversion routines in a
contrib module. Is there any way we can avoid this?


> +/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
> +static char *
> +_daitch_mokotoff(char *word, char *soundex, size_t n)
> +{
> +    int            i = 0,
> +                j;
> +    int            letter_no = 0;
> +    int            ix_leaves = 0;
> +    int            num_nodes = 0,
> +                num_leaves = 0;
> +    dm_codes   *codes,
> +               *next_codes;
> +    dm_node    *nodes;
> +    dm_leaves  *leaves;
> +
> +    /* First letter. */
> +    if (!(codes = read_letter(word, &i)))
> +    {
> +        /* No encodable character in input. */
> +        return NULL;
> +    }
> +
> +    /* Allocate memory for node tree. */
> +    nodes = palloc(sizeof(dm_nodes));
> +    leaves = palloc(2 * sizeof(dm_leaves));

So this allocates the worst case memory usage, is that right? That's quite a
bit of memory. Shouldn't nodes be allocated dynamically?

Instead of carefully freeing individual memory allocations, I think it be
better to create a temporary memory context, allocate the necessary nodes etc
on demand, and destroy the temporary memory context at the end.


> +/* Codes for letter sequence at start of name, before a vowel, and any other. */
> +static dm_codes codes_0_1_X[2] =

Any reason these aren't all const?


It's not clear to me where the intended line between the .h and .c file is.


> +print <<EOF;
> +/*
> + * Types and lookup tables for Daitch-Mokotoff Soundex
> + *

If we generate the code, why is the generated header included in the commit?

> +/* Letter in input sequence */
> +struct dm_letter
> +{
> +    char        letter;            /* Present letter in sequence */
> +    struct dm_letter *letters;    /* List of possible successive letters */
> +    dm_codes   *codes;            /* Code sequence(s) for complete sequence */
> +};
> +
> +/* Node in soundex code tree */
> +struct dm_node
> +{
> +    int            soundex_length; /* Length of generated soundex code */
> +    char        soundex[DM_MAX_CODE_DIGITS + 1];    /* Soundex code */
> +    int            is_leaf;        /* Candidate for complete soundex code */
> +    int            last_update;    /* Letter number for last update of node */
> +    char        code_digit;        /* Last code digit, 0 - 9 */
> +
> +    /*
> +     * One or two alternate code digits leading to this node. If there are two
> +     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
> +     * back to the same node.
> +     */
> +    char        prev_code_digits[2];
> +    /* One or two alternate code digits moving forward. */
> +    char        next_code_digits[2];
> +    /* ORed together code index(es) used to reach current node. */
> +    int            prev_code_index;
> +    int            next_code_index;
> +    /* Nodes branching out from this node. */
> +    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
> +};
> +
> +typedef struct dm_letter dm_letter;
> +typedef struct dm_node dm_node;

Why is all this in the generated header? It needs DM_MAX_ALTERNATE_CODES etc,
but it seems that the structs could just be defined in the .c file.


> +# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html

What does "adapted" mean here? And what's the path to updating the data?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Transaction timeout
Next
From: Andres Freund
Date:
Subject: Re: Partial aggregates pushdown