Home > mailing lists
Re: daitch_mokotoff module - Mailing list pgsql-hackers

From	Dag Lem
Subject	Re: daitch_mokotoff module
Date	December 21, 2022 09:26:05
Msg-id	ygeh6xp2jma.fsf@sid.nimrod.no Whole thread Raw
In response to	Re: daitch_mokotoff module (Andres Freund <andres@anarazel.de>)
Responses	Re: daitch_mokotoff module
List	pgsql-hackers
Tree view
Hi Andreas,

Thank you for your detailed and constructive review!

I have made a conscientuous effort to address all the issues you point
out, please see comments below.

Andres Freund <andres@anarazel.de> writes:

> Hi,
>
> On 2022-02-03 15:27:32 +0100, Dag Lem wrote:

[...]

> [23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File
> fuzzystrmatch--1.1.sql does not exist.
>
>
>> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
>> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql
>> fuzzystrmatch--1.0--1.1.sql
>>  PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
>
>
> The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing
> that. In recent years our approach has been to just keep the "base version" of
> the upgrade script, with extension creation running through the upgrade
> scripts.
>

OK, I have now kept fuzzystrmatch--1.1.sql, and omitted
fuzzystrmatch--1.2.sql

Both the Makefile and meson.build are updated to handle the new files,
including the generated header.

>>
>> +
>> +#include "daitch_mokotoff.h"
>> +
>> +#include "postgres.h"
>
> Postgres policy is that the include of "postgres.h" has to be the first
> include in every .c file.
>
>

OK, fixed.

>> +#include "utils/builtins.h"
>> +#include "mb/pg_wchar.h"
>> +
>> +#include <string.h>
>> +
>> +/* Internal C implementation */
>> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
>> +
>> +
>> +PG_FUNCTION_INFO_V1(daitch_mokotoff);
>> +Datum
>> +daitch_mokotoff(PG_FUNCTION_ARGS)
>> +{
>> +    text       *arg = PG_GETARG_TEXT_PP(0);
>> +    char       *string,
>> +               *tmp_soundex;
>> +    text       *soundex;
>> +
>> +    /*
>> + * The maximum theoretical soundex size is several KB, however in
>> practice
>> +     * anything but contrived synthetic inputs will yield a soundex size of
>> + * less than 100 bytes. We thus allocate and free a temporary work
>> buffer,
>> +     * and return only the actual soundex result.
>> +     */
>> + string = pg_server_to_any(text_to_cstring(arg),
>> VARSIZE_ANY_EXHDR(arg), PG_UTF8);
>> +    tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
>
> Seems that just using StringInfo to hold the soundex output would work better
> than a static allocation?
>

OK, fixed.

>
>> +    if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))
>
> We imo shouldn't introduce new functions starting with _.
>

OK, fixed. Note that I just followed the existing pattern in
fuzzystrmatch.c there.

[...]

>> +/* Find next node corresponding to code digit, or create a new node. */
>> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
>> + dm_node * node, char code_digit)
>
> PG code style is to have a line break between a function defintion's return
> type and the function name - like you actually do above.
>

OK, fixed. Both pgindent and I must have missed that particular
function.

>> +/* Mapping from ISO8859-1 to upper-case ASCII */
>> +static const char tr_iso8859_1_to_ascii_upper[] =
>> +/*
>> +"`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬
>> ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
>> +*/
>> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~ !
>> ?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
>> +
>> +static char
>> +iso8859_1_to_ascii_upper(unsigned char c)
>> +{
>> +    return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
>> +}
>> +
>> +
>> +/* Convert an UTF-8 character to ISO-8859-1.
>> + * Unconvertable characters are returned as '?'.
>> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
>> + */
>> +static char
>> +utf8_to_iso8859_1(char *str, int *ix)
>
> It seems decidedly not great to have custom encoding conversion routines in a
> contrib module. Is there any way we can avoid this?
>

I have now replaced the custom UTF-8 decode with calls to
utf8_to_unicode and pg_utf_mblen, and simplified the subsequent
conversion to ASCII. Hopefully this makes the conversion code more
palatable.

I don't see how the conversion to ASCII could be substantially
simplified further. The conversion maps lowercase and 8 bit ISO8859-1
characters to ASCII via uppercasing, removal of accents, and discarding
of special characters. In addition to that, it maps (the non-ISO8859-1)
Ą, Ę, and Ţ/Ț from the coding chart to [, \, and ]. After this, a simple
O(1) table lookup can be used to retrieve the soundex code tree for a
letter sequence.

>
>> +/* Generate all Daitch-Mokotoff soundex codes for word, separated
>> by space. */
>> +static char *
>> +_daitch_mokotoff(char *word, char *soundex, size_t n)
>> +{
>> +    int            i = 0,
>> +                j;
>> +    int            letter_no = 0;
>> +    int            ix_leaves = 0;
>> +    int            num_nodes = 0,
>> +                num_leaves = 0;
>> +    dm_codes   *codes,
>> +               *next_codes;
>> +    dm_node    *nodes;
>> +    dm_leaves  *leaves;
>> +
>> +    /* First letter. */
>> +    if (!(codes = read_letter(word, &i)))
>> +    {
>> +        /* No encodable character in input. */
>> +        return NULL;
>> +    }
>> +
>> +    /* Allocate memory for node tree. */
>> +    nodes = palloc(sizeof(dm_nodes));
>> +    leaves = palloc(2 * sizeof(dm_leaves));
>
> So this allocates the worst case memory usage, is that right? That's quite a
> bit of memory. Shouldn't nodes be allocated dynamically?
>
> Instead of carefully freeing individual memory allocations, I think it be
> better to create a temporary memory context, allocate the necessary nodes etc
> on demand, and destroy the temporary memory context at the end.
>

Yes, the one-time allocation was intended to cover the worst case memory
usage. This was done to avoid any performance hit incurred by allocating
and deallocating memory for each new node in the soundex code tree.

I have rewritten the bookeeping of nodes in the soundex code tree to use
linked lists, and have followed your advice to use a temporary memory
context for allocation.

I also made an optimization by excluding completed soundex nodes from
the next letter iteration. This seems to offset any allocation overhead
- the performance is more or less the same as before.

>
>> +/* Codes for letter sequence at start of name, before a vowel, and
>> any other. */
>> +static dm_codes codes_0_1_X[2] =
>
> Any reason these aren't all const?
>

No reason why they can't be :-) They are now changed to const.

>
> It's not clear to me where the intended line between the .h and .c file is.
>
>
>> +print <<EOF;
>> +/*
>> + * Types and lookup tables for Daitch-Mokotoff Soundex
>> + *
>
> If we generate the code, why is the generated header included in the commit?
>

This was mainly to have the content available for reference without
having to generate the header. I have removed the file - after the
change you suggest below, the struct declarations are available in the
.c file anyway.

>> +/* Letter in input sequence */
>> +struct dm_letter
>> +{
>> +    char        letter;            /* Present letter in sequence */
>> + struct dm_letter *letters; /* List of possible successive letters
>> */
>> + dm_codes *codes; /* Code sequence(s) for complete sequence */
>> +};
>> +
>> +/* Node in soundex code tree */
>> +struct dm_node
>> +{
>> + int soundex_length; /* Length of generated soundex code */
>> + char soundex[DM_MAX_CODE_DIGITS + 1]; /* Soundex code */
>> + int is_leaf; /* Candidate for complete soundex code */
>> + int last_update; /* Letter number for last update of node */
>> +    char        code_digit;        /* Last code digit, 0 - 9 */
>> +
>> +    /*
>> + * One or two alternate code digits leading to this node. If there
>> are two
>> + * digits, one of them is always an 'X'. Repeated code digits and
>> X' lead
>> +     * back to the same node.
>> +     */
>> +    char        prev_code_digits[2];
>> +    /* One or two alternate code digits moving forward. */
>> +    char        next_code_digits[2];
>> +    /* ORed together code index(es) used to reach current node. */
>> +    int            prev_code_index;
>> +    int            next_code_index;
>> +    /* Nodes branching out from this node. */
>> +    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
>> +};
>> +
>> +typedef struct dm_letter dm_letter;
>> +typedef struct dm_node dm_node;
>
> Why is all this in the generated header? It needs DM_MAX_ALTERNATE_CODES etc,
> but it seems that the structs could just be defined in the .c file.
>

To accomplish this, I had to rearrange the code a bit. The structs are
now all declared in daitch_mokotoff.c, and the generated header is
included inbetween them.

>
>> +# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
>
> What does "adapted" mean here? And what's the path to updating the data?
>

It means that the original soundex coding chart, which is referred to,
has been converted to a machine readable format, with a few
modifications. These modifications are outlined further down in the
comments. I expanded a bit on the comments, hopefully making things
clearer.

I don't think there is much to be said about updating the data - that's
simply a question of modifying the table and regenerating the header
file. It goes without saying that making changes requires an
understanding of the soundex coding, which is explained in the
reference. However if anything should be unclear, please do point out
what should be explained better.

> Greetings,
>
> Andres Freund
>

Thanks again, and a Merry Christmas to you and all the other PostgreSQL
hackers!


Best regards,

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+    rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..d4ad95c283
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,596 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+#define DM_CODE_DIGITS 6
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    const struct dm_letter *letters;    /* List of possible successive letters */
+    const        dm_codes *codes;    /* Code sequence(s) for complete sequence */
+};
+
+typedef struct dm_letter dm_letter;
+
+/* Generated coding chart table */
+#include "daitch_mokotoff.h"
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *children[DM_MAX_ALTERNATE_CODES + 1];
+    /* Next node in linked list. Alternating index for each iteration. */
+    struct dm_node *next[2];
+};
+
+typedef struct dm_node dm_node;
+
+
+/* Internal C implementation */
+static int    daitch_mokotoff_coding(char *word, StringInfo soundex);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string;
+    StringInfoData soundex;
+    text       *retval;
+    MemoryContext old_ctx,
+                tmp_ctx;
+
+    tmp_ctx = AllocSetContextCreate(CurrentMemoryContext,
+                                    "daitch_mokotoff temporary context",
+                                    ALLOCSET_DEFAULT_SIZES);
+    old_ctx = MemoryContextSwitchTo(tmp_ctx);
+
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    initStringInfo(&soundex);
+
+    if (!daitch_mokotoff_coding(string, &soundex))
+    {
+        /* No encodable characters in input. */
+        MemoryContextSwitchTo(old_ctx);
+        MemoryContextDelete(tmp_ctx);
+        PG_RETURN_NULL();
+    }
+
+    string = pg_any_to_server(soundex.data, soundex.len, PG_UTF8);
+    MemoryContextSwitchTo(old_ctx);
+    retval = cstring_to_text(string);
+    MemoryContextDelete(tmp_ctx);
+
+    PG_RETURN_TEXT_P(retval);
+}
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .children = {NULL},
+    .next = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static const dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+
+        if (first_node[ix_node] == NULL)
+        {
+            first_node[ix_node] = node;
+        }
+        else
+        {
+            last_node[ix_node]->next[ix_node] = node;
+        }
+
+        last_node[ix_node] = node;
+        node->next[ix_node] = NULL;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node *
+find_or_create_child_node(dm_node * parent, char code_digit, StringInfo soundex)
+{
+    dm_node   **nodes;
+    dm_node    *node;
+    int            i;
+
+    for (nodes = parent->children, i = 0; (node = nodes[i]); i++)
+    {
+        if (node->code_digit == code_digit)
+        {
+            /* Found existing child node. Skip completed nodes. */
+            return node->soundex_length < DM_CODE_DIGITS ? node : NULL;
+        }
+    }
+
+    /* Create new child node. */
+    Assert(i < DM_MAX_ALTERNATE_CODES);
+    node = palloc(sizeof(dm_node));
+    nodes[i] = node;
+
+    *node = start_node;
+    memcpy(node->soundex, parent->soundex, sizeof(parent->soundex));
+    node->soundex_length = parent->soundex_length;
+    node->soundex[node->soundex_length++] = code_digit;
+    node->code_digit = code_digit;
+    node->next_code_index = node->prev_code_index;
+
+    if (node->soundex_length < DM_CODE_DIGITS)
+    {
+        return node;
+    }
+    else
+    {
+        /* Append completed soundex code to soundex string. */
+        appendBinaryStringInfoNT(soundex, node->soundex, DM_CODE_DIGITS + 1);
+        return NULL;
+    }
+}
+
+
+/* Update node for next code digit(s). */
+static void
+update_node(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node,
+            int letter_no, int prev_code_index, int next_code_index,
+            const char *next_code_digits, int digit_no,
+            StringInfo soundex)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_child_node(node, next_code_digit, soundex);
+        if (node)
+        {
+            initialize_node(node, letter_no);
+            dirty_nodes[num_dirty_nodes++] = node;
+        }
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(first_node, last_node, dirty_nodes[i], ix_node,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no,
+                        soundex);
+        }
+        else
+        {
+            /* Add incomplete leaf node to linked list. */
+            set_leaf(first_node, last_node, dirty_nodes[i], ix_node);
+        }
+    }
+}
+
+
+/* Update soundex tree leaf nodes. */
+static void
+update_leaves(dm_node * first_node[2], int *ix_node, int letter_no,
+              const dm_codes * codes, const dm_codes * next_codes,
+              StringInfo soundex)
+{
+    int            i,
+                j,
+                code_index;
+    dm_node    *node,
+               *last_node[2];
+    const        dm_code *code,
+               *next_code;
+    int            ix_node_next = (*ix_node + 1) & 1;    /* Alternating index: 0, 1 */
+
+    /* Initialize for new linked list of leaves. */
+    first_node[ix_node_next] = NULL;
+    last_node[ix_node_next] = NULL;
+
+    /* Process all nodes. */
+    for (node = first_node[*ix_node]; node; node = node->next[*ix_node])
+    {
+        /* One or two alternate code sequences. */
+        for (i = 0; i < 2 && (code = codes[i]) && code[0][0]; i++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (j = 0; j < 2 && (next_code = next_codes[j]) && next_code[0][0]; j++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                update_node(first_node, last_node, node, ix_node_next,
+                            letter_no, prev_code_index, code_index,
+                            code[code_index], 0,
+                            soundex);
+            }
+        }
+    }
+
+    *ix_node = ix_node_next;
+}
+
+
+/* Mapping from ISO8859-1 to upper-case ASCII */
+static const char iso8859_1_to_ascii_upper[] =
+/*
+"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(unsigned char *str, int *ix)
+{
+    /* Substitute character for skipped code points. */
+    const char    na = '\x1a';
+    pg_wchar    c;
+
+    /* Decode UTF-8 character to ISO 10646 code point. */
+    str += *ix;
+    c = utf8_to_unicode(str);
+    *ix += pg_utf_mblen(str);
+
+    if (c >= (unsigned char) '[' && c <= (unsigned char) ']')
+    {
+        /* ASCII characters [, \, and ] are reserved for Ą, Ę, and Ţ/Ț. */
+        return na;
+    }
+    else if (c < 0x60)
+    {
+        /* Non-lowercase ASCII character. */
+        return c;
+    }
+    else if (c < 0x100)
+    {
+        /* ISO-8859-1 code point, converted to upper-case ASCII. */
+        return iso8859_1_to_ascii_upper[c - 0x60];
+    }
+    else
+    {
+        /* Conversion of non-ASCII characters in the coding chart. */
+        switch (c)
+        {
+            case 0x0104:
+            case 0x0105:
+                /* Ą/ą */
+                return '[';
+            case 0x0118:
+            case 0x0119:
+                /* Ę/ę */
+                return '\\';
+            case 0x0162:
+            case 0x0163:
+            case 0x021A:
+            case 0x021B:
+                /* Ţ/ţ or Ț/ț */
+                return ']';
+            default:
+                return na;
+        }
+    }
+}
+
+
+/* Read next ASCII character, skipping any characters not in [A-\]]. */
+static char
+read_valid_char(char *str, int *ix)
+{
+    char        c;
+
+    while ((c = read_char((unsigned char *) str, ix)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            break;
+        }
+    }
+
+    return c;
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static const dm_codes *
+read_letter(char *str, int *ix)
+{
+    char        c,
+                cmp;
+    int            i,
+                j;
+    const        dm_letter *letters;
+    const        dm_codes *codes;
+
+    /* First letter in sequence. */
+    if (!(c = read_valid_char(str, ix)))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    i = *ix;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = read_valid_char(str, &i)))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
+static int
+daitch_mokotoff_coding(char *word, StringInfo soundex)
+{
+    int            i = 0;
+    int            letter_no = 0;
+    int            ix_node = 0;
+    const        dm_codes *codes,
+               *next_codes;
+    dm_node    *first_node[2],
+               *node;
+
+    /* First letter. */
+    if (!(codes = read_letter(word, &i)))
+    {
+        /* No encodable character in input. */
+        return 0;
+    }
+
+    /* Starting point. */
+    first_node[ix_node] = palloc(sizeof(dm_node));
+    *first_node[ix_node] = start_node;
+
+    /*
+     * Loop until either the word input is exhausted, or all generated soundex
+     * codes are completed to six digits.
+     */
+    while (codes && first_node[ix_node])
+    {
+        next_codes = read_letter(word, &i);
+
+        /* Update leaf nodes. */
+        update_leaves(first_node, &ix_node, letter_no,
+                      codes, next_codes ? next_codes : end_codes,
+                      soundex);
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Append all remaining (incomplete) soundex codes. */
+    for (node = first_node[ix_node]; node; node = node->next[ix_node])
+    {
+        appendBinaryStringInfoNT(soundex, node->soundex, DM_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string at the final space. */
+    soundex->len--;
+    soundex->data[soundex->len] = '\0';
+
+    return 1;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..807b5fb8c5
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,260 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+die "Usage: $0 OUTPUT_FILE\n" if @ARGV != 1;
+my $output_file = $ARGV[0];
+
+# Open the output file
+open my $OUTPUT, '>', $output_file
+  or die "Could not open output file $output_file: $!\n";
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character ("start of a name") can never yield more than two alternate codes,
+    # and is not considered here.
+    if (@codes > 1) {
+        for my $j (1..2) {  # Codes for "before a vowel" and "any other"
+            for my $i (0..1) {  # Alternate codes
+                # Identical code digits for adjacent letters are collapsed.
+                # For each possible non-transition due to code digit
+                # collapsing, find all alternate transitions.
+                my ($present, $next) = ($codes[$i][$j], $codes[($i + 1)%2][$j]);
+                next if length($present) != 1;
+                $next = $present ne substr($next, 0, 1) ? substr($next, 0, 1) : substr($next, -1, 1);
+                $alternates{$present}{$next} = 1;
+            }
+        }
+    }
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Add alternates by following transitions to 'X' (not coded).
+my $alt_x = $alternates{"X"};
+delete $alt_x->{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+
+# Find the maximum number of alternate codes in one position.
+# Add two for any additional final code digit transitions.
+my $max_alt = (sort { $b <=> $a } (map { scalar keys %$_ } values %alternates))[0] + 2;
+
+print $OUTPUT <<EOF;
+/*
+ * Constants and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#define DM_MAX_ALTERNATE_CODES $max_alt
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print $OUTPUT "static const dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print $OUTPUT <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print $OUTPUT "static const dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print $OUTPUT "$_,\n";
+    }
+    print $OUTPUT "\t{\n\t\t'\\0'\n\t}\n";
+    print $OUTPUT "};\n";
+}
+
+hash2code($table, '');
+
+close $OUTPUT;
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# The conversion from the coding chart to the table should be self
+# explanatory, but note the differences stated below.
+#
+# X = NC (not coded)
+#
+# The non-ASCII letters in the coding chart are coded with substitute
+# lowercase ASCII letters, which sort after the uppercase ASCII letters:
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" does not correspond to the coding chart, however
+# it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+#
+# Note that the implementation assumes that vowels are assigned code
+# 0 or 1. "J" can be either a vowel or a consonant.
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..f62ddad4ee 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ 054795
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ 587943 587433
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ 596740 496740
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ 595400 495400
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ 673950
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ 798600
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ 743000
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ 710000
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ 566600
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 651900 654900 654190 654490 645190 645490 641900 644900
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 059400 054000 054940 054400 045940 045400 049400 044000
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 794575 794574 794750 794740 745750 745740 747500 747400
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..32d8260383
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ 564000 540000
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+       daitch_mokotoff
+-----------------------------
+ 794640 794400 746400 744000
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..b9d7b229a3
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/meson.build b/contrib/fuzzystrmatch/meson.build
index e6d06149ce..6b4a13694f 100644
--- a/contrib/fuzzystrmatch/meson.build
+++ b/contrib/fuzzystrmatch/meson.build
@@ -1,7 +1,16 @@
 fuzzystrmatch_sources = files(
-  'fuzzystrmatch.c',
+  'daitch_mokotoff.c',
   'dmetaphone.c',
+  'fuzzystrmatch.c',
+)
+
+daitch_mokotoff_h = custom_target('daitch_mokotoff',
+  input: 'daitch_mokotoff_header.pl',
+  output: 'daitch_mokotoff.h',
+  command: [perl, '@INPUT@', '@OUTPUT@'],
 )
+generated_sources += daitch_mokotoff_h
+fuzzystrmatch_sources += daitch_mokotoff_h

 if host_system == 'windows'
   fuzzystrmatch_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
@@ -19,6 +28,7 @@ install_data(
   'fuzzystrmatch.control',
   'fuzzystrmatch--1.0--1.1.sql',
   'fuzzystrmatch--1.1.sql',
+  'fuzzystrmatch--1.1--1.2.sql',
   kwargs: contrib_data_args,
 )

@@ -29,6 +39,7 @@ tests += {
   'regress': {
     'sql': [
       'fuzzystrmatch',
+      'fuzzystrmatch_utf8',
     ],
   },
 }
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..08781778f8 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,101 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   Any alternative soundex codes are separated by space, which makes the returned
+   text suited for use in Full Text Search, see <xref linkend="textsearch"/> and
+   <xref linkend="functions-textsearch"/>.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_name(v_name text) RETURNS text AS $$
+  SELECT string_agg(daitch_mokotoff(n), ' ')
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', soundex_name(v_name))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT to_tsquery('simple', quote_literal(soundex_name(v_name)))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no recalculation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('John Doe');
+INSERT INTO s VALUES ('Jane Roe');
+INSERT INTO s VALUES ('Public John Q.');
+INSERT INTO s VALUES ('George Best');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+</programlisting>
+ </sect2>
+
 </sect1>
pgsql-hackers by date:
From: "Drouvot, Bertrand"
Date: 21 December 2022, 09:06:42
Subject: Re: Minimal logical decoding on standbys
From: "Hayato Kuroda (Fujitsu)"
Date: 21 December 2022, 09:50:12
Subject: RE: Force streaming every change in logical decoding
Re: daitch_mokotoff module - Mailing list pgsql-hackers

Previous

Next