Re: daitch_mokotoff module - Mailing list pgsql-hackers

From Paul Ramsey
Subject Re: daitch_mokotoff module
Date
Msg-id A2C12929-5B04-4202-95AB-A2970FB5016A@cleverelephant.ca
Whole thread Raw
In response to Re: daitch_mokotoff module  (Dag Lem <dag@nimrod.no>)
Responses Re: daitch_mokotoff module
List pgsql-hackers

> On Jan 12, 2023, at 7:30 AM, Dag Lem <dag@nimrod.no> wrote:
>
> Paul Ramsey <pramsey@cleverelephant.ca> writes:
>
>> On Mon, Jan 2, 2023 at 2:03 PM Dag Lem <dag@nimrod.no> wrote:
>>
>>> I also improved on the documentation example (using Full Text Search).
>>> AFAIK you can't make general queries like that using arrays, however in
>>> any case I must admit that text arrays seem like more natural building
>>> blocks than space delimited text here.
>>
>> This is a fun addition to fuzzystrmatch.
>
> I'm glad to hear it! :-)
>
>>
>> While it's a little late in the game, I'll just put it out there:
>> daitch_mokotoff() is way harder to type than soundex_dm(). Not sure
>> how you feel about that.
>
> I chose the name in order to follow the naming of the other functions in
> fuzzystrmatch, which as far as I can tell are given the name which each
> algorithm is known by.
>
> Personally I don't think it's worth it to deviate from the naming of the
> other functions just to avoid typing a few characters, and I certainly
> don't think daitch_mokotoff is any harder to get right than
> levenshtein_less_equal ;-)

Good points :)

>
>>
>> On the documentation, I found the leap directly into the tsquery
>> example a bit too big. Maybe start with a very simple example,
>>
>> --
>> dm=# SELECT daitch_mokotoff('Schwartzenegger'),
>>            daitch_mokotoff('Swartzenegger');
>>
>> daitch_mokotoff | daitch_mokotoff
>> -----------------+-----------------
>> {479465}        | {479465}
>> --
>>
>> Then transition into a more complex example that illustrates the GIN
>> index technique you mention in the text, but do not show:
>>
>> --
>> CREATE TABLE dm_gin (source text, dm text[]);
>>
>> INSERT INTO dm_gin (source) VALUES
>>    ('Swartzenegger'),
>>    ('John'),
>>    ('James'),
>>    ('Steinman'),
>>    ('Steinmetz');
>>
>> UPDATE dm_gin SET dm = daitch_mokotoff(source);
>>
>> CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm);
>>
>> SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger');
>> --
>
> Sure, I can do that. You don't think this much example text will be
> TL;DR?

I can only speak for myself, but examples are the meat of documentation learning, so as long as they come with enough
explanatorycontext to be legible it's worth having them, IMO. 

>
>>
>> And only then go into the tsearch example. Incidentally, what does the
>> tsearch approach provide that the simple GIN approach does not?
>
> The example shows how to do a simultaneous match on first AND last
> names, where the first and last names (any number of names) are stored
> in the same indexed column, and the order of the names in the index and
> the search term does not matter.
>
> If you were to use the GIN "&&" operator, you would get a match if
> either the first OR the last name matches. If you were to use the GIN
> "@>" operator, you would *not* get a match if the search term contains
> more soundex codes than the indexed name.
>
> E.g. this yields a correct match:
> SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson');
>
> While this yields a false positive:
> SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && (daitch_mokotoff('John') || daitch_mokotoff('Doe'));
>
> And this yields a false negative:
> SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> (daitch_mokotoff('John') ||
daitch_mokotoff('Jameson'));
>
> This may explained better by simply showing the output of
> soundex_tsvector and soundex_tsquery:
>
> SELECT soundex_tsvector('John Yamson');
>         soundex_tsvector
> ----------------------------------
> '160000':1 '164600':3 '460000':2
>
> SELECT soundex_tsquery('John Jameson');
>                  soundex_tsquery
> ---------------------------------------------------
> ( '160000' | '460000' ) & ( '164600' | '464600' )
>
>> Ideally explain that briefly before launching into the example. With
>> all the custom functions and so on it's a little involved, so maybe if
>> there's not a huge win in using that approach drop it entirely?
>
> I believe this functionality is quite useful, and that it's actually
> what's called for in many situations. So, I'd rather not drop this
> example.

Sounds good

P

>
>>
>> ATB,
>> P
>>
>
> Best regards,
>
> Dag Lem




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: PG11 to PG14 Migration Slowness
Next
From: "Takamichi Osumi (Fujitsu)"
Date:
Subject: RE: Time delayed LR (WAS Re: logical replication restrictions)