Re: daitch_mokotoff module - Mailing list pgsql-hackers
From | Dag Lem |
---|---|
Subject | Re: daitch_mokotoff module |
Date | |
Msg-id | ygeh6wvsrc0.fsf@sid.nimrod.no Whole thread Raw |
In response to | daitch_mokotoff module (Dag Lem <dag@nimrod.no>) |
Responses |
Re: daitch_mokotoff module
|
List | pgsql-hackers |
Paul Ramsey <pramsey@cleverelephant.ca> writes: > On Mon, Jan 2, 2023 at 2:03 PM Dag Lem <dag@nimrod.no> wrote: > >> I also improved on the documentation example (using Full Text Search). >> AFAIK you can't make general queries like that using arrays, however in >> any case I must admit that text arrays seem like more natural building >> blocks than space delimited text here. > > This is a fun addition to fuzzystrmatch. I'm glad to hear it! :-) > > While it's a little late in the game, I'll just put it out there: > daitch_mokotoff() is way harder to type than soundex_dm(). Not sure > how you feel about that. I chose the name in order to follow the naming of the other functions in fuzzystrmatch, which as far as I can tell are given the name which each algorithm is known by. Personally I don't think it's worth it to deviate from the naming of the other functions just to avoid typing a few characters, and I certainly don't think daitch_mokotoff is any harder to get right than levenshtein_less_equal ;-) So, if I were to decide, I wouldn't change the name of the function. However I'm obviously not calling the shots on what goes into PostgreSQL - perhaps someone else would like to weigh in on this? > > On the documentation, I found the leap directly into the tsquery > example a bit too big. Maybe start with a very simple example, > > -- > dm=# SELECT daitch_mokotoff('Schwartzenegger'), > daitch_mokotoff('Swartzenegger'); > > daitch_mokotoff | daitch_mokotoff > -----------------+----------------- > {479465} | {479465} > -- > > Then transition into a more complex example that illustrates the GIN > index technique you mention in the text, but do not show: > > -- > CREATE TABLE dm_gin (source text, dm text[]); > > INSERT INTO dm_gin (source) VALUES > ('Swartzenegger'), > ('John'), > ('James'), > ('Steinman'), > ('Steinmetz'); > > UPDATE dm_gin SET dm = daitch_mokotoff(source); > > CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm); > > SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger'); > -- Sure, I can do that. You don't think this much example text will be TL;DR? > > And only then go into the tsearch example. Incidentally, what does the > tsearch approach provide that the simple GIN approach does not? The example shows how to do a simultaneous match on first AND last names, where the first and last names (any number of names) are stored in the same indexed column, and the order of the names in the index and the search term does not matter. If you were to use the GIN "&&" operator, you would get a match if either the first OR the last name matches. If you were to use the GIN "@>" operator, you would *not* get a match if the search term contains more soundex codes than the indexed name. E.g. this yields a correct match: SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson'); While this yields a false positive: SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && (daitch_mokotoff('John') || daitch_mokotoff('Doe')); And this yields a false negative: SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> (daitch_mokotoff('John') || daitch_mokotoff('Jameson')); This may explained better by simply showing the output of soundex_tsvector and soundex_tsquery: SELECT soundex_tsvector('John Yamson'); soundex_tsvector ---------------------------------- '160000':1 '164600':3 '460000':2 SELECT soundex_tsquery('John Jameson'); soundex_tsquery --------------------------------------------------- ( '160000' | '460000' ) & ( '164600' | '464600' ) > Ideally explain that briefly before launching into the example. With > all the custom functions and so on it's a little involved, so maybe if > there's not a huge win in using that approach drop it entirely? I believe this functionality is quite useful, and that it's actually what's called for in many situations. So, I'd rather not drop this example. > > ATB, > P > Best regards, Dag Lem
pgsql-hackers by date: