Re: daitch_mokotoff module - Mailing list pgsql-hackers
From | Paul Ramsey |
---|---|
Subject | Re: daitch_mokotoff module |
Date | |
Msg-id | A2C12929-5B04-4202-95AB-A2970FB5016A@cleverelephant.ca Whole thread Raw |
In response to | Re: daitch_mokotoff module (Dag Lem <dag@nimrod.no>) |
Responses |
Re: daitch_mokotoff module
|
List | pgsql-hackers |
> On Jan 12, 2023, at 7:30 AM, Dag Lem <dag@nimrod.no> wrote: > > Paul Ramsey <pramsey@cleverelephant.ca> writes: > >> On Mon, Jan 2, 2023 at 2:03 PM Dag Lem <dag@nimrod.no> wrote: >> >>> I also improved on the documentation example (using Full Text Search). >>> AFAIK you can't make general queries like that using arrays, however in >>> any case I must admit that text arrays seem like more natural building >>> blocks than space delimited text here. >> >> This is a fun addition to fuzzystrmatch. > > I'm glad to hear it! :-) > >> >> While it's a little late in the game, I'll just put it out there: >> daitch_mokotoff() is way harder to type than soundex_dm(). Not sure >> how you feel about that. > > I chose the name in order to follow the naming of the other functions in > fuzzystrmatch, which as far as I can tell are given the name which each > algorithm is known by. > > Personally I don't think it's worth it to deviate from the naming of the > other functions just to avoid typing a few characters, and I certainly > don't think daitch_mokotoff is any harder to get right than > levenshtein_less_equal ;-) Good points :) > >> >> On the documentation, I found the leap directly into the tsquery >> example a bit too big. Maybe start with a very simple example, >> >> -- >> dm=# SELECT daitch_mokotoff('Schwartzenegger'), >> daitch_mokotoff('Swartzenegger'); >> >> daitch_mokotoff | daitch_mokotoff >> -----------------+----------------- >> {479465} | {479465} >> -- >> >> Then transition into a more complex example that illustrates the GIN >> index technique you mention in the text, but do not show: >> >> -- >> CREATE TABLE dm_gin (source text, dm text[]); >> >> INSERT INTO dm_gin (source) VALUES >> ('Swartzenegger'), >> ('John'), >> ('James'), >> ('Steinman'), >> ('Steinmetz'); >> >> UPDATE dm_gin SET dm = daitch_mokotoff(source); >> >> CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm); >> >> SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger'); >> -- > > Sure, I can do that. You don't think this much example text will be > TL;DR? I can only speak for myself, but examples are the meat of documentation learning, so as long as they come with enough explanatorycontext to be legible it's worth having them, IMO. > >> >> And only then go into the tsearch example. Incidentally, what does the >> tsearch approach provide that the simple GIN approach does not? > > The example shows how to do a simultaneous match on first AND last > names, where the first and last names (any number of names) are stored > in the same indexed column, and the order of the names in the index and > the search term does not matter. > > If you were to use the GIN "&&" operator, you would get a match if > either the first OR the last name matches. If you were to use the GIN > "@>" operator, you would *not* get a match if the search term contains > more soundex codes than the indexed name. > > E.g. this yields a correct match: > SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson'); > > While this yields a false positive: > SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && (daitch_mokotoff('John') || daitch_mokotoff('Doe')); > > And this yields a false negative: > SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> (daitch_mokotoff('John') || daitch_mokotoff('Jameson')); > > This may explained better by simply showing the output of > soundex_tsvector and soundex_tsquery: > > SELECT soundex_tsvector('John Yamson'); > soundex_tsvector > ---------------------------------- > '160000':1 '164600':3 '460000':2 > > SELECT soundex_tsquery('John Jameson'); > soundex_tsquery > --------------------------------------------------- > ( '160000' | '460000' ) & ( '164600' | '464600' ) > >> Ideally explain that briefly before launching into the example. With >> all the custom functions and so on it's a little involved, so maybe if >> there's not a huge win in using that approach drop it entirely? > > I believe this functionality is quite useful, and that it's actually > what's called for in many situations. So, I'd rather not drop this > example. Sounds good P > >> >> ATB, >> P >> > > Best regards, > > Dag Lem
pgsql-hackers by date: