Thread: Prefix support for synonym dictionary

Prefix support for synonym dictionary

From
Oleg Bartunov
Date:
Hi there,

attached is our patch for CVS HEAD, which adds prefix support for synonym
dictionary.

Quick example:


> cat $SHAREDIR/tsearch_data/synonym_sample.syn
postgres        pgsql
postgresql      pgsql
postgre pgsql
gogle   googl
indices index*

=# create text search dictionary syn( template=synonym,synonyms='synonym_sample');
=# select ts_lexize('syn','indices'); ts_lexize
----------- {index}
(1 row)
=# create text search configuration tst ( copy=simple);
=# alter text search configuration tst alter mapping for asciiword with syn;
=# select to_tsquery('tst','indices'); to_tsquery
------------ 'index':*
(1 row)
=# select 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices'); ?column?
---------- t
(1 row)
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Prefix support for synonym dictionary

From
Jeff Davis
Date:
Hi,

The patch looks good.

Comments:

1. The docs should be clarified a little. For instance, it should have a
link back to the definition of a prefix search (12.3.2). I included my
doc suggestions as an attachment.

2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps
fragile) way. After calling findwrd(), the "end" pointer is pointing at
either the end of the string, or the *; depending on whether the string
ends in * and whether flags is NULL. I only mention this because I had
to take a more careful look to see what was happening. Perhaps add a
comment to make it more clear?

3. The patch looks for the special byte '*'. I think that's fine,
because we depend on the files being in UTF-8 encoding, where it's the
same byte. However, I thought it was worth mentioning in case we want to
support other encodings for text search files later.

Regards,
    Jeff Davis




Attachment

Re: Prefix support for synonym dictionary

From
Robert Haas
Date:
On Sun, Aug 2, 2009 at 3:05 PM, Jeff Davis<pgsql@j-davis.com> wrote:
> The patch looks good.
>
> Comments:
>
> 1. The docs should be clarified a little. For instance, it should have a
> link back to the definition of a prefix search (12.3.2). I included my
> doc suggestions as an attachment.
>
> 2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps
> fragile) way. After calling findwrd(), the "end" pointer is pointing at
> either the end of the string, or the *; depending on whether the string
> ends in * and whether flags is NULL. I only mention this because I had
> to take a more careful look to see what was happening. Perhaps add a
> comment to make it more clear?
>
> 3. The patch looks for the special byte '*'. I think that's fine,
> because we depend on the files being in UTF-8 encoding, where it's the
> same byte. However, I thought it was worth mentioning in case we want to
> support other encodings for text search files later.

Oleg,

Are you planning to update this patch this week?  If not I will set it
to "Returned with Feedback".

Thanks,

...Robert


Re: Prefix support for synonym dictionary

From
Jeff Davis
Date:
On Wed, 2009-08-05 at 12:34 -0400, Robert Haas wrote:
> Oleg,
> 
> Are you planning to update this patch this week?  If not I will set it
> to "Returned with Feedback".

My only comments were related to docs and comments, and I supplied a
patch as a suggested fix for the docs. Also, the patch is very small.

I'd hate to hold it up over such a minor issue, and it seems like a
useful feature. If Oleg is unavailable, would you mind just having a
second review of the patch to see if they agree with my suggestions, and
then mark "ready for committer review"?

Regards,Jeff Davis



Re: Prefix support for synonym dictionary

From
Teodor Sigaev
Date:
> 1. The docs should be clarified a little. For instance, it should have a
> link back to the definition of a prefix search (12.3.2). I included my
> doc suggestions as an attachment.
Thank you, merged

> 2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps
> fragile) way. After calling findwrd(), the "end" pointer is pointing at
> either the end of the string, or the *; depending on whether the string
> ends in * and whether flags is NULL. I only mention this because I had
> to take a more careful look to see what was happening. Perhaps add a
> comment to make it more clear?
Add comments:
/*
  * Finds the next whitespace-delimited word within the 'in' string.
  * Returns a pointer to the first character of the word, and a pointer
  * to the next byte after the last character in the word (in *end).
  * Character '*' at the end of word will not be threated as word
  * charater if flags is not null.
  */
static char *
findwrd(char *in, char **end, uint16 *flags)



> 3. The patch looks for the special byte '*'. I think that's fine,
> because we depend on the files being in UTF-8 encoding, where it's the
> same byte. However, I thought it was worth mentioning in case we want to
> support other encodings for text search files later.

tsearch_readline() converts file's UTF8 encoding into server encoding. pgsql
supports only encoding which are a superset of ASCII. So it's safe to use
asterisk with any encodings

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: Prefix support for synonym dictionary

From
Robert Haas
Date:
2009/8/6 Teodor Sigaev <teodor@sigaev.ru>:
>> 1. The docs should be clarified a little. For instance, it should have a
>> link back to the definition of a prefix search (12.3.2). I included my
>> doc suggestions as an attachment.
>
> Thank you, merged
>
>> 2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps
>> fragile) way. After calling findwrd(), the "end" pointer is pointing at
>> either the end of the string, or the *; depending on whether the string
>> ends in * and whether flags is NULL. I only mention this because I had
>> to take a more careful look to see what was happening. Perhaps add a
>> comment to make it more clear?
>
> Add comments:
> /*
>  * Finds the next whitespace-delimited word within the 'in' string.
>  * Returns a pointer to the first character of the word, and a pointer
>  * to the next byte after the last character in the word (in *end).
>  * Character '*' at the end of word will not be threated as word
>  * charater if flags is not null.
>  */
> static char *
> findwrd(char *in, char **end, uint16 *flags)
>
>
>
>> 3. The patch looks for the special byte '*'. I think that's fine,
>> because we depend on the files being in UTF-8 encoding, where it's the
>> same byte. However, I thought it was worth mentioning in case we want to
>> support other encodings for text search files later.
>
> tsearch_readline() converts file's UTF8 encoding into server encoding. pgsql
> supports only encoding which are a superset of ASCII. So it's safe to use
> asterisk with any encodings

Jeff,

Based on these comments, do you want to go ahead and mark this "Ready
for Committer"?

https://commitfest.postgresql.org/action/patch_view?id=133

...Robert


Re: Prefix support for synonym dictionary

From
Jeff Davis
Date:
On Thu, 2009-08-06 at 12:19 -0400, Robert Haas wrote:
> Based on these comments, do you want to go ahead and mark this "Ready
> for Committer"?

Done, thanks Teodor.

However, on the commitfest page, the patches got updated in the wrong
places: "prefix support" and "filtering dictionary support" are pointing
at each others' patches.

Regards,Jeff Davis





Re: Prefix support for synonym dictionary

From
Robert Haas
Date:
On Thu, Aug 6, 2009 at 12:53 PM, Jeff Davis<pgsql@j-davis.com> wrote:
> On Thu, 2009-08-06 at 12:19 -0400, Robert Haas wrote:
>> Based on these comments, do you want to go ahead and mark this "Ready
>> for Committer"?
>
> Done, thanks Teodor.
>
> However, on the commitfest page, the patches got updated in the wrong
> places: "prefix support" and "filtering dictionary support" are pointing
> at each others' patches.

Fixed.

...Robert