Thread: Prefix support for synonym dictionary
Hi there, attached is our patch for CVS HEAD, which adds prefix support for synonym dictionary. Quick example: > cat $SHAREDIR/tsearch_data/synonym_sample.syn postgres pgsql postgresql pgsql postgre pgsql gogle googl indices index* =# create text search dictionary syn( template=synonym,synonyms='synonym_sample'); =# select ts_lexize('syn','indices'); ts_lexize ----------- {index} (1 row) =# create text search configuration tst ( copy=simple); =# alter text search configuration tst alter mapping for asciiword with syn; =# select to_tsquery('tst','indices'); to_tsquery ------------ 'index':* (1 row) =# select 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices'); ?column? ---------- t (1 row) Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Hi, The patch looks good. Comments: 1. The docs should be clarified a little. For instance, it should have a link back to the definition of a prefix search (12.3.2). I included my doc suggestions as an attachment. 2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps fragile) way. After calling findwrd(), the "end" pointer is pointing at either the end of the string, or the *; depending on whether the string ends in * and whether flags is NULL. I only mention this because I had to take a more careful look to see what was happening. Perhaps add a comment to make it more clear? 3. The patch looks for the special byte '*'. I think that's fine, because we depend on the files being in UTF-8 encoding, where it's the same byte. However, I thought it was worth mentioning in case we want to support other encodings for text search files later. Regards, Jeff Davis
Attachment
On Sun, Aug 2, 2009 at 3:05 PM, Jeff Davis<pgsql@j-davis.com> wrote: > The patch looks good. > > Comments: > > 1. The docs should be clarified a little. For instance, it should have a > link back to the definition of a prefix search (12.3.2). I included my > doc suggestions as an attachment. > > 2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps > fragile) way. After calling findwrd(), the "end" pointer is pointing at > either the end of the string, or the *; depending on whether the string > ends in * and whether flags is NULL. I only mention this because I had > to take a more careful look to see what was happening. Perhaps add a > comment to make it more clear? > > 3. The patch looks for the special byte '*'. I think that's fine, > because we depend on the files being in UTF-8 encoding, where it's the > same byte. However, I thought it was worth mentioning in case we want to > support other encodings for text search files later. Oleg, Are you planning to update this patch this week? If not I will set it to "Returned with Feedback". Thanks, ...Robert
On Wed, 2009-08-05 at 12:34 -0400, Robert Haas wrote: > Oleg, > > Are you planning to update this patch this week? If not I will set it > to "Returned with Feedback". My only comments were related to docs and comments, and I supplied a patch as a suggested fix for the docs. Also, the patch is very small. I'd hate to hold it up over such a minor issue, and it seems like a useful feature. If Oleg is unavailable, would you mind just having a second review of the patch to see if they agree with my suggestions, and then mark "ready for committer review"? Regards,Jeff Davis
> 1. The docs should be clarified a little. For instance, it should have a > link back to the definition of a prefix search (12.3.2). I included my > doc suggestions as an attachment. Thank you, merged > 2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps > fragile) way. After calling findwrd(), the "end" pointer is pointing at > either the end of the string, or the *; depending on whether the string > ends in * and whether flags is NULL. I only mention this because I had > to take a more careful look to see what was happening. Perhaps add a > comment to make it more clear? Add comments: /* * Finds the next whitespace-delimited word within the 'in' string. * Returns a pointer to the first character of the word, and a pointer * to the next byte after the last character in the word (in *end). * Character '*' at the end of word will not be threated as word * charater if flags is not null. */ static char * findwrd(char *in, char **end, uint16 *flags) > 3. The patch looks for the special byte '*'. I think that's fine, > because we depend on the files being in UTF-8 encoding, where it's the > same byte. However, I thought it was worth mentioning in case we want to > support other encodings for text search files later. tsearch_readline() converts file's UTF8 encoding into server encoding. pgsql supports only encoding which are a superset of ASCII. So it's safe to use asterisk with any encodings -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
2009/8/6 Teodor Sigaev <teodor@sigaev.ru>: >> 1. The docs should be clarified a little. For instance, it should have a >> link back to the definition of a prefix search (12.3.2). I included my >> doc suggestions as an attachment. > > Thank you, merged > >> 2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps >> fragile) way. After calling findwrd(), the "end" pointer is pointing at >> either the end of the string, or the *; depending on whether the string >> ends in * and whether flags is NULL. I only mention this because I had >> to take a more careful look to see what was happening. Perhaps add a >> comment to make it more clear? > > Add comments: > /* > * Finds the next whitespace-delimited word within the 'in' string. > * Returns a pointer to the first character of the word, and a pointer > * to the next byte after the last character in the word (in *end). > * Character '*' at the end of word will not be threated as word > * charater if flags is not null. > */ > static char * > findwrd(char *in, char **end, uint16 *flags) > > > >> 3. The patch looks for the special byte '*'. I think that's fine, >> because we depend on the files being in UTF-8 encoding, where it's the >> same byte. However, I thought it was worth mentioning in case we want to >> support other encodings for text search files later. > > tsearch_readline() converts file's UTF8 encoding into server encoding. pgsql > supports only encoding which are a superset of ASCII. So it's safe to use > asterisk with any encodings Jeff, Based on these comments, do you want to go ahead and mark this "Ready for Committer"? https://commitfest.postgresql.org/action/patch_view?id=133 ...Robert
On Thu, 2009-08-06 at 12:19 -0400, Robert Haas wrote: > Based on these comments, do you want to go ahead and mark this "Ready > for Committer"? Done, thanks Teodor. However, on the commitfest page, the patches got updated in the wrong places: "prefix support" and "filtering dictionary support" are pointing at each others' patches. Regards,Jeff Davis
On Thu, Aug 6, 2009 at 12:53 PM, Jeff Davis<pgsql@j-davis.com> wrote: > On Thu, 2009-08-06 at 12:19 -0400, Robert Haas wrote: >> Based on these comments, do you want to go ahead and mark this "Ready >> for Committer"? > > Done, thanks Teodor. > > However, on the commitfest page, the patches got updated in the wrong > places: "prefix support" and "filtering dictionary support" are pointing > at each others' patches. Fixed. ...Robert