Thread: english parser in text search: support for multiple words in the same position
english parser in text search: support for multiple words in the same position
From
Sushant Sinha
Date:
Currently the english parser in text search does not support multiple words in the same position. Consider a word "wikipedia.org". The text search would return a single token "wikipedia.org". However if someone searches for "wikipedia org" then there will not be a match. There are two problems here: 1. We do not have separate tokens "wikipedia" and "org" 2. If we have the two tokens we should have them at adjacent position so that a phrase search for "wikipedia org" should work. It will be nice to have the following tokenization and positioning for "wikipedia.org" position 0: WORD(wikipedia), URL(wikipedia.org) position 1: WORD(org) Take the example of "wikipedia.org/search?q=sushant" Here is the TSVECTOR: select to_tsvector('english', 'wikipedia.org/search?q=sushant'); to_tsvector ---------------------------------------------------------------------------- '/search?q=sushant':3 'wikipedia.org':2 'wikipedia.org/search?q=sushant':1 And here are the tokens: select ts_debug('english', 'wikipedia.org/search?q=sushant'); ts_debug -------------------------------------------------------------------------------- (url,URL,wikipedia.org/search?q=sushant,{simple},simple,{wikipedia.org/search?q =sushant})(host,Host,wikipedia.org,{simple},simple,{wikipedia.org})(url_path,"URL path",/search?q=sushant,{simple},simple,{/search?q=sushant}) The tokenization I would like to see is: position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant) position 1: WORD(org) position 2: WORD(search), URL_PATH(search/?q=sushant) position 3: WORD(q), URL_QUERY(q=search) position 4: WORD(sushant) So what we need is to support multiple tokens at the same position. And I need help in understanding how to realize this. Currently the position assignment happens in make_tsvector by working or parsed lexemes. The lexeme is obtained by prsd_nexttoken. However, prsd_nexttoken only returns a single token. Will it be possiblt to store some tokens and return them tokegher? Or can we put a flag to certain tokens that say the position should not be increased? -Sushant.
Re: english parser in text search: support for multiple words in the same position
From
Markus Wanner
Date:
Hi, On 08/01/2010 08:04 PM, Sushant Sinha wrote: > 1. We do not have separate tokens "wikipedia" and "org" > 2. If we have the two tokens we should have them at adjacent position so > that a phrase search for "wikipedia org" should work. This would needlessly increase the number of tokens. Instead you'd better make it work like compound word support, having just "wikipedia" and "org" as tokens. Searching for "wikipedia.org" or "wikipedia org" should then result in the same search query with the two tokens: "wikipedia" and "org". > position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant) IMO the differentiation between WORDs and URLs is not something the text search engine should have to take care a lot. Let it just do the searching and make it do that well. What does a token "wikipedia.org/search?q=sushant" buy you in terms of text searching? Or even result highlighting? I wouldn't expect anybody to want to search for a full URL, do you? Regards Markus Wanner
Re: english parser in text search: support for multiple words in the same position
From
Sushant Sinha
Date:
> On 08/01/2010 08:04 PM, Sushant Sinha wrote: > > 1. We do not have separate tokens "wikipedia" and "org" > > 2. If we have the two tokens we should have them at adjacent position so > > that a phrase search for "wikipedia org" should work. > > This would needlessly increase the number of tokens. Instead you'd > better make it work like compound word support, having just "wikipedia" > and "org" as tokens. The current text parser already returns url and url_path. That already increases the number of unique tokens. I am only asking for adding of normal english words as well so that if someone types only "wikipedia" he gets a match. > > Searching for "wikipedia.org" or "wikipedia org" should then result in > the same search query with the two tokens: "wikipedia" and "org". Earlier people have expressed the need to index urls/emails and currently the text parser already does so. Reverting that would be a regression of functionality. Further, a ranking function can take advantage of direct match of a token. > > position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant) > > IMO the differentiation between WORDs and URLs is not something the text > search engine should have to take care a lot. Let it just do the > searching and make it do that well. Postgres english parser already emits urls as tokens. Only thing I am asking is on improving the tokenization and positioning. > What does a token "wikipedia.org/search?q=sushant" buy you in terms of > text searching? Or even result highlighting? I wouldn't expect anybody > to want to search for a full URL, do you? There have been need expressed in past. And an exact token match can result in better ranking functions. For example, a tf-idf ranking will rank matching of such unique tokens significantly higher. -Sushant. > Regards > > Markus Wanner
Re: english parser in text search: support for multiple words in the same position
From
Markus Wanner
Date:
Hi, On 08/02/2010 03:12 PM, Sushant Sinha wrote: > The current text parser already returns url and url_path. That already > increases the number of unique tokens. Well, I think I simply turned that off to be able to search for plain words. It still works for complete URLs, those are just treated like text, then. > Earlier people have expressed the need to index urls/emails and > currently the text parser already does so. Reverting that would be a > regression of functionality. Further, a ranking function can take > advantage of direct match of a token. That's a point, yes. However, simply making the same string turn up twice in the tokenizer's output doesn't sound like the right solution to me. Especially considering that the query parser uses the very same tokenizer. Regards Markus Wanner
Re: english parser in text search: support for multiple words in the same position
From
Robert Haas
Date:
On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha <sushant354@gmail.com> wrote: > The current text parser already returns url and url_path. That already > increases the number of unique tokens. I am only asking for adding of > normal english words as well so that if someone types only "wikipedia" > he gets a match. [...] > Postgres english parser already emits urls as tokens. Only thing I am > asking is on improving the tokenization and positioning. Can you write a patch to implement your idea? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: english parser in text search: support for multiple words in the same position
From
Sushant Sinha
Date:
On Mon, 2010-08-02 at 09:32 -0400, Robert Haas wrote: > On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha <sushant354@gmail.com> wrote: > > The current text parser already returns url and url_path. That already > > increases the number of unique tokens. I am only asking for adding of > > normal english words as well so that if someone types only "wikipedia" > > he gets a match. > [...] > > Postgres english parser already emits urls as tokens. Only thing I am > > asking is on improving the tokenization and positioning. > > Can you write a patch to implement your idea? > Yes thats what I am planning to do. I just wanted to see if anyone can help me in estimating whether this is doable in the current parser or I need to write a new one. If possible, then some idea on how to go about implementing? -Sushant.
Re: english parser in text search: support for multiple words in the same position
From
Tom Lane
Date:
Sushant Sinha <sushant354@gmail.com> writes: >> This would needlessly increase the number of tokens. Instead you'd >> better make it work like compound word support, having just "wikipedia" >> and "org" as tokens. > The current text parser already returns url and url_path. That already > increases the number of unique tokens. I am only asking for adding of > normal english words as well so that if someone types only "wikipedia" > he gets a match. The suggestion to make it work like compound words is still a good one, ie given wikipedia.org you'd get back host wikipedia.orghost-part wikipediahost-part org not just the "host" token as at present. Then the user could decide whether he needed to index hostname components or not, by choosing whether to forward hostname-part tokens to a dictionary or just discard them. If you submit a patch that tries to force the issue by classifying hostname parts as plain words, it'll probably get rejected out of hand on backwards-compatibility grounds. regards, tom lane
Re: english parser in text search: support for multiple words in the same position
From
"Kevin Grittner"
Date:
Sushant Sinha <sushant354@gmail.com> wrote: > Yes thats what I am planning to do. I just wanted to see if anyone > can help me in estimating whether this is doable in the current > parser or I need to write a new one. If possible, then some idea > on how to go about implementing? The current tsearch parser is a state machine which does clunky mode switches to handle special cases like you describe. If you're looking at doing very much in there, you might want to consider a rewrite to something based on regular expressions. See discussion in these threads: http://archives.postgresql.org/message-id/200912102005.16560.andres@anarazel.de http://archives.postgresql.org/message-id/4B210D9E020000250002D344@gw.wicourts.gov That was actually at the top of my personal PostgreSQL TODO list (after my current project is wrapped up), but I wouldn't complain if someone else wanted to take it. :-) -Kevin
Re: english parser in text search: support for multiple words in the same position
From
Robert Haas
Date:
On Mon, Aug 2, 2010 at 10:21 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Sushant Sinha <sushant354@gmail.com> wrote: > >> Yes thats what I am planning to do. I just wanted to see if anyone >> can help me in estimating whether this is doable in the current >> parser or I need to write a new one. If possible, then some idea >> on how to go about implementing? > > The current tsearch parser is a state machine which does clunky mode > switches to handle special cases like you describe. If you're > looking at doing very much in there, you might want to consider a > rewrite to something based on regular expressions. See discussion > in these threads: > > http://archives.postgresql.org/message-id/200912102005.16560.andres@anarazel.de > > http://archives.postgresql.org/message-id/4B210D9E020000250002D344@gw.wicourts.gov > > That was actually at the top of my personal PostgreSQL TODO list > (after my current project is wrapped up), but I wouldn't complain if > someone else wanted to take it. :-) If you end up rewriting it, it may be a good idea, in the initial rewrite, to mimic the current results as closely as possible - and then submit a separate patch to change the results. Changing two things at the same time exponentially increases the chance of your patch getting rejected. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: english parser in text search: support for multiple words in the same position
From
Sushant Sinha
Date:
I have attached a patch that emits parts of a host token, a url token, an email token and a file token. Further, it makes sure that a host/url/email/file token and the first part-token are at the same position in tsvector. The two major changes are: 1. Tokenization changes: The patch exploits the special handlers in the text parser to reset the parser position to the start of a host/url/email/file token when it finds one. Special handlers were already used for extracting host and urlpath from a full url. So this is more of an extension of the same idea. 2. Position changes: We do not advance position when we encounter a host/url/email/file token. As a result the first part of that token aligns with the token itself. Attachments: tokens_output.txt: sample queries and results with the patch token_v1.patch: patch wrt cvs head Currently, the patch output parts of the tokens as normal tokens like WORD, NUMWORD etc. Tom argued earlier that this will break backward-compatibility and so it should be outputted as parts of the respective tokens. If there is an agreement over what Tom says, then the current patch can be modified to output subtokens as parts. However, before I complicate the patch with that, I wanted to get feedback on any other major problem with the patch. -Sushant. On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote: > Sushant Sinha <sushant354@gmail.com> writes: > >> This would needlessly increase the number of tokens. Instead you'd > >> better make it work like compound word support, having just "wikipedia" > >> and "org" as tokens. > > > The current text parser already returns url and url_path. That already > > increases the number of unique tokens. I am only asking for adding of > > normal english words as well so that if someone types only "wikipedia" > > he gets a match. > > The suggestion to make it work like compound words is still a good one, > ie given wikipedia.org you'd get back > > host wikipedia.org > host-part wikipedia > host-part org > > not just the "host" token as at present. > > Then the user could decide whether he needed to index hostname > components or not, by choosing whether to forward hostname-part > tokens to a dictionary or just discard them. > > If you submit a patch that tries to force the issue by classifying > hostname parts as plain words, it'll probably get rejected out of > hand on backwards-compatibility grounds. > > regards, tom lane
Attachment
Re: english parser in text search: support for multiple words in the same position
From
Robert Haas
Date:
On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha <sushant354@gmail.com> wrote: > I have attached a patch that emits parts of a host token, a url token, > an email token and a file token. Further, it makes sure that a > host/url/email/file token and the first part-token are at the same > position in tsvector. You should probably add this patch here: https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: english parser in text search: support for multiple words in the same position
From
Sushant Sinha
Date:
Updating the patch with emitting parttoken and registering it with snowball config. -Sushant. On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote: > On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha <sushant354@gmail.com> wrote: > > I have attached a patch that emits parts of a host token, a url token, > > an email token and a file token. Further, it makes sure that a > > host/url/email/file token and the first part-token are at the same > > position in tsvector. > > You should probably add this patch here: > > https://commitfest.postgresql.org/action/commitfest_view/open >
Attachment
Re: english parser in text search: support for multiple words in the same position
From
Sushant Sinha
Date:
For the headline generation to work properly, email/file/url/host need to become skip tokens. Updating the patch with that change. -Sushant. On Sat, 2010-09-04 at 13:25 +0530, Sushant Sinha wrote: > Updating the patch with emitting parttoken and registering it with > snowball config. > > -Sushant. > > On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote: > > On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha <sushant354@gmail.com> wrote: > > > I have attached a patch that emits parts of a host token, a url token, > > > an email token and a file token. Further, it makes sure that a > > > host/url/email/file token and the first part-token are at the same > > > position in tsvector. > > > > You should probably add this patch here: > > > > https://commitfest.postgresql.org/action/commitfest_view/open > > >
Attachment
Re: english parser in text search: support for multiple words in the same position
From
Tom Lane
Date:
Sushant Sinha <sushant354@gmail.com> writes: > For the headline generation to work properly, email/file/url/host need > to become skip tokens. Updating the patch with that change. I looked at this patch a bit. I'm fairly unhappy that it seems to be inventing a brand new mechanism to do something the ts parser can already do. Why didn't you code the url-part mechanism using the existing support for compound words? The changes made to parsetext() seem particularly scary: it's not clear at all that that's not breaking unrelated behaviors. In fact, the changes in the regression test results suggest strongly to me that it *is* breaking things. Why are there so many diffs in examples that include no URLs at all? An issue that's nearly as bad is the 100% lack of documentation, which makes the patch difficult to review because it's hard to tell what it intends to accomplish or whether it's met the intent. The patch is not committable without documentation anyway, but right now I'm not sure it's even usefully reviewable. In line with the lack of documentation, I would say that the choice of the name "parttoken" for the new token type is not helpful. Part of what? And none of the other token type names include the word "token", so that's not a good decision either. Possibly "url_part" would be a suitable name. regards, tom lane
Re: english parser in text search: support for multiple words in the same position
From
Sushant Sinha
Date:
> I looked at this patch a bit. I'm fairly unhappy that it seems to be > inventing a brand new mechanism to do something the ts parser can > already do. Why didn't you code the url-part mechanism using the > existing support for compound words? I am not familiar with compound word implementation and so I am not sure how to split a url with compound word support. I looked into the documentation for compound words and that does not say much about how to identify components of a token. Does a compound word split by matching with a list of words? If yes, then we will not be able to use that as we do not know all the words that can appear in a url/host/email/file. I think another approach can be to use the dict_regex dictionary support. However, we will have to match the regex with something that parser is doing. The current patch is not inventing any new mechanism. It uses the special handler mechanism already present in the parser. For example, when the current parser finds a URL it runs a special handler called SpecialFURL which resets the parser position to the start of token to find hostname. After finding the host it moves to finding the path. So you first get the URL and then the host and finally the path. Similarly, we are resetting the parser to the start of the token on finding a url to output url parts. Then before entering the state that can lead to a url we output the url part. The state machine modification is similar for other tokens like file/email/host. > The changes made to parsetext() > seem particularly scary: it's not clear at all that that's not breaking > unrelated behaviors. In fact, the changes in the regression test > results suggest strongly to me that it *is* breaking things. Why are > there so many diffs in examples that include no URLs at all? > I think some of the difference is coming from the fact that now pos starts with 0 and it used to be 1 earlier. That is easily fixable though. > An issue that's nearly as bad is the 100% lack of documentation, > which makes the patch difficult to review because it's hard to tell > what it intends to accomplish or whether it's met the intent. > The patch is not committable without documentation anyway, but right > now I'm not sure it's even usefully reviewable. I did not provide any explanation as I could not find any place in the code to provide the documentation (that was just a modification of state machine). Should I do a separate write-up to explain the desired output and the changes to achieve it? > > In line with the lack of documentation, I would say that the choice of > the name "parttoken" for the new token type is not helpful. Part of > what? And none of the other token type names include the word "token", > so that's not a good decision either. Possibly "url_part" would be a > suitable name. > I can modify it to output url-part/host-part/email-part/file-part if there is an agreement over the rest of the issues. So let me know if I should go ahead with this. -Sushant.
Re: english parser in text search: support for multiple words in the same position
From
Sushant Sinha
Date:
Any updates on this?<br /><br /><br /><div class="gmail_quote">On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha <span dir="ltr"><<ahref="mailto:sushant354@gmail.com">sushant354@gmail.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><divclass="im">> I looked at this patch a bit. I'm fairly unhappy that it seems to be<br /> > inventing a brandnew mechanism to do something the ts parser can<br /> > already do. Why didn't you code the url-part mechanism usingthe<br /> > existing support for compound words?<br /><br /></div>I am not familiar with compound word implementationand so I am not sure<br /> how to split a url with compound word support. I looked into the<br /> documentationfor compound words and that does not say much about how to<br /> identify components of a token. Does a compoundword split by matching<br /> with a list of words? If yes, then we will not be able to use that as we<br /> do notknow all the words that can appear in a url/host/email/file.<br /><br /> I think another approach can be to use the dict_regexdictionary<br /> support. However, we will have to match the regex with something that<br /> parser is doing.<br/><br /> The current patch is not inventing any new mechanism. It uses the<br /> special handler mechanism alreadypresent in the parser. For example,<br /> when the current parser finds a URL it runs a special handler called<br/> SpecialFURL which resets the parser position to the start of token to<br /> find hostname. After finding thehost it moves to finding the path. So<br /> you first get the URL and then the host and finally the path.<br /><br />Similarly, we are resetting the parser to the start of the token on<br /> finding a url to output url parts. Then beforeentering the state that<br /> can lead to a url we output the url part. The state machine modification<br /> is similarfor other tokens like file/email/host.<br /><div class="im"><br /><br /> > The changes made to parsetext()<br />> seem particularly scary: it's not clear at all that that's not breaking<br /> > unrelated behaviors. In fact,the changes in the regression test<br /> > results suggest strongly to me that it *is* breaking things. Why are<br/> > there so many diffs in examples that include no URLs at all?<br /> ><br /><br /></div>I think some of thedifference is coming from the fact that now pos<br /> starts with 0 and it used to be 1 earlier. That is easily fixable<br/> though.<br /><div class="im"><br /> > An issue that's nearly as bad is the 100% lack of documentation,<br/> > which makes the patch difficult to review because it's hard to tell<br /> > what it intends toaccomplish or whether it's met the intent.<br /> > The patch is not committable without documentation anyway, but right<br/> > now I'm not sure it's even usefully reviewable.<br /><br /></div>I did not provide any explanation as I couldnot find any place in the<br /> code to provide the documentation (that was just a modification of state<br /> machine).Should I do a separate write-up to explain the desired output<br /> and the changes to achieve it?<br /><div class="im"><br/> ><br /> > In line with the lack of documentation, I would say that the choice of<br /> > the name"parttoken" for the new token type is not helpful. Part of<br /> > what? And none of the other token type namesinclude the word "token",<br /> > so that's not a good decision either. Possibly "url_part" would be a<br /> >suitable name.<br /> ><br /><br /></div>I can modify it to output url-part/host-part/email-part/file-part if<br />there is an agreement over the rest of the issues. So let me know if I<br /> should go ahead with this.<br /><font color="#888888"><br/> -Sushant.<br /><br /></font></blockquote></div><br />
Re: english parser in text search: support for multiple words in the same position
From
Robert Haas
Date:
On Wed, Sep 29, 2010 at 1:29 AM, Sushant Sinha <sushant354@gmail.com> wrote: > Any updates on this? > > > On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha <sushant354@gmail.com> > wrote: >> >> > I looked at this patch a bit. I'm fairly unhappy that it seems to be >> > inventing a brand new mechanism to do something the ts parser can >> > already do. Why didn't you code the url-part mechanism using the >> > existing support for compound words? >> >> I am not familiar with compound word implementation and so I am not sure >> how to split a url with compound word support. I looked into the >> documentation for compound words and that does not say much about how to >> identify components of a token. Does a compound word split by matching >> with a list of words? If yes, then we will not be able to use that as we >> do not know all the words that can appear in a url/host/email/file. It seems to me that you need to familiarize yourself with this stuff and then post an analysis, or a new patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Re: english parser in text search: support for multiple words in the same position
From
Tom Lane
Date:
[ sorry for not responding on this sooner, it's been hectic the last couple weeks ] Sushant Sinha <sushant354@gmail.com> writes: >> I looked at this patch a bit. I'm fairly unhappy that it seems to be >> inventing a brand new mechanism to do something the ts parser can >> already do. Why didn't you code the url-part mechanism using the >> existing support for compound words? > I am not familiar with compound word implementation and so I am not sure > how to split a url with compound word support. I looked into the > documentation for compound words and that does not say much about how to > identify components of a token. IIRC, the way that that works is associated with pushing a sub-state of the state machine in order to scan each compound-word part. I don't have the details in my head anymore, though I recall having traced through it in the past. Look at the state machine actions that are associated with producing the compound word tokens and sub-tokens. > The current patch is not inventing any new mechanism. It uses the > special handler mechanism already present in the parser. The fact that that mechanism is there doesn't mean that it's the right one for this task. I think that Teodor meant it for other things altogether. If it were the best way to solve this problem, why wouldn't he have used it for compound words? >> The changes made to parsetext() >> seem particularly scary: it's not clear at all that that's not breaking >> unrelated behaviors. In fact, the changes in the regression test >> results suggest strongly to me that it *is* breaking things. Why are >> there so many diffs in examples that include no URLs at all? > I think some of the difference is coming from the fact that now pos > starts with 0 and it used to be 1 earlier. That is easily fixable > though. You cannot seriously believe that it's okay for a patch to just arbitrarily change such an easily user-visible behavior. This comes back again to the point that this patch is not going to get in at all unless it makes the absolute minimum amount of change in the established behavior of the parser. I think we can probably accept a patch that produces new tokens (of newly-defined types) in addition to what it already produced for URL-looking input, because any existing dictionary configuration will just drop the newly-defined token types on the floor leaving you with exactly the same indexing behavior you got in the last three major releases. Changes beyond that are going to need to meet a very high bar of arguable necessity. >> An issue that's nearly as bad is the 100% lack of documentation, > I did not provide any explanation as I could not find any place in the > code to provide the documentation (that was just a modification of state > machine). The code is not the place that I'm complaining about the lack of documentation in. A patch that changes user-visible behavior needs to change the appropriate parts of the SGML documentation also. In the case at hand, I think an absolute minimum level of documentation would involve changing table 12-1 here: http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html and probably adding another example to the ones following that table. But there may well be other parts of chapter 12 that need revision also. Now I will grant that an early-draft patch needn't include final user docs, but if you're omitting that then it's all the more important that you provide clear information with the patch about what it's supposed to do, so that reviewers can understand what the point is. The two sentences of description provided in the commitfest entry were nowhere near enough for intelligent reviewing, IMO. regards, tom lane
Re: english parser in text search: support for multiple words in the same position
From
Sushant Sinha
Date:
Just a reminder that this patch is discussing how to break url, emails etc into its components.<br /><br /><div class="gmail_quote">OnMon, Oct 4, 2010 at 3:54 AM, Tom Lane <span dir="ltr"><<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">[ sorry for not respondingon this sooner, it's been hectic the last<br /> couple weeks ]<br /><div class="im"><br /> Sushant Sinha <<ahref="mailto:sushant354@gmail.com">sushant354@gmail.com</a>> writes:<br /></div><div class="im">>> I lookedat this patch a bit. I'm fairly unhappy that it seems to be<br /> >> inventing a brand new mechanism to do somethingthe ts parser can<br /> >> already do. Why didn't you code the url-part mechanism using the<br /> >>existing support for compound words?<br /><br /> > I am not familiar with compound word implementation and soI am not sure<br /> > how to split a url with compound word support. I looked into the<br /> > documentation forcompound words and that does not say much about how to<br /> > identify components of a token.<br /><br /></div>IIRC,the way that that works is associated with pushing a sub-state<br /> of the state machine in order to scan eachcompound-word part. I don't<br /> have the details in my head anymore, though I recall having traced<br /> through itin the past. Look at the state machine actions that are<br /> associated with producing the compound word tokens and sub-tokens.<br/></blockquote></div><br />I did look around for compound word support in postgres. In particular, I read thedocumentation and code in tsearch/spell.c that seems to implement the compound word support. <br /><br />So in my understandingthe way it works is:<br /><br />1. Specify a dictionary of words in which each word will have applicable prefix/suffixflags<br />2. Specify a flag file that provides prefix/suffix operations on those flags<br /> 3. flag z indicatesthat a word in the dictionary can participate in compound word splitting<br />4. When a token matches words specifiedin the dictionary (after applying affix/suffix operations), the matching words are emitted as sub-words of the token(i.e., compound word)<br /><br />If my above understanding is correct, then I think it will not be possible to implementurl/email splitting using the compound word support.<br /><br />The main reason is that the compound word supportrequires the "PRE-DETERMINED" dictionary of words. So to split a url/email we will need to provide a list of *allpossible* host names and user names. I do not think that is a possibility.<br /><br />Please correct me if I have mis-understoodsomething.<br /><br />-Sushant. <br />
Re: english parser in text search: support for multiple words in the same position
From
Sushant Sinha
Date:
Do not know if this mail got lost in between or no one noticed it! On Thu, 2010-12-23 at 11:05 +0530, Sushant Sinha wrote: Just a reminder that this patch is discussing how to break url, emails etc into its components. > > On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > [ sorry for not responding on this sooner, it's been hectic > the last > couple weeks ] > > Sushant Sinha <sushant354@gmail.com> writes: > > >> I looked at this patch a bit. I'm fairly unhappy that it > seems to be > >> inventing a brand new mechanism to do something the ts > parser can > >> already do. Why didn't you code the url-part mechanism > using the > >> existing support for compound words? > > > I am not familiar with compound word implementation and so I > am not sure > > how to split a url with compound word support. I looked into > the > > documentation for compound words and that does not say much > about how to > > identify components of a token. > > > IIRC, the way that that works is associated with pushing a > sub-state > of the state machine in order to scan each compound-word > part. I don't > have the details in my head anymore, though I recall having > traced > through it in the past. Look at the state machine actions > that are > associated with producing the compound word tokens and > sub-tokens. > I did look around for compound word support in postgres. In particular, I read the documentation and code in tsearch/spell.c that seems to implement the compound word support. So in my understanding the way it works is: 1. Specify a dictionary of words in which each word will have applicable prefix/suffix flags 2. Specify a flag file that provides prefix/suffix operations on those flags 3. flag z indicates that a word in the dictionary can participate in compound word splitting 4. When a token matches words specified in the dictionary (after applying affix/suffix operations), the matching words are emitted as sub-words of the token (i.e., compound word) If my above understanding is correct, then I think it will not be possible to implement url/email splitting using the compound word support. The main reason is that the compound word support requires the "PRE-DETERMINED" dictionary of words. So to split a url/email we will need to provide a list of *all possible* host names and user names. I do not think that is a possibility. Please correct me if I have mis-understood something. -Sushant.