Thread: english parser in text search: support for multiple words in the same position

english parser in text search: support for multiple words in the same position

From
Sushant Sinha
Date:
Currently the english parser in text search does not support multiple
words in the same position. Consider a word "wikipedia.org". The text
search would return a single token "wikipedia.org". However if someone
searches for "wikipedia org" then there will not be a match. There are
two problems here:

1. We do not have separate tokens "wikipedia" and "org"
2. If we have the two tokens we should have them at adjacent position so
that a phrase search for "wikipedia org" should work.
It will be nice to have the following tokenization and positioning for
"wikipedia.org"

position 0: WORD(wikipedia), URL(wikipedia.org)
position 1: WORD(org)

Take the example of "wikipedia.org/search?q=sushant"

Here is the TSVECTOR:

select to_tsvector('english', 'wikipedia.org/search?q=sushant');

to_tsvector                                 
----------------------------------------------------------------------------
'/search?q=sushant':3 'wikipedia.org':2
'wikipedia.org/search?q=sushant':1

And here are the tokens:

select ts_debug('english', 'wikipedia.org/search?q=sushant');

ts_debug                                        
--------------------------------------------------------------------------------
(url,URL,wikipedia.org/search?q=sushant,{simple},simple,{wikipedia.org/search?q
=sushant})(host,Host,wikipedia.org,{simple},simple,{wikipedia.org})(url_path,"URL
path",/search?q=sushant,{simple},simple,{/search?q=sushant})

The tokenization I would like to see is:

position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)
position 1: WORD(org)
position 2: WORD(search), URL_PATH(search/?q=sushant)
position 3: WORD(q), URL_QUERY(q=search)
position 4: WORD(sushant)

So what we need is to support multiple tokens at the same position. And
I need help in understanding how to realize this. Currently the position
assignment happens in make_tsvector by working or parsed lexemes. The
lexeme is obtained by prsd_nexttoken.

However, prsd_nexttoken only returns a single token. Will it be possiblt
to store some tokens and return them tokegher? Or can we put a flag to
certain tokens that say the position should not be increased?

-Sushant.




Hi,

On 08/01/2010 08:04 PM, Sushant Sinha wrote:
> 1. We do not have separate tokens "wikipedia" and "org"
> 2. If we have the two tokens we should have them at adjacent position so
> that a phrase search for "wikipedia org" should work.

This would needlessly increase the number of tokens. Instead you'd 
better make it work like compound word support, having just "wikipedia" 
and "org" as tokens.

Searching for "wikipedia.org" or "wikipedia org" should then result in 
the same search query with the two tokens: "wikipedia" and "org".

> position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)

IMO the differentiation between WORDs and URLs is not something the text 
search engine should have to take care a lot. Let it just do the 
searching and make it do that well.

What does a token "wikipedia.org/search?q=sushant" buy you in terms of 
text searching? Or even result highlighting? I wouldn't expect anybody 
to want to search for a full URL, do you?

Regards

Markus Wanner


> On 08/01/2010 08:04 PM, Sushant Sinha wrote:
> > 1. We do not have separate tokens "wikipedia" and "org"
> > 2. If we have the two tokens we should have them at adjacent position so
> > that a phrase search for "wikipedia org" should work.
> 
> This would needlessly increase the number of tokens. Instead you'd 
> better make it work like compound word support, having just "wikipedia" 
> and "org" as tokens.

The current text parser already returns url and url_path. That already
increases the number of unique tokens. I am only asking for adding of
normal english words as well so that if someone types only "wikipedia"
he gets a match. 

> 
> Searching for "wikipedia.org" or "wikipedia org" should then result in 
> the same search query with the two tokens: "wikipedia" and "org".

Earlier people have expressed the need to index urls/emails and
currently the text parser already does so. Reverting that would be a
regression of functionality. Further, a ranking function can take
advantage of direct match of a token.

> > position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)
> 
> IMO the differentiation between WORDs and URLs is not something the text 
> search engine should have to take care a lot. Let it just do the 
> searching and make it do that well.

Postgres english parser already emits urls as tokens. Only thing I am
asking is on improving the tokenization and positioning.

> What does a token "wikipedia.org/search?q=sushant" buy you in terms of 
> text searching? Or even result highlighting? I wouldn't expect anybody 
> to want to search for a full URL, do you?

There have been need expressed in past. And an exact token match can
result in better ranking functions. For example, a tf-idf ranking will
rank matching of such unique tokens significantly higher.

-Sushant.

> Regards
> 
> Markus Wanner




Hi,

On 08/02/2010 03:12 PM, Sushant Sinha wrote:
> The current text parser already returns url and url_path. That already
> increases the number of unique tokens.

Well, I think I simply turned that off to be able to search for plain 
words. It still works for complete URLs, those are just treated like 
text, then.

> Earlier people have expressed the need to index urls/emails and
> currently the text parser already does so. Reverting that would be a
> regression of functionality. Further, a ranking function can take
> advantage of direct match of a token.

That's a point, yes. However, simply making the same string turn up 
twice in the tokenizer's output doesn't sound like the right solution to 
me. Especially considering that the query parser uses the very same 
tokenizer.

Regards

Markus Wanner


On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha <sushant354@gmail.com> wrote:
> The current text parser already returns url and url_path. That already
> increases the number of unique tokens. I am only asking for adding of
> normal english words as well so that if someone types only "wikipedia"
> he gets a match.
[...]
> Postgres english parser already emits urls as tokens. Only thing I am
> asking is on improving the tokenization and positioning.

Can you write a patch to implement your idea?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


On Mon, 2010-08-02 at 09:32 -0400, Robert Haas wrote:
> On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha <sushant354@gmail.com> wrote:
> > The current text parser already returns url and url_path. That already
> > increases the number of unique tokens. I am only asking for adding of
> > normal english words as well so that if someone types only "wikipedia"
> > he gets a match.
> [...]
> > Postgres english parser already emits urls as tokens. Only thing I am
> > asking is on improving the tokenization and positioning.
> 
> Can you write a patch to implement your idea?
> 

Yes thats what I am planning to do. I just wanted to see if anyone can
help me in estimating whether this is doable in the current parser or I
need to write a new one. If possible, then some idea on how to go about
implementing?

-Sushant.



Sushant Sinha <sushant354@gmail.com> writes:
>> This would needlessly increase the number of tokens. Instead you'd 
>> better make it work like compound word support, having just "wikipedia" 
>> and "org" as tokens.

> The current text parser already returns url and url_path. That already
> increases the number of unique tokens. I am only asking for adding of
> normal english words as well so that if someone types only "wikipedia"
> he gets a match. 

The suggestion to make it work like compound words is still a good one,
ie given wikipedia.org you'd get back
host        wikipedia.orghost-part    wikipediahost-part    org

not just the "host" token as at present.

Then the user could decide whether he needed to index hostname
components or not, by choosing whether to forward hostname-part
tokens to a dictionary or just discard them.

If you submit a patch that tries to force the issue by classifying
hostname parts as plain words, it'll probably get rejected out of
hand on backwards-compatibility grounds.
        regards, tom lane


Re: english parser in text search: support for multiple words in the same position

From
"Kevin Grittner"
Date:
Sushant Sinha <sushant354@gmail.com> wrote:
> Yes thats what I am planning to do. I just wanted to see if anyone
> can help me in estimating whether this is doable in the current
> parser or I need to write a new one. If possible, then some idea
> on how to go about implementing?
The current tsearch parser is a state machine which does clunky mode
switches to handle special cases like you describe.  If you're
looking at doing very much in there, you might want to consider a
rewrite to something based on regular expressions.  See discussion
in these threads:
http://archives.postgresql.org/message-id/200912102005.16560.andres@anarazel.de
http://archives.postgresql.org/message-id/4B210D9E020000250002D344@gw.wicourts.gov
That was actually at the top of my personal PostgreSQL TODO list
(after my current project is wrapped up), but I wouldn't complain if
someone else wanted to take it.  :-)
-Kevin


On Mon, Aug 2, 2010 at 10:21 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Sushant Sinha <sushant354@gmail.com> wrote:
>
>> Yes thats what I am planning to do. I just wanted to see if anyone
>> can help me in estimating whether this is doable in the current
>> parser or I need to write a new one. If possible, then some idea
>> on how to go about implementing?
>
> The current tsearch parser is a state machine which does clunky mode
> switches to handle special cases like you describe.  If you're
> looking at doing very much in there, you might want to consider a
> rewrite to something based on regular expressions.  See discussion
> in these threads:
>
> http://archives.postgresql.org/message-id/200912102005.16560.andres@anarazel.de
>
> http://archives.postgresql.org/message-id/4B210D9E020000250002D344@gw.wicourts.gov
>
> That was actually at the top of my personal PostgreSQL TODO list
> (after my current project is wrapped up), but I wouldn't complain if
> someone else wanted to take it.  :-)

If you end up rewriting it, it may be a good idea, in the initial
rewrite, to mimic the current results as closely as possible - and
then submit a separate patch to change the results.  Changing two
things at the same time exponentially increases the chance of your
patch getting rejected.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: english parser in text search: support for multiple words in the same position

From
Sushant Sinha
Date:
I have attached a patch that emits parts of a host token, a url token,
an email token and a file token. Further, it makes sure that a
host/url/email/file token and the first part-token are at the same
position in tsvector.

The two major changes are:

1. Tokenization changes: The patch exploits the special handlers in the
text parser to reset the parser position to the start of a
host/url/email/file token when it finds one. Special handlers were
already used for extracting host and urlpath from a full url. So this is
more of an extension of the same idea.

2. Position changes: We do not advance position when we encounter a
host/url/email/file token. As a result the first part of that token
aligns with the token itself.

Attachments:

tokens_output.txt: sample queries and results with the patch
token_v1.patch:    patch wrt cvs head

Currently, the patch output parts of the tokens as normal tokens like
WORD, NUMWORD etc. Tom argued earlier that this will break
backward-compatibility and so it should be outputted as parts of the
respective tokens. If there is an agreement over what Tom says, then the
current patch can be modified to output subtokens as parts. However,
before I complicate the patch with that, I wanted to get feedback on any
other major problem with the patch.

-Sushant.

On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote:
> Sushant Sinha <sushant354@gmail.com> writes:
> >> This would needlessly increase the number of tokens. Instead you'd
> >> better make it work like compound word support, having just "wikipedia"
> >> and "org" as tokens.
>
> > The current text parser already returns url and url_path. That already
> > increases the number of unique tokens. I am only asking for adding of
> > normal english words as well so that if someone types only "wikipedia"
> > he gets a match.
>
> The suggestion to make it work like compound words is still a good one,
> ie given wikipedia.org you'd get back
>
>     host        wikipedia.org
>     host-part    wikipedia
>     host-part    org
>
> not just the "host" token as at present.
>
> Then the user could decide whether he needed to index hostname
> components or not, by choosing whether to forward hostname-part
> tokens to a dictionary or just discard them.
>
> If you submit a patch that tries to force the issue by classifying
> hostname parts as plain words, it'll probably get rejected out of
> hand on backwards-compatibility grounds.
>
>             regards, tom lane


Attachment
On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha <sushant354@gmail.com> wrote:
> I have attached a patch that emits parts of a host token, a url token,
> an email token and a file token. Further, it makes sure that a
> host/url/email/file token and the first part-token are at the same
> position in tsvector.

You should probably add this patch here:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: english parser in text search: support for multiple words in the same position

From
Sushant Sinha
Date:
Updating the patch with emitting parttoken and registering it with
snowball config.

-Sushant.

On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote:
> On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha <sushant354@gmail.com> wrote:
> > I have attached a patch that emits parts of a host token, a url token,
> > an email token and a file token. Further, it makes sure that a
> > host/url/email/file token and the first part-token are at the same
> > position in tsvector.
>
> You should probably add this patch here:
>
> https://commitfest.postgresql.org/action/commitfest_view/open
>


Attachment

Re: english parser in text search: support for multiple words in the same position

From
Sushant Sinha
Date:
For the headline generation to work properly, email/file/url/host need
to become skip tokens. Updating the patch with that change.

-Sushant.

On Sat, 2010-09-04 at 13:25 +0530, Sushant Sinha wrote:
> Updating the patch with emitting parttoken and registering it with
> snowball config.
>
> -Sushant.
>
> On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote:
> > On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha <sushant354@gmail.com> wrote:
> > > I have attached a patch that emits parts of a host token, a url token,
> > > an email token and a file token. Further, it makes sure that a
> > > host/url/email/file token and the first part-token are at the same
> > > position in tsvector.
> >
> > You should probably add this patch here:
> >
> > https://commitfest.postgresql.org/action/commitfest_view/open
> >
>


Attachment
Sushant Sinha <sushant354@gmail.com> writes:
> For the headline generation to work properly, email/file/url/host need
> to become skip tokens. Updating the patch with that change.

I looked at this patch a bit.  I'm fairly unhappy that it seems to be
inventing a brand new mechanism to do something the ts parser can
already do.  Why didn't you code the url-part mechanism using the
existing support for compound words?  The changes made to parsetext()
seem particularly scary: it's not clear at all that that's not breaking
unrelated behaviors.  In fact, the changes in the regression test
results suggest strongly to me that it *is* breaking things.  Why are
there so many diffs in examples that include no URLs at all?

An issue that's nearly as bad is the 100% lack of documentation,
which makes the patch difficult to review because it's hard to tell
what it intends to accomplish or whether it's met the intent.
The patch is not committable without documentation anyway, but right
now I'm not sure it's even usefully reviewable.

In line with the lack of documentation, I would say that the choice of
the name "parttoken" for the new token type is not helpful.  Part of
what?  And none of the other token type names include the word "token",
so that's not a good decision either.  Possibly "url_part" would be a
suitable name.
        regards, tom lane


Re: english parser in text search: support for multiple words in the same position

From
Sushant Sinha
Date:
> I looked at this patch a bit.  I'm fairly unhappy that it seems to be
> inventing a brand new mechanism to do something the ts parser can
> already do.  Why didn't you code the url-part mechanism using the
> existing support for compound words? 

I am not familiar with compound word implementation and so I am not sure
how to split a url with compound word support. I looked into the
documentation for compound words and that does not say much about how to
identify components of a token. Does a compound word split by matching
with a list of words? If yes, then we will not be able to use that as we
do not know all the words that can appear in a url/host/email/file.

I think another approach can be to use the dict_regex dictionary
support. However, we will have to match the regex with something that
parser is doing. 

The current patch is not inventing any new mechanism. It uses the
special handler mechanism already present in the parser. For example,
when the current parser finds a URL it runs a special handler called
SpecialFURL which resets the parser position to the start of token to
find hostname. After finding the host it moves to finding the path. So
you first get the URL and then the host and finally the path.

Similarly, we are resetting the parser to the start of the token on
finding a url to output url parts. Then before entering the state that
can lead to a url we output the url part. The state machine modification
is similar for other tokens like file/email/host.


> The changes made to parsetext()
> seem particularly scary: it's not clear at all that that's not breaking
> unrelated behaviors.  In fact, the changes in the regression test
> results suggest strongly to me that it *is* breaking things.  Why are
> there so many diffs in examples that include no URLs at all?
> 

I think some of the difference is coming from the fact that now pos
starts with 0 and it used to be 1 earlier. That is easily fixable
though. 

> An issue that's nearly as bad is the 100% lack of documentation,
> which makes the patch difficult to review because it's hard to tell
> what it intends to accomplish or whether it's met the intent.
> The patch is not committable without documentation anyway, but right
> now I'm not sure it's even usefully reviewable.

I did not provide any explanation as I could not find any place in the
code to provide the documentation (that was just a modification of state
machine). Should I do a separate write-up to explain the desired output
and the changes to achieve it?

> 
> In line with the lack of documentation, I would say that the choice of
> the name "parttoken" for the new token type is not helpful.  Part of
> what?  And none of the other token type names include the word "token",
> so that's not a good decision either.  Possibly "url_part" would be a
> suitable name.
> 

I can modify it to output url-part/host-part/email-part/file-part if
there is an agreement over the rest of the issues. So let me know if I
should go ahead with this.

-Sushant.



Re: english parser in text search: support for multiple words in the same position

From
Sushant Sinha
Date:
Any updates on this?<br /><br /><br /><div class="gmail_quote">On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha <span
dir="ltr"><<ahref="mailto:sushant354@gmail.com">sushant354@gmail.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left:
1ex;"><divclass="im">> I looked at this patch a bit.  I'm fairly unhappy that it seems to be<br /> > inventing a
brandnew mechanism to do something the ts parser can<br /> > already do.  Why didn't you code the url-part mechanism
usingthe<br /> > existing support for compound words?<br /><br /></div>I am not familiar with compound word
implementationand so I am not sure<br /> how to split a url with compound word support. I looked into the<br />
documentationfor compound words and that does not say much about how to<br /> identify components of a token. Does a
compoundword split by matching<br /> with a list of words? If yes, then we will not be able to use that as we<br /> do
notknow all the words that can appear in a url/host/email/file.<br /><br /> I think another approach can be to use the
dict_regexdictionary<br /> support. However, we will have to match the regex with something that<br /> parser is
doing.<br/><br /> The current patch is not inventing any new mechanism. It uses the<br /> special handler mechanism
alreadypresent in the parser. For example,<br /> when the current parser finds a URL it runs a special handler
called<br/> SpecialFURL which resets the parser position to the start of token to<br /> find hostname. After finding
thehost it moves to finding the path. So<br /> you first get the URL and then the host and finally the path.<br /><br
/>Similarly, we are resetting the parser to the start of the token on<br /> finding a url to output url parts. Then
beforeentering the state that<br /> can lead to a url we output the url part. The state machine modification<br /> is
similarfor other tokens like file/email/host.<br /><div class="im"><br /><br /> > The changes made to parsetext()<br
/>> seem particularly scary: it's not clear at all that that's not breaking<br /> > unrelated behaviors.  In
fact,the changes in the regression test<br /> > results suggest strongly to me that it *is* breaking things.  Why
are<br/> > there so many diffs in examples that include no URLs at all?<br /> ><br /><br /></div>I think some of
thedifference is coming from the fact that now pos<br /> starts with 0 and it used to be 1 earlier. That is easily
fixable<br/> though.<br /><div class="im"><br /> > An issue that's nearly as bad is the 100% lack of
documentation,<br/> > which makes the patch difficult to review because it's hard to tell<br /> > what it intends
toaccomplish or whether it's met the intent.<br /> > The patch is not committable without documentation anyway, but
right<br/> > now I'm not sure it's even usefully reviewable.<br /><br /></div>I did not provide any explanation as I
couldnot find any place in the<br /> code to provide the documentation (that was just a modification of state<br />
machine).Should I do a separate write-up to explain the desired output<br /> and the changes to achieve it?<br /><div
class="im"><br/> ><br /> > In line with the lack of documentation, I would say that the choice of<br /> > the
name"parttoken" for the new token type is not helpful.  Part of<br /> > what?  And none of the other token type
namesinclude the word "token",<br /> > so that's not a good decision either.  Possibly "url_part" would be a<br />
>suitable name.<br /> ><br /><br /></div>I can modify it to output url-part/host-part/email-part/file-part if<br
/>there is an agreement over the rest of the issues. So let me know if I<br /> should go ahead with this.<br /><font
color="#888888"><br/> -Sushant.<br /><br /></font></blockquote></div><br /> 
On Wed, Sep 29, 2010 at 1:29 AM, Sushant Sinha <sushant354@gmail.com> wrote:
> Any updates on this?
>
>
> On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha <sushant354@gmail.com>
> wrote:
>>
>> > I looked at this patch a bit.  I'm fairly unhappy that it seems to be
>> > inventing a brand new mechanism to do something the ts parser can
>> > already do.  Why didn't you code the url-part mechanism using the
>> > existing support for compound words?
>>
>> I am not familiar with compound word implementation and so I am not sure
>> how to split a url with compound word support. I looked into the
>> documentation for compound words and that does not say much about how to
>> identify components of a token. Does a compound word split by matching
>> with a list of words? If yes, then we will not be able to use that as we
>> do not know all the words that can appear in a url/host/email/file.

It seems to me that you need to familiarize yourself with this stuff
and then post an analysis, or a new patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


[ sorry for not responding on this sooner, it's been hectic the last couple weeks ]

Sushant Sinha <sushant354@gmail.com> writes:
>> I looked at this patch a bit.  I'm fairly unhappy that it seems to be
>> inventing a brand new mechanism to do something the ts parser can
>> already do.  Why didn't you code the url-part mechanism using the
>> existing support for compound words? 

> I am not familiar with compound word implementation and so I am not sure
> how to split a url with compound word support. I looked into the
> documentation for compound words and that does not say much about how to
> identify components of a token.

IIRC, the way that that works is associated with pushing a sub-state
of the state machine in order to scan each compound-word part.  I don't
have the details in my head anymore, though I recall having traced
through it in the past.  Look at the state machine actions that are
associated with producing the compound word tokens and sub-tokens.

> The current patch is not inventing any new mechanism. It uses the
> special handler mechanism already present in the parser.

The fact that that mechanism is there doesn't mean that it's the right
one for this task.  I think that Teodor meant it for other things
altogether.  If it were the best way to solve this problem, why wouldn't
he have used it for compound words?

>> The changes made to parsetext()
>> seem particularly scary: it's not clear at all that that's not breaking
>> unrelated behaviors.  In fact, the changes in the regression test
>> results suggest strongly to me that it *is* breaking things.  Why are
>> there so many diffs in examples that include no URLs at all?

> I think some of the difference is coming from the fact that now pos
> starts with 0 and it used to be 1 earlier. That is easily fixable
> though. 

You cannot seriously believe that it's okay for a patch to just
arbitrarily change such an easily user-visible behavior.  This comes
back again to the point that this patch is not going to get in at all
unless it makes the absolute minimum amount of change in the established
behavior of the parser.  I think we can probably accept a patch that
produces new tokens (of newly-defined types) in addition to what it
already produced for URL-looking input, because any existing dictionary
configuration will just drop the newly-defined token types on the floor
leaving you with exactly the same indexing behavior you got in the
last three major releases.  Changes beyond that are going to need to
meet a very high bar of arguable necessity.

>> An issue that's nearly as bad is the 100% lack of documentation,

> I did not provide any explanation as I could not find any place in the
> code to provide the documentation (that was just a modification of state
> machine).

The code is not the place that I'm complaining about the lack of
documentation in.  A patch that changes user-visible behavior needs to
change the appropriate parts of the SGML documentation also.  In the
case at hand, I think an absolute minimum level of documentation would
involve changing table 12-1 here:
http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
and probably adding another example to the ones following that table.
But there may well be other parts of chapter 12 that need revision also.

Now I will grant that an early-draft patch needn't include final user
docs, but if you're omitting that then it's all the more important that
you provide clear information with the patch about what it's supposed to
do, so that reviewers can understand what the point is.  The two
sentences of description provided in the commitfest entry were nowhere
near enough for intelligent reviewing, IMO.
        regards, tom lane


Re: english parser in text search: support for multiple words in the same position

From
Sushant Sinha
Date:
Just a reminder that this patch is discussing  how to break url, emails etc into its components.<br /><br /><div
class="gmail_quote">OnMon, Oct 4, 2010 at 3:54 AM, Tom Lane <span dir="ltr"><<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="margin:0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">[ sorry for not
respondingon this sooner, it's been hectic the last<br />  couple weeks ]<br /><div class="im"><br /> Sushant Sinha
<<ahref="mailto:sushant354@gmail.com">sushant354@gmail.com</a>> writes:<br /></div><div class="im">>> I
lookedat this patch a bit.  I'm fairly unhappy that it seems to be<br /> >> inventing a brand new mechanism to do
somethingthe ts parser can<br /> >> already do.  Why didn't you code the url-part mechanism using the<br />
>>existing support for compound words?<br /><br /> > I am not familiar with compound word implementation and
soI am not sure<br /> > how to split a url with compound word support. I looked into the<br /> > documentation
forcompound words and that does not say much about how to<br /> > identify components of a token.<br /><br
/></div>IIRC,the way that that works is associated with pushing a sub-state<br /> of the state machine in order to scan
eachcompound-word part.  I don't<br /> have the details in my head anymore, though I recall having traced<br /> through
itin the past.  Look at the state machine actions that are<br /> associated with producing the compound word tokens and
sub-tokens.<br/></blockquote></div><br />I did look around for compound word support in postgres. In particular, I read
thedocumentation and code in tsearch/spell.c that seems to implement the compound word support. <br /><br />So in my
understandingthe way it works is:<br /><br />1. Specify a dictionary of words in which each word will have applicable
prefix/suffixflags<br />2. Specify a flag file that provides prefix/suffix operations on those flags<br /> 3. flag z
indicatesthat a word in the dictionary can participate in compound word splitting<br />4. When a token matches words
specifiedin the dictionary (after applying affix/suffix operations), the matching words are emitted as sub-words of the
token(i.e., compound word)<br /><br />If my above understanding is correct, then I think it will not be possible to
implementurl/email splitting using the compound word support.<br /><br />The main reason is that the compound word
supportrequires the  "PRE-DETERMINED" dictionary of words. So to split a url/email we will need to provide a list of
*allpossible* host names and user names. I do not think that is a possibility.<br /><br />Please correct me if I have
mis-understoodsomething.<br /><br />-Sushant. <br /> 
Do not know if this mail got lost in between or no one noticed it!

On Thu, 2010-12-23 at 11:05 +0530, Sushant Sinha wrote:
Just a reminder that this patch is discussing  how to break url, emails
etc into its components.
> 
> On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>         [ sorry for not responding on this sooner, it's been hectic
>         the last
>          couple weeks ]
>         
>         Sushant Sinha <sushant354@gmail.com> writes:
>         
>         >> I looked at this patch a bit.  I'm fairly unhappy that it
>         seems to be
>         >> inventing a brand new mechanism to do something the ts
>         parser can
>         >> already do.  Why didn't you code the url-part mechanism
>         using the
>         >> existing support for compound words?
>         
>         > I am not familiar with compound word implementation and so I
>         am not sure
>         > how to split a url with compound word support. I looked into
>         the
>         > documentation for compound words and that does not say much
>         about how to
>         > identify components of a token.
>         
>         
>         IIRC, the way that that works is associated with pushing a
>         sub-state
>         of the state machine in order to scan each compound-word
>         part.  I don't
>         have the details in my head anymore, though I recall having
>         traced
>         through it in the past.  Look at the state machine actions
>         that are
>         associated with producing the compound word tokens and
>         sub-tokens.
> 

I did look around for compound word support in postgres. In particular,
I read the documentation and code in tsearch/spell.c that seems to
implement the compound word support. 

So in my understanding the way it works is:

1. Specify a dictionary of words in which each word will have applicable
prefix/suffix flags

2. Specify a flag file that provides prefix/suffix operations on those
flags

3. flag z indicates that a word in the dictionary can participate in
compound word splitting

4. When a token matches words specified in the dictionary (after
applying affix/suffix operations), the matching words are emitted as
sub-words of the token (i.e., compound word)

If my above understanding is correct, then I think it will not be
possible to implement url/email splitting using the compound word
support.

The main reason is that the compound word support requires the
"PRE-DETERMINED" dictionary of words. So to split a url/email we will
need to provide a list of *all possible* host names and user names. I do
not think that is a possibility.

Please correct me if I have mis-understood something.

-Sushant.