Do not know if this mail got lost in between or no one noticed it!
On Thu, 2010-12-23 at 11:05 +0530, Sushant Sinha wrote:
Just a reminder that this patch is discussing how to break url, emails
etc into its components.
>
> On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> [ sorry for not responding on this sooner, it's been hectic
> the last
> couple weeks ]
>
> Sushant Sinha <sushant354@gmail.com> writes:
>
> >> I looked at this patch a bit. I'm fairly unhappy that it
> seems to be
> >> inventing a brand new mechanism to do something the ts
> parser can
> >> already do. Why didn't you code the url-part mechanism
> using the
> >> existing support for compound words?
>
> > I am not familiar with compound word implementation and so I
> am not sure
> > how to split a url with compound word support. I looked into
> the
> > documentation for compound words and that does not say much
> about how to
> > identify components of a token.
>
>
> IIRC, the way that that works is associated with pushing a
> sub-state
> of the state machine in order to scan each compound-word
> part. I don't
> have the details in my head anymore, though I recall having
> traced
> through it in the past. Look at the state machine actions
> that are
> associated with producing the compound word tokens and
> sub-tokens.
>
I did look around for compound word support in postgres. In particular,
I read the documentation and code in tsearch/spell.c that seems to
implement the compound word support.
So in my understanding the way it works is:
1. Specify a dictionary of words in which each word will have applicable
prefix/suffix flags
2. Specify a flag file that provides prefix/suffix operations on those
flags
3. flag z indicates that a word in the dictionary can participate in
compound word splitting
4. When a token matches words specified in the dictionary (after
applying affix/suffix operations), the matching words are emitted as
sub-words of the token (i.e., compound word)
If my above understanding is correct, then I think it will not be
possible to implement url/email splitting using the compound word
support.
The main reason is that the compound word support requires the
"PRE-DETERMINED" dictionary of words. So to split a url/email we will
need to provide a list of *all possible* host names and user names. I do
not think that is a possibility.
Please correct me if I have mis-understood something.
-Sushant.