Home > mailing lists

Re: english parser in text search: support for multiple words in the same position - Mailing list pgsql-hackers

From	Sushant Sinha
Subject	Re: english parser in text search: support for multiple words in the same position
Date	December 23, 2010 01:35:39
Msg-id	AANLkTin+XiewXD396WMqr-Pnk9QOHday3OTTM3MyS7SR@mail.gmail.com Whole thread Raw
In response to	Re: english parser in text search: support for multiple words in the same position (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: english parser in text search: support for multiple words in the same position
List	pgsql-hackers

Tree view

Just a reminder that this patch is discussing how to break url, emails etc into its components.

On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[ sorry for not responding on this sooner, it's been hectic the last
couple weeks ]

Sushant Sinha <sushant354@gmail.com> writes:
>> I looked at this patch a bit. I'm fairly unhappy that it seems to be
>> inventing a brand new mechanism to do something the ts parser can
>> already do. Why didn't you code the url-part mechanism using the
>> existing support for compound words?

> I am not familiar with compound word implementation and so I am not sure
> how to split a url with compound word support. I looked into the
> documentation for compound words and that does not say much about how to
> identify components of a token.

IIRC, the way that that works is associated with pushing a sub-state
of the state machine in order to scan each compound-word part. I don't
have the details in my head anymore, though I recall having traced
through it in the past. Look at the state machine actions that are
associated with producing the compound word tokens and sub-tokens.

I did look around for compound word support in postgres. In particular, I read the documentation and code in tsearch/spell.c that seems to implement the compound word support.

So in my understanding the way it works is:

1. Specify a dictionary of words in which each word will have applicable prefix/suffix flags
2. Specify a flag file that provides prefix/suffix operations on those flags
3. flag z indicates that a word in the dictionary can participate in compound word splitting
4. When a token matches words specified in the dictionary (after applying affix/suffix operations), the matching words are emitted as sub-words of the token (i.e., compound word)

If my above understanding is correct, then I think it will not be possible to implement url/email splitting using the compound word support.

The main reason is that the compound word support requires the "PRE-DETERMINED" dictionary of words. So to split a url/email we will need to provide a list of *all possible* host names and user names. I do not think that is a possibility.

Please correct me if I have mis-understood something.

-Sushant.

pgsql-hackers by date:

From: Robert Haas
Date: 22 December 2010, 23:05:41
Subject: Re: knngist - 0.8

From: Pavel Stehule
Date: 23 December 2010, 04:11:14
Subject: recapitulation: FOREACH-IN-ARRAY

Re: english parser in text search: support for multiple words in the same position - Mailing list pgsql-hackers

Previous

Next