Re: english parser in text search: support for multiple words in the same position - Mailing list pgsql-hackers

From Sushant Sinha
Subject Re: english parser in text search: support for multiple words in the same position
Date
Msg-id 1283323324.2084.22.camel@dragflick
Whole thread Raw
In response to Re: english parser in text search: support for multiple words in the same position  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: english parser in text search: support for multiple words in the same position
List pgsql-hackers
I have attached a patch that emits parts of a host token, a url token,
an email token and a file token. Further, it makes sure that a
host/url/email/file token and the first part-token are at the same
position in tsvector.

The two major changes are:

1. Tokenization changes: The patch exploits the special handlers in the
text parser to reset the parser position to the start of a
host/url/email/file token when it finds one. Special handlers were
already used for extracting host and urlpath from a full url. So this is
more of an extension of the same idea.

2. Position changes: We do not advance position when we encounter a
host/url/email/file token. As a result the first part of that token
aligns with the token itself.

Attachments:

tokens_output.txt: sample queries and results with the patch
token_v1.patch:    patch wrt cvs head

Currently, the patch output parts of the tokens as normal tokens like
WORD, NUMWORD etc. Tom argued earlier that this will break
backward-compatibility and so it should be outputted as parts of the
respective tokens. If there is an agreement over what Tom says, then the
current patch can be modified to output subtokens as parts. However,
before I complicate the patch with that, I wanted to get feedback on any
other major problem with the patch.

-Sushant.

On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote:
> Sushant Sinha <sushant354@gmail.com> writes:
> >> This would needlessly increase the number of tokens. Instead you'd
> >> better make it work like compound word support, having just "wikipedia"
> >> and "org" as tokens.
>
> > The current text parser already returns url and url_path. That already
> > increases the number of unique tokens. I am only asking for adding of
> > normal english words as well so that if someone types only "wikipedia"
> > he gets a match.
>
> The suggestion to make it work like compound words is still a good one,
> ie given wikipedia.org you'd get back
>
>     host        wikipedia.org
>     host-part    wikipedia
>     host-part    org
>
> not just the "host" token as at present.
>
> Then the user could decide whether he needed to index hostname
> components or not, by choosing whether to forward hostname-part
> tokens to a dictionary or just discard them.
>
> If you submit a patch that tries to force the issue by classifying
> hostname parts as plain words, it'll probably get rejected out of
> hand on backwards-compatibility grounds.
>
>             regards, tom lane


Attachment

pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: array_agg() NULL Handling
Next
From: Thom Brown
Date:
Subject: Re: array_agg() NULL Handling