Thread: ts_headline

ts_headline

From
Stephen Davies
Date:
I am a bit puzzled by the output of ts_headline (V8.3) for different queries.

I have one record in a test documentation table and am applying different
queries against that table to check out the ts_headline outputs.

The "document" in question has 2553 words which generate 519 tokens in the
ts_vector.
For most queries, ts_headline returns a string starting with one of the
criterion words and with all criterion words highlit - as I would expect.

However, some other queries return a string which seems to always start at the
beginning of the "document" and contains no highlit terms.

It seems that the difference is in the number of occurrences of the criterion
words. If the number of hits is less than some number, the ts_headline result
is "correct" but if the number of hits exceeds that limit, the result is just
the first MinWords of the "document".

I have seen cases with up to 20 hits succeed but cases with 35 hits miss.
The spread of hits does not seem to be relevant.

Is this a bug or am I missing some configuration option?

TIA,
Stephen Davies
--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: ts_headline

From
Richard Huxton
Date:
Stephen Davies wrote:
> I am a bit puzzled by the output of ts_headline (V8.3) for different queries.

> It seems that the difference is in the number of occurrences of the criterion
> words. If the number of hits is less than some number, the ts_headline result
> is "correct" but if the number of hits exceeds that limit, the result is just
> the first MinWords of the "document".
>
> I have seen cases with up to 20 hits succeed but cases with 35 hits miss.
> The spread of hits does not seem to be relevant.

http://www.postgresql.org/docs/8.3/static/textsearch-controls.html#TEXTSEARCH-HEADLINE

Are you bumping into the MaxWords limit? If you've got 35 hits that must
be at least 35 words.

--
   Richard Huxton
   Archonet Ltd

Re: ts_headline

From
Stephen Davies
Date:
G'day Richard.

I don't think so. A sample command is:

ts_headline(abstract,to_tsquery('english','database'),'minWords = 99, maxWords
= 999')

I have also tried with smaller maxwords without any visible effect.

Cheers,
Stephen


On Thursday 21 February 2008 19:19, Richard Huxton wrote:
> Stephen Davies wrote:
> > I am a bit puzzled by the output of ts_headline (V8.3) for different
> > queries.
> >
> > It seems that the difference is in the number of occurrences of the
> > criterion words. If the number of hits is less than some number, the
> > ts_headline result is "correct" but if the number of hits exceeds that
> > limit, the result is just the first MinWords of the "document".
> >
> > I have seen cases with up to 20 hits succeed but cases with 35 hits miss.
> > The spread of hits does not seem to be relevant.
>
> http://www.postgresql.org/docs/8.3/static/textsearch-controls.html#TEXTSEAR
>CH-HEADLINE
>
> Are you bumping into the MaxWords limit? If you've got 35 hits that must
> be at least 35 words.

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: ts_headline

From
Richard Huxton
Date:
Stephen Davies wrote:
> G'day Richard.
>
> I don't think so. A sample command is:
>
> ts_headline(abstract,to_tsquery('english','database'),'minWords = 99, maxWords
> = 999')
>
> I have also tried with smaller maxwords without any visible effect.

Hmm - a simple test seems to work OK.

SELECT ts_headline( repeat('apple banana carrot ', 100),
to_tsquery('apple'));
                                                               ts_headline

----------------------------------------------------------------------------------------------------------------------------------------
  <b>apple</b> banana carrot <b>apple</b> banana carrot <b>apple</b>
banana carrot <b>apple</b> banana carrot <b>apple</b> banana carrot
(1 row)

It's not just the start of the text either:

SELECT ts_headline( repeat('elephant ', 100) || repeat('apple banana
carrot ', 100), to_tsquery('apple'));
                                                               ts_headline

----------------------------------------------------------------------------------------------------------------------------------------
  <b>apple</b> banana carrot <b>apple</b> banana carrot <b>apple</b>
banana carrot <b>apple</b> banana carrot <b>apple</b> banana carrot
(1 row)

Can you provide a piece of text that shows the problem?

--
   Richard Huxton
   Archonet Ltd

Re: ts_headline

From
Stephen Davies
Date:
Attached is the "document" in question.

Searches for "norwegian", "thesaurus" and "statement" give good results. A
search for "database" gives the plain text from the beginning.

Cheers and thanks,
Stephen Davies

On Thursday 21 February 2008 20:08, Richard Huxton wrote:
> Stephen Davies wrote:
> > G'day Richard.
> >
> > I don't think so. A sample command is:
> >
> > ts_headline(abstract,to_tsquery('english','database'),'minWords = 99,
> > maxWords = 999')
> >
> > I have also tried with smaller maxwords without any visible effect.
>
> Hmm - a simple test seems to work OK.
>
> SELECT ts_headline( repeat('apple banana carrot ', 100),
> to_tsquery('apple'));
>                                                                ts_headline
> ---------------------------------------------------------------------------
>------------------------------------------------------------- <b>apple</b>
> banana carrot <b>apple</b> banana carrot <b>apple</b> banana carrot
> <b>apple</b> banana carrot <b>apple</b> banana carrot (1 row)
>
> It's not just the start of the text either:
>
> SELECT ts_headline( repeat('elephant ', 100) || repeat('apple banana
> carrot ', 100), to_tsquery('apple'));
>                                                                ts_headline
> ---------------------------------------------------------------------------
>------------------------------------------------------------- <b>apple</b>
> banana carrot <b>apple</b> banana carrot <b>apple</b> banana carrot
> <b>apple</b> banana carrot <b>apple</b> banana carrot (1 row)
>
> Can you provide a piece of text that shows the problem?

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Attachment

Re: ts_headline

From
Richard Huxton
Date:
Stephen Davies wrote:
> Attached is the "document" in question.
>
> Searches for "norwegian", "thesaurus" and "statement" give good results. A
> search for "database" gives the plain text from the beginning.

Seems OK here - might need to look at your configuration settings.
http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html

I'll make sure I've got a clean setup here and re-run the test.


SELECT ts_headline(t, to_tsquery('database')) FROM tsearch_test;
                                               ts_headline
-------------------------------------------------------------------------------------------------------
  <b>database</b> (using a 2 KB page) to a Large File Support (LFS)
<b>database</b> (using an 8 KB page
(1 row)

--
   Richard Huxton
   Archonet Ltd

Re: ts_headline

From
Stephen Davies
Date:
Interesting. I hadn't seen that section before.

As I said in my original post: "Is this a bug or am I missing some
configuration option".

I shall investigate the stuff in 12.8.
Any suggestions as to where to start?

Thanks,
Stephen Davies

 On Thursday 21 February 2008 20:50, Richard Huxton wrote:
> Stephen Davies wrote:
> > Attached is the "document" in question.
> >
> > Searches for "norwegian", "thesaurus" and "statement" give good results.
> > A search for "database" gives the plain text from the beginning.
>
> Seems OK here - might need to look at your configuration settings.
> http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html
>
> I'll make sure I've got a clean setup here and re-run the test.
>
>
> SELECT ts_headline(t, to_tsquery('database')) FROM tsearch_test;
>                                                ts_headline
> ---------------------------------------------------------------------------
>---------------------------- <b>database</b> (using a 2 KB page) to a Large
> File Support (LFS) <b>database</b> (using an 8 KB page
> (1 row)

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: ts_headline

From
Richard Huxton
Date:
Stephen Davies wrote:
> Interesting. I hadn't seen that section before.
>
> As I said in my original post: "Is this a bug or am I missing some
> configuration option".
>
> I shall investigate the stuff in 12.8.
> Any suggestions as to where to start?

Well, no-one has been using 8.3 for more than a few weeks, so I don't
think anyone has a lot of experience with it.

I'm getting to_tsquery('database') stemming to "databas" and
to_tsvector(<your text>) gives me the following for that token:


'databas':305,630,642,663,698,719,746,870,872,951,961,974,993,1034,1042,1159,1217,1223,1238,1244,1265,1343,1357,1399,1434,1758,1818,1821,1834,2212,2240,2258,2278,2389,2529


--
   Richard Huxton
   Archonet Ltd

Re: ts_headline

From
Stephen Davies
Date:
I just spotted the difference between your test and mine.

My query says:

select ts_headline(abstract,to_tsquery('english','database'),'minWords = 99,
maxWords = 999') from document where id=21;

where your equivalent does not include the 'english' arg.

If I take out the 'english' from this query, I get the same result as you.

However, the following returns zero rows:

select title,author,ts_headline(abstract,to_tsquery('database') from document
where clob @@ to_tsquery('database')

It gets more interesting:

select title,author,ts_headline(abstract,to_tsquery('database') from document
where clob @@ to_tsquery('english','database')

returns the "correct" result - one row with the expected headline.

select title,author,ts_headline(abstract,to_tsquery('english','thesaurus')
from document where clob @@ to_tsquery('english','thesaurus')

also returns the "correct" result.

I suggest that the above indicates a bug somewhere.

Cheers and thanks,
Stephen Davies


On Thursday 21 February 2008 20:50, Richard Huxton wrote:
> Stephen Davies wrote:
> > Attached is the "document" in question.
> >
> > Searches for "norwegian", "thesaurus" and "statement" give good results.
> > A search for "database" gives the plain text from the beginning.
>
> Seems OK here - might need to look at your configuration settings.
> http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html
>
> I'll make sure I've got a clean setup here and re-run the test.
>
>
> SELECT ts_headline(t, to_tsquery('database')) FROM tsearch_test;
>                                                ts_headline
> ---------------------------------------------------------------------------
>---------------------------- <b>database</b> (using a 2 KB page) to a Large
> File Support (LFS) <b>database</b> (using an 8 KB page
> (1 row)

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: ts_headline

From
Richard Huxton
Date:
Stephen Davies wrote:
> I just spotted the difference between your test and mine.
>
> My query says:
>
> select ts_headline(abstract,to_tsquery('english','database'),'minWords = 99,
> maxWords = 999') from document where id=21;
>
> where your equivalent does not include the 'english' arg.
>
> If I take out the 'english' from this query, I get the same result as you.

What does this give you:
   show default_text_search_config;
I get pg_catalog.english and the same result for the query whether I use:
    to_tsquery('english','database')
or to_tsquery('pg_catalog.english','database')

Could you be picking up a bad "english" configuration (see \dF)?

> However, the following returns zero rows:
>
> select title,author,ts_headline(abstract,to_tsquery('database') from document
> where clob @@ to_tsquery('database')

I take it "clob" matches "abstract"?

> It gets more interesting:
>
> select title,author,ts_headline(abstract,to_tsquery('database') from document
> where clob @@ to_tsquery('english','database')
>
> returns the "correct" result - one row with the expected headline.

Now that *is* strange. ts_headline() works without specifying 'english'
but the actual search works the other way.

> select title,author,ts_headline(abstract,to_tsquery('english','thesaurus')
> from document where clob @@ to_tsquery('english','thesaurus')
>
> also returns the "correct" result.
>
> I suggest that the above indicates a bug somewhere.

Could be - it'd be good to rule out a bad config. You might have an
unexpected list of stopwords or similar.

Let's try:
  SELECT ts_debug('the database and thesaurus');
  SELECT ts_debug('english', 'the database and thesaurus');
  SELECT ts_debug('pg_catalog.english', 'the database and thesaurus');
I'd expect "the", "and" to be stripped out as stopwords and the other
two to get through (database stemmed to "databas").


--
   Richard Huxton
   Archonet Ltd

Re: ts_headline

From
Stephen Davies
Date:
OK. The first level explanation is that my default config is "simple".
This explains the different query results as "english" reduces "database" to
"databas" while "simple does not reduce it at all.

The "document" is parsed/indexed using "english" explicitly so my queries nedd
to be explicit also (not an issue as all "real" queries are generated rather
than typed).

However, I still cannot see a reason for the ts_headline results. If anything,
they should be the other way around.

I suspect that ts_headline may only work properly when no configuration is
specified - regardless of the default setting.

Cheers,
Stephen

On Thursday 21 February 2008 22:30, Richard Huxton wrote:
> Stephen Davies wrote:
> > I just spotted the difference between your test and mine.
> >
> > My query says:
> >
> > select ts_headline(abstract,to_tsquery('english','database'),'minWords =
> > 99, maxWords = 999') from document where id=21;
> >
> > where your equivalent does not include the 'english' arg.
> >
> > If I take out the 'english' from this query, I get the same result as
> > you.
>
> What does this give you:
>    show default_text_search_config;
> I get pg_catalog.english and the same result for the query whether I use:
>     to_tsquery('english','database')
> or to_tsquery('pg_catalog.english','database')
>
> Could you be picking up a bad "english" configuration (see \dF)?
>
> > However, the following returns zero rows:
> >
> > select title,author,ts_headline(abstract,to_tsquery('database') from
> > document where clob @@ to_tsquery('database')
>
> I take it "clob" matches "abstract"?
>
> > It gets more interesting:
> >
> > select title,author,ts_headline(abstract,to_tsquery('database') from
> > document where clob @@ to_tsquery('english','database')
> >
> > returns the "correct" result - one row with the expected headline.
>
> Now that *is* strange. ts_headline() works without specifying 'english'
> but the actual search works the other way.
>
> > select
> > title,author,ts_headline(abstract,to_tsquery('english','thesaurus') from
> > document where clob @@ to_tsquery('english','thesaurus')
> >
> > also returns the "correct" result.
> >
> > I suggest that the above indicates a bug somewhere.
>
> Could be - it'd be good to rule out a bad config. You might have an
> unexpected list of stopwords or similar.
>
> Let's try:
>   SELECT ts_debug('the database and thesaurus');
>   SELECT ts_debug('english', 'the database and thesaurus');
>   SELECT ts_debug('pg_catalog.english', 'the database and thesaurus');
> I'd expect "the", "and" to be stripped out as stopwords and the other
> two to get through (database stemmed to "databas").

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: ts_headline

From
Richard Huxton
Date:
Stephen Davies wrote:
> OK. The first level explanation is that my default config is "simple".

Aha! Actually, that's the whole explanation.

> This explains the different query results as "english" reduces "database" to
> "databas" while "simple does not reduce it at all.

Exactly.

> The "document" is parsed/indexed using "english" explicitly so my queries nedd
> to be explicit also (not an issue as all "real" queries are generated rather
> than typed).

Or change your default configuration to match the one you're using.

> However, I still cannot see a reason for the ts_headline results. If anything,
> they should be the other way around.
>
> I suspect that ts_headline may only work properly when no configuration is
> specified - regardless of the default setting.

No. What's happening is that your tsvector representation of the
document (which gets indexed) contains lexemes processed by your
"english" config. So, it will have something like:
   ... databas: 123, 129, 200 ...
Of course, when you do a tsquery search with "simple" configuration it
checks doesn't do any stemming so is actually looking for a lexeme
called "database" which it can't find.

Since it can't find anything, it falls back to displaying just the start
of the document. Since the alternative would be to display nothing, that
makes a certain amount of sense.

To check this, try: ts_headline(t, to_tsquery('simple','databas')) and
you should get your database results.


Moral of the story: if you specify a configuration, always specify it.

Thanks for working through this Stephen - good question specification btw.

--
   Richard Huxton
   Archonet Ltd

Re: ts_headline

From
Stephen Davies
Date:
Not quite:-(

It is the ts_headline with the explicit "english" configuration that "fails"
rather than the implicit "simple".

That's what is so weird.

As you say, the ts_vector has "databas" so the "english" version of
ts_headline should work - but it doesn't. The "simple" version does; despite
the above.

Weird!

Stephen

On Friday 22 February 2008 19:33, Richard Huxton wrote:
> Stephen Davies wrote:
> > OK. The first level explanation is that my default config is "simple".
>
> Aha! Actually, that's the whole explanation.
>
> > This explains the different query results as "english" reduces "database"
> > to "databas" while "simple does not reduce it at all.
>
> Exactly.
>
> > The "document" is parsed/indexed using "english" explicitly so my queries
> > nedd to be explicit also (not an issue as all "real" queries are
> > generated rather than typed).
>
> Or change your default configuration to match the one you're using.
>
> > However, I still cannot see a reason for the ts_headline results. If
> > anything, they should be the other way around.
> >
> > I suspect that ts_headline may only work properly when no configuration
> > is specified - regardless of the default setting.
>
> No. What's happening is that your tsvector representation of the
> document (which gets indexed) contains lexemes processed by your
> "english" config. So, it will have something like:
>    ... databas: 123, 129, 200 ...
> Of course, when you do a tsquery search with "simple" configuration it
> checks doesn't do any stemming so is actually looking for a lexeme
> called "database" which it can't find.
>
> Since it can't find anything, it falls back to displaying just the start
> of the document. Since the alternative would be to display nothing, that
> makes a certain amount of sense.
>
> To check this, try: ts_headline(t, to_tsquery('simple','databas')) and
> you should get your database results.
>
>
> Moral of the story: if you specify a configuration, always specify it.
>
> Thanks for working through this Stephen - good question specification btw.

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: ts_headline

From
Richard Huxton
Date:
Stephen Davies wrote:
> Not quite:-(
>
> It is the ts_headline with the explicit "english" configuration that "fails"
> rather than the implicit "simple".

Hmm... arse.

> That's what is so weird.
>
> As you say, the ts_vector has "databas" so the "english" version of
> ts_headline should work - but it doesn't. The "simple" version does; despite
> the above.

[goes away, tests some more]

OK, so:

set default_text_search_config = 'simple';
SELECT ts_headline('my database is a database', to_tsquery('database'));
SELECT ts_headline('my database is a database', to_tsquery('simple',
'database'));
SELECT ts_headline('my database is a database', to_tsquery('english',
'database'));

The first two work, the last one doesn't.

set default_text_search_config = 'english';
SELECT ts_headline('my database is a database', to_tsquery('database'));
SELECT ts_headline('my database is a database', to_tsquery('simple',
'database'));
SELECT ts_headline('my database is a database', to_tsquery('english',
'database'));

The middle one doesn't work.

Note that there are no indexes involved here, we're just running against
the raw text.

[light goes on over sluggish London-based database chap]

When the ts_headline function is working on the text, it needs to
convert it from varchar/text type to tsvector so that it can use the
tsquery to find words to highlight.

When it converts the text to a tsvector, it's doing it based on
default_text_search_config - we've not told it otherwise. In an ideal
world, it would look "inside" the tsquery and see what config that was
using, but it can't (or at least doesn't).

Of course, if to_tsquery()'s config doesn't match to_tsheadline()'s then
we get a problem.

And, if I actually bother to read an up-to-date copy of the manual,
rather than the beta version I've got linked on my desktop I can see
there's a parameter for ts_headline. So...

set default_text_search_config = 'simple';
SELECT ts_headline('english', 'my database is a database',
   to_tsquery('english','database')
);

set default_text_search_config = 'english';
SELECT ts_headline('simple', 'my database is a database',
   to_tsquery('simple','database')
);


These all work fine. Phew!

--
   Richard Huxton
   Archonet Ltd

Re: ts_headline

From
Stephen Davies
Date:
Unfortunately, my link to the box with the test database is down due to lack
of maintenance by our local telco (Telstra) but I think that I also missed
the optional config arg to ts_headline.

The lack of link also means that I cannot confirm your findings but your logic
looks good.

It begs the question, however, as to why ts-headline needs to reparse the raw
text.

At least in my case, I am using a trigger to parse the combination of Title
and Abstract to a ts_vector field in the table row (as suggested in 12.2.2
and 12.4.3 in the doco) so that the ts_vector is already available to
ts_headline.

If ts_headline had the ability to use that pre-parsed ts_vector, my problem
would never have arisen - and the performance of ts_headline would be
improved.

Cheers and thanks,
Stephen

On Friday 22 February 2008 20:00, Richard Huxton wrote:
> Stephen Davies wrote:
> > Not quite:-(
> >
> > It is the ts_headline with the explicit "english" configuration that
> > "fails" rather than the implicit "simple".
>
> Hmm... arse.
>
> > That's what is so weird.
> >
> > As you say, the ts_vector has "databas" so the "english" version of
> > ts_headline should work - but it doesn't. The "simple" version does;
> > despite the above.
>
> [goes away, tests some more]
>
> OK, so:
>
> set default_text_search_config = 'simple';
> SELECT ts_headline('my database is a database', to_tsquery('database'));
> SELECT ts_headline('my database is a database', to_tsquery('simple',
> 'database'));
> SELECT ts_headline('my database is a database', to_tsquery('english',
> 'database'));
>
> The first two work, the last one doesn't.
>
> set default_text_search_config = 'english';
> SELECT ts_headline('my database is a database', to_tsquery('database'));
> SELECT ts_headline('my database is a database', to_tsquery('simple',
> 'database'));
> SELECT ts_headline('my database is a database', to_tsquery('english',
> 'database'));
>
> The middle one doesn't work.
>
> Note that there are no indexes involved here, we're just running against
> the raw text.
>
> [light goes on over sluggish London-based database chap]
>
> When the ts_headline function is working on the text, it needs to
> convert it from varchar/text type to tsvector so that it can use the
> tsquery to find words to highlight.
>
> When it converts the text to a tsvector, it's doing it based on
> default_text_search_config - we've not told it otherwise. In an ideal
> world, it would look "inside" the tsquery and see what config that was
> using, but it can't (or at least doesn't).
>
> Of course, if to_tsquery()'s config doesn't match to_tsheadline()'s then
> we get a problem.
>
> And, if I actually bother to read an up-to-date copy of the manual,
> rather than the beta version I've got linked on my desktop I can see
> there's a parameter for ts_headline. So...
>
> set default_text_search_config = 'simple';
> SELECT ts_headline('english', 'my database is a database',
>    to_tsquery('english','database')
> );
>
> set default_text_search_config = 'english';
> SELECT ts_headline('simple', 'my database is a database',
>    to_tsquery('simple','database')
> );
>
>
> These all work fine. Phew!

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: ts_headline

From
Richard Huxton
Date:
Stephen Davies wrote:
> Unfortunately, my link to the box with the test database is down due to lack
> of maintenance by our local telco (Telstra) but I think that I also missed
> the optional config arg to ts_headline.
>
> The lack of link also means that I cannot confirm your findings but your logic
> looks good.

Looks like ALTER DATABASE SET default_text_config='english' is what you
need.

> It begs the question, however, as to why ts-headline needs to reparse the raw
> text.

It needs to line up tsvector lexemes with actual characters in the text.
The tsvector is missing punctuation, any stopwords (the, it, a) as well
as being stemmed (if your dictionary does that).

Also, it's looking for a short span of words that provide the best
match. That might not be a complete match of course, and is different to
how you'd normally look to use a tsvector.

> At least in my case, I am using a trigger to parse the combination of Title
> and Abstract to a ts_vector field in the table row (as suggested in 12.2.2
> and 12.4.3 in the doco) so that the ts_vector is already available to
> ts_headline.
>
> If ts_headline had the ability to use that pre-parsed ts_vector, my problem
> would never have arisen - and the performance of ts_headline would be
> improved.

Maybe. It would still have to parse the text to some degree though, just
to get the original words & punctuation into the headline.

--
   Richard Huxton
   Archonet Ltd

Re: ts_headline

From
Stephen Davies
Date:
Hmmmm!
I think I now understand the ts position better, thank you.

Part of my problem has been that I am used to the functionality of Open Text's
LCS (aka BASIS) product which handles text differently.

It includes the position (and context) information in the index and does
"remember" how the text was parsed so does not need to reparse to insert hit
navigation tags nor need pointers as to how to parse queries. (It also
supports phrase searching.)

Now that I have a better understanding of ts, I think I will be able to make
it do at least most of what I hoped for.

Thank you again for your help with this.

Cheers,
Stephen Davies

On Friday 22 February 2008 20:45, Richard Huxton wrote:
> Stephen Davies wrote:
> > Unfortunately, my link to the box with the test database is down due to
> > lack of maintenance by our local telco (Telstra) but I think that I also
> > missed the optional config arg to ts_headline.
> >
> > The lack of link also means that I cannot confirm your findings but your
> > logic looks good.
>
> Looks like ALTER DATABASE SET default_text_config='english' is what you
> need.
>
> > It begs the question, however, as to why ts-headline needs to reparse the
> > raw text.
>
> It needs to line up tsvector lexemes with actual characters in the text.
> The tsvector is missing punctuation, any stopwords (the, it, a) as well
> as being stemmed (if your dictionary does that).
>
> Also, it's looking for a short span of words that provide the best
> match. That might not be a complete match of course, and is different to
> how you'd normally look to use a tsvector.
>
> > At least in my case, I am using a trigger to parse the combination of
> > Title and Abstract to a ts_vector field in the table row (as suggested in
> > 12.2.2 and 12.4.3 in the doco) so that the ts_vector is already available
> > to ts_headline.
> >
> > If ts_headline had the ability to use that pre-parsed ts_vector, my
> > problem would never have arisen - and the performance of ts_headline
> > would be improved.
>
> Maybe. It would still have to parse the text to some degree though, just
> to get the original words & punctuation into the headline.

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: ts_headline

From
Oleg Bartunov
Date:
On Fri, 22 Feb 2008, Stephen Davies wrote:

> Hmmmm!
> I think I now understand the ts position better, thank you.
>
> Part of my problem has been that I am used to the functionality of Open Text's
> LCS (aka BASIS) product which handles text differently.
>
> It includes the position (and context) information in the index and does
> "remember" how the text was parsed so does not need to reparse to insert hit
> navigation tags nor need pointers as to how to parse queries. (It also
> supports phrase searching.)
>
> Now that I have a better understanding of ts, I think I will be able to make
> it do at least most of what I hoped for.

I'm wondering if it was not described in the text search documentation :)


>
> Thank you again for your help with this.
>
> Cheers,
> Stephen Davies
>
> On Friday 22 February 2008 20:45, Richard Huxton wrote:
>> Stephen Davies wrote:
>>> Unfortunately, my link to the box with the test database is down due to
>>> lack of maintenance by our local telco (Telstra) but I think that I also
>>> missed the optional config arg to ts_headline.
>>>
>>> The lack of link also means that I cannot confirm your findings but your
>>> logic looks good.
>>
>> Looks like ALTER DATABASE SET default_text_config='english' is what you
>> need.
>>
>>> It begs the question, however, as to why ts-headline needs to reparse the
>>> raw text.
>>
>> It needs to line up tsvector lexemes with actual characters in the text.
>> The tsvector is missing punctuation, any stopwords (the, it, a) as well
>> as being stemmed (if your dictionary does that).
>>
>> Also, it's looking for a short span of words that provide the best
>> match. That might not be a complete match of course, and is different to
>> how you'd normally look to use a tsvector.
>>
>>> At least in my case, I am using a trigger to parse the combination of
>>> Title and Abstract to a ts_vector field in the table row (as suggested in
>>> 12.2.2 and 12.4.3 in the doco) so that the ts_vector is already available
>>> to ts_headline.
>>>
>>> If ts_headline had the ability to use that pre-parsed ts_vector, my
>>> problem would never have arisen - and the performance of ts_headline
>>> would be improved.
>>
>> Maybe. It would still have to parse the text to some degree though, just
>> to get the original words & punctuation into the headline.
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: ts_headline

From
Stephen Davies
Date:
As it turns out, all I needed was in the doco but the key element - the first
config arg to ts_headline - was not in any of the examples so I missed it.

Would it be possible for ts_headline to work with the pre-parsed ts_vector?

I see references to future plans for phrase searching in ts. Is there a date
for this?

Cheers and thanks,
Stephen
Davies


On Friday 22 February 2008 22:54, Oleg Bartunov wrote:
> On Fri, 22 Feb 2008, Stephen Davies wrote:
> > Hmmmm!
> > I think I now understand the ts position better, thank you.
> >
> > Part of my problem has been that I am used to the functionality of Open
> > Text's LCS (aka BASIS) product which handles text differently.
> >
> > It includes the position (and context) information in the index and does
> > "remember" how the text was parsed so does not need to reparse to insert
> > hit navigation tags nor need pointers as to how to parse queries. (It
> > also supports phrase searching.)
> >
> > Now that I have a better understanding of ts, I think I will be able to
> > make it do at least most of what I hoped for.
>
> I'm wondering if it was not described in the text search documentation :)
>
> > Thank you again for your help with this.
> >
> > Cheers,
> > Stephen Davies
> >
> > On Friday 22 February 2008 20:45, Richard Huxton wrote:
> >> Stephen Davies wrote:
> >>> Unfortunately, my link to the box with the test database is down due to
> >>> lack of maintenance by our local telco (Telstra) but I think that I
> >>> also missed the optional config arg to ts_headline.
> >>>
> >>> The lack of link also means that I cannot confirm your findings but
> >>> your logic looks good.
> >>
> >> Looks like ALTER DATABASE SET default_text_config='english' is what you
> >> need.
> >>
> >>> It begs the question, however, as to why ts-headline needs to reparse
> >>> the raw text.
> >>
> >> It needs to line up tsvector lexemes with actual characters in the text.
> >> The tsvector is missing punctuation, any stopwords (the, it, a) as well
> >> as being stemmed (if your dictionary does that).
> >>
> >> Also, it's looking for a short span of words that provide the best
> >> match. That might not be a complete match of course, and is different to
> >> how you'd normally look to use a tsvector.
> >>
> >>> At least in my case, I am using a trigger to parse the combination of
> >>> Title and Abstract to a ts_vector field in the table row (as suggested
> >>> in 12.2.2 and 12.4.3 in the doco) so that the ts_vector is already
> >>> available to ts_headline.
> >>>
> >>> If ts_headline had the ability to use that pre-parsed ts_vector, my
> >>> problem would never have arisen - and the performance of ts_headline
> >>> would be improved.
> >>
> >> Maybe. It would still have to parse the text to some degree though, just
> >> to get the original words & punctuation into the headline.
>
>      Regards,
>          Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

Re: ts_headline

From
Oleg Bartunov
Date:
On Sat, 23 Feb 2008, Stephen Davies wrote:

> As it turns out, all I needed was in the doco but the key element - the first
> config arg to ts_headline - was not in any of the examples so I missed it.

aha, Original one were based on default
configuration, but then concept was changed, but the examples were not
modified.

>
> Would it be possible for ts_headline to work with the pre-parsed ts_vector?

it's impossible, Richard already explained you the reasons.

>
> I see references to future plans for phrase searching in ts. Is there a date
> for this?

Not yet. The problem mostly algebraical :) Simple 'exact search' is doable, but
we need something more, since we support boolean operators,
pluggable dictionaries (which could produce several lexemes, for example),
and document structure (lexem weights). So, we need to define consistent
algebra for text, to have predictable results. This is quite a complex task,
which require a lot of dedicated time, which we don't have.

>
> Cheers and thanks,
> Stephen
> Davies
>
>
> On Friday 22 February 2008 22:54, Oleg Bartunov wrote:
>> On Fri, 22 Feb 2008, Stephen Davies wrote:
>>> Hmmmm!
>>> I think I now understand the ts position better, thank you.
>>>
>>> Part of my problem has been that I am used to the functionality of Open
>>> Text's LCS (aka BASIS) product which handles text differently.
>>>
>>> It includes the position (and context) information in the index and does
>>> "remember" how the text was parsed so does not need to reparse to insert
>>> hit navigation tags nor need pointers as to how to parse queries. (It
>>> also supports phrase searching.)
>>>
>>> Now that I have a better understanding of ts, I think I will be able to
>>> make it do at least most of what I hoped for.
>>
>> I'm wondering if it was not described in the text search documentation :)
>>
>>> Thank you again for your help with this.
>>>
>>> Cheers,
>>> Stephen Davies
>>>
>>> On Friday 22 February 2008 20:45, Richard Huxton wrote:
>>>> Stephen Davies wrote:
>>>>> Unfortunately, my link to the box with the test database is down due to
>>>>> lack of maintenance by our local telco (Telstra) but I think that I
>>>>> also missed the optional config arg to ts_headline.
>>>>>
>>>>> The lack of link also means that I cannot confirm your findings but
>>>>> your logic looks good.
>>>>
>>>> Looks like ALTER DATABASE SET default_text_config='english' is what you
>>>> need.
>>>>
>>>>> It begs the question, however, as to why ts-headline needs to reparse
>>>>> the raw text.
>>>>
>>>> It needs to line up tsvector lexemes with actual characters in the text.
>>>> The tsvector is missing punctuation, any stopwords (the, it, a) as well
>>>> as being stemmed (if your dictionary does that).
>>>>
>>>> Also, it's looking for a short span of words that provide the best
>>>> match. That might not be a complete match of course, and is different to
>>>> how you'd normally look to use a tsvector.
>>>>
>>>>> At least in my case, I am using a trigger to parse the combination of
>>>>> Title and Abstract to a ts_vector field in the table row (as suggested
>>>>> in 12.2.2 and 12.4.3 in the doco) so that the ts_vector is already
>>>>> available to ts_headline.
>>>>>
>>>>> If ts_headline had the ability to use that pre-parsed ts_vector, my
>>>>> problem would never have arisen - and the performance of ts_headline
>>>>> would be improved.
>>>>
>>>> Maybe. It would still have to parse the text to some degree though, just
>>>> to get the original words & punctuation into the headline.
>>
>>      Regards,
>>          Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Phrase searching

From
Stephen Davies
Date:
As I understand it, the way that BASIS does phrase searching is based on first
parsing the base text to "context units" (sentences and/or paragraphs) and
then calculating position for tokens within those context units.

That is, a token might have position 3 in context unit 4. All of this is
stored in the index.

There are then multiple operators: phrase any (any token in the query is
matched in a context unit), phrase all (all tokens in the query match within
a context unit), phrase is (all tokens match in order including any stop
words), phrase like ( as for phrase is but with stop words only being
position holders).

There is also an "includes" operator which supports queries such as:

includes "foo" & "bar" within 3 words

or

includes "foo" & "bar" within 3 sentences

All of these plus hit highlighting are supported without reparsing the
original text (which might be gigabytes); just using the information in the
index.

Things like thesaurus expansion in queries are handled by adding AND/OR
constructs.

All operators support wild cards in query terms.

HTH,
Stephen

On Saturday 23 February 2008 21:48, Oleg Bartunov wrote:
> On Sat, 23 Feb 2008, Stephen Davies wrote:
> > As it turns out, all I needed was in the doco but the key element - the
> > first config arg to ts_headline - was not in any of the examples so I
> > missed it.
>
> aha, Original one were based on default
> configuration, but then concept was changed, but the examples were not
> modified.
>
> > Would it be possible for ts_headline to work with the pre-parsed
> > ts_vector?
>
> it's impossible, Richard already explained you the reasons.
>
> > I see references to future plans for phrase searching in ts. Is there a
> > date for this?
>
> Not yet. The problem mostly algebraical :) Simple 'exact search' is doable,
> but we need something more, since we support boolean operators,
> pluggable dictionaries (which could produce several lexemes, for example),
> and document structure (lexem weights). So, we need to define consistent
> algebra for text, to have predictable results. This is quite a complex
> task, which require a lot of dedicated time, which we don't have.
>
> > Cheers and thanks,
> > Stephen
> > Davies
> >
> > On Friday 22 February 2008 22:54, Oleg Bartunov wrote:
> >> On Fri, 22 Feb 2008, Stephen Davies wrote:
> >>> Hmmmm!
> >>> I think I now understand the ts position better, thank you.
> >>>
> >>> Part of my problem has been that I am used to the functionality of Open
> >>> Text's LCS (aka BASIS) product which handles text differently.
> >>>
> >>> It includes the position (and context) information in the index and
> >>> does "remember" how the text was parsed so does not need to reparse to
> >>> insert hit navigation tags nor need pointers as to how to parse
> >>> queries. (It also supports phrase searching.)
> >>>
> >>> Now that I have a better understanding of ts, I think I will be able to
> >>> make it do at least most of what I hoped for.
> >>
> >> I'm wondering if it was not described in the text search documentation
> >> :)
> >>
> >>> Thank you again for your help with this.
> >>>
> >>> Cheers,
> >>> Stephen Davies
> >>>
> >>> On Friday 22 February 2008 20:45, Richard Huxton wrote:
> >>>> Stephen Davies wrote:
> >>>>> Unfortunately, my link to the box with the test database is down due
> >>>>> to lack of maintenance by our local telco (Telstra) but I think that
> >>>>> I also missed the optional config arg to ts_headline.
> >>>>>
> >>>>> The lack of link also means that I cannot confirm your findings but
> >>>>> your logic looks good.
> >>>>
> >>>> Looks like ALTER DATABASE SET default_text_config='english' is what
> >>>> you need.
> >>>>
> >>>>> It begs the question, however, as to why ts-headline needs to reparse
> >>>>> the raw text.
> >>>>
> >>>> It needs to line up tsvector lexemes with actual characters in the
> >>>> text. The tsvector is missing punctuation, any stopwords (the, it, a)
> >>>> as well as being stemmed (if your dictionary does that).
> >>>>
> >>>> Also, it's looking for a short span of words that provide the best
> >>>> match. That might not be a complete match of course, and is different
> >>>> to how you'd normally look to use a tsvector.
> >>>>
> >>>>> At least in my case, I am using a trigger to parse the combination of
> >>>>> Title and Abstract to a ts_vector field in the table row (as
> >>>>> suggested in 12.2.2 and 12.4.3 in the doco) so that the ts_vector is
> >>>>> already available to ts_headline.
> >>>>>
> >>>>> If ts_headline had the ability to use that pre-parsed ts_vector, my
> >>>>> problem would never have arisen - and the performance of ts_headline
> >>>>> would be improved.
> >>>>
> >>>> Maybe. It would still have to parse the text to some degree though,
> >>>> just to get the original words & punctuation into the headline.
> >>
> >>      Regards,
> >>          Oleg
> >> _____________________________________________________________
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> >> phone: +007(495)939-16-83, +007(495)939-23-83
>
>      Regards,
>          Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

--
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83