Thread: BUG #4562: ts_headline() adds space when parsing url

BUG #4562: ts_headline() adds space when parsing url

From
"Denis Monsieur"
Date:
The following bug has been logged online:

Bug reference:      4562
Logged by:          Denis Monsieur
Email address:      dmonsieur@gmail.com
PostgreSQL version: 8.3.4
Operating system:   Debian etch
Description:        ts_headline() adds space when parsing url
Details:

My system is 8.3.4, but people in #postgresql with 8.3.5 have confirmed the
issue.

The problem is a space being added to text in the form of
http://some.url/path
Compare the output:

shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
   ts_headline
-----------------
 http://some.url
(1 row)

shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
      ts_headline
-----------------------
 http:// some.url/path
(1 row)

Re: BUG #4562: ts_headline() adds space when parsing url

From
"gildas prime"
Date:
U2FtZSB0aGluZyBvbiA4LjMuNSBXaW4zMg0KDQoNCmVzdGVyPSMgU0VMRUNU
IHRzX2hlYWRsaW5lKCdodHRwOi8vc29tZS51cmwvcGF0aCcsIHRvX3RzcXVl
cnkoJ3NvbWV0ZXh0JykpOw0KICAgICAgdHNfaGVhZGxpbmUNCi0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tDQogaHR0cDovLyBzb21lLnVybC9wYXRoDQooMSBy
b3cpDQoNCmVzdGVyPSMgU0VMRUNUIHRzX2hlYWRsaW5lKCdodHRwOi8vc29t
ZS51cmwnLCB0b190c3F1ZXJ5KCdzb21ldGV4dCcpKTsNCiAgIHRzX2hlYWRs
aW5lDQotLS0tLS0tLS0tLS0tLS0tLQ0KIGh0dHA6Ly9zb21lLnVybA0KKDEg
cm93KQ0KDQplc3Rlcj0jDQoNCkdpbGRhcw0KDQotLS0tLU1lc3NhZ2UgZCdv
cmlnaW5lLS0tLS0NCkRlwqA6IHBnc3FsLWJ1Z3Mtb3duZXJAcG9zdGdyZXNx
bC5vcmcgW21haWx0bzpwZ3NxbC1idWdzLW93bmVyQHBvc3RncmVzcWwub3Jn
XSBEZSBsYSBwYXJ0IGRlIERlbmlzIE1vbnNpZXVyDQpFbnZvecOpwqA6IGpl
dWRpIDQgZMOpY2VtYnJlIDIwMDggMDA6MzMNCsOAwqA6IHBnc3FsLWJ1Z3NA
cG9zdGdyZXNxbC5vcmcNCk9iamV0wqA6IFtCVUdTXSBCVUcgIzQ1NjI6IHRz
X2hlYWRsaW5lKCkgYWRkcyBzcGFjZSB3aGVuIHBhcnNpbmcgdXJsDQoNCg0K
VGhlIGZvbGxvd2luZyBidWcgaGFzIGJlZW4gbG9nZ2VkIG9ubGluZToNCg0K
QnVnIHJlZmVyZW5jZTogICAgICA0NTYyDQpMb2dnZWQgYnk6ICAgICAgICAg
IERlbmlzIE1vbnNpZXVyDQpFbWFpbCBhZGRyZXNzOiAgICAgIGRtb25zaWV1
ckBnbWFpbC5jb20NClBvc3RncmVTUUwgdmVyc2lvbjogOC4zLjQNCk9wZXJh
dGluZyBzeXN0ZW06ICAgRGViaWFuIGV0Y2gNCkRlc2NyaXB0aW9uOiAgICAg
ICAgdHNfaGVhZGxpbmUoKSBhZGRzIHNwYWNlIHdoZW4gcGFyc2luZyB1cmwN
CkRldGFpbHM6IA0KDQpNeSBzeXN0ZW0gaXMgOC4zLjQsIGJ1dCBwZW9wbGUg
aW4gI3Bvc3RncmVzcWwgd2l0aCA4LjMuNSBoYXZlIGNvbmZpcm1lZCB0aGUN
Cmlzc3VlLg0KDQpUaGUgcHJvYmxlbSBpcyBhIHNwYWNlIGJlaW5nIGFkZGVk
IHRvIHRleHQgaW4gdGhlIGZvcm0gb2YNCmh0dHA6Ly9zb21lLnVybC9wYXRo
DQpDb21wYXJlIHRoZSBvdXRwdXQ6DQoNCnNocz0jIFNFTEVDVCB0c19oZWFk
bGluZSgnaHR0cDovL3NvbWUudXJsJywgdG9fdHNxdWVyeSgnc29tZXRleHQn
KSk7DQogICB0c19oZWFkbGluZQ0KLS0tLS0tLS0tLS0tLS0tLS0NCiBodHRw
Oi8vc29tZS51cmwNCigxIHJvdykNCg0Kc2hzPSMgU0VMRUNUIHRzX2hlYWRs
aW5lKCdodHRwOi8vc29tZS51cmwvcGF0aCcsIHRvX3RzcXVlcnkoJ3NvbWV0
ZXh0JykpOw0KICAgICAgdHNfaGVhZGxpbmUNCi0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tDQogaHR0cDovLyBzb21lLnVybC9wYXRoDQooMSByb3cpDQoNCi0t
IA0KU2VudCB2aWEgcGdzcWwtYnVncyBtYWlsaW5nIGxpc3QgKHBnc3FsLWJ1
Z3NAcG9zdGdyZXNxbC5vcmcpDQpUbyBtYWtlIGNoYW5nZXMgdG8geW91ciBz
dWJzY3JpcHRpb246DQpodHRwOi8vd3d3LnBvc3RncmVzcWwub3JnL21haWxw
cmVmL3Bnc3FsLWJ1Z3MNCg==

Re: BUG #4562: ts_headline() adds space when parsing url

From
Tom Lane
Date:
"Denis Monsieur" <dmonsieur@gmail.com> writes:
> The problem is a space being added to text in the form of
> http://some.url/path
> Compare the output:

> shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
>    ts_headline
> -----------------
>  http://some.url
> (1 row)

> shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
>       ts_headline
> -----------------------
>  http:// some.url/path
> (1 row)

I looked into this, and it seems that the problem is that
generateHeadline() emits a space for any token marked as replace = 1.
I think it probably shouldn't emit anything at all.  AFAICS the cases
where replace will get set are token types URL, TAG, NUMHWORD,
ASCIIHWORD, HWORD.  For URL and the HWORD variants the space is
certainly undesirable, because these token types are just respecifying
text that is also covered by their component tokens.  The only case
where you could make an argument that the space is useful is TAG,
as in

regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
 ts_headline
-------------
 http blah
(1 row)

But it seems to me to be at least as plausible that you should get
nothing as that you should get a space for a removed tag.

Comments?

            regards, tom lane

Re: BUG #4562: ts_headline() adds space when parsing url

From
Bruce Momjian
Date:
This bug still exists in my testing.

---------------------------------------------------------------------------

Tom Lane wrote:
> "Denis Monsieur" <dmonsieur@gmail.com> writes:
> > The problem is a space being added to text in the form of
> > http://some.url/path
> > Compare the output:
>
> > shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
> >    ts_headline
> > -----------------
> >  http://some.url
> > (1 row)
>
> > shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
> >       ts_headline
> > -----------------------
> >  http:// some.url/path
> > (1 row)
>
> I looked into this, and it seems that the problem is that
> generateHeadline() emits a space for any token marked as replace = 1.
> I think it probably shouldn't emit anything at all.  AFAICS the cases
> where replace will get set are token types URL, TAG, NUMHWORD,
> ASCIIHWORD, HWORD.  For URL and the HWORD variants the space is
> certainly undesirable, because these token types are just respecifying
> text that is also covered by their component tokens.  The only case
> where you could make an argument that the space is useful is TAG,
> as in
>
> regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
>  ts_headline
> -------------
>  http blah
> (1 row)
>
> But it seems to me to be at least as plausible that you should get
> nothing as that you should get a space for a removed tag.
>
> Comments?
>
>             regards, tom lane
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: BUG #4562: ts_headline() adds space when parsing url

From
Oleg Bartunov
Date:
On Wed, 14 Jan 2009, Bruce Momjian wrote:

>
> This bug still exists in my testing.

We fixed all issues with ts_headline and will submit soon.

>
> ---------------------------------------------------------------------------
>
> Tom Lane wrote:
>> "Denis Monsieur" <dmonsieur@gmail.com> writes:
>>> The problem is a space being added to text in the form of
>>> http://some.url/path
>>> Compare the output:
>>
>>> shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
>>>    ts_headline
>>> -----------------
>>>  http://some.url
>>> (1 row)
>>
>>> shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
>>>       ts_headline
>>> -----------------------
>>>  http:// some.url/path
>>> (1 row)
>>
>> I looked into this, and it seems that the problem is that
>> generateHeadline() emits a space for any token marked as replace = 1.
>> I think it probably shouldn't emit anything at all.  AFAICS the cases
>> where replace will get set are token types URL, TAG, NUMHWORD,
>> ASCIIHWORD, HWORD.  For URL and the HWORD variants the space is
>> certainly undesirable, because these token types are just respecifying
>> text that is also covered by their component tokens.  The only case
>> where you could make an argument that the space is useful is TAG,
>> as in
>>
>> regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
>>  ts_headline
>> -------------
>>  http blah
>> (1 row)
>>
>> But it seems to me to be at least as plausible that you should get
>> nothing as that you should get a space for a removed tag.
>>
>> Comments?
>>
>>             regards, tom lane
>>
>> --
>> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-bugs
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: BUG #4562: ts_headline() adds space when parsing url

From
Bruce Momjian
Date:
This has been fixed and will be in the next 8.3 minor release.

---------------------------------------------------------------------------

Tom Lane wrote:
> "Denis Monsieur" <dmonsieur@gmail.com> writes:
> > The problem is a space being added to text in the form of
> > http://some.url/path
> > Compare the output:
>
> > shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
> >    ts_headline
> > -----------------
> >  http://some.url
> > (1 row)
>
> > shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
> >       ts_headline
> > -----------------------
> >  http:// some.url/path
> > (1 row)
>
> I looked into this, and it seems that the problem is that
> generateHeadline() emits a space for any token marked as replace = 1.
> I think it probably shouldn't emit anything at all.  AFAICS the cases
> where replace will get set are token types URL, TAG, NUMHWORD,
> ASCIIHWORD, HWORD.  For URL and the HWORD variants the space is
> certainly undesirable, because these token types are just respecifying
> text that is also covered by their component tokens.  The only case
> where you could make an argument that the space is useful is TAG,
> as in
>
> regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
>  ts_headline
> -------------
>  http blah
> (1 row)
>
> But it seems to me to be at least as plausible that you should get
> nothing as that you should get a space for a removed tag.
>
> Comments?
>
>             regards, tom lane
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +