Thread: BUG #4562: ts_headline() adds space when parsing url
The following bug has been logged online: Bug reference: 4562 Logged by: Denis Monsieur Email address: dmonsieur@gmail.com PostgreSQL version: 8.3.4 Operating system: Debian etch Description: ts_headline() adds space when parsing url Details: My system is 8.3.4, but people in #postgresql with 8.3.5 have confirmed the issue. The problem is a space being added to text in the form of http://some.url/path Compare the output: shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext')); ts_headline ----------------- http://some.url (1 row) shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext')); ts_headline ----------------------- http:// some.url/path (1 row)
U2FtZSB0aGluZyBvbiA4LjMuNSBXaW4zMg0KDQoNCmVzdGVyPSMgU0VMRUNU IHRzX2hlYWRsaW5lKCdodHRwOi8vc29tZS51cmwvcGF0aCcsIHRvX3RzcXVl cnkoJ3NvbWV0ZXh0JykpOw0KICAgICAgdHNfaGVhZGxpbmUNCi0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tDQogaHR0cDovLyBzb21lLnVybC9wYXRoDQooMSBy b3cpDQoNCmVzdGVyPSMgU0VMRUNUIHRzX2hlYWRsaW5lKCdodHRwOi8vc29t ZS51cmwnLCB0b190c3F1ZXJ5KCdzb21ldGV4dCcpKTsNCiAgIHRzX2hlYWRs aW5lDQotLS0tLS0tLS0tLS0tLS0tLQ0KIGh0dHA6Ly9zb21lLnVybA0KKDEg cm93KQ0KDQplc3Rlcj0jDQoNCkdpbGRhcw0KDQotLS0tLU1lc3NhZ2UgZCdv cmlnaW5lLS0tLS0NCkRlwqA6IHBnc3FsLWJ1Z3Mtb3duZXJAcG9zdGdyZXNx bC5vcmcgW21haWx0bzpwZ3NxbC1idWdzLW93bmVyQHBvc3RncmVzcWwub3Jn XSBEZSBsYSBwYXJ0IGRlIERlbmlzIE1vbnNpZXVyDQpFbnZvecOpwqA6IGpl dWRpIDQgZMOpY2VtYnJlIDIwMDggMDA6MzMNCsOAwqA6IHBnc3FsLWJ1Z3NA cG9zdGdyZXNxbC5vcmcNCk9iamV0wqA6IFtCVUdTXSBCVUcgIzQ1NjI6IHRz X2hlYWRsaW5lKCkgYWRkcyBzcGFjZSB3aGVuIHBhcnNpbmcgdXJsDQoNCg0K VGhlIGZvbGxvd2luZyBidWcgaGFzIGJlZW4gbG9nZ2VkIG9ubGluZToNCg0K QnVnIHJlZmVyZW5jZTogICAgICA0NTYyDQpMb2dnZWQgYnk6ICAgICAgICAg IERlbmlzIE1vbnNpZXVyDQpFbWFpbCBhZGRyZXNzOiAgICAgIGRtb25zaWV1 ckBnbWFpbC5jb20NClBvc3RncmVTUUwgdmVyc2lvbjogOC4zLjQNCk9wZXJh dGluZyBzeXN0ZW06ICAgRGViaWFuIGV0Y2gNCkRlc2NyaXB0aW9uOiAgICAg ICAgdHNfaGVhZGxpbmUoKSBhZGRzIHNwYWNlIHdoZW4gcGFyc2luZyB1cmwN CkRldGFpbHM6IA0KDQpNeSBzeXN0ZW0gaXMgOC4zLjQsIGJ1dCBwZW9wbGUg aW4gI3Bvc3RncmVzcWwgd2l0aCA4LjMuNSBoYXZlIGNvbmZpcm1lZCB0aGUN Cmlzc3VlLg0KDQpUaGUgcHJvYmxlbSBpcyBhIHNwYWNlIGJlaW5nIGFkZGVk IHRvIHRleHQgaW4gdGhlIGZvcm0gb2YNCmh0dHA6Ly9zb21lLnVybC9wYXRo DQpDb21wYXJlIHRoZSBvdXRwdXQ6DQoNCnNocz0jIFNFTEVDVCB0c19oZWFk bGluZSgnaHR0cDovL3NvbWUudXJsJywgdG9fdHNxdWVyeSgnc29tZXRleHQn KSk7DQogICB0c19oZWFkbGluZQ0KLS0tLS0tLS0tLS0tLS0tLS0NCiBodHRw Oi8vc29tZS51cmwNCigxIHJvdykNCg0Kc2hzPSMgU0VMRUNUIHRzX2hlYWRs aW5lKCdodHRwOi8vc29tZS51cmwvcGF0aCcsIHRvX3RzcXVlcnkoJ3NvbWV0 ZXh0JykpOw0KICAgICAgdHNfaGVhZGxpbmUNCi0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tDQogaHR0cDovLyBzb21lLnVybC9wYXRoDQooMSByb3cpDQoNCi0t IA0KU2VudCB2aWEgcGdzcWwtYnVncyBtYWlsaW5nIGxpc3QgKHBnc3FsLWJ1 Z3NAcG9zdGdyZXNxbC5vcmcpDQpUbyBtYWtlIGNoYW5nZXMgdG8geW91ciBz dWJzY3JpcHRpb246DQpodHRwOi8vd3d3LnBvc3RncmVzcWwub3JnL21haWxw cmVmL3Bnc3FsLWJ1Z3MNCg==
"Denis Monsieur" <dmonsieur@gmail.com> writes: > The problem is a space being added to text in the form of > http://some.url/path > Compare the output: > shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext')); > ts_headline > ----------------- > http://some.url > (1 row) > shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext')); > ts_headline > ----------------------- > http:// some.url/path > (1 row) I looked into this, and it seems that the problem is that generateHeadline() emits a space for any token marked as replace = 1. I think it probably shouldn't emit anything at all. AFAICS the cases where replace will get set are token types URL, TAG, NUMHWORD, ASCIIHWORD, HWORD. For URL and the HWORD variants the space is certainly undesirable, because these token types are just respecifying text that is also covered by their component tokens. The only case where you could make an argument that the space is useful is TAG, as in regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext')); ts_headline ------------- http blah (1 row) But it seems to me to be at least as plausible that you should get nothing as that you should get a space for a removed tag. Comments? regards, tom lane
This bug still exists in my testing. --------------------------------------------------------------------------- Tom Lane wrote: > "Denis Monsieur" <dmonsieur@gmail.com> writes: > > The problem is a space being added to text in the form of > > http://some.url/path > > Compare the output: > > > shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext')); > > ts_headline > > ----------------- > > http://some.url > > (1 row) > > > shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext')); > > ts_headline > > ----------------------- > > http:// some.url/path > > (1 row) > > I looked into this, and it seems that the problem is that > generateHeadline() emits a space for any token marked as replace = 1. > I think it probably shouldn't emit anything at all. AFAICS the cases > where replace will get set are token types URL, TAG, NUMHWORD, > ASCIIHWORD, HWORD. For URL and the HWORD variants the space is > certainly undesirable, because these token types are just respecifying > text that is also covered by their component tokens. The only case > where you could make an argument that the space is useful is TAG, > as in > > regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext')); > ts_headline > ------------- > http blah > (1 row) > > But it seems to me to be at least as plausible that you should get > nothing as that you should get a space for a removed tag. > > Comments? > > regards, tom lane > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 14 Jan 2009, Bruce Momjian wrote: > > This bug still exists in my testing. We fixed all issues with ts_headline and will submit soon. > > --------------------------------------------------------------------------- > > Tom Lane wrote: >> "Denis Monsieur" <dmonsieur@gmail.com> writes: >>> The problem is a space being added to text in the form of >>> http://some.url/path >>> Compare the output: >> >>> shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext')); >>> ts_headline >>> ----------------- >>> http://some.url >>> (1 row) >> >>> shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext')); >>> ts_headline >>> ----------------------- >>> http:// some.url/path >>> (1 row) >> >> I looked into this, and it seems that the problem is that >> generateHeadline() emits a space for any token marked as replace = 1. >> I think it probably shouldn't emit anything at all. AFAICS the cases >> where replace will get set are token types URL, TAG, NUMHWORD, >> ASCIIHWORD, HWORD. For URL and the HWORD variants the space is >> certainly undesirable, because these token types are just respecifying >> text that is also covered by their component tokens. The only case >> where you could make an argument that the space is useful is TAG, >> as in >> >> regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext')); >> ts_headline >> ------------- >> http blah >> (1 row) >> >> But it seems to me to be at least as plausible that you should get >> nothing as that you should get a space for a removed tag. >> >> Comments? >> >> regards, tom lane >> >> -- >> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-bugs > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
This has been fixed and will be in the next 8.3 minor release. --------------------------------------------------------------------------- Tom Lane wrote: > "Denis Monsieur" <dmonsieur@gmail.com> writes: > > The problem is a space being added to text in the form of > > http://some.url/path > > Compare the output: > > > shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext')); > > ts_headline > > ----------------- > > http://some.url > > (1 row) > > > shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext')); > > ts_headline > > ----------------------- > > http:// some.url/path > > (1 row) > > I looked into this, and it seems that the problem is that > generateHeadline() emits a space for any token marked as replace = 1. > I think it probably shouldn't emit anything at all. AFAICS the cases > where replace will get set are token types URL, TAG, NUMHWORD, > ASCIIHWORD, HWORD. For URL and the HWORD variants the space is > certainly undesirable, because these token types are just respecifying > text that is also covered by their component tokens. The only case > where you could make an argument that the space is useful is TAG, > as in > > regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext')); > ts_headline > ------------- > http blah > (1 row) > > But it seems to me to be at least as plausible that you should get > nothing as that you should get a space for a removed tag. > > Comments? > > regards, tom lane > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +