Thread: old bug in full text parser
It looks like there is a very old bug in full text parser (somebody pointed me on it), which appeared after moving tsearch2 into the core. The problem is in how full text parser process hyphenated words. Our original idea was to report hyphenated word itself as well as its parts and ignore hyphen. That was how tsearch2 works.
This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.
2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently than ones with plain text words like 'four-dot', no hyphenated word itself reported.This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.
I think we should consider this as a bug and produce fix for all supported versions.
After investigation we found this commit:
commit 73e6f9d3b61995525785b2f4490b465fe860196b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Oct 27 19:03:45 2007 +0000
Change text search parsing rules for hyphenated words so that digit strings
containing decimal points aren't considered part of a hyphenated word.
Sync the hyphenated-word lookahead states with the subsequent part-by-part
reparsing states so that we don't get different answers about how much text
is part of the hyphenated word. Per my gripe of a few days ago.
8.2.23
select tok_type, description, token from ts_debug('dot-four');
tok_type | description | token
-------------+-------------------------------+----------
lhword | Latin hyphenated word | dot-four
lpart_hword | Latin part of hyphenated word | dot
lpart_hword | Latin part of hyphenated word | four
(3 rows)
select tok_type, description, token from ts_debug('dot-4');
tok_type | description | token
-------------+-------------------------------+-------
hword | Hyphenated word | dot-4
lpart_hword | Latin part of hyphenated word | dot
uint | Unsigned integer | 4
(3 rows)
select tok_type, description, token from ts_debug('4-dot');
tok_type | description | token
----------+------------------+-------
uint | Unsigned integer | 4
lword | Latin word | dot
(2 rows)
8.3.23
select alias, description, token from ts_debug('dot-four');
alias | description | token
-----------------+---------------------------------+----------
asciihword | Hyphenated word, all ASCII | dot-four
hword_asciipart | Hyphenated word part, all ASCII | dot
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | four
(4 rows)
select alias, description, token from ts_debug('dot-4');
alias | description | token
-----------+-----------------+-------
asciiword | Word, all ASCII | dot
int | Signed integer | -4
(2 rows)
select alias, description, token from ts_debug('4-dot');
alias | description | token
-----------+------------------+-------
uint | Unsigned integer | 4
blank | Space symbols | -
asciiword | Word, all ASCII | dot
(3 rows)
Regards,
Oleg
On Wed, Feb 10, 2016 at 12:28 PM, Oleg Bartunov <obartunov@gmail.com> wrote:
It looks like there is a very old bug in full text parser (somebody pointed me on it), which appeared after moving tsearch2 into the core. The problem is in how full text parser process hyphenated words. Our original idea was to report hyphenated word itself as well as its parts and ignore hyphen. That was how tsearch2 works.2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently than ones with plain text words like 'four-dot', no hyphenated word itself reported.
This behaviour was changed after moving tsearch2 into the core:
1. hyphen now reported by parser, which is useless.I think we should consider this as a bug and produce fix for all supported versions.
After investigation we found this commit:
commit 73e6f9d3b61995525785b2f4490b465fe860196b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Oct 27 19:03:45 2007 +0000
Change text search parsing rules for hyphenated words so that digit strings
containing decimal points aren't considered part of a hyphenated word.
Sync the hyphenated-word lookahead states with the subsequent part-by-part
reparsing states so that we don't get different answers about how much text
is part of the hyphenated word. Per my gripe of a few days ago.
8.2.23
select tok_type, description, token from ts_debug('dot-four');
tok_type | description | token
-------------+-------------------------------+----------
lhword | Latin hyphenated word | dot-four
lpart_hword | Latin part of hyphenated word | dot
lpart_hword | Latin part of hyphenated word | four
(3 rows)
select tok_type, description, token from ts_debug('dot-4');
tok_type | description | token
-------------+-------------------------------+-------
hword | Hyphenated word | dot-4
lpart_hword | Latin part of hyphenated word | dot
uint | Unsigned integer | 4
(3 rows)
select tok_type, description, token from ts_debug('4-dot');
tok_type | description | token
----------+------------------+-------
uint | Unsigned integer | 4
lword | Latin word | dot
(2 rows)
8.3.23
select alias, description, token from ts_debug('dot-four');
alias | description | token
-----------------+---------------------------------+----------
asciihword | Hyphenated word, all ASCII | dot-four
hword_asciipart | Hyphenated word part, all ASCII | dot
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | four
(4 rows)
select alias, description, token from ts_debug('dot-4');
alias | description | token
-----------+-----------------+-------
asciiword | Word, all ASCII | dot
int | Signed integer | -4
(2 rows)
select alias, description, token from ts_debug('4-dot');
alias | description | token
-----------+------------------+-------
uint | Unsigned integer | 4
blank | Space symbols | -
asciiword | Word, all ASCII | dot
(3 rows)
Oh, one more bug, which existed even in tsearch2.
select tok_type, description, token from ts_debug('4-dot');
tok_type | description | token
----------+------------------+-------
uint | Unsigned integer | 4
lword | Latin word | dot
(2 rows)
select tok_type, description, token from ts_debug('4-dot');
tok_type | description | token
----------+------------------+-------
uint | Unsigned integer | 4
lword | Latin word | dot
(2 rows)
Regards,Oleg
Oleg Bartunov <obartunov@gmail.com> writes: > It looks like there is a very old bug in full text parser (somebody > pointed me on it), which appeared after moving tsearch2 into the core. The > problem is in how full text parser process hyphenated words. Our original > idea was to report hyphenated word itself as well as its parts and ignore > hyphen. That was how tsearch2 works. > This behaviour was changed after moving tsearch2 into the core: > 1. hyphen now reported by parser, which is useless. > 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently > than ones with plain text words like 'four-dot', no hyphenated word itself > reported. > I think we should consider this as a bug and produce fix for all supported > versions. I don't see anything here that looks like a bug, more like a definition disagreement. As such, I'd be pretty dubious about back-patching a change. But it's hard to debate the merits when you haven't said exactly what you'd do instead. I believe the commit you mention was intended to fix this inconsistency: http://www.postgresql.org/message-id/6269.1193184058@sss.pgh.pa.us so I would be against simply reverting it. In any case, the examples given there make it look like there was already inconsistency about mixed words and numbers. Do we really think that "4-dot" should be considered a hyphenated word? I'm not sure. regards, tom lane
On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov <obartunov@gmail.com> wrote: > It looks like there is a very old bug in full text parser (somebody pointed > me on it), which appeared after moving tsearch2 into the core. The problem > is in how full text parser process hyphenated words. Our original idea was > to report hyphenated word itself as well as its parts and ignore hyphen. > That was how tsearch2 works. > > This behaviour was changed after moving tsearch2 into the core: > 1. hyphen now reported by parser, which is useless. > 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently > than ones with plain text words like 'four-dot', no hyphenated word itself > reported. > > I think we should consider this as a bug and produce fix for all supported > versions. > The Evergreen project has long depended on tsearch2 (both as an extension and in-core FTS), and one thing we've struggled with is date range parsing such as birth and death years for authors in the form of 1979-2014, for instance. Strings like that end up being parsed as two lexems, "1979" and "-2014". We work around this by pre-normalizing strings matching /(\d+)-(\d+)/ into two numbers separated by a space instead of a hyphen, but if fixing this bug would remove the need for such a preprocessing step it would be a great help to us. Would such strings be parsed "properly" into lexems of the form of "1979" and "2014" with you proposed change? Thanks! -- Mike Rylander > After investigation we found this commit: > > commit 73e6f9d3b61995525785b2f4490b465fe860196b > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Sat Oct 27 19:03:45 2007 +0000 > > Change text search parsing rules for hyphenated words so that digit > strings > containing decimal points aren't considered part of a hyphenated word. > Sync the hyphenated-word lookahead states with the subsequent > part-by-part > reparsing states so that we don't get different answers about how much > text > is part of the hyphenated word. Per my gripe of a few days ago. > > > 8.2.23 > > select tok_type, description, token from ts_debug('dot-four'); > tok_type | description | token > -------------+-------------------------------+---------- > lhword | Latin hyphenated word | dot-four > lpart_hword | Latin part of hyphenated word | dot > lpart_hword | Latin part of hyphenated word | four > (3 rows) > > select tok_type, description, token from ts_debug('dot-4'); > tok_type | description | token > -------------+-------------------------------+------- > hword | Hyphenated word | dot-4 > lpart_hword | Latin part of hyphenated word | dot > uint | Unsigned integer | 4 > (3 rows) > > select tok_type, description, token from ts_debug('4-dot'); > tok_type | description | token > ----------+------------------+------- > uint | Unsigned integer | 4 > lword | Latin word | dot > (2 rows) > > 8.3.23 > > select alias, description, token from ts_debug('dot-four'); > alias | description | token > -----------------+---------------------------------+---------- > asciihword | Hyphenated word, all ASCII | dot-four > hword_asciipart | Hyphenated word part, all ASCII | dot > blank | Space symbols | - > hword_asciipart | Hyphenated word part, all ASCII | four > (4 rows) > > select alias, description, token from ts_debug('dot-4'); > alias | description | token > -----------+-----------------+------- > asciiword | Word, all ASCII | dot > int | Signed integer | -4 > (2 rows) > > select alias, description, token from ts_debug('4-dot'); > alias | description | token > -----------+------------------+------- > uint | Unsigned integer | 4 > blank | Space symbols | - > asciiword | Word, all ASCII | dot > (3 rows) > > > Regards, > Oleg
On Wed, Feb 10, 2016 at 7:45 PM, Mike Rylander <mrylander@gmail.com> wrote:
On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
> It looks like there is a very old bug in full text parser (somebody pointed
> me on it), which appeared after moving tsearch2 into the core. The problem
> is in how full text parser process hyphenated words. Our original idea was
> to report hyphenated word itself as well as its parts and ignore hyphen.
> That was how tsearch2 works.
>
> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently
> than ones with plain text words like 'four-dot', no hyphenated word itself
> reported.
>
> I think we should consider this as a bug and produce fix for all supported
> versions.
>
The Evergreen project has long depended on tsearch2 (both as an
extension and in-core FTS), and one thing we've struggled with is date
range parsing such as birth and death years for authors in the form of
1979-2014, for instance. Strings like that end up being parsed as two
lexems, "1979" and "-2014". We work around this by pre-normalizing
strings matching /(\d+)-(\d+)/ into two numbers separated by a space
instead of a hyphen, but if fixing this bug would remove the need for
such a preprocessing step it would be a great help to us. Would such
strings be parsed "properly" into lexems of the form of "1979" and
"2014" with you proposed change?
I'd love to consider all hyphenated "words" in one way, disregarding to what is "a word", number of plain text, namely, 'w1-w2' should be reported as {'w1-w2', 'w1', 'w2'}. The problem is in definition of "word".
We'll definitely look on parser again, fortunately, we could just fork default parser and develop new one to not break compatibility. You have chance to help us to produce "consistent" view of what tokens new parser should recognize and how process them.
We'll definitely look on parser again, fortunately, we could just fork default parser and develop new one to not break compatibility. You have chance to help us to produce "consistent" view of what tokens new parser should recognize and how process them.
Thanks!
--
Mike Rylander
> After investigation we found this commit:
>
> commit 73e6f9d3b61995525785b2f4490b465fe860196b
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date: Sat Oct 27 19:03:45 2007 +0000
>
> Change text search parsing rules for hyphenated words so that digit
> strings
> containing decimal points aren't considered part of a hyphenated word.
> Sync the hyphenated-word lookahead states with the subsequent
> part-by-part
> reparsing states so that we don't get different answers about how much
> text
> is part of the hyphenated word. Per my gripe of a few days ago.
>
>
> 8.2.23
>
> select tok_type, description, token from ts_debug('dot-four');
> tok_type | description | token
> -------------+-------------------------------+----------
> lhword | Latin hyphenated word | dot-four
> lpart_hword | Latin part of hyphenated word | dot
> lpart_hword | Latin part of hyphenated word | four
> (3 rows)
>
> select tok_type, description, token from ts_debug('dot-4');
> tok_type | description | token
> -------------+-------------------------------+-------
> hword | Hyphenated word | dot-4
> lpart_hword | Latin part of hyphenated word | dot
> uint | Unsigned integer | 4
> (3 rows)
>
> select tok_type, description, token from ts_debug('4-dot');
> tok_type | description | token
> ----------+------------------+-------
> uint | Unsigned integer | 4
> lword | Latin word | dot
> (2 rows)
>
> 8.3.23
>
> select alias, description, token from ts_debug('dot-four');
> alias | description | token
> -----------------+---------------------------------+----------
> asciihword | Hyphenated word, all ASCII | dot-four
> hword_asciipart | Hyphenated word part, all ASCII | dot
> blank | Space symbols | -
> hword_asciipart | Hyphenated word part, all ASCII | four
> (4 rows)
>
> select alias, description, token from ts_debug('dot-4');
> alias | description | token
> -----------+-----------------+-------
> asciiword | Word, all ASCII | dot
> int | Signed integer | -4
> (2 rows)
>
> select alias, description, token from ts_debug('4-dot');
> alias | description | token
> -----------+------------------+-------
> uint | Unsigned integer | 4
> blank | Space symbols | -
> asciiword | Word, all ASCII | dot
> (3 rows)
>
>
> Regards,
> Oleg
On Wed, Feb 10, 2016 at 7:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Oleg Bartunov <obartunov@gmail.com> writes:
> It looks like there is a very old bug in full text parser (somebody
> pointed me on it), which appeared after moving tsearch2 into the core. The
> problem is in how full text parser process hyphenated words. Our original
> idea was to report hyphenated word itself as well as its parts and ignore
> hyphen. That was how tsearch2 works.
> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently
> than ones with plain text words like 'four-dot', no hyphenated word itself
> reported.
> I think we should consider this as a bug and produce fix for all supported
> versions.
I don't see anything here that looks like a bug, more like a definition
disagreement. As such, I'd be pretty dubious about back-patching a
change. But it's hard to debate the merits when you haven't said exactly
what you'd do instead.
Yeah, better say not bug, but inconsistency. We definitely should work on better
"consistent" parser with predicted behaviour.
I believe the commit you mention was intended to fix this inconsistency:
http://www.postgresql.org/message-id/6269.1193184058@sss.pgh.pa.us
so I would be against simply reverting it. In any case, the examples
given there make it look like there was already inconsistency about mixed
words and numbers. Do we really think that "4-dot" should be considered
a hyphenated word? I'm not sure.
I agree, that we shouldn't just revert it. My idea is to work on new parser and leave old as is for compatibility reason. Fortunately, fts is flexible enough, so we could add new parser at any time as an extension.
regards, tom lane