Re: old bug in full text parser - Mailing list pgsql-hackers
From | Oleg Bartunov |
---|---|
Subject | Re: old bug in full text parser |
Date | |
Msg-id | CAF4Au4ybGJMErZf+CRDX0Y=SRuLhGA0pi8nThjrs2-DhfJo0xQ@mail.gmail.com Whole thread Raw |
In response to | Re: old bug in full text parser (Mike Rylander <mrylander@gmail.com>) |
List | pgsql-hackers |
On Wed, Feb 10, 2016 at 7:45 PM, Mike Rylander <mrylander@gmail.com> wrote:
On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
> It looks like there is a very old bug in full text parser (somebody pointed
> me on it), which appeared after moving tsearch2 into the core. The problem
> is in how full text parser process hyphenated words. Our original idea was
> to report hyphenated word itself as well as its parts and ignore hyphen.
> That was how tsearch2 works.
>
> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently
> than ones with plain text words like 'four-dot', no hyphenated word itself
> reported.
>
> I think we should consider this as a bug and produce fix for all supported
> versions.
>
The Evergreen project has long depended on tsearch2 (both as an
extension and in-core FTS), and one thing we've struggled with is date
range parsing such as birth and death years for authors in the form of
1979-2014, for instance. Strings like that end up being parsed as two
lexems, "1979" and "-2014". We work around this by pre-normalizing
strings matching /(\d+)-(\d+)/ into two numbers separated by a space
instead of a hyphen, but if fixing this bug would remove the need for
such a preprocessing step it would be a great help to us. Would such
strings be parsed "properly" into lexems of the form of "1979" and
"2014" with you proposed change?
I'd love to consider all hyphenated "words" in one way, disregarding to what is "a word", number of plain text, namely, 'w1-w2' should be reported as {'w1-w2', 'w1', 'w2'}. The problem is in definition of "word".
We'll definitely look on parser again, fortunately, we could just fork default parser and develop new one to not break compatibility. You have chance to help us to produce "consistent" view of what tokens new parser should recognize and how process them.
We'll definitely look on parser again, fortunately, we could just fork default parser and develop new one to not break compatibility. You have chance to help us to produce "consistent" view of what tokens new parser should recognize and how process them.
Thanks!
--
Mike Rylander
> After investigation we found this commit:
>
> commit 73e6f9d3b61995525785b2f4490b465fe860196b
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date: Sat Oct 27 19:03:45 2007 +0000
>
> Change text search parsing rules for hyphenated words so that digit
> strings
> containing decimal points aren't considered part of a hyphenated word.
> Sync the hyphenated-word lookahead states with the subsequent
> part-by-part
> reparsing states so that we don't get different answers about how much
> text
> is part of the hyphenated word. Per my gripe of a few days ago.
>
>
> 8.2.23
>
> select tok_type, description, token from ts_debug('dot-four');
> tok_type | description | token
> -------------+-------------------------------+----------
> lhword | Latin hyphenated word | dot-four
> lpart_hword | Latin part of hyphenated word | dot
> lpart_hword | Latin part of hyphenated word | four
> (3 rows)
>
> select tok_type, description, token from ts_debug('dot-4');
> tok_type | description | token
> -------------+-------------------------------+-------
> hword | Hyphenated word | dot-4
> lpart_hword | Latin part of hyphenated word | dot
> uint | Unsigned integer | 4
> (3 rows)
>
> select tok_type, description, token from ts_debug('4-dot');
> tok_type | description | token
> ----------+------------------+-------
> uint | Unsigned integer | 4
> lword | Latin word | dot
> (2 rows)
>
> 8.3.23
>
> select alias, description, token from ts_debug('dot-four');
> alias | description | token
> -----------------+---------------------------------+----------
> asciihword | Hyphenated word, all ASCII | dot-four
> hword_asciipart | Hyphenated word part, all ASCII | dot
> blank | Space symbols | -
> hword_asciipart | Hyphenated word part, all ASCII | four
> (4 rows)
>
> select alias, description, token from ts_debug('dot-4');
> alias | description | token
> -----------+-----------------+-------
> asciiword | Word, all ASCII | dot
> int | Signed integer | -4
> (2 rows)
>
> select alias, description, token from ts_debug('4-dot');
> alias | description | token
> -----------+------------------+-------
> uint | Unsigned integer | 4
> blank | Space symbols | -
> asciiword | Word, all ASCII | dot
> (3 rows)
>
>
> Regards,
> Oleg
pgsql-hackers by date: