Re: old bug in full text parser - Mailing list pgsql-hackers

From Mike Rylander
Subject Re: old bug in full text parser
Date
Msg-id CAO8ar==RC4o7a3Yw_AoQ=TVyH2EmZLx1PRQPGfios+XsXEr+xw@mail.gmail.com
Whole thread Raw
In response to old bug in full text parser  (Oleg Bartunov <obartunov@gmail.com>)
Responses Re: old bug in full text parser
List pgsql-hackers
On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
> It  looks like there is a very old bug in full text parser (somebody pointed
> me on it), which appeared after moving tsearch2 into the core.  The problem
> is in how full text parser process hyphenated words. Our original idea was
> to report hyphenated word itself as well as its parts and ignore hyphen.
> That was how tsearch2 works.
>
> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2.  Hyphenated words with numbers ('4-dot', 'dot-4')  processed differently
> than ones with plain text words like 'four-dot', no hyphenated word itself
> reported.
>
> I think we should consider this as a bug and produce fix for all supported
> versions.
>

The Evergreen project has long depended on tsearch2 (both as an
extension and in-core FTS), and one thing we've struggled with is date
range parsing such as birth and death years for authors in the form of
1979-2014, for instance.  Strings like that end up being parsed as two
lexems, "1979" and "-2014".  We work around this by pre-normalizing
strings matching /(\d+)-(\d+)/ into two numbers separated by a space
instead of a hyphen, but if fixing this bug would remove the need for
such a preprocessing step it would be a great help to us.  Would such
strings be parsed "properly" into lexems of the form of "1979" and
"2014" with you proposed change?

Thanks!

--
Mike Rylander

> After  investigation we found this commit:
>
> commit 73e6f9d3b61995525785b2f4490b465fe860196b
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Sat Oct 27 19:03:45 2007 +0000
>
>     Change text search parsing rules for hyphenated words so that digit
> strings
>     containing decimal points aren't considered part of a hyphenated word.
>     Sync the hyphenated-word lookahead states with the subsequent
> part-by-part
>     reparsing states so that we don't get different answers about how much
> text
>     is part of the hyphenated word.  Per my gripe of a few days ago.
>
>
> 8.2.23
>
> select tok_type, description, token from ts_debug('dot-four');
>   tok_type   |          description          |  token
> -------------+-------------------------------+----------
>  lhword      | Latin hyphenated word         | dot-four
>  lpart_hword | Latin part of hyphenated word | dot
>  lpart_hword | Latin part of hyphenated word | four
> (3 rows)
>
> select tok_type, description, token from ts_debug('dot-4');
>   tok_type   |          description          | token
> -------------+-------------------------------+-------
>  hword       | Hyphenated word               | dot-4
>  lpart_hword | Latin part of hyphenated word | dot
>  uint        | Unsigned integer              | 4
> (3 rows)
>
> select tok_type, description, token from ts_debug('4-dot');
>  tok_type |   description    | token
> ----------+------------------+-------
>  uint     | Unsigned integer | 4
>  lword    | Latin word       | dot
> (2 rows)
>
> 8.3.23
>
> select alias, description, token from ts_debug('dot-four');
>       alias      |           description           |  token
> -----------------+---------------------------------+----------
>  asciihword      | Hyphenated word, all ASCII      | dot-four
>  hword_asciipart | Hyphenated word part, all ASCII | dot
>  blank           | Space symbols                   | -
>  hword_asciipart | Hyphenated word part, all ASCII | four
> (4 rows)
>
> select alias, description, token from ts_debug('dot-4');
>    alias   |   description   | token
> -----------+-----------------+-------
>  asciiword | Word, all ASCII | dot
>  int       | Signed integer  | -4
> (2 rows)
>
> select alias, description, token from ts_debug('4-dot');
>    alias   |   description    | token
> -----------+------------------+-------
>  uint      | Unsigned integer | 4
>  blank     | Space symbols    | -
>  asciiword | Word, all ASCII  | dot
> (3 rows)
>
>
> Regards,
> Oleg



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [COMMITTERS] pgsql: Code cleanup in the wake of recent LWLock refactoring.
Next
From: Teodor Sigaev
Date:
Subject: Re: [PROPOSAL] Improvements of Hunspell dictionaries support