Re: old bug in full text parser - Mailing list pgsql-hackers
From | Mike Rylander |
---|---|
Subject | Re: old bug in full text parser |
Date | |
Msg-id | CAO8ar==RC4o7a3Yw_AoQ=TVyH2EmZLx1PRQPGfios+XsXEr+xw@mail.gmail.com Whole thread Raw |
In response to | old bug in full text parser (Oleg Bartunov <obartunov@gmail.com>) |
Responses |
Re: old bug in full text parser
|
List | pgsql-hackers |
On Wed, Feb 10, 2016 at 4:28 AM, Oleg Bartunov <obartunov@gmail.com> wrote: > It looks like there is a very old bug in full text parser (somebody pointed > me on it), which appeared after moving tsearch2 into the core. The problem > is in how full text parser process hyphenated words. Our original idea was > to report hyphenated word itself as well as its parts and ignore hyphen. > That was how tsearch2 works. > > This behaviour was changed after moving tsearch2 into the core: > 1. hyphen now reported by parser, which is useless. > 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed differently > than ones with plain text words like 'four-dot', no hyphenated word itself > reported. > > I think we should consider this as a bug and produce fix for all supported > versions. > The Evergreen project has long depended on tsearch2 (both as an extension and in-core FTS), and one thing we've struggled with is date range parsing such as birth and death years for authors in the form of 1979-2014, for instance. Strings like that end up being parsed as two lexems, "1979" and "-2014". We work around this by pre-normalizing strings matching /(\d+)-(\d+)/ into two numbers separated by a space instead of a hyphen, but if fixing this bug would remove the need for such a preprocessing step it would be a great help to us. Would such strings be parsed "properly" into lexems of the form of "1979" and "2014" with you proposed change? Thanks! -- Mike Rylander > After investigation we found this commit: > > commit 73e6f9d3b61995525785b2f4490b465fe860196b > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: Sat Oct 27 19:03:45 2007 +0000 > > Change text search parsing rules for hyphenated words so that digit > strings > containing decimal points aren't considered part of a hyphenated word. > Sync the hyphenated-word lookahead states with the subsequent > part-by-part > reparsing states so that we don't get different answers about how much > text > is part of the hyphenated word. Per my gripe of a few days ago. > > > 8.2.23 > > select tok_type, description, token from ts_debug('dot-four'); > tok_type | description | token > -------------+-------------------------------+---------- > lhword | Latin hyphenated word | dot-four > lpart_hword | Latin part of hyphenated word | dot > lpart_hword | Latin part of hyphenated word | four > (3 rows) > > select tok_type, description, token from ts_debug('dot-4'); > tok_type | description | token > -------------+-------------------------------+------- > hword | Hyphenated word | dot-4 > lpart_hword | Latin part of hyphenated word | dot > uint | Unsigned integer | 4 > (3 rows) > > select tok_type, description, token from ts_debug('4-dot'); > tok_type | description | token > ----------+------------------+------- > uint | Unsigned integer | 4 > lword | Latin word | dot > (2 rows) > > 8.3.23 > > select alias, description, token from ts_debug('dot-four'); > alias | description | token > -----------------+---------------------------------+---------- > asciihword | Hyphenated word, all ASCII | dot-four > hword_asciipart | Hyphenated word part, all ASCII | dot > blank | Space symbols | - > hword_asciipart | Hyphenated word part, all ASCII | four > (4 rows) > > select alias, description, token from ts_debug('dot-4'); > alias | description | token > -----------+-----------------+------- > asciiword | Word, all ASCII | dot > int | Signed integer | -4 > (2 rows) > > select alias, description, token from ts_debug('4-dot'); > alias | description | token > -----------+------------------+------- > uint | Unsigned integer | 4 > blank | Space symbols | - > asciiword | Word, all ASCII | dot > (3 rows) > > > Regards, > Oleg
pgsql-hackers by date: