Thread: to_tsvector() with hyphens

to_tsvector() with hyphens

From
Brian DeRocher
Date:
Hey everyone,

I think it's great that the full text search parser breaks hyphenated words into multiple parts.  I think this really
couldhelp, but something is not right. 


rasmas_hackathon=> select * from ts_debug( 'gn-foo' );
      alias      |           description           |  token  |  dictionaries  |  dictionary  | lexemes
-----------------+---------------------------------+---------+----------------+--------------+----------
 asciihword      | Hyphenated word, all ASCII      | gn-foo  | {english_stem} | english_stem | {gn-foo}
 hword_asciipart | Hyphenated word part, all ASCII | gn      | {english_stem} | english_stem | {gn}
 blank           | Space symbols                   | -       | {}             |              |
 hword_asciipart | Hyphenated word part, all ASCII | foo     | {english_stem} | english_stem | {foo}
 blank           | Space symbols                   |         | {}             |              |
(6 rows)


But why does to_tsquery() AND them?

rasmas_hackathon=> select * from to_tsquery( 'gn-foo | bandage' );
             to_tsquery
------------------------------------
 'gn-foo' & 'gn' & 'foo' | 'bandag'
(1 row)


Perhaps my vector is like this:

rasmas_hackathon=> select to_tsvector( 'gn series bandage' );
         to_tsvector
-----------------------------
 'bandag':3 'gn':1 'seri':2
(1 row)


The rank is so bad.

rasmas_hackathon=> select ts_rank_cd( to_tsvector( 'gn series bandage' ), to_tsquery( 'gn-foo | bandage' ) );
 ts_rank_cd
------------
        0.1
(1 row)

Without the hyphen the rank is better, despite the process above.

rasmas_hackathon=> select ts_rank_cd( to_tsvector( 'gn series bandage' ), to_tsquery( 'gn | bandage' ) );
 ts_rank_cd
------------
        0.2
(1 row)


So wouldn't this be a better query for hyphenated words?

 'gn-foo' | 'gn' | 'foo'


Aside: Best i can tell the parser is giving instructions to pushval_morph() to treat hyphenated words as
"same variants".


thanks,
Brian


--
http://brian.derocher.org
http://mappingdc.org
http://about.me/brian.derocher


Re: to_tsvector() with hyphens

From
Tom Lane
Date:
Brian DeRocher <brian@derocher.org> writes:
> But why does to_tsquery() AND them?

> rasmas_hackathon=> select * from to_tsquery( 'gn-foo | bandage' );
>              to_tsquery
> ------------------------------------
>  'gn-foo' & 'gn' & 'foo' | 'bandag'
> (1 row)

Because what you're looking for is gn-foo, not either gn alone or foo
alone.  Converting to "OR" would be the wrong thing.

> The rank is so bad.

> rasmas_hackathon=> select ts_rank_cd( to_tsvector( 'gn series bandage' ), to_tsquery( 'gn-foo | bandage' ) );
>  ts_rank_cd
> ------------
>         0.1
> (1 row)

> Without the hyphen the rank is better, despite the process above.

> rasmas_hackathon=> select ts_rank_cd( to_tsvector( 'gn series bandage' ), to_tsquery( 'gn | bandage' ) );
>  ts_rank_cd
> ------------
>         0.2
> (1 row)

Don't see the problem.  The first case doesn't match the query as well as
the second one does, so I'd fully expect a higher rank for the second.

            regards, tom lane