websearch_to_tsquery() returns queries that don't match to_tsvector() - Mailing list pgsql-bugs

From Valentin Gatien-Baron
Subject websearch_to_tsquery() returns queries that don't match to_tsvector()
Date
Msg-id CA+0DEqiZs7gdOd4ikmg=0UWG+SwWOLxPsk_JW-sx9WNOyrb0KQ@mail.gmail.com
Whole thread Raw
List pgsql-bugs
Hi,

I'm surprised that the following expression is false:

select to_tsvector('english', 'aaa: bbb') @@ websearch_to_tsquery('english', '"aaa: bbb"');
 ?column?
----------
 f
(1 row)

My expectation is that to_tsvector('english', text) @@
websearch_to_tsquery('english', '" || text || "') would be true for
all texts, or pretty close to all texts. Otherwise it makes search
rather unpredictable. The actual example that started this
investigation was searching for '"/path/to/some/exe: no such file or
directory"' (which was failing to find the exact matches that I knew
existed).

Looking at the tsvector and tsquery, we can see that the problem is
that the ":" counts as one position for the ts_query but not the
ts_vector:

select to_tsvector('english', 'aaa: bbb'), websearch_to_tsquery('english', '"aaa: bbb"');
   to_tsvector   | websearch_to_tsquery
-----------------+----------------------
 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'
(1 row)

So I wondered: are there more such cases? Looking at all texts of the
form 'aaa' || maybe-space || one-byte || maybe-space || 'bbb', it
happens quite a bit:

select text, ts_vector, ts_query, matches from unnest(array['', ' ']) as prefix, unnest(array['', ' ']) as suffix, (select chr(a) as char from generate_series(1,192) as s(a)) as zz1, lateral (select 'aaa' || prefix || char || suffix || 'bbb' as text) as zz2, lateral (select to_tsvector('english', text) as ts_vector) as zz3, lateral (select websearch_to_tsquery('english', '"' || text || '"') as ts_query) as zz4, lateral (select ts_vector @@ ts_query as matches) as zz5 where not matches;
      text      |    ts_vector    |     ts_query     | matches
----------------+-----------------+------------------+---------
 aaa \x01 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x02 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x03 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x04 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x05 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x06 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x07 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x08 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x0E bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x0F bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x10 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x11 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x12 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x13 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x14 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x15 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x16 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x17 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x18 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x19 bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x1A bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x1B bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x1C bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x1D bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x1E bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x1F bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa # bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa $ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa % bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ' bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa * bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa + bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa , bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa . bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa / bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa: bbb       | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa : bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ; bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa = bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa > bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ? bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa @ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa [ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ] bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ^ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa _ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ` bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa { bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa } bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ~bbb       | 'aaa':1 'bbb':2 | 'aaa' <-> '~bbb' | f
 aaa ~ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \x7F bbb   | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0080 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0081 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0082 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0083 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0084 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0085 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0086 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0087 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0088 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0089 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u008A bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u008B bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u008C bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u008D bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u008E bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u008F bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0090 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0091 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0092 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0093 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0094 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0095 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0096 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0097 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0098 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u0099 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u009A bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u009B bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u009C bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u009D bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u009E bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa \u009F bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa   bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¡ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¢ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa £ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¤ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¥ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¦ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa § bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¨ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa © bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa « bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¬ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ­ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ® bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¯ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ° bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ± bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ² bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ³ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ´ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¶ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa · bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¸ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¹ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa » bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¼ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ½ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¾ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
 aaa ¿ bbb      | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'  | f
(114 rows)

There is no obvious workaround either:

- there's no function that converts a tsvector like 'aaa':1 'bbb':2
into a tsquery like 'aaa' <-> 'bbb', that one might be able to use to
build a query with exactly the same normalization as tsvector.

- replacing all problematic characters above by spaces seems to work
for most characters but not others, as for instance it fixes 'aaa
. bbb' but breaks 'aaa.bbb'.

select version();
                                                 version                                                
---------------------------------------------------------------------------------------------------------
 PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
(1 row)

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #16969: INSERT of multiple rows into GENERATED ALWAYS AS IDENTITY column with DEFAULT value is broken.
Next
From: "Pete O'Such"
Date:
Subject: Re: BUG #16959: Unnesting null from string_to_array silently removes whole rows from result