Thread: ts_parse reports different between MacOS, FreeBSD/Linux

ts_parse reports different between MacOS, FreeBSD/Linux

From

"Mark Felder"

Date:

22 December 2020, 21:15:39

Hello,

We have an application whose test suite fails on MacOS when running the search tests on unicode characters.

I've narrowed it down to the following:

macos=# select * from ts_parse('default','天');
 tokid | token
-------+-------
    12 | 天
(1 row)

freebsd=# select * from ts_parse('default','天');
 tokid | token
-------+-------
     2 | 天
(1 row)


This has been bugging me for a while, but it's a test our devs using MacOS just ignores for now as we know it passes
ourCI/CD pipeline on FreeBSD/Linux. It seems if anyone is shipping an app on MacOS and bundling Postgres they're going
tohave a bad time with searching. 


Please let me know if there's anything I can do to help. Will gladly test patches.



Thanks,



--
  Mark Felder
  ports-secteam & portmgr alumni
  feld@FreeBSD.org

Re: ts_parse reports different between MacOS, FreeBSD/Linux

From

Tom Lane

Date:

22 December 2020, 21:46:44

"Mark Felder" <feld@FreeBSD.org> writes:
> We have an application whose test suite fails on MacOS when running the search tests on unicode characters.

Yeah, known problem :-(.  The text search parser relies on the C library's
locale data to classify characters as being letters, digits, etc.
Unfortunately, the UTF8 locales on macOS are just horribly bad, and
report many results that are different from other platforms.

I suppose that Apple has got reasonable Unicode character knowledge
somewhere in their OS; they are just not very interested in making the
POSIX locale APIs work well.  Which leaves us with a bit of a problem
for getting consistent results cross-platform.

            regards, tom lane