Re: Issue Supporting Emojis in Full Text Search on Ubuntu - Mailing list pgsql-novice

From Tom Lane
Subject Re: Issue Supporting Emojis in Full Text Search on Ubuntu
Date
Msg-id 547627.1593563243@sss.pgh.pa.us
Whole thread Raw
In response to Issue Supporting Emojis in Full Text Search on Ubuntu  (Jordan Hurwich <jhurwich@pulsasensors.com>)
Responses Re: Issue Supporting Emojis in Full Text Search on Ubuntu  (Jordan Hurwich <jhurwich@pulsasensors.com>)
Re: Issue Supporting Emojis in Full Text Search on Ubuntu  (Bzzzz <lazyvirus@gmx.com>)
List pgsql-novice
Jordan Hurwich <jhurwich@pulsasensors.com> writes:
> We rely on the Postgres tsvector implementation to enable full text search
> in our app, but we're having some issues with getting the parser to
> recognize emoji characters (like "😀" <U+1F600>) as anything other than
> "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters
> are recognized as "word"/"Word, all letters" characters on Mac; and
> non-english, non-emoji characters (like "我" <U+6211>) are recognized as
> "word" characters on both Mac and Ubuntu.

With a non-C lc_ctype setting, which characters are considered to be
letters is entirely up to the locale code supplied by the platform's libc.
I duplicate your result that U+1F600 is not considered a letter by glibc,
at least not with lc_ctype = en_US.utf8.  (Perhaps there are other locale
settings that would consider it a letter?  Not my expertise though.)

However, with lc_ctype set to C, you should get a platform-independent
result that any non-ASCII character is a letter.

macOS' utf8-based locales are almost completely broken, unfortunately,
so it's hard to predict what they will do except that it'll probably be
wrong.

> We have not modified the 'english' text search configuration on either
> instance, however the query "SELECT * FROM ts_debug('english', '😀');"
> returns different results on MacOS 10.15.5 and our Ubuntu instance:

That's unsurprising in itself, per the above.  There's no standardization
worth mentioning in this area.

> There are minor differences between MacOS and Ubuntu in pg_database as
> follows, however modifications to set datcollate and datctype to 'C' on
> Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for
> ts_debug().

lc_ctype = C should have done what you want (at least for this specific
symbol), so I wonder whether you did the test right.  You gave few details
about how you tested this.

> db=# select datname, encoding, datcollate, datctype, datistemplate from
> pg_database;
>     datname     | encoding | datcollate | datctype | datistemplate
> ----------------+----------+------------+----------+---------------
>  postgres       |        6 | C          | C        | f
>  template0      |        6 | C          | C        | t
>  template1      |        6 | C          | C        | t
>  testdb01       |        6 | C          | C        | f

Hm, how are you getting that result when your prompt says you're
connected to database "db"?

> The result of `$ locale` on both instances is similar, included below for
> Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items
> are returned while only 4 entries are returned on Ubuntu, included below:
> - on Ubuntu
> $ locale -a
> C
> C.UTF-8
> en_US.utf8
> POSIX

This just indicates that you didn't install the package(s) that provide
alternative locales.  On my RHEL box, "locale -a" reports 865 entries,
and I'm pretty sure I don't have all the odder ones.  I do not know,
unfortunately, how Ubuntu/Debian divvy this stuff up into packages.

Anyway, the bottom line is that if you need platform-independent results
then setting lc_ctype to "C" is what to do.  It will not be very bright
about non-ASCII characters, but at least the stupidity will be uniform.

            regards, tom lane



pgsql-novice by date:

Previous
From: Jordan Hurwich
Date:
Subject: Issue Supporting Emojis in Full Text Search on Ubuntu
Next
From: Chamath Sajeewa
Date:
Subject: PostgreSQL Module to use Oracle specific functions as it is.