Issue Supporting Emojis in Full Text Search on Ubuntu - Mailing list pgsql-novice

From Jordan Hurwich
Subject Issue Supporting Emojis in Full Text Search on Ubuntu
Date
Msg-id CAJKqsjs22F-sJVks96MO9PC9fZ-8FjK=GZOvQEvvoFVXCopbDw@mail.gmail.com
Whole thread Raw
Responses Re: Issue Supporting Emojis in Full Text Search on Ubuntu  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-novice
We rely on the Postgres tsvector implementation to enable full text search in our app, but we're having some issues with getting the parser to recognize emoji characters (like "😀" <U+1F600>) as anything other than "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters are recognized as "word"/"Word, all letters" characters on Mac; and non-english, non-emoji characters (like "我" <U+6211>) are recognized as "word" characters on both Mac and Ubuntu.

We greatly appreciate your feedback, debug details below and happy to provide more as requested,
Jordan
pulsasensors.com, jhurwich@

Platform:
- AWS Ubuntu 18.04.2 LTS vs MacOS 10.15.5
- postgres (PostgreSQL) 11.5


* ts_debug() differs on MacOS and Ubuntu *

We have not modified the 'english' text search configuration on either instance, however the query "SELECT * FROM ts_debug('english', '😀');" returns different results on MacOS 10.15.5 and our Ubuntu instance:
- on MacOS:
db=#  select * from  ts_debug('english', '😀');
 alias |    description    | token |  dictionaries  |  dictionary  | lexemes
-------+-------------------+-------+----------------+--------------+---------
 word  | Word, all letters | 😀     | {english_stem} | english_stem | {😀}


 - on Ubuntu:
db=# SELECT * from ts_debug('english','😀');
 alias |  description  | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
 blank | Space symbols | 😀     | {}           |            |


Notably non-english, non-emoji characters like '我' behave as desired on both instances, with the same result on both MacOS and Ubuntu for "SELECT * FROM ts_debug('english', '我');":
db=# SELECT * FROM ts_debug('english', '我');
 alias |    description    | token |  dictionaries  |  dictionary  | lexemes
-------+-------------------+-------+----------------+--------------+---------
 word  | Word, all letters | 我    | {english_stem} | english_stem | {我}



* pg_database *

There are minor differences between MacOS and Ubuntu in pg_database as follows, however modifications to set datcollate and datctype to 'C' on Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for ts_debug(). See row for 'testdb01':
- on Mac:
db=# select datname, encoding, datcollate, datctype, datistemplate from pg_database;
    datname     | encoding | datcollate | datctype | datistemplate
----------------+----------+------------+----------+---------------
 postgres       |        6 | C          | C        | f
 template0      |        6 | C          | C        | t
 template1      |        6 | C          | C        | t
 testdb01       |        6 | C          | C        | f


- on Ubuntu:
db=# select datname, encoding, datcollate, datctype, datistemplate from pg_database;
  datname  | encoding | datcollate  |  datctype   | datistemplate
-----------+----------+-------------+-------------+---------------
 postgres  |        6 | C.UTF-8     | C.UTF-8     | f
 template0 |        6 | C.UTF-8     | C.UTF-8     | t
 template1 |        6 | en_US.UTF-8 | en_US.UTF-8 | t
 testdb01  |        6 | en_US.UTF-8 | en_US.UTF-8 | f



* locale *

The result of `$ locale` on both instances is similar, included below for Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items are returned while only 4 entries are returned on Ubuntu, included below:

- on Ubuntu
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=


- on Ubuntu
$ locale -a
C
C.UTF-8
en_US.utf8
POSIX


* Postgres installation *

On Mac, Postgres was installed and is managed by Homebrew via the "postgresql@11" formula.
On Ubuntu, Postgres was installed from source at https://ftp.postgresql.org/pub/source/v11.5/postgresql-11.5.tar.bz2.

pgsql-novice by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: manual for syntax for pg Query Tool
Next
From: Tom Lane
Date:
Subject: Re: Issue Supporting Emojis in Full Text Search on Ubuntu