Home > mailing lists

PostgreSQL Asian language support for full text search using ICU (andalso updating pg_trgm) - Mailing list pgsql-general

From	Chanon Sajjamanochai
Subject	PostgreSQL Asian language support for full text search using ICU (andalso updating pg_trgm)
Date	May 1, 2019 01:55:50
Msg-id	CAEV3FNPU8hU_hi=0+QNAbEkc-uO8-K9PB3aAChdmcCyPfWX6rg@mail.gmail.com Whole thread
List	pgsql-general

Tree view

Hello,

Currently PostgreSQL doesn't support full text search natively for many Asian languages such as Chinese, Japanese and others. These languages are used by a large portion of the population of the world.

The two key modules that could be modified to support Asian languages are the full text search module (including tsvector) and pg_trgm.

I would like to propose that this support be added to PostgreSQL.

For full text search, PostgreSQL could add a new parser (https://www.postgresql.org/docs/9.2/textsearch-parsers.html) that implements ICU word tokenization. This should be a lot more easier than before now that PostgreSQL itself already includes ICU dependencies for other things.

Then allow the ICU parser to be chosen at run-time (via a run-time config or an option to to_tsvector). That is all that is needed to support full text search for many more Asian languages natively in PostgreSQL such as Chinese, Japanese and Thai.

For example Elastic Search implements this using its ICU Tokenizer plugin:
https://www.elastic.co/guide/en/elasticsearch/guide/current/icu-tokenizer.html

Some information about the related APIs in ICU for this are at:

http://userguide.icu-project.org/boundaryanalysis

Another simple improvement that would give another option for searching for Asian languages is to add a run-time setting for pg_trgm that would tell it to not drop non-ascii characters, as currently it only indexes ascii characters and thus all Asian language characters are dropped.

I emphasize 'run-time setting' because when using PostgreSQL via a Database-As-A-Service service provider, most of the time it is not possible to change the config files, recompile sources, or add any new extensions.

PostgreSQL is an awesome project and probably the best RDBMS right now. I hope the maintainers consider this suggestion.

Best Regards,

Chanon

pgsql-general by date:

From: Adrian Klaver
Date: 30 April 2019, 22:25:38
Subject: Re: ERROR: operator does not exist: timestamp without time zone +integer

From: Charlin Barak
Date: 01 May 2019, 13:01:40
Subject: Oracle number to PostgreSQL

PostgreSQL Asian language support for full text search using ICU (andalso updating pg_trgm) - Mailing list pgsql-general

Previous

Next