Re: fixing tsearch locale support - Mailing list pgsql-hackers

From Daniel Verite
Subject Re: fixing tsearch locale support
Date
Msg-id 15e97660-9e3c-43a2-8cad-7b33fc7f7476@manitou-mail.org
Whole thread Raw
In response to Re: fixing tsearch locale support  (Peter Eisentraut <peter@eisentraut.org>)
List pgsql-hackers
    Peter Eisentraut wrote:

> There is a PG18 open item to document this possible upgrade incompatibility.
>
> I think the following text could be added to the release notes:
>
> """
> The locale implementation underlying full-text search was improved.  It
> now observes the locale provider configured for the database.  It was
> previously hardcoded to use the configured libc LC_CTYPE setting
> [...]

That sounds misleading because LC_CTYPE is still used in 18.

To illustrate in an ICU database, the parser will classify "Em Dash"
as a separator or not depending on LC_CTYPE.

with LC_CTYPE=C

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
 alias |   token   |   lexemes
-------+-----------+-------------
 word  | ABCD—EFGH | {abcd—efgh}


with LC_CTYPE=en_US.utf8 (glibc 2.35):

=> select alias, token,lexemes from ts_debug('simple', U&'ABCD\2014EFGH');
   alias   | token | lexemes
-----------+-------+---------
 asciiword | ABCD  | {abcd}
 blank       | —       |
 asciiword | EFGH  | {efgh}


OTOH lower casing uses LC_CTYPE in 17, but not in 18, leading
to better lexemes.

pg17, ICU locale, LC_TYPE=C

=> select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
 alias | token | lexemes
-------+-------+---------
 word  | ÉTÉ   | {ÉtÉ}

pg18, ICU locale, LC_TYPE=C

select alias, token,lexemes from ts_debug('simple', 'ÉTÉ');
 alias | token | lexemes
-------+-------+---------
 word  | ÉTÉ   | {été}

So maybe the release notes should say
"now observes the locale provider configured for the database to
convert strings to lower case".

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: Support getrandom() for pg_strong_random() source
Next
From: Jacob Champion
Date:
Subject: Re: Proposal: Extending the PostgreSQL Protocol with Command Metadata