Home > mailing lists

Re: Question regarding UTF-8 data and "C" collation on definition of field of table - Mailing list pgsql-general

From	Dionisis Kontominas
Subject	Re: Question regarding UTF-8 data and "C" collation on definition of field of table
Date	February 6, 2023 03:48:15
Msg-id	CAB4Evu1mV53U1K-5jGxRq-u2GchP3kq3R97O11CRvQhGXfJuew@mail.gmail.com Whole thread Raw
In response to	Re: Question regarding UTF-8 data and "C" collation on definition of field of table (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Question regarding UTF-8 data and "C" collation on definition of field of table (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-general

Tree view

Hi Tom,

Regarding the different languages in the same column, that is normal if the column is a UTF-8 one, i.e. should be able to hold for example English, Greek and Chinese characters. In this case what is the best approach to define the collation and lctype of the column? Either C or ucs_basic maybe or something else or it does not matter, the characters would be stored correctly as long as the database is UTF8 encoding?
Is there a real difference between C and ucs_basic collation/lctype? I have looked at the documentation and on the internet and most information converges that they are actually the same as behavior is concerned; actually no difference. What I suspect though as a non-similar is that C after the ASCII part sorts the characters according to byte code point, whereas the ucs_basic sorts those characters according to the Unicode code point which may be different I suppose. Can you confirm?
In case the previous is correct I think that the most successful collation for unicode columns would e to set the collation to ucs_basic collation type and lctype accordingly if exists.

Regards,

Dionisis

On Mon, 6 Feb 2023 at 01:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dionisis Kontominas <dkontominas@gmail.com> writes:
> I suppose that affects the outcome of ORDER BY clauses on the field,
> along with the content of the indexes. Is this right?

Yeah.

> Assuming that the requirement exists, to store UTF-8 characters on a
> field that can be from multiple languages, and the database default
> encoding is UTF8 which is the right thing I suppose (please verify), what
> do you think should be the values of the Collation and Ctype for the
> database to behave correctly?

Um ... so define "correct". If you have a mishmash of languages in the
same column, it's likely that they have conflicting rules about sorting,
and there may be no ordering that's not surprising to somebody.

If there's a predominant language in the data, selecting a collation
matching that seems like your best bet. Otherwise, maybe you should
just shrug your shoulders and stick with C collation. It's likely
to be faster than any alternative.

regards, tom lane

pgsql-general by date:

From: Tom Lane
Date: 06 February 2023, 03:19:01
Subject: Re: Question regarding UTF-8 data and "C" collation on definition of field of table

From: Ron
Date: 06 February 2023, 03:57:13
Subject: Re: Question regarding UTF-8 data and "C" collation on definition of field of table

Re: Question regarding UTF-8 data and "C" collation on definition of field of table - Mailing list pgsql-general

Previous

Next