Home > mailing lists

Re: UTF8 national character data type support WIP patch and list of open issues. - Mailing list pgsql-hackers

From	Valentine Gogichashvili
Subject	Re: UTF8 national character data type support WIP patch and list of open issues.
Date	September 22, 2013 21:56:07
Msg-id	CAP93muVF=baHDtRs1JBPS3A85j6jRTjEUDMBjr=Voa-xFym4qg@mail.gmail.com Whole thread
In response to	Re: UTF8 national character data type support WIP patch and list of open issues. ("MauMau" <maumau307@gmail.com>)
List	pgsql-hackers

Tree view

PostgreSQL has a very powerful possibilities for storing any kind of
encoding. So maybe it makes sense to add the ENCODING as another column
property, the same way a COLLATION was added?

Some other people in this community suggested that. ANd the SQL standard suggests the same -- specifying a character encoding for each column: CHAR(n) CHARASET SET ch.

Text operations should work automatically, as in memory all strings will be
converted to the database encoding.

This approach will also open a possibility to implement custom ENCODINGs
for the column data storage, like snappy compression or even BSON, gobs or
protbufs for much more compact type storage.

Thanks for your idea that sounds interesting, although I don't understand that well.

The idea is very simple:

CREATE DATABASE utf8_database ENCODING 'utf8';

\c utf8_database

CREATE TABLE a(

id serial,

ascii_data text ENCODING 'ascii', -- will use ascii_to_utf8 to read and utf8_to_ascii to write

koi8_data text ENCODING 'koi8_r', -- will use koi8_r_to_utf8 to read and utf8_to_koi8_r to write
json_data json ENCODING 'bson' -- will use bson_to_json to read and json_to_bson to write

);

The problem with bson_to_json here is that probably it will not be possible to write JSON in koi8_r for example. But now it is also even not considered in these discussions.

If the ENCODING machinery would get not only the encoding name, but also the type OID, it should be possible to write encoders for TYPEs and array of TYPEs (I had to do it using the casts to bytea and protobuff to minimize the size of storage for an array of types when writing a lot of data, that could be unpacked afterwords directly in the DB as normal database types).

I hope I made my point a little bit clearer.

Regards,

Valentine Gogichashvili

pgsql-hackers by date:

From: Alexander Korotkov
Date: 22 September 2013, 20:47:47
Subject: Re: GIN improvements part 1: additional information

From: Hannu Krosing
Date: 22 September 2013, 22:07:42
Subject: Re: SSI freezing bug

Re: UTF8 national character data type support WIP patch and list of open issues. - Mailing list pgsql-hackers

Previous

Next