PostgreSQL has a very powerful possibilities for storing any kind of encoding. So maybe it makes sense to add the ENCODING as another column property, the same way a COLLATION was added?
Some other people in this community suggested that. ANd the SQL standard suggests the same -- specifying a character encoding for each column: CHAR(n) CHARASET SET ch.
Text operations should work automatically, as in memory all strings will be converted to the database encoding.
This approach will also open a possibility to implement custom ENCODINGs for the column data storage, like snappy compression or even BSON, gobs or protbufs for much more compact type storage.
Thanks for your idea that sounds interesting, although I don't understand that well.
The idea is very simple:
CREATE DATABASE utf8_database ENCODING 'utf8';
\c utf8_database
CREATE TABLE a(
id serial,
ascii_data text ENCODING 'ascii', -- will use ascii_to_utf8 to read and utf8_to_ascii to write
koi8_data text ENCODING 'koi8_r', -- will use koi8_r_to_utf8 to read and utf8_to_koi8_r to write json_data json ENCODING 'bson' -- will use bson_to_json to read and json_to_bson to write
);
The problem with bson_to_json here is that probably it will not be possible to write JSON in koi8_r for example. But now it is also even not considered in these discussions.
If the ENCODING machinery would get not only the encoding name, but also the type OID, it should be possible to write encoders for TYPEs and array of TYPEs (I had to do it using the casts to bytea and protobuff to minimize the size of storage for an array of types when writing a lot of data, that could be unpacked afterwords directly in the DB as normal database types).