UTF8 national character data type support WIP patch and list of open issues. - Mailing list pgsql-hackers

From Boguk, Maksym
Subject UTF8 national character data type support WIP patch and list of open issues.
Date
Msg-id A756FAD7EDC2E24F8CAB7E2F3B5375E918E2D5B8@FALEX03.au.fjanz.com
Whole thread Raw
Responses Re: UTF8 national character data type support WIP patch and list of open issues.
List pgsql-hackers
Hi,

As part of my job I started developing in-core support for the UTF8
National Character types (national character/national character
variable).
I attached current WIP patch (against HEAD) to community review.

Target usage:  ability to store UTF8 national characters in some
selected fields inside a single-byte encoded database.
For sample if I have a ru-RU.koi8r encoded database with mostly Russian
text inside,  it would be nice to be able store an Japanese text in one
field without converting the whole database to UTF8 (convert such
database to UTF8 easily could almost double the database size even if
only one field in whole database will use any symbols outside of
ru-RU.koi8r encoding).

What has been done:

1)Addition of new string data types NATIONAL CHARACTER and NATIONAL
CHARACTER VARIABLE.
These types differ from the char/varchar data types in one important
respect:  NATIONAL string types are always have UTF8 encoding even
(independent from used database encoding).
Of course that lead to encoding conversion overhead when comparing
NATIONAL string types with common string types (that is expected and
unavoidable).
2)Some ECPG support for these types
3)Some documentation patch (not finished)

What need to be done:

1)Full set of string functions and operators for NATIONAL types (we
could not use generic text functions because they assume that the stings
will have database encoding).
Now only basic set implemented.
2)Need implement some way to define default collation for a NATIONAL
types.
3)Need implement some way to input UTF8 characters into NATIONAL types
via SQL  (there are serious open problem... it will be defined later in
the text).

Most serious open problem that the patch in current state doesn't allow
input/output UTF8 symbols which could not be represented in used
database encoding into NATIONAL fields.
It happen because encoding conversion from the client_encoding to the
database encoding happens before syntax analyze/parse stage and throw an
error for symbols which could not be represented.
I don't see any good solution to this problem except made whole codebase
use an UTF8 encoding for the all internal operations with huge
performance hit.
May be someone have good idea how to deal with this issue.

That is really WIP patch (with lots things on todo list/required
polish).

Kindly please tell me what you think about this idea/patch in general.

PS: It is my first patch to PostgreSQL so there are a lot of space to
improvement/style for sure.


Kind Regards,
Maksym






Attachment

pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: [9.4] Make full_page_writes only settable on server start?
Next
From: Peter Geoghegan
Date:
Subject: Re: [9.4] Make full_page_writes only settable on server start?