Re: UTF8 national character data type support WIP patch and list of open issues. - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject Re: UTF8 national character data type support WIP patch and list of open issues.
Date
Msg-id 20131112.155752.666523035722474275.t-ishii@sraoss.co.jp
Whole thread Raw
In response to Re: UTF8 national character data type support WIP patch and list of open issues.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: UTF8 national character data type support WIP patch and list of open issues.  (Peter Eisentraut <peter_e@gmx.net>)
Re: UTF8 national character data type support WIP patch and list of open issues.  (Martijn van Oosterhout <kleptog@svana.org>)
List pgsql-hackers
> I'd be much more impressed by seeing a road map for how we get to a
> useful amount of added functionality --- which, to my mind, would be
> the ability to support N different encodings in one database, for N>2.
> But even if you think N=2 is sufficient, we haven't got a road map, and
> commandeering spec-mandated syntax for an inadequate feature doesn't seem
> like a good first step.  It'll just make our backwards-compatibility
> problems even worse when somebody does come up with a real solution.

I have been thinking about this for years and I think the key idea for
this is, implementing "universal encoding". The universal encoding
should have following characteristics to implement N>2 encoding in a
database.

1) no loss of round trip encoding conversion

2) no mapping table is necessary to convert from/to existing encodings

Once we implement the universal encoding, other problem such as
"pg_database with multiple encoding problem" can be solved easily.

Currently there's no such an universal encoding in the universe, I
think the only way is, inventing it by ourselves.

At this point the design of the encoding I have in mind is,

1) 1 byte encoding identifier + 7 bytes body (totaly 8 bytes). The  encoding identifier's value is between 0x80 and
0xffand is  assigned to exiting encoding such as UTF-8, ascii, EUC-JP and so  on. The encodings should be limited to
"databasesafe"  encodings. The encoding body is raw characters represented by  existing encodings. This form is called
"word".

2) We also have "mutibyte" representation of the universal  encoding. The first byte represents the lenght of the
multibyte character (similar to the first byte of UTF-8). The second byte is  the encoding identifier explained in
above.The rest of the  character is same as above.
 

#1 and #2 are logically same and converted to each other, and we can
use one of them whenever we like.

The form #1 is easy to handle because each word has fixed length (8
bytes). So probably used in temporary data in memory. The second form
can save space and will be used in the data itself.

If we want to have a table encoded in an encoding different from the
database encoding, the table is encoded in the universal
encoding. pg_class should remember the fact to avoid the confusion
about what encoding a table is using. I think majority of tables in a
database uses the same encoding as the database encoding. Only a few
tables want to have different encoding. The design pushes the penalty
to such minorities.

If we need to join two tables which have different encoding, we need
to convert them into the same encoding (this should succeed if the
encodings are "compatible"). If fails, the join will fail too.

We could expand the technique above for the design which allow each
column has different encoding.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: Updatable security_barrier views (was: [v9.4] row level security)
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Get more from indices.