Re: Supporting SJIS as a database encoding - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: Supporting SJIS as a database encoding
Date
Msg-id 20160906.122904.256837704.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: Supporting SJIS as a database encoding  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Supporting SJIS as a database encoding  ("Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com>)
List pgsql-hackers
Hello,

At Mon, 5 Sep 2016 19:38:33 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<529db688-72fc-1ca2-f898-b0b99e30076f@iki.fi>
> On 09/05/2016 05:47 PM, Tom Lane wrote:
> > "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
> >> Before digging into the problem, could you share your impression on
> >> whether PostgreSQL can support SJIS?  Would it be hopeless?
> >
> > I think it's pretty much hopeless.
> 
> Agreed.

+1, even as a user of SJIS:)

> But one thing that would help a little, would be to optimize the UTF-8
> -> SJIS conversion. It uses a very generic routine, with a binary
> search over a large array of mappings. I bet you could do better than
> that, maybe using a hash table or a radix tree instead of the large
> binary-searched array.

I'm very impressed by the idea. Mean number of iterations for
binsearch on current conversion table with 8000 characters is
about 13 and the table size is under 100kBytes (maybe).

A three-level array with 2 byte values will take about 1.6~2MB of memory.

A radix tree for UTF-8->some-encoding conversion requires about,
or up to.. (using 1 byte index to point the next level)

(1 *  ((7f + 1) +     (df - c2 + 1) * (bf - 80 + 1) +     (ef - e0 + 1) * (bf - 80 + 1)^2)) = 67 kbytes.

SJIS characters are 2byte length at longest so about 8000
characters takes extra 16 k Bytes. And some padding space will be
added on them.

As the result, radix tree seems to be promising because of small
requirement of additional memory and far less comparisons.  Also
Big5 and other encodings including EUC-* will get benefit from
it.

Implementing radix tree code, then redefining the format of
mapping table to suppot radix tree, then modifying mapping
generator script are needed.

If no one oppse to this, I'll do that.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Supporting SJIS as a database encoding
Next
From: "Tsunakawa, Takayuki"
Date:
Subject: Re: Supporting SJIS as a database encoding