Re: [HACKERS] Radix tree for character conversion - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: [HACKERS] Radix tree for character conversion
Date
Msg-id 20170323.121307.241436413.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: [HACKERS] Radix tree for character conversion  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: Radix tree for character conversion  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-hackers
At Tue, 21 Mar 2017 13:10:48 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170321.131048.150321071.horiguchi.kyotaro@lab.ntt.co.jp>
> At Fri, 17 Mar 2017 13:03:35 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<01efd334-b839-0450-1b63-f2dea9326a7e@iki.fi>
> > On 03/17/2017 07:19 AM, Kyotaro HORIGUCHI wrote:
> > > I would like to use convert() function. It can be a large
> > > PL/PgSQL function or a series of "SELECT convert(...)"s. The
> > > latter is doable on-the-fly (by not generating/storing the whole
> > > script).
> > >
> > > | -- Test for SJIS->UTF-8 conversion
> > > | ...
> > > | SELECT convert('\0000', 'SJIS', 'UTF-8'); -- results in error
> > > | ...
> > > | SELECT convert('\897e', 'SJIS', 'UTF-8');
> > 
> > Makes sense.
> > 
> > >> You could then run those SQL statements against old and new server
> > >> version, and verify that you get the same results.
> > >
> > > Including the result files in the repository will make this easy
> > > but unacceptably bloats. Put mb/Unicode/README.sanity_check?
> > 
> > Yeah, a README with instructions on how to do sounds good. No need to
> > include the results in the repository, you can run the script against
> > an older version when you need something to compare with.
> 
> Ok, I'll write a small script to generate a set of "conversion
> dump" and try to write README.sanity_check describing how to use
> it.

I found that there's no way to identify the character domain of a
conversion on SQL interface. Unconditionally giving from 0 to
0xffffffff as a bytea string yields too-bloat result by containg
many bogus lines.  (If \x40 is a character, convert() also
accepts \x4040, \x404040 and \x40404040)

One more annoyance is the fact that mappings and conversion
procedures are not in one-to-one correspondence. The
corresnponcence is hidden in conversion_procs/*.c files so we
should extract it from them or provide as knowledge. Both don't
seem good.

Finally, it seems that I have no choice than resurrecting
map_checker. The exactly the same one no longer works but
map_dumper.c with almost the same structure will work.

If no one objects to adding map_dumper.c and
gen_mapdumper_header.pl (tentavie name, of course), I'll make a
patch to do that.

Any suggestions?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [HACKERS] pageinspect and hash indexes
Next
From: Amit Kapila
Date:
Subject: Re: [HACKERS] segfault in hot standby for hash indexes