Re: [HACKERS] Radix tree for character conversion - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: [HACKERS] Radix tree for character conversion
Date
Msg-id 20170317.141931.188239181.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: [HACKERS] Radix tree for character conversion  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: [HACKERS] Radix tree for character conversion  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
Thank you for committing this.

At Mon, 13 Mar 2017 21:07:39 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<d5b70078-9f57-0f63-3462-1e564a57739f@iki.fi>
> On 03/13/2017 08:53 PM, Tom Lane wrote:
> > Heikki Linnakangas <hlinnaka@iki.fi> writes:
> >> It would be nice to run the map_checker tool one more time, though, to
> >> verify that the mappings match those from PostgreSQL 9.6.
> >
> > +1
> >
> >> Just to be sure, and after that the map checker can go to the dustbin.
> >
> > Hm, maybe we should keep it around for the next time somebody has a
> > bright
> > idea in this area?
> 
> The map checker compares old-style maps with the new radix maps. The
> next time 'round, we'll need something that compares the radix maps
> with the next great thing. Not sure how easy it would be to adapt.
> 
> Hmm. A somewhat different approach might be more suitable for testing
> across versions, though. We could modify the perl scripts slightly to
> print out SQL statements that exercise every mapping. For every
> supported conversion, the SQL script could:
> 
> 1. create a database in the source encoding.
> 2. set client_encoding='<target encoding>'
> 3. SELECT a string that contains every character in the source
> encoding.

There are many encodings that can be client-encoding but cannot
be database-encoding. And some encodings such as UTF-8 has
several one-way conversion. If we do something like this, it
would be as the following.

1. Encoding test
1-1. create a database in UTF-8
1-2. set client_encoding='<source encoding>'
1-3. INSERT all characters defined in the source encoding.
1-4. set client_encoding='UTF-8'
1-5. SELECT a string that contains every character in UTF-8.
2. Decoding test

.... sucks!


I would like to use convert() function. It can be a large
PL/PgSQL function or a series of "SELECT convert(...)"s. The
latter is doable on-the-fly (by not generating/storing the whole
script).

| -- Test for SJIS->UTF-8 conversion
| ...
| SELECT convert('\0000', 'SJIS', 'UTF-8'); -- results in error
| ...
| SELECT convert('\897e', 'SJIS', 'UTF-8');

> You could then run those SQL statements against old and new server
> version, and verify that you get the same results.

Including the result files in the repository will make this easy
but unacceptably bloats. Put mb/Unicode/README.sanity_check?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [HACKERS] Speedup twophase transactions
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: [HACKERS] Protect syscache from bloating with negative cache entries