Re: Perform COPY FROM encoding conversions in larger chunks - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Perform COPY FROM encoding conversions in larger chunks
Date
Msg-id b9e3167f-f84b-7aa4-5738-be578a4db924@iki.fi
Whole thread Raw
In response to Perform COPY FROM encoding conversions in larger chunks  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
One of the patches in this patch set is worth calling out separately: 
0003-Add-direct-conversion-routines-between-EUC_TW-and-Bi.patch. Per 
commit message:

     Add direct conversion routines between EUC_TW and Big5.

     Conversions between EUC_TW and Big5 were previously implemented by
     converting the whole input to MIC first, and then from MIC to the
     target encoding. Implement functions to convert directly between the
     two.

     The reason to do this now is that the next patch will change the
     change the conversion function signature so that if the input is
     invalid, we convert as much as we can and return the number of bytes
     successfully converted. That's not possible if we use an intermediary
     format, because if an error happens in the intermediary -> final
     conversion, we lose track of the location of the invalid character in
     the original input. Avoiding the intermediate step should be faster
     too.

This patch is fairly independent of the others. It could be reviewed and 
applied separately.


In order to verify that the new code is correct, I wrote some helper 
plpgsql functions to generate all valid EUC_TW and Big5 byte sequences 
that encode one character, and tested converting each of them. Then I 
compared the the results with unpatched server, to check that the new 
code performs the same conversion. This is perhaps overkill, but since 
its pretty straightforward to enumerate all the input characters, might 
as well do it.

For the sake of completeness, I wrote similar helpers for all the other 
encodings and conversions. Except for UTF-8, there are too many formally 
valid codepoints for that to feasible. This does test round-trip 
conversions of all codepoints from all the other encodings to UTF-8 and 
back, though, so there's pretty good coverage of UTF-8 too.

This test suite is probably too large to add to the source tree, but for 
the sake of the archives, I'm attaching it here. The first patch adds 
the test suite, including the expected output of each conversion. The 
second patch contains expected output changes for the above patch to add 
direct conversions between EUC_TW and Big5. It affected the error 
messages for some byte sequences that cannot be converted. For example, 
on unpatched  master:

postgres=# select convert('\xfdcc', 'euc_tw', 'big5');
ERROR:  character with byte sequence 0x95 0xfd 0xcc in encoding 
"MULE_INTERNAL" has no equivalent in encoding "BIG5"

With the patch:

postgres=# select convert('\xfdcc', 'euc_tw', 'big5');
ERROR:  character with byte sequence 0xfd 0xcc in encoding "EUC_TW" has 
no equivalent in encoding "BIG5"

The old message talked about "MULE_INTERNAL" which exposes the 
implementation detail that we used it as an intermediate in the 
conversion. That can be confusing to a user, the new message makes more 
sense. So that's also nice.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: multi-install PostgresNode
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] [PATCH] Generic type subscripting