Re: Convert from hex to string - Mailing list pgsql-general

From Adrian Klaver
Subject Re: Convert from hex to string
Date
Msg-id 56561B10.3050208@aklaver.com
Whole thread Raw
In response to Re: Convert from hex to string  (Francisco Olarte <folarte@peoplecall.com>)
Responses Re: Convert from hex to string
List pgsql-general
On 11/25/2015 10:49 AM, Francisco Olarte wrote:
> Mail pingpong day. ;-)
>
> On Wed, Nov 25, 2015 at 7:27 PM, Adrian Klaver
> <adrian.klaver@aklaver.com> wrote:
>
>> postgres@latin1_db=# \encoding
>> UTF8
>
> This does not matter, as you are sending/receiving hex data, and the
> encoding done when sending query results to you gets reverted when you
> send queries back to the server.
>
>
>> postgres@latin1_db=# select convert_from('\xe9', 'latin1');
>>   convert_from
>> --------------
>>   é
>> (1 row)
>
> This DOES matter, you've had to change the query for it to work, you
> had to look up ( or know beforehand ) the database encoding and change
> it accordingly. . I mean, when you do encode('text',....) what you are
> really asking for is encode( implicit_text_to_bytea_conversion(text),
> ....), so you save one step, but them you have to change your query to
> the 'latin1' encoding you discovered. This is what I meant, you had to
> look at the database properties. But if you do an explicit
> convert_form with any encoding capable of representing all your data,
> like utf-8 which can represent anything,  the database encoding does
> not matter. And it should not, proper code should work with any
> database encoding. Even more, I can do encode(convert_to(utf8)) in a
> latin1 database conecting with any encoding, then send the hex to
> convert_from(decode(),utf8) to an ebcdic database use another encoding
> ( Of course I may need to transcode the hex, but not to the ebcdic,
> but to the client encoding been used in teh second case ), ant it
> would work as long as all the characters In the source data are
> representable in the destination database ( if they are not you are
> out luck in any scheme you may think off ).


I will grant you that working with encodings is like working with
timestamps, explicit is better. The thing I am having a problem with is
how not knowing the context of the bytea value is different in the
implicit vs explicit case:

create table hex_test(hex_fld bytea);

aklaver@latin1_db=> select convert_to('é', 'utf8');
  convert_to
------------
  \xc3a9
(1 row)

aklaver@latin1_db=> insert into hex_test values ('\xc3a9');


aklaver@latin1_db=> select encode('é', 'hex');
  encode
--------
  e9
(1 row)

aklaver@latin1_db=> select decode('e9', 'hex');
  decode
--------
  \xe9
(1 row)


aklaver@latin1_db=> insert into hex_test values ('\xe9');

aklaver@latin1_db=> select * from hex_test ;
  hex_fld
---------
  \xc3a9
  \xe9
(2 rows)


aklaver@latin1_db=> select convert_from(hex_fld, 'latin1') from hex_test ;
  convert_from
--------------
  é
  é
(2 rows)

aklaver@latin1_db=> select convert_from(hex_fld, 'utf8') from hex_test ;
ERROR:  invalid byte sequence for encoding "UTF8": 0xe9

Granted the above is contrived and bound to fail, but the point is you
need to know what created the bytea however it got there. Now if you are
in charge of both ends of the process, then the above is your own fault.
Otherwise, you are down to detective work on what encoding was used
whether it was implicit or explicit. As the OP was working in a single
context I am not seeing the issue in making use of that context to do
the heavy lifting. For the use cases that you show I agree that a
defined convert_to/encode/decode/convert_from chain is a best practice
and something I had not really thought out, so thanks.

>
> You cannot encode generically an string to hex unless you define an
> encoding. How do you encode '1', "31" or "F1" ? or maybe "0031" or
> "3100"? You can do a quick hack, but normally what you want is first
> to encode a sequence of characters to a sequence of bytes and then
> hex-encode that, as nearly everybody uses the same conversion for
> hex-encoding a byte sequence. This means you can have a '0' in a
> ebcdic database, transform it to to [0x30] byte array, encode this as
> "30" and then transform the later to 00 30 00 10 because you are using
> UTF16-BE wire encoding. Encoding is tricky enough without relying on
> implicit convertion or on a character being the same as a byte.
>
> Francisco Olarte.
>


--
Adrian Klaver
adrian.klaver@aklaver.com


pgsql-general by date:

Previous
From: John R Pierce
Date:
Subject: Re: Error creating Tablespace on Windows
Next
From: Adrian Klaver
Date:
Subject: Re: Taking lot time