Thread: How can I return a UTF8 string from a hex representation in a latin9 database?

How can I return a UTF8 string from a hex representation in a latin9 database?

From
"PGUser2020"
Date:
Hello. Hopefully I can explain myself

I have a database for a legacy application that requires an 8 bit database (i.e. the application itself won't function
ona UTF8 database). 

Looking at ways to extend the functionality to be able to handle a few specified fields in Unicode.

Had the idea to store a UTF8 string as either hex pairs or Base64 inside a VARCHAR field, which is fine. I can do that.

What needs to happen though, is to build a view, that will return the decoded hex (or b64) as a UTF8 string to a client
whichhas specified client encoding UTF8. 

I've tried various combinations of convert_from, and  convert_to, and convert, but I just can't seem to get it to
returnthe string a UTF8 select to the client. 

So if I have this data:

select * from mytable;
                 mycolumn
------------------------------------------
 ceb120ceb220ceb320ceb420ceb520cf83cf84

Then:

select convert_from(decode(mycolumn, 'hex')::bytea, 'utf-8') from mytable where usr='BATCH';

ERROR:  character with byte sequence 0xce 0xb1 in encoding "UTF8" has no equivalent in encoding "LATIN9"

So the database encoding is still relevant , this is expected by the description of convert_from in the documentation
ofcourse. 

Is there some combination of functions I can use to have a client select this column from this table in a LATIN9
databaseand get a UTF8 string back?   Any thoughts appreciated, thank you. 


Re: How can I return a UTF8 string from a hex representation in a latin9 database?

From
Erik Wienhold
Date:
On 2023-11-19 21:37 +0100, PGUser2020 wrote:
> I have a database for a legacy application that requires an 8 bit
> database (i.e. the application itself won't function on a UTF8
> database).

Technically speaking, UTF-8 is an 8-bit encoding.  But I guess that
application would then show mojibake if UTF-8 were stored.

> Looking at ways to extend the functionality to be able to handle a few
> specified fields in Unicode.
> 
> Had the idea to store a UTF8 string as either hex pairs or Base64
> inside a VARCHAR field, which is fine. I can do that.

Do you have to use existing Latin-9 text columns to store UTF-8?  If not
then I'd go with bytea instead of text (varchar) if possible and also
supported by your client.  Otherwise it may be difficult to distinguish
between "normal" Latin-9 text and the hex- or base64-encoded UTF-8.
Although bytea could also store anything, not just UTF-8, so you'd have
to deal with invalid data anyway.

> What needs to happen though, is to build a view, that will return the
> decoded hex (or b64) as a UTF8 string to a client which has specified
> client encoding UTF8.

Is the same client sending and reading that data?  If yes, why can't the
client do the hex-encoding of the UTF-8 string and only send/read those
encoded strings so that database won't event see UTF-8?  Why must the
database be involved in this custom encoding scheme instead of just
storing BLOBs (either as bytea or some encoded text)?

> I've tried various combinations of convert_from, and  convert_to, and
> convert, but I just can't seem to get it to return the string a UTF8
> select to the client.
> 
> So if I have this data:
> 
> select * from mytable; mycolumn
> ------------------------------------------
> ceb120ceb220ceb320ceb420ceb520cf83cf84  
> 
> Then:
> 
> select convert_from(decode(mycolumn, 'hex')::bytea, 'utf-8') from mytable where usr='BATCH';
> 
> ERROR:  character with byte sequence 0xce 0xb1 in encoding "UTF8" has no equivalent in encoding "LATIN9"
> 
> So the database encoding is still relevant , this is expected by the
> description of convert_from in the documentation of course.
> 
> Is there some combination of functions I can use to have a client
> select this column from this table in a LATIN9 database and get a UTF8
> string back?

The client can disable encoding conversion by setting client_encoding to
sql_ascii:

    latin9_test=# show server_encoding;
     server_encoding 
    -----------------
     LATIN9
    (1 row)
    
    latin9_test=# set client_encoding to sql_ascii;
    SET
    latin9_test=# show client_encoding;
     client_encoding 
    -----------------
     SQL_ASCII
    (1 row)
    
    latin9_test=# select convert_from(decode('ceb120ceb220ceb320ceb420ceb520cf83cf84', 'hex'), 'sql_ascii');
        convert_from     
    ---------------------
     α β γ δ ε στ
    (1 row)

Maybe that's also an option for your client.

-- 
Erik



On 2023-11-20 02:59 +00:00 GMT, "Erik Wienhold" <ewie@ewie.name> wrote:
> On 2023-11-19 21:37 +0100, PGUser2020 wrote:

>
> Technically speaking, UTF-8 is an 8-bit encoding.  But I guess that
> application would then show mojibake if UTF-8 were stored.
>

Yes sorry, I should have said single byte rather than 8 bit. There must be no possibility that a single character
occupiesmore than one byte as the (e.g.) varchar(10) and char(5) fields overflow lengths otherwise. 

>
> Do you have to use existing Latin-9 text columns to store UTF-8?  If not
> then I'd go with bytea instead of text (varchar) if possible and also
> supported by your client.  Otherwise it may be difficult to distinguish
> between "normal" Latin-9 text and the hex- or base64-encoded UTF-8.
> Although bytea could also store anything, not just UTF-8, so you'd have
> to deal with invalid data anyway.
>

I do have to use existing columns yes, and they are varchar latin9 columns.

> Is the same client sending and reading that data?  If yes, why can't the
> client do the hex-encoding of the UTF-8 string and only send/read those
> encoded strings so that database won't event see UTF-8?  Why must the
> database be involved in this custom encoding scheme instead of just
> storing BLOBs (either as bytea or some encoded text)?
>

So one of the external clients applications which is interacting with this database will do just that -- it will make a
hexstring from its utf8 input and store that in a varchar 

>
> The client can disable encoding conversion by setting client_encoding to
> sql_ascii:
>
>     latin9_test=# show server_encoding;
>      server_encoding
>     -----------------
>      LATIN9
>     (1 row)
>
>     latin9_test=# set client_encoding to sql_ascii;
>     SET
>     latin9_test=# show client_encoding;
>      client_encoding
>     -----------------
>      SQL_ASCII
>     (1 row)
>
>     latin9_test=# select convert_from(decode('ceb120ceb220ceb320ceb420ceb520cf83cf84', 'hex'), 'sql_ascii');
>         convert_from
>     ---------------------
>      α β γ δ ε στ
>     (1 row)
>
> Maybe that's also an option for your client.
>

It is very useful and exactly what I was looking for thanks.

This technique should allow me to create a login, mask a table with a view containing this decode, and use search_path
toget the view returned in preference to the base table.