Thread: LATIN9 - hex in varchar after convert

LATIN9 - hex in varchar after convert

From
"Steve Tucknott (TuSol)"
Date:
Postgresql 11.7 - client/server encoding LATIN9

Good morning,
I have made a couple of stupid mistakes.

I have a table with a varchar(5000) that contains general text. The table is typically maintained via a GUI, but on this occasion I received a spreadsheet with data and loaded it - via copy - from a csv extracted from that. The data looked fine in psql, but when looking at the data in the GUI, characters such as single quote marks (') appeared as a series of special characters. I assumed that the spreadsheet then had some different encoding (UTF8?) and that I then needed to 'translate' the characters.
I saw there was a convert_to/from function in postgresql, and thought I'd try that to see what it would do, but (the two mistakes) I didn't start a transaction and where I had intended to run the command on one row, I had a brain freeze and ran the update across the table.
The command was:
update gamespubquiz set question = convert_to(question,'LATIN9');

What I now appear to have in the question column is the hex equivalent of what was there originally.

Two questions:
1) How can I convert the hext back to 'text'?
2) How can I convert the (what appears to be) incorrect coding of the single quote (')?


Thanks in advance,
Steve T

Re: LATIN9 - hex in varchar after convert

From
"Steve Tucknott (TuSol)"
Date:



On Sat, 2020-04-25 at 10:41 +0100, Steve Tucknott (TuSol) wrote:
Postgresql 11.7 - client/server encoding LATIN9

Good morning,
I have made a couple of stupid mistakes.

I have a table with a varchar(5000) that contains general text. The table is typically maintained via a GUI, but on this occasion I received a spreadsheet with data and loaded it - via copy - from a csv extracted from that. The data looked fine in psql, but when looking at the data in the GUI, characters such as single quote marks (') appeared as a series of special characters. I assumed that the spreadsheet then had some different encoding (UTF8?) and that I then needed to 'translate' the characters.
I saw there was a convert_to/from function in postgresql, and thought I'd try that to see what it would do, but (the two mistakes) I didn't start a transaction and where I had intended to run the command on one row, I had a brain freeze and ran the update across the table.
The command was:
update gamespubquiz set question = convert_to(question,'LATIN9');

What I now appear to have in the question column is the hex equivalent of what was there originally.

Two questions:
1) How can I convert the hext back to 'text'?
2) How can I convert the (what appears to be) incorrect coding of the single quote (')?


Thanks in advance,
Steve T

...for example ....as taken from psql:

Original text: ‘With the lights out, it’s less dangerous. Here we are now, entertain us.’

After convert: \xe280985769746820746865206c6967687473206f75742c206974e2809973206c6573732064616e6765726f75732e204865726520776520617265206e6f772c20656e7465727461696e2075732ee28099

PS apologies for any inconvenience for posting with 'read receipt' request turned on earlier.

Re: LATIN9 - hex in varchar after convert

From
Tom Lane
Date:
"Steve Tucknott (TuSol)" <steve@tusol.co.uk> writes:
> I saw there was a convert_to/from function in postgresql,  and thought
> I'd try that to see what it would do, but (the two mistakes) I didn't
> start a transaction and where I had intended to run the command on one
> row, I had a brain freeze and ran the update across the table.
> The command was:
> update gamespubquiz set question = convert_to(question,'LATIN9');

Ooops.

> Two questions:
> 1) How can I convert the hext back to 'text'?
> 2) How can I convert the (what appears to be) incorrect coding of the
> single quote (')?

Don't have an answer for (2) offhand, but you can undo (1).
What bit you is that there's an implicit cast from text to
bytea, but not vice versa.

regression=# create table foo (f1 text);
CREATE TABLE
regression=# insert into foo values ('this is a test');
INSERT 0 1
regression=# update foo set f1 = convert_to(f1,'LATIN9');
UPDATE 1
regression=# table foo;
               f1
--------------------------------
 \x7468697320697320612074657374
(1 row)

You can't just apply convert_from:

regression=# update foo set f1 = convert_from(f1,'LATIN9');
ERROR:  function convert_from(text, unknown) does not exist
LINE 1: update foo set f1 = convert_from(f1,'LATIN9');
                            ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

but forcing it with a cast works:

regression=# update foo set f1 = convert_from(f1::bytea,'LATIN9');
UPDATE 1
regression=# table foo;
       f1
----------------
 this is a test
(1 row)

As for (2), maybe replace() would help you.

            regards, tom lane



Re: LATIN9 - hex in varchar after convert

From
"Steve Tucknott (TuSol)"
Date:
On Sat, 2020-04-25 at 10:35 -0400, Tom Lane wrote:
"Steve Tucknott (TuSol)" <
steve@tusol.co.uk
> writes:
I saw there was a convert_to/from function in postgresql,  and thought
I'd try that to see what it would do, but (the two mistakes) I didn't
start a transaction and where I had intended to run the command on one
row, I had a brain freeze and ran the update across the table.
The command was:
update gamespubquiz set question = convert_to(question,'LATIN9');

Ooops.

Two questions:
1) How can I convert the hext back to 'text'?

regression=# update foo set f1 = convert_from(f1::bytea,'LATIN9');
UPDATE 1
regression=# table foo;
       f1       
----------------
 this is a test
(1 row)

As for (2), maybe replace() would help you.

			regards, tom lane


Tom
Thanks. I tried the convert from and got the unknown function error and then tried again and applied the cast, but I applied it after the bracket ie
convert_from(question,'LATIN9')::bytea. I scare myself at times - but my excuse is that I haven't used a database in earnest for a fair few years. Poor excuse I know.

I'll clone the table with the problem before playing with it again to try to fix the quote encoding issue - and I'll make sure I'm in a transaction anyway so I can roll back the attempts.

Thanks again.

Re: LATIN9 - hex in varchar after convert

From
Laurenz Albe
Date:
On Sat, 2020-04-25 at 10:41 +0100, Steve Tucknott (TuSol) wrote:
> I have a table with a varchar(5000) that contains general text. The table is typically
> maintained via a GUI, but on this occasion I received a spreadsheet with data and
> loaded it - via copy - from a csv extracted from that. The data looked fine in psql,
> but when looking at the data in the GUI, characters such as single quote marks (')
> appeared as a series of special characters. I assumed that the spreadsheet then had
> some different encoding (UTF8?) and that I then needed to 'translate' the characters.

Very likely, the characters were not really single quotes, but "curly quotes"
(UNICODE 201C and 201E) characters.

One of the following scenarios must have taken place:

1. The file was encoded in UTF-8, but when you copied the data in, the encoding
   you specified (or had by default) was a single-gyte encoding like LATIN9.

   The curly quotes are more than one byte in UTF-8, but each byte was interpreted as
   a LATIN9 character.

   The solution would be to specify ENCODING 'UTF8' with COPY.

2. The characters are actually fine in the database, and you loaded them correctly,
   and your database client encoding is UTF8, but your terminal is in LATIN9.

   The characters were displayed correctly, but your terminal interpreted each
   byte as a character.

To determine which was the case, look what bytes are in the database:

SELECT badcol, badcol::bytea FROM tab WHERE id = 12345;

Yours,
Laurenz Albe

-- 
Cybertec | https://www.cybertec-postgresql.com




Re: LATIN9 - hex in varchar after convert

From
"Steve Tucknott (TuSol)"
Date:
aaa
On Sun, 2020-04-26 at 04:24 +0200, Laurenz Albe wrote:
On Sat, 2020-04-25 at 10:41 +0100, Steve Tucknott (TuSol) wrote:
I have a table with a varchar(5000) that contains general text. The table is typically
maintained via a GUI, but on this occasion I received a spreadsheet with data and
loaded it - via copy - from a csv extracted from that. The data looked fine in psql,
but when looking at the data in the GUI, characters such as single quote marks (')
appeared as a series of special characters. I assumed that the spreadsheet then had
some different encoding (UTF8?) and that I then needed to 'translate' the characters.

Very likely, the characters were not really single quotes, but "curly quotes"
(UNICODE 201C and 201E) characters.

One of the following scenarios must have taken place:

1. The file was encoded in UTF-8, but when you copied the data in, the encoding
   you specified (or had by default) was a single-gyte encoding like LATIN9.

   The curly quotes are more than one byte in UTF-8, but each byte was interpreted as
   a LATIN9 character.

   The solution would be to specify ENCODING 'UTF8' with COPY.

2. The characters are actually fine in the database, and you loaded them correctly,
   and your database client encoding is UTF8, but your terminal is in LATIN9.

   The characters were displayed correctly, but your terminal interpreted each
   byte as a character.

To determine which was the case, look what bytes are in the database:

SELECT badcol, badcol::bytea FROM tab WHERE id = 12345;

Yours,
Laurenz Albe

Laurenz,
Thank you for the reply. After correcting my 'convert' mistake (thanks again to Tom), I decided to drop the rows loaded from the spreadsheet and correct the source data and reload - only as Libreoffice has a simple 'find and replace' - so I could copy the 'naff' quote character into the find field and then replace with a standard quote. Looks fine so far!

Thank you all for your help,
Steve