Thread: Invalid byte sequence when importing Oracle BLOB

Invalid byte sequence when importing Oracle BLOB

From

Ron

Date:

26 April 2021, 11:49:18

Hi,

The source is an Oracle 12 db with this encoding:
NLS_LANG = AMERICAN_AMERICA.AL32UTF8
NLS_NCHAR = AL32UTF8

The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being 
loaded through COPY commands generated by ora2pg.

The source table has a BLOB column (I think they are scanned images) which 
I'm loading into a Postgresql bytea column.

Seven times out of about 60M rows, I get this error:
Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf

Is there anything I can change on the Postgresql side to make these records 
import correctly?

-- 
Angular momentum makes the world go 'round.

Re: Invalid byte sequence when importing Oracle BLOB

From

"Peter J. Holzer"

Date:

26 April 2021, 12:32:59

On 2021-04-26 06:49:18 -0500, Ron wrote:
> The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
> loaded through COPY commands generated by ora2pg.
>
> The source table has a BLOB column (I think they are scanned images) which
> I'm loading into a Postgresql bytea column.
>
> Seven times out of about 60M rows, I get this error:
> Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf

Decoding UTF8 doesn't make sense for a bytea column. How does that data
look like in the file generated by ora2pg?

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

signature.asc

Re: Invalid byte sequence when importing Oracle BLOB

From

Ron

Date:

26 April 2021, 12:45:26

On 4/26/21 7:32 AM, Peter J. Holzer wrote:
> On 2021-04-26 06:49:18 -0500, Ron wrote:
>> The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
>> loaded through COPY commands generated by ora2pg.
>>
>> The source table has a BLOB column (I think they are scanned images) which
>> I'm loading into a Postgresql bytea column.
>>
>> Seven times out of about 60M rows, I get this error:
>> Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf
> Decoding UTF8 doesn't make sense for a bytea column. How does that data
> look like in the file generated by ora2pg?

I thought it was weird, too, but COPY has to read text, no?

Anyway, here are the first 8 lines (beware line wrapping) of the input file:

SET client_encoding TO 'UTF8';
SET synchronous_commit TO off;
SET search_path = strans,public;

TRUNCATE TABLE mv_response_attachment_old;

COPY mv_response_attachment_old 

(response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date)

FROM STDIN;
1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f...

It looks like a bog-standard COPY command, with which I've imported a lot of 
data.

-- 
Angular momentum makes the world go 'round.

Re: Invalid byte sequence when importing Oracle BLOB

From

"Peter J. Holzer"

Date:

26 April 2021, 12:58:34

On 2021-04-26 07:45:26 -0500, Ron wrote:
> On 4/26/21 7:32 AM, Peter J. Holzer wrote:
> > On 2021-04-26 06:49:18 -0500, Ron wrote:
> > > The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
> > > loaded through COPY commands generated by ora2pg.
> > >
> > > The source table has a BLOB column (I think they are scanned images) which
> > > I'm loading into a Postgresql bytea column.
> > >
> > > Seven times out of about 60M rows, I get this error:
> > > Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf
> > Decoding UTF8 doesn't make sense for a bytea column. How does that data
> > look like in the file generated by ora2pg?
>
> I thought it was weird, too, but COPY has to read text, no?

Yes, but data for a bytea column would normally be encoded in hex or
something like that ...


> COPY mv_response_attachment_old
(response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date)
> FROM STDIN;
> 1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f...

... Yes, like this. There are only hex digits (plus \ and x) in the
column, nothing which would require decoding UTF-8.

My guess is that the error is actually in the data for another column.

I'd try to identify the broken records and check whether they contain
some other strange content.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

signature.asc

Re: Invalid byte sequence when importing Oracle BLOB

From

Matthias Apitz

Date:

26 April 2021, 13:14:40

El día lunes, abril 26, 2021 a las 06:49:18a. m. -0500, Ron escribió:

> Hi,
> 
> The source is an Oracle 12 db with this encoding:
> NLS_LANG = AMERICAN_AMERICA.AL32UTF8
> NLS_NCHAR = AL32UTF8
> 
> The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
> loaded through COPY commands generated by ora2pg.
> 
> The source table has a BLOB column (I think they are scanned images) which
> I'm loading into a Postgresql bytea column.
> 
> Seven times out of about 60M rows, I get this error:
> Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf

This error can't be caused by a BLOB or bytea column. Only char or vchar
columns can cause (and did cause) such errors in our Oracle/Sybase to
PostgreSQL migrations.

    matthias
-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

Re: Invalid byte sequence when importing Oracle BLOB

From

Ron

Date:

26 April 2021, 13:43:20

On 4/26/21 7:58 AM, Peter J. Holzer wrote:
> On 2021-04-26 07:45:26 -0500, Ron wrote:
>> On 4/26/21 7:32 AM, Peter J. Holzer wrote:
>>> On 2021-04-26 06:49:18 -0500, Ron wrote:
>>>> The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being
>>>> loaded through COPY commands generated by ora2pg.
>>>>
>>>> The source table has a BLOB column (I think they are scanned images) which
>>>> I'm loading into a Postgresql bytea column.
>>>>
>>>> Seven times out of about 60M rows, I get this error:
>>>> Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf
>>> Decoding UTF8 doesn't make sense for a bytea column. How does that data
>>> look like in the file generated by ora2pg?
>> I thought it was weird, too, but COPY has to read text, no?
> Yes, but data for a bytea column would normally be encoded in hex or
> something like that ...
>
>
>> COPY mv_response_attachment_old
(response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date)
>> FROM STDIN;
>> 1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f...
> ... Yes, like this. There are only hex digits (plus \ and x) in the
> column, nothing which would require decoding UTF-8.
>
> My guess is that the error is actually in the data for another column.
>
> I'd try to identify the broken records and check whether they contain
> some other strange content.

That's a good idea.  Thanks.


-- 
Angular momentum makes the world go 'round.

Re: Invalid byte sequence when importing Oracle BLOB

From

Laurenz Albe

Date:

26 April 2021, 15:19:12

On Mon, 2021-04-26 at 06:49 -0500, Ron wrote:
> The source is an Oracle 12 db with this encoding:
> NLS_LANG = AMERICAN_AMERICA.AL32UTF8
> NLS_NCHAR = AL32UTF8
> 
> The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being 
> loaded through COPY commands generated by ora2pg.
> 
> The source table has a BLOB column (I think they are scanned images) which 
> I'm loading into a Postgresql bytea column.
> 
> Seven times out of about 60M rows, I get this error:
> Psql:909242: ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf
> 
> Is there anything I can change on the Postgresql side to make these records 
> import correctly?

The "bytea" column is probably a red herring - this error message should
be caused by a "text" or "varchar" or other string data column.

It is surprisingly easy to enter currupt strings into Oracle - just set
client encoding to the same value as server encoding, and it won't check
the integrity of your strings.

If that is your problem, you must identify and fix the data in Oracle.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com