Thread: Invalid byte sequence when importing Oracle BLOB
Hi, The source is an Oracle 12 db with this encoding: NLS_LANG = AMERICAN_AMERICA.AL32UTF8 NLS_NCHAR = AL32UTF8 The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being loaded through COPY commands generated by ora2pg. The source table has a BLOB column (I think they are scanned images) which I'm loading into a Postgresql bytea column. Seven times out of about 60M rows, I get this error: Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf Is there anything I can change on the Postgresql side to make these records import correctly? -- Angular momentum makes the world go 'round.
On 2021-04-26 06:49:18 -0500, Ron wrote: > The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being > loaded through COPY commands generated by ora2pg. > > The source table has a BLOB column (I think they are scanned images) which > I'm loading into a Postgresql bytea column. > > Seven times out of about 60M rows, I get this error: > Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf Decoding UTF8 doesn't make sense for a bytea column. How does that data look like in the file generated by ora2pg? hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
On 4/26/21 7:32 AM, Peter J. Holzer wrote: > On 2021-04-26 06:49:18 -0500, Ron wrote: >> The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being >> loaded through COPY commands generated by ora2pg. >> >> The source table has a BLOB column (I think they are scanned images) which >> I'm loading into a Postgresql bytea column. >> >> Seven times out of about 60M rows, I get this error: >> Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf > Decoding UTF8 doesn't make sense for a bytea column. How does that data > look like in the file generated by ora2pg? I thought it was weird, too, but COPY has to read text, no? Anyway, here are the first 8 lines (beware line wrapping) of the input file: SET client_encoding TO 'UTF8'; SET synchronous_commit TO off; SET search_path = strans,public; TRUNCATE TABLE mv_response_attachment_old; COPY mv_response_attachment_old (response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date) FROM STDIN; 1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f... It looks like a bog-standard COPY command, with which I've imported a lot of data. -- Angular momentum makes the world go 'round.
On 2021-04-26 07:45:26 -0500, Ron wrote: > On 4/26/21 7:32 AM, Peter J. Holzer wrote: > > On 2021-04-26 06:49:18 -0500, Ron wrote: > > > The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being > > > loaded through COPY commands generated by ora2pg. > > > > > > The source table has a BLOB column (I think they are scanned images) which > > > I'm loading into a Postgresql bytea column. > > > > > > Seven times out of about 60M rows, I get this error: > > > Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf > > Decoding UTF8 doesn't make sense for a bytea column. How does that data > > look like in the file generated by ora2pg? > > I thought it was weird, too, but COPY has to read text, no? Yes, but data for a bytea column would normally be encoded in hex or something like that ... > COPY mv_response_attachment_old (response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date) > FROM STDIN; > 1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f... ... Yes, like this. There are only hex digits (plus \ and x) in the column, nothing which would require decoding UTF-8. My guess is that the error is actually in the data for another column. I'd try to identify the broken records and check whether they contain some other strange content. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
El día lunes, abril 26, 2021 a las 06:49:18a. m. -0500, Ron escribió: > Hi, > > The source is an Oracle 12 db with this encoding: > NLS_LANG = AMERICAN_AMERICA.AL32UTF8 > NLS_NCHAR = AL32UTF8 > > The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being > loaded through COPY commands generated by ora2pg. > > The source table has a BLOB column (I think they are scanned images) which > I'm loading into a Postgresql bytea column. > > Seven times out of about 60M rows, I get this error: > Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf This error can't be caused by a BLOB or bytea column. Only char or vchar columns can cause (and did cause) such errors in our Oracle/Sybase to PostgreSQL migrations. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
On 4/26/21 7:58 AM, Peter J. Holzer wrote: > On 2021-04-26 07:45:26 -0500, Ron wrote: >> On 4/26/21 7:32 AM, Peter J. Holzer wrote: >>> On 2021-04-26 06:49:18 -0500, Ron wrote: >>>> The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being >>>> loaded through COPY commands generated by ora2pg. >>>> >>>> The source table has a BLOB column (I think they are scanned images) which >>>> I'm loading into a Postgresql bytea column. >>>> >>>> Seven times out of about 60M rows, I get this error: >>>> Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf >>> Decoding UTF8 doesn't make sense for a bytea column. How does that data >>> look like in the file generated by ora2pg? >> I thought it was weird, too, but COPY has to read text, no? > Yes, but data for a bytea column would normally be encoded in hex or > something like that ... > > >> COPY mv_response_attachment_old (response_attachement_id,binary_data,employer_response_id,attachment_id_code,file_type,attachment_desc,attachment_size,file_name,partition_date,prior_incident_id,part_date) >> FROM STDIN; >> 1583201 \\x255044462d312e330d25e2e3cfd30d0a31362030206f... > ... Yes, like this. There are only hex digits (plus \ and x) in the > column, nothing which would require decoding UTF-8. > > My guess is that the error is actually in the data for another column. > > I'd try to identify the broken records and check whether they contain > some other strange content. That's a good idea. Thanks. -- Angular momentum makes the world go 'round.
On Mon, 2021-04-26 at 06:49 -0500, Ron wrote: > The source is an Oracle 12 db with this encoding: > NLS_LANG = AMERICAN_AMERICA.AL32UTF8 > NLS_NCHAR = AL32UTF8 > > The destination is an (RDS) Postgresql 12.5 with encoding UTF8, and is being > loaded through COPY commands generated by ora2pg. > > The source table has a BLOB column (I think they are scanned images) which > I'm loading into a Postgresql bytea column. > > Seven times out of about 60M rows, I get this error: > Psql:909242: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0xbf > > Is there anything I can change on the Postgresql side to make these records > import correctly? The "bytea" column is probably a red herring - this error message should be caused by a "text" or "varchar" or other string data column. It is surprisingly easy to enter currupt strings into Oracle - just set client encoding to the same value as server encoding, and it won't check the integrity of your strings. If that is your problem, you must identify and fix the data in Oracle. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com