Thread: finding rows with invalid characters

finding rows with invalid characters

From

Sim Zacks

Date:

21 November 2010, 03:55:27

I am using PG 8.2.17 with UTF8 encoding.
"PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1
(Gentoo 4.1.1)"

One of my tables somehow has invalid characters in it:
> ERROR:  invalid byte sequence for encoding "UTF8": 0xa9
> HINT:  This error can also happen if the byte sequence does not match
> the encoding expected by the server, which is controlled by
> "client_encoding".
I have already manually found a number of the bad rows by running
queries with text functions (upper) between groups of IDs until I found
the specific bad row.

1) Is there a quicker way to get a list of all rows with invalid characters
2) Shouldn't the database prevent these rows from being entered in the
first place?
3) I have backups of this database (using -Fc) and I noticed that on
restore, this table is not restored because of this error. Is there a
way to fix the existing backups, or tell the restore to ignore bad rows
instead of erroring out the whole table?

Thanks
Sim

Re: finding rows with invalid characters

From

Dmitriy Igrishin

Date:

21 November 2010, 11:56:01

Hey Sim,

Maybe this helps:
http://blog.tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html

2010/11/21 Sim Zacks <sim@compulab.co.il>

I am using PG 8.2.17 with UTF8 encoding.
"PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1 (Gentoo 4.1.1)"

One of my tables somehow has invalid characters in it:
ERROR: invalid byte sequence for encoding "UTF8": 0xa9
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
I have already manually found a number of the bad rows by running queries with text functions (upper) between groups of IDs until I found the specific bad row.

1) Is there a quicker way to get a list of all rows with invalid characters
2) Shouldn't the database prevent these rows from being entered in the first place?
3) I have backups of this database (using -Fc) and I noticed that on restore, this table is not restored because of this error. Is there a way to fix the existing backups, or tell the restore to ignore bad rows instead of erroring out the whole table?

Thanks
Sim

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
// Dmitriy.

Re: finding rows with invalid characters

From

Sim Zacks

Date:

23 November 2010, 04:41:05

Thanks for your help.

I think a trigger will actually be the easiest. The way i can tell if there is invalid data is simply to do an Upper(text) and if it has invalid data it fails.

I dumped the fixed database. Now I have a years worth of backups that I can't restore a specific table from. It most probably will never mean anything. Every once in a while, I get asked to check what was in the db against what is in there, but this table will probably never be audited.

Sim

On 11/23/2010 10:33 AM, Dmitriy Igrishin wrote:

Hey Sim,

2010/11/23 Sim Zacks <sim@compulab.co.il>
On 11/21/2010 05:55 PM, Dmitriy Igrishin wrote:
Hey Sim,

Maybe this helps:
http://blog.tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html
That worked to find some of them. I still needed to find a bunch of others manually, such as 0xa0 and 0xd725 which weren't found with that function. I finally figured out that
select * from emaildetails where emailbody like '%\xa0%' and
select * from emaildetails where emailbody like '%\xd7\x25%' would show me all those rows.

My 2 big problems now are:

A) how to make sure that these chars are not inserted in the future. The database should prevent them from being inserted.
Consider to use domains -- generic-based types with constraints -- instead of generic types.
Use regular expressions in constraints.
As alternative, you can use triggers for more complex validation. But domains in you case
IMO will work good.

B) How to fix the backups that I have so that I can restore them. As I mentioned, they are being taken with -Fc
Oops. Why not dump fixed database ?

Sim

2010/11/21 Sim Zacks <sim@compulab.co.il>
I am using PG 8.2.17 with UTF8 encoding.
"PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1 (Gentoo 4.1.1)"

One of my tables somehow has invalid characters in it:
ERROR: invalid byte sequence for encoding "UTF8": 0xa9
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
I have already manually found a number of the bad rows by running queries with text functions (upper) between groups of IDs until I found the specific bad row.

1) Is there a quicker way to get a list of all rows with invalid characters
2) Shouldn't the database prevent these rows from being entered in the first place?
3) I have backups of this database (using -Fc) and I noticed that on restore, this table is not restored because of this error. Is there a way to fix the existing backups, or tell the restore to ignore bad rows instead of erroring out the whole table?

Thanks
Sim

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
// Dmitriy.

--
// Dmitriy.

Re: finding rows with invalid characters

From

Dmitriy Igrishin

Date:

23 November 2010, 05:02:28

Sim, are you sure that encoding of the dump and the database
in which you are trying to restore it are same ?

2010/11/23 Sim Zacks <sim@compulab.co.il>

Thanks for your help.
I think a trigger will actually be the easiest. The way i can tell if there is invalid data is simply to do an Upper(text) and if it has invalid data it fails.
I dumped the fixed database. Now I have a years worth of backups that I can't restore a specific table from. It most probably will never mean anything. Every once in a while, I get asked to check what was in the db against what is in there, but this table will probably never be audited.

Sim

On 11/23/2010 10:33 AM, Dmitriy Igrishin wrote:
Hey Sim,

2010/11/23 Sim Zacks <sim@compulab.co.il>
On 11/21/2010 05:55 PM, Dmitriy Igrishin wrote:
Hey Sim,

Maybe this helps:
http://blog.tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html
That worked to find some of them. I still needed to find a bunch of others manually, such as 0xa0 and 0xd725 which weren't found with that function. I finally figured out that
select * from emaildetails where emailbody like '%\xa0%' and
select * from emaildetails where emailbody like '%\xd7\x25%' would show me all those rows.

My 2 big problems now are:

A) how to make sure that these chars are not inserted in the future. The database should prevent them from being inserted.
Consider to use domains -- generic-based types with constraints -- instead of generic types.
Use regular expressions in constraints.
As alternative, you can use triggers for more complex validation. But domains in you case
IMO will work good.

B) How to fix the backups that I have so that I can restore them. As I mentioned, they are being taken with -Fc
Oops. Why not dump fixed database ?

Sim

2010/11/21 Sim Zacks <sim@compulab.co.il>
I am using PG 8.2.17 with UTF8 encoding.
"PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1 (Gentoo 4.1.1)"

One of my tables somehow has invalid characters in it:
ERROR: invalid byte sequence for encoding "UTF8": 0xa9
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
I have already manually found a number of the bad rows by running queries with text functions (upper) between groups of IDs until I found the specific bad row.

1) Is there a quicker way to get a list of all rows with invalid characters
2) Shouldn't the database prevent these rows from being entered in the first place?
3) I have backups of this database (using -Fc) and I noticed that on restore, this table is not restored because of this error. Is there a way to fix the existing backups, or tell the restore to ignore bad rows instead of erroring out the whole table?

Thanks
Sim

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
// Dmitriy.

--
// Dmitriy.

--
// Dmitriy.

Re: finding rows with invalid characters

From

Jasen Betts

Date:

30 November 2010, 15:09:08

On 2010-11-21, Sim Zacks <sim@compulab.co.il> wrote:
> I am using PG 8.2.17 with UTF8 encoding.
> "PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1
> (Gentoo 4.1.1)"
>
> One of my tables somehow has invalid characters in it:
>> ERROR:  invalid byte sequence for encoding "UTF8": 0xa9
>> HINT:  This error can also happen if the byte sequence does not match
>> the encoding expected by the server, which is controlled by
>> "client_encoding".
> I have already manually found a number of the bad rows by running
> queries with text functions (upper) between groups of IDs until I found
> the specific bad row.
>

> 1) Is there a quicker way to get a list of all rows with invalid characters

dumpthe table, run it through iconv , diff agaist the original.

> 2) Shouldn't the database prevent these rows from being entered in the
> first place?

it should have, but that bug has now been fixed.

> 3) I have backups of this database (using -Fc) and I noticed that on
> restore, this table is not restored because of this error. Is there a
> way to fix the existing backups, or tell the restore to ignore bad rows
> instead of erroring out the whole table?

translate them to SQL (use pg_resore with no databse name)
then you can again use iconv to clean them.

use iconv something like this.

  iconv --from-code UTF8 --to-code UTF8 -c  < input_file > output_file

This will translate surrogates and drop other invalid characters.

if you have any constraints that place lower bounds on string-length
this has the potential to break them.

--
⚂⚃ 100% natural