Re: finding rows with invalid characters - Mailing list pgsql-general

From Sim Zacks
Subject Re: finding rows with invalid characters
Date
Msg-id 4CEB7DBE.9@compulab.co.il
Whole thread Raw
In response to finding rows with invalid characters  (Sim Zacks <sim@compulab.co.il>)
Responses Re: finding rows with invalid characters
List pgsql-general

Thanks for your help.

I think a trigger will actually be the easiest. The way i can tell if there is invalid data is simply to do an Upper(text) and if it has invalid data it fails.

I dumped the fixed database. Now I have a years worth of backups that I can't restore a specific table from. It most probably will never mean anything. Every once in a while, I get asked to check what was in the db against what is in there, but this table will probably never be audited.


Sim



On 11/23/2010 10:33 AM, Dmitriy Igrishin wrote:

Hey Sim,

2010/11/23 Sim Zacks <sim@compulab.co.il>
On 11/21/2010 05:55 PM, Dmitriy Igrishin wrote:
Hey Sim,

Maybe this helps:
http://blog.tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html
That worked to find some of them. I still needed to find a bunch of others manually, such as 0xa0 and 0xd725 which weren't found with that function. I finally figured out that
select * from emaildetails where emailbody like '%\xa0%' and
select * from emaildetails where emailbody like '%\xd7\x25%' would show me all those rows.

My 2 big problems now are:

A) how to make sure that these chars are not inserted in the future. The database should prevent them from being inserted.
Consider to use domains -- generic-based types with constraints -- instead of generic types.
Use regular expressions in constraints.
As alternative, you can use triggers for more complex validation. But domains in you case
IMO will work good.


B) How to fix the backups that I have so that I can restore them. As I mentioned, they are being taken with -Fc
Oops. Why not dump fixed database ?

Sim



2010/11/21 Sim Zacks <sim@compulab.co.il>
I am using PG 8.2.17 with UTF8 encoding.
"PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1 (Gentoo 4.1.1)"

One of my tables somehow has invalid characters in it:
ERROR:  invalid byte sequence for encoding "UTF8": 0xa9
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
I have already manually found a number of the bad rows by running queries with text functions (upper) between groups of IDs until I found the specific bad row.

1) Is there a quicker way to get a list of all rows with invalid characters
2) Shouldn't the database prevent these rows from being entered in the first place?
3) I have backups of this database (using -Fc) and I noticed that on restore, this table is not restored because of this error. Is there a way to fix the existing backups, or tell the restore to ignore bad rows instead of erroring out the whole table?

Thanks
Sim

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
// Dmitriy.






--
// Dmitriy.



pgsql-general by date:

Previous
From: Mike Christensen
Date:
Subject: Re: json data type
Next
From: Dmitriy Igrishin
Date:
Subject: Re: finding rows with invalid characters