Re: How to remove non-UTF values from a table? - Mailing list pgsql-general

From Phoenix Kiula
Subject Re: How to remove non-UTF values from a table?
Date
Msg-id e373d31e0912151713x4697a862j97d8f7d8d8a1c20a@mail.gmail.com
Whole thread Raw
In response to Re: How to remove non-UTF values from a table?  (Howard Cole <howardnews@selestial.com>)
Responses Re: How to remove non-UTF values from a table?  (Howard Cole <howardnews@selestial.com>)
List pgsql-general
On Tue, Dec 15, 2009 at 9:26 PM, Howard Cole <howardnews@selestial.com> wrote:
> Phoenix Kiula wrote:
>>
>> An easy question for some I hope.
>>
>> I have a DB from 8.2 days that when I now dump and try to take into
>> the 8.3.7, it gives me errors about utf-8 stuff.
>>
>> I tried searching this list's archives but could not come up with an
>> answer.
>>
>> Google returns some sites like these:
>> http://sniptools.com/databases/finding-non-utf8-values-in-postgresql -
>> but I'm not clear on how to use them.
>>
>> Following the SQL on this site I could identify some columns that
>> contain text like this:
>>
>>    "Évolution générale de la situation démographique"
>>
>> So my guess is that the non-English characters were originally not
>> getting written in proper utf-8 variants.
>>
>> Is there any SQL possibility to find these columns and replace them
>> with utf-8 equivalents using some postgresql commands? Couldn't find
>> anything in the "Strings functions" (chapter 9 of manual).
>>
>> We're on CentOS.
>>
>> Thanks!
>>
>>
>
> My recommendation would be to install the iconv utility and run it on a
> plain text (pg_dump -Fp) backup as suggested in the google article - and
> then reimport the clean UTF-8.
>
> I am surprised that you managed to install the original backup on 8.3
> because it seems to be much more strict on encoding - Unless your database
> is not in UTF-8?



Thanks Howard.

I ran the SQL and it finds anything that has non-English characters.
For example:



http://www.amazon.co.jp/%E3%83%A4%E3%83%9E%E3%83%80%E9%9B%BB%E6%A9%9F%E3%81%AE%E5%93%81%E6%A0%BC%E2%80%95No-1%E4%BC%81%E6%A5%AD%E3%81%AE%E6%BF%80%E5%AE%89%E5%93%B2%E5%AD%A6-%E7%AB%8B%E7%9F%B3-%E6%B3%B0%E5%89%87/dp/406214378X/ref=sr_1_1?ie=UTF8&s=books&qid=1199212694&sr=8-1


Part of this URL is actually in Japanese, but when I paste it in this
email it comes up with all these percentage signs. I suppose this is
"url encoded".

Shouldn't this be valid UTF-8? How does PG calculate if something is
not valid UTF-8?

Thanks.

pgsql-general by date:

Previous
From: Scott Marlowe
Date:
Subject: Re: Possible causes for database corruption and solutions
Next
From: Craig Ringer
Date:
Subject: Re: Possible causes for database corruption and solutions