On Sun, Dec 04, 2005 at 12:19:32PM -0500, Gregory Maxwell wrote:
> > That's exactly what's bothering me about it. If we recommend that
> > we had better put a large THIS WILL DESTROY YOUR DATA warning first.
> > The problem is that the data is not "invalid" from the user's point
> > of view --- more likely, it's in some non-UTF8 encoding --- and so
> > just throwing away some of the characters is unlikely to make people
> > happy.
>
> Nor is it even guarenteed to make the data load: If the column is
> unique constrained and the removal of the non-UTF characters makes two
> rows have the same data where they didn't before...
>
> The way to preserve the data is to switch the column to be a bytea.
Additionally, it's hard to suggest anything better without specific
knowledge of the characters that are incorrect and how they got there.
The ideal solution would be a way for people to identify problem data
*before* they dump so they have an opportunity to fix it. Something
like a module they can load and say:
select val from table where not utf8_validate(val);
This would allow people to examine the data while the system is still
running and fix it. Maybe we can code something up in plpgsql? Slow as
molasses but you'll be able to run it anywhere.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.