Thread: Character encoding in database dumps

Character encoding in database dumps

From
Lynna Landstreet
Date:
Hi there,

My database was created with Unicode encoding, as a fair number of the
artists and exhibitions have French names, and thus include accented
characters. I thought I'd solved all the various problems this generates,
but have just discovered one new one.

When I do a dump of the database to back it up (because I'm paranoid and
don't want my web host to be my only source of security on this matter), the
accented characters don't come through correctly. They all display correctly
in phpPgAdmin, and in the PHP pages I've made to search and display the data
- it's just in dumps that they're garbled. I presume this means the Unicode
encoding is somehow being lost and the data read as standard ASCII.

I'm currently using phpPgAdmin to export the data, and it gives me a choice
of downloading it as a .sql file or displaying it in the browser window, but
either way, the special characters are toasted. Does anyone have any idea
how I can do a database dump that keeps the character encoding intact?

Thanks,

Lynna

--
Resource Centre Database Coordinator
Gallery 44
www.gallery44.org


Re: Character encoding in database dumps

From
Aarni Ruuhimäki
Date:
Hi Lynna,

Have you tried restoring your data from the dumps ?

I use Latin1 encoding with my dbs. All russian characters are a mess in the
dump files and in the psql terminal as well. However, they restore, display
and distribute ok, even into win-platform. Out of curiosity, I renamed one
dump.file to dump.html and opened it with a browser and set the browser
encoding to cyrillic. Characters display ok.

So could this be more an issue with the client application's ability to use
proper encoding for displaying the characters right ? Dunno for sure, just my
thought out of some experience with cyrillic and Pg. Encodings are a bit of
jungle, it seems.

BR,

Aarni

On Wednesday 09 June 2004 21:11, you wrote:
> Hi there,
>
> My database was created with Unicode encoding, as a fair number of the
> artists and exhibitions have French names, and thus include accented
> characters. I thought I'd solved all the various problems this generates,
> but have just discovered one new one.
>
> When I do a dump of the database to back it up (because I'm paranoid and
> don't want my web host to be my only source of security on this matter),
> the accented characters don't come through correctly. They all display
> correctly in phpPgAdmin, and in the PHP pages I've made to search and
> display the data - it's just in dumps that they're garbled. I presume this
> means the Unicode encoding is somehow being lost and the data read as
> standard ASCII.
>
> I'm currently using phpPgAdmin to export the data, and it gives me a choice
> of downloading it as a .sql file or displaying it in the browser window,
> but either way, the special characters are toasted. Does anyone have any
> idea how I can do a database dump that keeps the character encoding intact?
>
> Thanks,
>
> Lynna

--
-------------------------------------------------
Aarni Ruuhimäki
-------------------------------------------------
This is a bugfree broadcast to you from a linux system.

Re: Character encoding in database dumps

From
Markus Bertheau
Date:
В Срд, 09.06.2004, в 20:11, Lynna Landstreet пишет:
> Hi there,
>
> My database was created with Unicode encoding, as a fair number of the
> artists and exhibitions have French names, and thus include accented
> characters. I thought I'd solved all the various problems this generates,
> but have just discovered one new one.
>
> When I do a dump of the database to back it up (because I'm paranoid and
> don't want my web host to be my only source of security on this matter), the
> accented characters don't come through correctly. They all display correctly
> in phpPgAdmin, and in the PHP pages I've made to search and display the data
> - it's just in dumps that they're garbled. I presume this means the Unicode
> encoding is somehow being lost and the data read as standard ASCII.
>
> I'm currently using phpPgAdmin to export the data, and it gives me a choice
> of downloading it as a .sql file or displaying it in the browser window, but
> either way, the special characters are toasted. Does anyone have any idea
> how I can do a database dump that keeps the character encoding intact?

How do you determine that the character encoding is wrong? Maybe
whatever it is you use to look at the dump just doesn't interpret the
data as UTF-8.

--
Markus Bertheau <twanger@bluetwanger.de>


Re: Character encoding in database dumps

From
Lynna Landstreet
Date:
on 6/10/04 8:19 AM, Aarni Ruuhimäki at aarni@kymi.com wrote:

> Have you tried restoring your data from the dumps ?

No - I wanted to, but my web host's control panel appears to be down so I
can't create a new database to restore it into, and I don't want to risk
restoring it into the existing database in case it screws everything up. And
the dump file was still messed up when I tried viewing it in my browser, as
you suggest.

However, I found the answer. I should have actually figured it out sooner,
because I originally had the same problem in reverse when trying to get data
*into* the database: FTP programs tend to send text files in ASCII format,
which doesn't support Unicode. When I uploaded the text files to \copy my
data into the database from, I had to make sure the FTP program used binary
mode, and in order to make it do that, I had to change the file extension to
something my internet preferences did not brand as ASCII automatically.

Turned out to be what was happening here as well. I had to skip using
phpPgAdmin's Export function and just use pg_dump via a shell connection,
and name the dump something that did not end in .txt or .sql or anything
like that, and then download it as binary, and *then* open it in BBEdit with
the "Read as" option set to UTF-8. Once I did that, all my special
characters were OK.

Major headache, though. I'm hoping as things evolve, Unicode support will be
built into more programs including those where it's currently not thought
necessary like FTP programs. People need to learn that ASCII is *not*
necessarily the correct format for all text files...


Lynna

--
Resource Centre Database Coordinator
Gallery 44
www.gallery44.org


Re: Character encoding in database dumps

From
Bruno Wolff III
Date:
On Fri, Jun 11, 2004 at 15:06:08 -0400,
  Lynna Landstreet <lynna@gallery44.org> wrote:
>
> Major headache, though. I'm hoping as things evolve, Unicode support will be
> built into more programs including those where it's currently not thought
> necessary like FTP programs. People need to learn that ASCII is *not*
> necessarily the correct format for all text files...

The ascii mode in ftp is for files that need to be converted to or from
using LF to end lines from or to using CRLF to end lines. Normally you want
to do binary transfers.

Re: Character encoding in database dumps

From
"M. Bastin"
Date:
>Does anyone have any idea
>how I can do a database dump that keeps the character encoding intact?

If you can access your database directly with a TCP/IP PostgreSQL
client then you can use Eduphant which supports copy to stdout/stdin.
This will allow you to copy directly to your local computer in 1 step.

<http://aliacta.com/download>

Cheers,

Marc

Re: Character encoding in database dumps

From
Date:
On Sat, 12 Jun 2004, M. Bastin wrote:

> >Does anyone have any idea
> >how I can do a database dump that keeps the character encoding intact?

[ Sorry, idle mind.  :-) ]

Reading a bunch of these messages over the last while, it appears
that the database will only support a single character encoding
internally/natively.  Maybe I am wrong, and maybe this will change
in the future.

But, if you need to interact in multiple character sets now, this
seems to be something which you are now doing with a front-end of
some kind.  Perhaps the thing to do, is to add a field to your
tables (or make them into 2 column arrays?) involving character
data, where this new field (or 0'th array element?) stores the
character set used when the data was originally input.  If a
front-end makes a query involving character data, it gets back the
character set involved and the character data.  Then the front-end
has to deal with translating from one character set to another.

Just my $0.02 (CDN)
Gord



Re: Character encoding in database dumps

From
Lynna Landstreet
Date:
on 6/11/04 3:33 PM, Bruno Wolff III at bruno@wolff.to wrote:

>> Major headache, though. I'm hoping as things evolve, Unicode support will be
>> built into more programs including those where it's currently not thought
>> necessary like FTP programs. People need to learn that ASCII is *not*
>> necessarily the correct format for all text files...
>
> The ascii mode in ftp is for files that need to be converted to or from
> using LF to end lines from or to using CRLF to end lines. Normally you want
> to do binary transfers.

I'm used to having to use ASCII on anything remotely CGI-related because
normally if I do binary transfers for any sort of CGI file it breaks the
script. But maybe that is the LF issue - I know MacOS, Windows and UNIX all
use different sorts of line returns and since CGIs are usually running on
UNIX systems, Mac or Windows line returns will mess everything up. I don't
know if PHP scripts are as sensitive...

But the bigger problem is that it seems like these days most FTP programs -
as well as browsers and text editors that have some FTP functionality -
automatically set the transfer mode based on the file extension (and text
editors like BBEdit *only* have ASCII transfer mode, presumably on the basis
that, hey, it's text, and text *always* means ASCII, right?). Some will
allow you to override that, some won't. That's why I had to change the
extension on my text data files and database dumps in order for a binary
transfer to even be possible.

It's really annoying - I hate software that thinks it knows what I'm trying
to do better than I do.


Lynna

--
Resource Centre Database Coordinator
Gallery 44
www.gallery44.org