Thread: evil characters #bfef cause dump failure

evil characters #bfef cause dump failure

From
Christian Fowler
Date:
I have been trying to track down the source of why my 7.4.5 database won't
reimport it's own dump ( http://archives.postgresql.org/pgsql-admin/2004-10/msg00213.php )

After much wrestling, it appears the hex byte sequence #bfef in a VARCHAR
column causes a truncated COPY line to be written (and thus the *entire*
COPY block fails). Exporting as inserts did not fix the problem either.

Any thoughts on why this might be so or how it can be avoided? Evil
thought of the day is if someone were to go around and paste this
multi-byte character in various websites' html forms it could cause a lot
of trouble.

Also, the behavior of the restore / psql import to complete the COPY
fields from the *following* line seems not good. It would be nice if the
missing columns could just be written as NULL's. 6 bad rows makes a 6 gig
dump worthless. Or perhaps an option to import each copy row in it's own
transaction so 5+ million copied rows don't fail for 6 bogus ones. Perhaps a
--this_is_an_emergency_so_please_do_everything_you_can_to_restore_as_much_as_possible
option.

If any of the core dev's want some small debug dumps I created, I'd be
happy to pass them on.

[ \ /
[ >X<   Christian Fowler      | spider AT viovio.com
[ / \   http://www.viovio.com | http://www.tikipro.org

Re: evil characters #bfef cause dump failure

From
Tom Lane
Date:
Christian Fowler <spider@viovio.com> writes:
> After much wrestling, it appears the hex byte sequence #bfef in a VARCHAR
> column causes a truncated COPY line to be written (and thus the *entire*
> COPY block fails).

What database encoding and locale are you using?

            regards, tom lane

Re: evil characters #bfef cause dump failure

From
Christian Fowler
Date:
[shell]$ env
PGCLIENTENCODING=UNICODE
LANG=en_US.UTF-8

db=# \encoding
UNICODE


On Mon, 15 Nov 2004, Tom Lane wrote:

> Christian Fowler <spider@viovio.com> writes:
>> After much wrestling, it appears the hex byte sequence #bfef in a VARCHAR
>> column causes a truncated COPY line to be written (and thus the *entire*
>> COPY block fails).
>
> What database encoding and locale are you using?
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

[ \ /
[ >X<   Christian Fowler      | spider AT viovio.com
[ / \   http://www.viovio.com | http://www.tikipro.org

Re: evil characters #bfef cause dump failure

From
Tom Lane
Date:
Christian Fowler <spider@viovio.com> writes:
> [shell]$ env
> PGCLIENTENCODING=UNICODE
> LANG=en_US.UTF-8

> db=# \encoding
> UNICODE

I was more concerned about the database encoding, which the above
doesn't prove.  Try "SHOW server_encoding"

            regards, tom lane

Re: evil characters #bfef cause dump failure

From
Christian Fowler
Date:
db-# ;
  server_encoding
-----------------
  SQL_ASCII
(1 row)

whoa! yikes, I bet this has a lot to do with it? I really wanted to keep
everything UNICODE end-to-end.  I must have forgotten --encoding on my
initdb? Anything I can do at this point?




On Mon, 15 Nov 2004, Tom Lane wrote:

> Christian Fowler <spider@viovio.com> writes:
>> [shell]$ env
>> PGCLIENTENCODING=UNICODE
>> LANG=en_US.UTF-8
>
>> db=# \encoding
>> UNICODE
>
> I was more concerned about the database encoding, which the above
> doesn't prove.  Try "SHOW server_encoding"
>
>             regards, tom lane
>

[ \ /
[ >X<   Christian Fowler      | spider AT viovio.com
[ / \   http://www.viovio.com | http://www.tikipro.org

Re: evil characters #bfef cause dump failure

From
Tom Lane
Date:
Christian Fowler <spider@viovio.com> writes:
>   server_encoding
> -----------------
>   SQL_ASCII

> whoa! yikes, I bet this has a lot to do with it? I really wanted to keep
> everything UNICODE end-to-end.  I must have forgotten --encoding on my
> initdb? Anything I can do at this point?

Hmm ... the safe way would be dump-n-reload but that's not working for
you.  What you can try is to alter the pg_database.encoding value for
that database, then start fresh backends (any existing ones won't notice
the change).  Worst case if that doesn't make life good is to change it
back.

The real problem is that you've got invalid unicode data in the database
(I'm not an expert, but I think that #bf is a 1-byte UTF8 sequence and
then #ef starts a 3-byte sequence, so if this comes within 2 characters
of end-of-line that would explain your dump problem).  You had better
fix the data first before trying to lock down the encoding.  Once you
change the encoding, backend internal operations will start spitting up
on any stored bad data, whereas right now it's just passing it through
unchanged.

The safest way might be a dump-n-reload in any case, since reloading
into a fresh UNICODE database will catch bad data.  If you try manual
repairs you're likely to miss some places :-(

            regards, tom lane

Re: evil characters #bfef cause dump failure

From
Markus Bertheau
Date:
В Пнд, 15/11/2004 в 16:00 -0500, Tom Lane пишет:

> The real problem is that you've got invalid unicode data in the database
> (I'm not an expert, but I think that #bf is a 1-byte UTF8 sequence and
> then #ef starts a 3-byte sequence, so if this comes within 2 characters
> of end-of-line that would explain your dump problem).

FWIW, 1-byte UTF-8 sequences are always < 128. BF can only appear
inside, not at the beginning of, a UTF-8 byte sequence with more than 1
byte.

Compare

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

It has a table that gives anyone who can tell bits from bytes a quick
understanding of how the UTF-8 encoding works.

--
Markus Bertheau <twanger@bluetwanger.de>


Re: evil characters #bfef cause dump failure

From
"Iain"
Date:
It seems that this kind of thing pops up from time to time. I don't have v8
available right now to check, but is SQL_ASCII still the default DB
encoding? I'm wondering is unicode wouldn't be a better choice these days.

regards
Iain
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Christian Fowler" <spider@viovio.com>
Cc: "pgsql-admin list" <pgsql-admin@postgresql.org>
Sent: Tuesday, November 16, 2004 6:00 AM
Subject: Re: [ADMIN] evil characters #bfef cause dump failure


> Christian Fowler <spider@viovio.com> writes:
>>   server_encoding
>> -----------------
>>   SQL_ASCII
>
>> whoa! yikes, I bet this has a lot to do with it? I really wanted to keep
>> everything UNICODE end-to-end.  I must have forgotten --encoding on my
>> initdb? Anything I can do at this point?
>
> Hmm ... the safe way would be dump-n-reload but that's not working for
> you.  What you can try is to alter the pg_database.encoding value for
> that database, then start fresh backends (any existing ones won't notice
> the change).  Worst case if that doesn't make life good is to change it
> back.
>
> The real problem is that you've got invalid unicode data in the database
> (I'm not an expert, but I think that #bf is a 1-byte UTF8 sequence and
> then #ef starts a 3-byte sequence, so if this comes within 2 characters
> of end-of-line that would explain your dump problem).  You had better
> fix the data first before trying to lock down the encoding.  Once you
> change the encoding, backend internal operations will start spitting up
> on any stored bad data, whereas right now it's just passing it through
> unchanged.
>
> The safest way might be a dump-n-reload in any case, since reloading
> into a fresh UNICODE database will catch bad data.  If you try manual
> repairs you're likely to miss some places :-(
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html


Re: evil characters #bfef cause dump failure

From
Tom Lane
Date:
"Iain" <iain@mst.co.jp> writes:
> It seems that this kind of thing pops up from time to time. I don't have v8
> available right now to check, but is SQL_ASCII still the default DB
> encoding? I'm wondering is unicode wouldn't be a better choice these days.

IIRC you can select the default encoding at build time, so this is
really a question for packagers not the development team.

You make a good point though --- I'm a bit tempted to make it default to
UNICODE for the Red Hat build, since Red Hat is pretty gung-ho on UTF8
support these days.

BTW, SQL_ASCII is not so much an encoding as the absence of any encoding
choice; it just passes 8-bit data with no interpretation.  So it's not
*that* unreasonable a default.  You can store UTF8 data in it without
any problem, you just won't have the niceties like detection of bad
character sequences.

            regards, tom lane

Re: evil characters #bfef cause dump failure

From
Peter Eisentraut
Date:
Tom Lane wrote:
> You make a good point though --- I'm a bit tempted to make it default
> to UNICODE for the Red Hat build, since Red Hat is pretty gung-ho on
> UTF8 support these days.

Recall that in 8.0 the default encoding will be derived from the locale.
So if the postgres account has a reasonable locale set (presumably
chosen somewhere during the system installation), everything will work
out.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: evil characters #bfef cause dump failure

From
Markus Bertheau
Date:
В Пнд, 15/11/2004 в 20:34 -0500, Tom Lane пишет:
> "Iain" <iain@mst.co.jp> writes:
> > It seems that this kind of thing pops up from time to time. I don't have v8
> > available right now to check, but is SQL_ASCII still the default DB
> > encoding? I'm wondering is unicode wouldn't be a better choice these days.
>
> IIRC you can select the default encoding at build time, so this is
> really a question for packagers not the development team.
>
> You make a good point though --- I'm a bit tempted to make it default to
> UNICODE for the Red Hat build, since Red Hat is pretty gung-ho on UTF8
> support these days.
>
> BTW, SQL_ASCII is not so much an encoding as the absence of any encoding
> choice; it just passes 8-bit data with no interpretation.  So it's not
> *that* unreasonable a default.  You can store UTF8 data in it without
> any problem, you just won't have the niceties like detection of bad
> character sequences.

This is, by the way, a reason why this encoding should be renamed to
SQL_8BIT (or something along these lines) and UNICODE to UTF-8.

--
Markus Bertheau <twanger@bluetwanger.de>


Re: evil characters #bfef cause dump failure

From
"Iain"
Date:
Hi,

> Recall that in 8.0 the default encoding will be derived from the locale.
> So if the postgres account has a reasonable locale set (presumably
> chosen somewhere during the system installation), everything will work
> out.

That's seems pretty reasonable, though I think that standardizing on unicode
(and I guess that means UTF-8) is really the way to go. It was designed as
the universal standard after all.

Thanks for the feedback,
Iain









Re: evil characters #bfef cause dump failure

From
Peter Eisentraut
Date:
Am Dienstag, 16. November 2004 09:45 schrieb Iain:
> That's seems pretty reasonable, though I think that standardizing on
> unicode (and I guess that means UTF-8) is really the way to go. It was
> designed as the universal standard after all.

It may have been designed that way, but it is a failure in practice.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: evil characters #bfef cause dump failure

From
"Iain"
Date:
Hi Peter,

I know that the unicode standards are far from perfect, but I'm wondering
why you consider it a failure.

Is it technical, or just an acceptance thing?

From my personal perspective, I never had any interest in such things as
encodings or internationlization until I started working in Japan, then I
realized what a nightmare it is. I expect you can imagine, but most people
(like me a year ago) couldn't. If everyone was already using unicode, I
don't think we'd have anything to worry about.

regards
Iain
----- Original Message -----
From: "Peter Eisentraut" <peter_e@gmx.net>
To: "Iain" <iain@mst.co.jp>
Cc: "Tom Lane" <tgl@sss.pgh.pa.us>; "pgsql-admin list"
<pgsql-admin@postgresql.org>
Sent: Tuesday, November 16, 2004 6:43 PM
Subject: Re: [ADMIN] evil characters #bfef cause dump failure


> Am Dienstag, 16. November 2004 09:45 schrieb Iain:
>> That's seems pretty reasonable, though I think that standardizing on
>> unicode (and I guess that means UTF-8) is really the way to go. It was
>> designed as the universal standard after all.
>
> It may have been designed that way, but it is a failure in practice.
>
> --
> Peter Eisentraut
> http://developer.postgresql.org/~petere/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


Re: evil characters #bfef cause dump failure

From
Peter Eisentraut
Date:
Am Dienstag, 16. November 2004 11:27 schrieb Iain:
> I know that the unicode standards are far from perfect, but I'm wondering
> why you consider it a failure.

If it were a success, then everyone would be using it and we wouldn't have
this discussion.  Certainly, Unicode is reasonable and the best option in
many cases.  But the original idea of replacing all other character sets
won't really happen anytime soon.  Since you work in Japan, I would think you
know about the problems, since much of the technical opposition comes from
there.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: evil characters #bfef cause dump failure

From
Christian Fowler
Date:
I strongly agree with this. I have always been uncomfortable selecting
"UNICODE" and never quite sure if it is the UTF8, UTF16, or UTF32
encoding.

SQL_8BIT or SQL_RAW make much more sense than SQL_ASCII given that Tom
said this is a lack of encoding. I fear I might have high-bits chopped off
or something.

However, back to my problem... if a #bfef character is shoved into a
VARCHAR, one's dump is hosed. If I went to various websites and entered
this in, I could cause a lot of pain. I believe I noticed some characters
(like new line and tab) are converted to <80> or similar. Could/should
this be extended to more character ranges - particularly high byte chars
for people with the SQL_ASCII (lackof) encoding?


On Tue, 16 Nov 2004, Markus Bertheau wrote:

>
> This is, by the way, a reason why this encoding should be renamed to
> SQL_8BIT (or something along these lines) and UNICODE to UTF-8.
>
> --
> Markus Bertheau <twanger@bluetwanger.de>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

[ \ /
[ >X<   Christian Fowler      | spider AT viovio.com
[ / \   http://www.viovio.com | http://www.tikipro.org