Thread: evil characters #bfef cause dump failure
I have been trying to track down the source of why my 7.4.5 database won't reimport it's own dump ( http://archives.postgresql.org/pgsql-admin/2004-10/msg00213.php ) After much wrestling, it appears the hex byte sequence #bfef in a VARCHAR column causes a truncated COPY line to be written (and thus the *entire* COPY block fails). Exporting as inserts did not fix the problem either. Any thoughts on why this might be so or how it can be avoided? Evil thought of the day is if someone were to go around and paste this multi-byte character in various websites' html forms it could cause a lot of trouble. Also, the behavior of the restore / psql import to complete the COPY fields from the *following* line seems not good. It would be nice if the missing columns could just be written as NULL's. 6 bad rows makes a 6 gig dump worthless. Or perhaps an option to import each copy row in it's own transaction so 5+ million copied rows don't fail for 6 bogus ones. Perhaps a --this_is_an_emergency_so_please_do_everything_you_can_to_restore_as_much_as_possible option. If any of the core dev's want some small debug dumps I created, I'd be happy to pass them on. [ \ / [ >X< Christian Fowler | spider AT viovio.com [ / \ http://www.viovio.com | http://www.tikipro.org
Christian Fowler <spider@viovio.com> writes: > After much wrestling, it appears the hex byte sequence #bfef in a VARCHAR > column causes a truncated COPY line to be written (and thus the *entire* > COPY block fails). What database encoding and locale are you using? regards, tom lane
[shell]$ env PGCLIENTENCODING=UNICODE LANG=en_US.UTF-8 db=# \encoding UNICODE On Mon, 15 Nov 2004, Tom Lane wrote: > Christian Fowler <spider@viovio.com> writes: >> After much wrestling, it appears the hex byte sequence #bfef in a VARCHAR >> column causes a truncated COPY line to be written (and thus the *entire* >> COPY block fails). > > What database encoding and locale are you using? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > [ \ / [ >X< Christian Fowler | spider AT viovio.com [ / \ http://www.viovio.com | http://www.tikipro.org
Christian Fowler <spider@viovio.com> writes: > [shell]$ env > PGCLIENTENCODING=UNICODE > LANG=en_US.UTF-8 > db=# \encoding > UNICODE I was more concerned about the database encoding, which the above doesn't prove. Try "SHOW server_encoding" regards, tom lane
db-# ; server_encoding ----------------- SQL_ASCII (1 row) whoa! yikes, I bet this has a lot to do with it? I really wanted to keep everything UNICODE end-to-end. I must have forgotten --encoding on my initdb? Anything I can do at this point? On Mon, 15 Nov 2004, Tom Lane wrote: > Christian Fowler <spider@viovio.com> writes: >> [shell]$ env >> PGCLIENTENCODING=UNICODE >> LANG=en_US.UTF-8 > >> db=# \encoding >> UNICODE > > I was more concerned about the database encoding, which the above > doesn't prove. Try "SHOW server_encoding" > > regards, tom lane > [ \ / [ >X< Christian Fowler | spider AT viovio.com [ / \ http://www.viovio.com | http://www.tikipro.org
Christian Fowler <spider@viovio.com> writes: > server_encoding > ----------------- > SQL_ASCII > whoa! yikes, I bet this has a lot to do with it? I really wanted to keep > everything UNICODE end-to-end. I must have forgotten --encoding on my > initdb? Anything I can do at this point? Hmm ... the safe way would be dump-n-reload but that's not working for you. What you can try is to alter the pg_database.encoding value for that database, then start fresh backends (any existing ones won't notice the change). Worst case if that doesn't make life good is to change it back. The real problem is that you've got invalid unicode data in the database (I'm not an expert, but I think that #bf is a 1-byte UTF8 sequence and then #ef starts a 3-byte sequence, so if this comes within 2 characters of end-of-line that would explain your dump problem). You had better fix the data first before trying to lock down the encoding. Once you change the encoding, backend internal operations will start spitting up on any stored bad data, whereas right now it's just passing it through unchanged. The safest way might be a dump-n-reload in any case, since reloading into a fresh UNICODE database will catch bad data. If you try manual repairs you're likely to miss some places :-( regards, tom lane
В Пнд, 15/11/2004 в 16:00 -0500, Tom Lane пишет: > The real problem is that you've got invalid unicode data in the database > (I'm not an expert, but I think that #bf is a 1-byte UTF8 sequence and > then #ef starts a 3-byte sequence, so if this comes within 2 characters > of end-of-line that would explain your dump problem). FWIW, 1-byte UTF-8 sequences are always < 128. BF can only appear inside, not at the beginning of, a UTF-8 byte sequence with more than 1 byte. Compare http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 It has a table that gives anyone who can tell bits from bytes a quick understanding of how the UTF-8 encoding works. -- Markus Bertheau <twanger@bluetwanger.de>
It seems that this kind of thing pops up from time to time. I don't have v8 available right now to check, but is SQL_ASCII still the default DB encoding? I'm wondering is unicode wouldn't be a better choice these days. regards Iain ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Christian Fowler" <spider@viovio.com> Cc: "pgsql-admin list" <pgsql-admin@postgresql.org> Sent: Tuesday, November 16, 2004 6:00 AM Subject: Re: [ADMIN] evil characters #bfef cause dump failure > Christian Fowler <spider@viovio.com> writes: >> server_encoding >> ----------------- >> SQL_ASCII > >> whoa! yikes, I bet this has a lot to do with it? I really wanted to keep >> everything UNICODE end-to-end. I must have forgotten --encoding on my >> initdb? Anything I can do at this point? > > Hmm ... the safe way would be dump-n-reload but that's not working for > you. What you can try is to alter the pg_database.encoding value for > that database, then start fresh backends (any existing ones won't notice > the change). Worst case if that doesn't make life good is to change it > back. > > The real problem is that you've got invalid unicode data in the database > (I'm not an expert, but I think that #bf is a 1-byte UTF8 sequence and > then #ef starts a 3-byte sequence, so if this comes within 2 characters > of end-of-line that would explain your dump problem). You had better > fix the data first before trying to lock down the encoding. Once you > change the encoding, backend internal operations will start spitting up > on any stored bad data, whereas right now it's just passing it through > unchanged. > > The safest way might be a dump-n-reload in any case, since reloading > into a fresh UNICODE database will catch bad data. If you try manual > repairs you're likely to miss some places :-( > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html
"Iain" <iain@mst.co.jp> writes: > It seems that this kind of thing pops up from time to time. I don't have v8 > available right now to check, but is SQL_ASCII still the default DB > encoding? I'm wondering is unicode wouldn't be a better choice these days. IIRC you can select the default encoding at build time, so this is really a question for packagers not the development team. You make a good point though --- I'm a bit tempted to make it default to UNICODE for the Red Hat build, since Red Hat is pretty gung-ho on UTF8 support these days. BTW, SQL_ASCII is not so much an encoding as the absence of any encoding choice; it just passes 8-bit data with no interpretation. So it's not *that* unreasonable a default. You can store UTF8 data in it without any problem, you just won't have the niceties like detection of bad character sequences. regards, tom lane
Tom Lane wrote: > You make a good point though --- I'm a bit tempted to make it default > to UNICODE for the Red Hat build, since Red Hat is pretty gung-ho on > UTF8 support these days. Recall that in 8.0 the default encoding will be derived from the locale. So if the postgres account has a reasonable locale set (presumably chosen somewhere during the system installation), everything will work out. -- Peter Eisentraut http://developer.postgresql.org/~petere/
В Пнд, 15/11/2004 в 20:34 -0500, Tom Lane пишет: > "Iain" <iain@mst.co.jp> writes: > > It seems that this kind of thing pops up from time to time. I don't have v8 > > available right now to check, but is SQL_ASCII still the default DB > > encoding? I'm wondering is unicode wouldn't be a better choice these days. > > IIRC you can select the default encoding at build time, so this is > really a question for packagers not the development team. > > You make a good point though --- I'm a bit tempted to make it default to > UNICODE for the Red Hat build, since Red Hat is pretty gung-ho on UTF8 > support these days. > > BTW, SQL_ASCII is not so much an encoding as the absence of any encoding > choice; it just passes 8-bit data with no interpretation. So it's not > *that* unreasonable a default. You can store UTF8 data in it without > any problem, you just won't have the niceties like detection of bad > character sequences. This is, by the way, a reason why this encoding should be renamed to SQL_8BIT (or something along these lines) and UNICODE to UTF-8. -- Markus Bertheau <twanger@bluetwanger.de>
Hi, > Recall that in 8.0 the default encoding will be derived from the locale. > So if the postgres account has a reasonable locale set (presumably > chosen somewhere during the system installation), everything will work > out. That's seems pretty reasonable, though I think that standardizing on unicode (and I guess that means UTF-8) is really the way to go. It was designed as the universal standard after all. Thanks for the feedback, Iain
Am Dienstag, 16. November 2004 09:45 schrieb Iain: > That's seems pretty reasonable, though I think that standardizing on > unicode (and I guess that means UTF-8) is really the way to go. It was > designed as the universal standard after all. It may have been designed that way, but it is a failure in practice. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Hi Peter, I know that the unicode standards are far from perfect, but I'm wondering why you consider it a failure. Is it technical, or just an acceptance thing? From my personal perspective, I never had any interest in such things as encodings or internationlization until I started working in Japan, then I realized what a nightmare it is. I expect you can imagine, but most people (like me a year ago) couldn't. If everyone was already using unicode, I don't think we'd have anything to worry about. regards Iain ----- Original Message ----- From: "Peter Eisentraut" <peter_e@gmx.net> To: "Iain" <iain@mst.co.jp> Cc: "Tom Lane" <tgl@sss.pgh.pa.us>; "pgsql-admin list" <pgsql-admin@postgresql.org> Sent: Tuesday, November 16, 2004 6:43 PM Subject: Re: [ADMIN] evil characters #bfef cause dump failure > Am Dienstag, 16. November 2004 09:45 schrieb Iain: >> That's seems pretty reasonable, though I think that standardizing on >> unicode (and I guess that means UTF-8) is really the way to go. It was >> designed as the universal standard after all. > > It may have been designed that way, but it is a failure in practice. > > -- > Peter Eisentraut > http://developer.postgresql.org/~petere/ > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Am Dienstag, 16. November 2004 11:27 schrieb Iain: > I know that the unicode standards are far from perfect, but I'm wondering > why you consider it a failure. If it were a success, then everyone would be using it and we wouldn't have this discussion. Certainly, Unicode is reasonable and the best option in many cases. But the original idea of replacing all other character sets won't really happen anytime soon. Since you work in Japan, I would think you know about the problems, since much of the technical opposition comes from there. -- Peter Eisentraut http://developer.postgresql.org/~petere/
I strongly agree with this. I have always been uncomfortable selecting "UNICODE" and never quite sure if it is the UTF8, UTF16, or UTF32 encoding. SQL_8BIT or SQL_RAW make much more sense than SQL_ASCII given that Tom said this is a lack of encoding. I fear I might have high-bits chopped off or something. However, back to my problem... if a #bfef character is shoved into a VARCHAR, one's dump is hosed. If I went to various websites and entered this in, I could cause a lot of pain. I believe I noticed some characters (like new line and tab) are converted to <80> or similar. Could/should this be extended to more character ranges - particularly high byte chars for people with the SQL_ASCII (lackof) encoding? On Tue, 16 Nov 2004, Markus Bertheau wrote: > > This is, by the way, a reason why this encoding should be renamed to > SQL_8BIT (or something along these lines) and UNICODE to UTF-8. > > -- > Markus Bertheau <twanger@bluetwanger.de> > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > [ \ / [ >X< Christian Fowler | spider AT viovio.com [ / \ http://www.viovio.com | http://www.tikipro.org