Thread: fatal copy in/out error (6.5.3)
Do this: testdb=> create table foo (word varchar(30)); CREATE testdb=> insert into foo values ('\217\210'); INSERT 23372891 testdb=> copy "foo" to '/usr/local/pgsql/foo.out'; COPY testdb=> \q % od -c foo.out 0000000 217 210 \0 \n 0000004 One run produced: 0000000 217 210 \0 \b \n In all cases, though, it emits a null. This, of course, causes "copy from" to fail when it tries to parse a null-containing field in a multi-field record. As for what the sequence "\217\210" represents, that's not really relevant (it's not even a valid Chinese character, so far as I can tell). Someone just entered this into our web interface and broke our database. -Michael Robinson P.S. As an aside, if "copy from" recognizes the \xxx octal convention, why doesn't "copy to" use it to represent everything outside of \040..\176?
Michael Robinson <robinson@netrinsics.com> writes: > testdb=> create table foo (word varchar(30)); > CREATE > testdb=> insert into foo values ('\217\210'); > INSERT 2337289 1 > testdb=> copy "foo" to '/usr/local/pgsql/foo.out'; > COPY > testdb=> \q > % od -c foo.out > 0000000 217 210 \0 \n > 0000004 Hmm. I get0000000 217 210 \n0000003 with either current sources or 6.5.3. I don't have MULTIBYTE nor USE_LOCALE turned on, though. What is your configuration exactly? This looks to me like something is deciding that \217 must be the start of a 3-byte multibyte character... in which case, it should have appeared that way in your database, I think. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: >Hmm. I get > 0000000 217 210 \n > 0000003 >with either current sources or 6.5.3. I don't have MULTIBYTE nor >USE_LOCALE turned on, though. What is your configuration exactly? My config line:./configure --with-mb=EUC_CN I can forward my include/config.h if that would be helpful. The OS is FreeBSD 3.4-RELEASE. >This looks to me like something is deciding that \217 must be the >start of a 3-byte multibyte character... in which case, it should have >appeared that way in your database, I think. That would be very weird, if true. \217 is certainly not the beginning of a UTF-8 three-byte sequence, and EUC doesn't have three-byte codes. -Michael Robinson
>> This looks to me like something is deciding that \217 must be the >> start of a 3-byte multibyte character... in which case, it should have >> appeared that way in your database, I think. > That would be very weird, if true. \217 is certainly not the beginning of > a UTF-8 three-byte sequence, and EUC doesn't have three-byte codes. Hmm. And I don't suppose it's real likely that \217 is a one-byte code and \210 starts a two-byte code? I'm out of my depth here --- it seems this is almost certainly a MULTIBYTE issue, but I know very little about MULTIBYTE. I'm going to punt and hope some of our MULTIBYTE experts pick it up. regards, tom lane
> My config line: > ./configure --with-mb=EUC_CN > > I can forward my include/config.h if that would be helpful. > > The OS is FreeBSD 3.4-RELEASE. > > >This looks to me like something is deciding that \217 must be the > >start of a 3-byte multibyte character... in which case, it should have > >appeared that way in your database, I think. I suspect that too. > That would be very weird, if true. \217 is certainly not the beginning of > a UTF-8 three-byte sequence, and EUC doesn't have three-byte codes. No. some EUC's (EUC_TW and EUC_JP) has three-byte or even four-byte codes. But you said your database has been configured as EUC_CN. As far as I know, it only uses 1 or 2 byte-code. Another thing I am confused is that ' \217\210' is not a valid EUC_CN data at all. \217 (0x8f) specifies code set 3 which does not exist in EUC_CN. In this case, it is assumed that the multi-byte word to be consisted of 3-byte code in the current implementation of PostgreSQL. In short, the problem you have is caused by: 1) wrong data submitted into the table 2) PostgreSQL assumes the data is consisted of 3 bytes data I would recommend you delete the data since it's not correct anyway. In the mean time I'm going to fix 2) so that it assumes data be consisted of 2 bytes even if wrong data sequence is submitted (needless to say, except ascii). Do you want the backpatch for 6.5.3? -- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >No. some EUC's (EUC_TW and EUC_JP) has three-byte or even four-byte >codes. But you said your database has been configured as EUC_CN. As >far as I know, it only uses 1 or 2 byte-code. Another thing I am >confused is that ' \217\210' is not a valid EUC_CN data at all. \217 >(0x8f) specifies code set 3 which does not exist in EUC_CN. In this >case, it is assumed that the multi-byte word to be consisted of 3-byte >code in the current implementation of PostgreSQL. It could be that one of our users had their input method set to produce EUC_TW or Big5. >In short, the problem you have is caused by: >1) wrong data submitted into the table Kind of hard to control that when data is submitted by random users on the Internet. >I would recommend you delete the data since it's not correct anyway. >In the mean time I'm going to fix 2) so that it assumes data be >consisted of 2 bytes even if wrong data sequence is submitted >(needless to say, except ascii). >Do you want the backpatch for 6.5.3? Very much so. Thank you. -Michael
> It could be that one of our users had their input method set to produce > EUC_TW or Big5. > > >In short, the problem you have is caused by: > >1) wrong data submitted into the table > > Kind of hard to control that when data is submitted by random users on > the Internet. Yes, it's not a PostgreSQL's business but is a really big problem in the real world. Maybe some HTML gurus might have good suggestions on these issues (something like using a language tag?) > >I would recommend you delete the data since it's not correct anyway. > >In the mean time I'm going to fix 2) so that it assumes data be > >consisted of 2 bytes even if wrong data sequence is submitted > >(needless to say, except ascii). > >Do you want the backpatch for 6.5.3? > > Very much so. Thank you. Here it is. With this patch, copy out should be happy even with the wrong data. I'm not sure if it could be displayed correctly, though. -- Tatsuo Ishii ------------------------------- cut here ------------------------------------ *** postgresql-6.5.3/src/backend/utils/mb/wchar.c.orig Tue Jan 25 13:37:21 2000 --- postgresql-6.5.3/src/backend/utils/mb/wchar.c Tue Jan 25 13:37:33 2000 *************** *** 166,176 **** { int len; ! if (*s == SS2) ! len = 3; ! else if (*s == SS3) ! len = 3; ! else if (*s & 0x80) len = 2; else len = 1; --- 166,172 ---- { int len; ! if (*s & 0x80) len = 2; else len = 1;
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >Yes, it's not a PostgreSQL's business but is a really big problem in >the real world. Maybe some HTML gurus might have good suggestions on >these issues (something like using a language tag?) The only solution is defensive programming. Even if there were a standard that everyone followed, if malicious people could break things by not following the standard, then you can be certain that somebody would do so. >Here it is. With this patch, copy out should be happy even with the >wrong data. I'm not sure if it could be displayed correctly, though. Thank you very much. However, I think even this is too optimistic: >! if (*s & 0x80) Shouldn't it be something like: if ((*s & 0x80) && (*(s+1) & 0x80)) Even though "\242\242\242\0" is an invalid EUC sequence, it still shouldn't be allowed to break the software. -Michael Robinson
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > >Yes, it's not a PostgreSQL's business but is a really big problem in > >the real world. Maybe some HTML gurus might have good suggestions on > >these issues (something like using a language tag?) > > The only solution is defensive programming. Even if there were a standard > that everyone followed, if malicious people could break things by not > following the standard, then you can be certain that somebody would do so. Defensive programming saves the system but does not user. Once corrupted data is stored in the system, it's totally useless for the user anyway. What about validating data *before* inserting it into a table? You expect EUC_CN data, and it should be possible to determine if the data is valid or not by doing some simple checking in most cases. Maybe I could provide a new libpq function something like: bool pg_validate_euc_cn(const unsigned char *euc_str); If it returns false, then euc_str is not a valid EUC_CN. So you show a message: "Sorry, but we only accepts EUC_CN data. Please try another input method..." or jump to other pages for EUC_TW or Big5 or whatever... Of course the function does not guarantee the string is 100% correct EUC_CN (on the other hand it can tell that the string is not valid) because: 1) there are chances that, for example, a EUC_CN string and a EUC_JP string has same bit patterns accidently. 2) I do not have enough information to implement it perfectly. At this point I could only perform minimal checking. However, it can be good a start point for someone who has more knowledge (on the other hand, I could implement pg_validate_euc_jp in much better way, since I have precise info for EUC_JP). > >Here it is. With this patch, copy out should be happy even with the > >wrong data. I'm not sure if it could be displayed correctly, though. > > Thank you very much. However, I think even this is too optimistic: > > >! if (*s & 0x80) > > Shouldn't it be something like: > > if ((*s & 0x80) && (*(s+1) & 0x80)) > > Even though "\242\242\242\0" is an invalid EUC sequence, it still shouldn't be > allowed to break the software. Thanks for the suggestion. More robust code is always good. -- Tatsuo Ishii