Thread: fatal copy in/out error (6.5.3)

fatal copy in/out error (6.5.3)

From
Michael Robinson
Date:
Do this:
   testdb=> create table foo (word varchar(30));   CREATE   testdb=> insert into foo values ('\217\210');   INSERT
23372891   testdb=> copy "foo" to '/usr/local/pgsql/foo.out';   COPY   testdb=> \q   % od -c foo.out   0000000  217 210
\0  \n                                                   0000004
 

One run produced:   0000000  217 210  \0  \b  \n                                                

In all cases, though, it emits a null.  This, of course, causes "copy from" to
fail when it tries to parse a null-containing field in a multi-field record.

As for what the sequence "\217\210" represents, that's not really relevant
(it's not even a valid Chinese character, so far as I can tell).  Someone 
just entered this into our web interface and broke our database.
-Michael Robinson

P.S. As an aside, if "copy from" recognizes the \xxx octal convention, why
doesn't "copy to" use it to represent everything outside of \040..\176?


Re: [HACKERS] fatal copy in/out error (6.5.3)

From
Tom Lane
Date:
Michael Robinson <robinson@netrinsics.com> writes:
>     testdb=> create table foo (word varchar(30));
>     CREATE
>     testdb=> insert into foo values ('\217\210');
>     INSERT 2337289 1
>     testdb=> copy "foo" to '/usr/local/pgsql/foo.out';
>     COPY
>     testdb=> \q
>     % od -c foo.out
>     0000000  217 210  \0  \n                                                
>     0000004

Hmm.  I get0000000 217 210  \n0000003
with either current sources or 6.5.3.  I don't have MULTIBYTE nor
USE_LOCALE turned on, though.  What is your configuration exactly?

This looks to me like something is deciding that \217 must be the
start of a 3-byte multibyte character... in which case, it should have
appeared that way in your database, I think.
        regards, tom lane


Re: [HACKERS] fatal copy in/out error (6.5.3)

From
Michael Robinson
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
>Hmm.  I get
>    0000000 217 210  \n
>    0000003
>with either current sources or 6.5.3.  I don't have MULTIBYTE nor
>USE_LOCALE turned on, though.  What is your configuration exactly?

My config line:./configure  --with-mb=EUC_CN

I can forward my include/config.h if that would be helpful.

The OS is FreeBSD 3.4-RELEASE.

>This looks to me like something is deciding that \217 must be the
>start of a 3-byte multibyte character... in which case, it should have
>appeared that way in your database, I think.

That would be very weird, if true.  \217 is certainly not the beginning of
a UTF-8 three-byte sequence, and EUC doesn't have three-byte codes.
-Michael Robinson



Re: [HACKERS] fatal copy in/out error (6.5.3)

From
Tom Lane
Date:
>> This looks to me like something is deciding that \217 must be the
>> start of a 3-byte multibyte character... in which case, it should have
>> appeared that way in your database, I think.

> That would be very weird, if true.  \217 is certainly not the beginning of
> a UTF-8 three-byte sequence, and EUC doesn't have three-byte codes.

Hmm.  And I don't suppose it's real likely that \217 is a one-byte code
and \210 starts a two-byte code?

I'm out of my depth here --- it seems this is almost certainly a
MULTIBYTE issue, but I know very little about MULTIBYTE.  I'm going
to punt and hope some of our MULTIBYTE experts pick it up.
        regards, tom lane


Re: [HACKERS] fatal copy in/out error (6.5.3)

From
Tatsuo Ishii
Date:
> My config line:
>     ./configure  --with-mb=EUC_CN
> 
> I can forward my include/config.h if that would be helpful.
> 
> The OS is FreeBSD 3.4-RELEASE.
> 
> >This looks to me like something is deciding that \217 must be the
> >start of a 3-byte multibyte character... in which case, it should have
> >appeared that way in your database, I think.

I suspect that too.

> That would be very weird, if true.  \217 is certainly not the beginning of
> a UTF-8 three-byte sequence, and EUC doesn't have three-byte codes.

No. some EUC's (EUC_TW and EUC_JP) has three-byte or even four-byte
codes.  But you said your database has been configured as EUC_CN. As
far as I know, it only uses 1 or 2 byte-code. Another thing I am
confused is that ' \217\210' is not a valid EUC_CN data at all. \217
(0x8f) specifies code set 3 which does not exist in EUC_CN.  In this
case, it is assumed that the multi-byte word to be consisted of 3-byte
code in the current implementation of PostgreSQL.

In short, the problem you have is caused by:

1) wrong data submitted into the table
2) PostgreSQL assumes the data is consisted of 3 bytes data

I would recommend you delete the data since it's not correct anyway.
In the mean time I'm going to fix 2) so that it assumes data be
consisted of 2 bytes even if wrong data sequence is submitted
(needless to say, except ascii).
Do you want the backpatch for 6.5.3?
--
Tatsuo Ishii


Re: [HACKERS] fatal copy in/out error (6.5.3)

From
Michael Robinson
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>No. some EUC's (EUC_TW and EUC_JP) has three-byte or even four-byte
>codes.  But you said your database has been configured as EUC_CN. As
>far as I know, it only uses 1 or 2 byte-code. Another thing I am
>confused is that ' \217\210' is not a valid EUC_CN data at all. \217
>(0x8f) specifies code set 3 which does not exist in EUC_CN.  In this
>case, it is assumed that the multi-byte word to be consisted of 3-byte
>code in the current implementation of PostgreSQL.

It could be that one of our users had their input method set to produce
EUC_TW or Big5.

>In short, the problem you have is caused by:
>1) wrong data submitted into the table

Kind of hard to control that when data is submitted by random users on
the Internet.

>I would recommend you delete the data since it's not correct anyway.
>In the mean time I'm going to fix 2) so that it assumes data be
>consisted of 2 bytes even if wrong data sequence is submitted
>(needless to say, except ascii).
>Do you want the backpatch for 6.5.3?

Very much so.  Thank you.
-Michael



Re: [HACKERS] fatal copy in/out error (6.5.3)

From
Tatsuo Ishii
Date:
> It could be that one of our users had their input method set to produce
> EUC_TW or Big5.
>
> >In short, the problem you have is caused by:
> >1) wrong data submitted into the table
> 
> Kind of hard to control that when data is submitted by random users on
> the Internet.

Yes, it's not a PostgreSQL's business but is a really big problem in
the real world. Maybe some HTML gurus might have good suggestions on
these issues (something like using a language tag?)

> >I would recommend you delete the data since it's not correct anyway.
> >In the mean time I'm going to fix 2) so that it assumes data be
> >consisted of 2 bytes even if wrong data sequence is submitted
> >(needless to say, except ascii).
> >Do you want the backpatch for 6.5.3?
> 
> Very much so.  Thank you.

Here it is. With this patch, copy out should be happy even with the
wrong data. I'm not sure if it could be displayed correctly, though.
--
Tatsuo Ishii
------------------------------- cut here ------------------------------------
*** postgresql-6.5.3/src/backend/utils/mb/wchar.c.orig    Tue Jan 25 13:37:21 2000
--- postgresql-6.5.3/src/backend/utils/mb/wchar.c    Tue Jan 25 13:37:33 2000
***************
*** 166,176 **** {     int            len; 
!     if (*s == SS2)
!         len = 3;
!     else if (*s == SS3)
!         len = 3;
!     else if (*s & 0x80)         len = 2;     else         len = 1;
--- 166,172 ---- {     int            len; 
!     if (*s & 0x80)         len = 2;     else         len = 1;


Re: [HACKERS] fatal copy in/out error (6.5.3)

From
Michael Robinson
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>Yes, it's not a PostgreSQL's business but is a really big problem in
>the real world. Maybe some HTML gurus might have good suggestions on
>these issues (something like using a language tag?)

The only solution is defensive programming.  Even if there were a standard
that everyone followed, if malicious people could break things by not
following the standard, then you can be certain that somebody would do so.

>Here it is. With this patch, copy out should be happy even with the
>wrong data. I'm not sure if it could be displayed correctly, though.

Thank you very much.  However, I think even this is too optimistic:

>!     if (*s & 0x80)

Shouldn't it be something like:
   if ((*s & 0x80) && (*(s+1) & 0x80))

Even though "\242\242\242\0" is an invalid EUC sequence, it still shouldn't be
allowed to break the software.
-Michael Robinson



Re: [HACKERS] fatal copy in/out error (6.5.3)

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> >Yes, it's not a PostgreSQL's business but is a really big problem in
> >the real world. Maybe some HTML gurus might have good suggestions on
> >these issues (something like using a language tag?)
> 
> The only solution is defensive programming.  Even if there were a standard
> that everyone followed, if malicious people could break things by not
> following the standard, then you can be certain that somebody would do so.

Defensive programming saves the system but does not user. Once
corrupted data is stored in the system, it's totally useless for the
user anyway.  What about validating data *before* inserting it into a
table? You expect EUC_CN data, and it should be possible to determine
if the data is valid or not by doing some simple checking in most
cases. Maybe I could provide a new libpq function something like:
bool pg_validate_euc_cn(const unsigned char *euc_str);

If it returns false, then euc_str is not a valid EUC_CN.
So you show a message:
"Sorry, but we only accepts EUC_CN data. Please try another input
method..." or jump to other pages for EUC_TW or Big5 or whatever...

Of course the function does not guarantee the string is 100% correct
EUC_CN (on the other hand it can tell that the string is not
valid) because:

1) there are chances that, for example, a EUC_CN string and a EUC_JP
string has same bit patterns accidently.

2) I do not have enough information to implement it perfectly. At this
point I could only perform minimal checking. However, it can be good
a start point for someone who has more knowledge (on the other hand, I
could implement pg_validate_euc_jp in much better way, since I have
precise info for EUC_JP).

> >Here it is. With this patch, copy out should be happy even with the
> >wrong data. I'm not sure if it could be displayed correctly, though.
> 
> Thank you very much.  However, I think even this is too optimistic:
> 
> >!     if (*s & 0x80)
> 
> Shouldn't it be something like:
> 
>     if ((*s & 0x80) && (*(s+1) & 0x80))
> 
> Even though "\242\242\242\0" is an invalid EUC sequence, it still shouldn't be
> allowed to break the software.

Thanks for the suggestion. More robust code is always good.
--
Tatsuo Ishii