Thread: Multibyte still broken
These are exerpts from a message from Tatsuo Ishii dated January 26, on the subject of fragile code in the multibyte routines: ---- begin ---- Defensive programming saves the system but does not user. Once corrupted data is stored in the system, it's totally useless for the user anyway. What about validating data *before* inserting it into a table? ---- end ---- ---- begin ---- > >Here it is. With this patch, copy out should be happy even with the > >wrong data. I'm not sure if it could be displayed correctly, though. > > Thank you very much. However, I think even this is too optimistic: > > >! if (*s & 0x80) > > Shouldn't it be something like: > > if ((*s & 0x80) && (*(s+1) & 0x80)) > > Even though "\242\242\242\0" is an invalid EUC sequence, it still shouldn't be > allowed to break the software. Thanks for the suggestion. More robust code is always good. ---- end ---- More robust code may always be good, but "good" apparently doesn't always go into the tree. Imagine my surprise, while upgrading a production server from 6.5.3 to 7.0, when the data dumped from the old database failed to load into the new database (well, crashed the backend, to be specific). Apparently the "validate your own damn data" sentiment of the first excerpt above has prevailed, because, on inspection, the MB code is just as fragile as it was five months ago. I was forced to perform emergency repairs to my database dump file to fool a non-multibyte 7.0 into accepting it. Since EUC_CN is compatible with Latin-1, and since the benefits of multibyte are small compared to the risks, I intend to stick with unibyte Postgres henceforth. I would, though, recommend a warning in the "INSTALL" file along the lines of: "WARNING: Use of improperly-encoded text with multi-byte support enabled WILL lead to data corruption and/or loss. Donot enable multi-byte support unless you intend to fully validate your own damn data." -Michael Robinson
Michael Robinson <robinson@netrinsics.com> writes: > More robust code may always be good, but "good" apparently doesn't always go > into the tree. Imagine my surprise, while upgrading a production server > from 6.5.3 to 7.0, when the data dumped from the old database failed to load > into the new database (well, crashed the backend, to be specific). Sounds like someone failed to follow through on applying that fix :-(. Patches have been known to slip through the cracks before, especially when not fully worked out and formally submitted. > I would, though, recommend a warning in the "INSTALL" file along the lines of: > "WARNING: Use of improperly-encoded text with multi-byte support enabled > WILL lead to data corruption and/or loss. Do not enable multi-byte support > unless you intend to fully validate your own damn data." Sorry you had trouble, but the above seems uncalled-for. What would be more productive is a submitted patch to apply against 7.0. regards, tom lane
Ok, it appears the database backend is being stupid again, I'm getting all kinds of errors. From my PHP code, I'm getting this kind of error now: heap_beginscan: !RelationIsValid(relation) in classes/db_pgsql.inc on line 63 Database error: Invalid SQL: select user_id_hash,session_id,user_id,agency_id,date_part('epoch',created) as start,date_part('epoch',updated) as age from users_online where user_id_hash='9de5085eac89de8be0169b6dc6801524' PostgreSQL Error: 1 (ERROR: heap_beginscan: !RelationIsValid(relation) ) Session halted I got "Sorry, too many clients already" -- a little while ago.. I'm running at debug level 3, but have nothing to show for it in the log file except this : NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. /usr/local/pgsql/bin/postmaster: CleanupProc: sending SIGUSR1 to process 44639 /usr/local/pgsql/bin/postmaster: CleanupProc: reinitializing shared memory and semaphores shmem_exit(0) [#0] So.. Now that I've got it to break, any ideas on what I can do to figure out why it's happening? This is still a 6.5.3 server.. (I weas going to upgrade today believe it or not.) - Mitch "The only real failure is quitting."
On Wed, 10 May 2000, Mitch Vincent wrote: > > Ok, it appears the database backend is being stupid again, I'm getting all > kinds of errors. > > >From my PHP code, I'm getting this kind of error now: > > > heap_beginscan: > !RelationIsValid(relation) in classes/db_pgsql.inc on line 63 > Database error: Invalid SQL: select > user_id_hash,session_id,user_id,agency_id,date_part('epoch',created) as > start,date_part('epoch',updated) as age from users_online where > user_id_hash='9de5085eac89de8be0169b6dc6801524' > PostgreSQL Error: 1 (ERROR: heap_beginscan: !RelationIsValid(relation) ) > Session halted > > I got "Sorry, too many clients already" -- a little while ago.. > > I'm running at debug level 3, but have nothing to show for it in the log > file except this : > > NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend died > abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am going to terminate > your database system connection and exit. > Please reconnect to the database system and repeat your query. > /usr/local/pgsql/bin/postmaster: CleanupProc: sending SIGUSR1 to process > 44639 > /usr/local/pgsql/bin/postmaster: CleanupProc: reinitializing shared memory > and semaphores > shmem_exit(0) [#0] > > So.. Now that I've got it to break, any ideas on what I can do to figure out > why it's happening? This is still a 6.5.3 server.. (I weas going to upgrade > today believe it or not.) My suggestions at this time, with v7.0 relesed (and assuming you aren't using ident?) is to get the upgrade over and done with, so that we can debug that. Its hard to spend the resources on debugging v6.5.3 when a) you are planning on upgrading anyway and b) it might already be fixed in v7.0 :( At least if you can generate the errors with v7.0, its fresh in ppls minds ... Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Already going... - Mitch "The only real failure is quitting." ----- Original Message ----- From: The Hermit Hacker <scrappy@hub.org> To: Mitch Vincent <mitch@huntsvilleal.com> Cc: <pgsql-hackers@hub.org> Sent: Wednesday, May 10, 2000 1:01 PM Subject: Re: [HACKERS] Great, big errors ... Again. > On Wed, 10 May 2000, Mitch Vincent wrote: > > > > > Ok, it appears the database backend is being stupid again, I'm getting all > > kinds of errors. > > > > >From my PHP code, I'm getting this kind of error now: > > > > > > heap_beginscan: > > !RelationIsValid(relation) in classes/db_pgsql.inc on line 63 > > Database error: Invalid SQL: select > > user_id_hash,session_id,user_id,agency_id,date_part('epoch',created) as > > start,date_part('epoch',updated) as age from users_online where > > user_id_hash='9de5085eac89de8be0169b6dc6801524' > > PostgreSQL Error: 1 (ERROR: heap_beginscan: !RelationIsValid(relation) ) > > Session halted > > > > I got "Sorry, too many clients already" -- a little while ago.. > > > > I'm running at debug level 3, but have nothing to show for it in the log > > file except this : > > > > NOTICE: Message from PostgreSQL backend: > > The Postmaster has informed me that some other backend died > > abnormally and possibly corrupted shared memory. > > I have rolled back the current transaction and am going to terminate > > your database system connection and exit. > > Please reconnect to the database system and repeat your query. > > /usr/local/pgsql/bin/postmaster: CleanupProc: sending SIGUSR1 to process > > 44639 > > /usr/local/pgsql/bin/postmaster: CleanupProc: reinitializing shared memory > > and semaphores > > shmem_exit(0) [#0] > > > > So.. Now that I've got it to break, any ideas on what I can do to figure out > > why it's happening? This is still a 6.5.3 server.. (I weas going to upgrade > > today believe it or not.) > > My suggestions at this time, with v7.0 relesed (and assuming you aren't > using ident?) is to get the upgrade over and done with, so that we can > debug that. Its hard to spend the resources on debugging v6.5.3 when > a) you are planning on upgrading anyway and b) it might already be fixed > in v7.0 :( > > At least if you can generate the errors with v7.0, its fresh in ppls minds > ... > > Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy > Systems Administrator @ hub.org > primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org > >
> More robust code may always be good, but "good" apparently doesn't always go > into the tree. Imagine my surprise, while upgrading a production server > from 6.5.3 to 7.0, when the data dumped from the old database failed to load > into the new database (well, crashed the backend, to be specific). > > Apparently the "validate your own damn data" sentiment of the first excerpt > above has prevailed, because, on inspection, the MB code is just as fragile > as it was five months ago. > > I was forced to perform emergency repairs to my database dump file to fool a > non-multibyte 7.0 into accepting it. Since EUC_CN is compatible with > Latin-1, and since the benefits of multibyte are small compared to the > risks, I intend to stick with unibyte Postgres henceforth. > > I would, though, recommend a warning in the "INSTALL" file along the lines of: > > "WARNING: Use of improperly-encoded text with multi-byte support enabled > WILL lead to data corruption and/or loss. Do not enable multi-byte support > unless you intend to fully validate your own damn data." Sorry for the problem. I forgot about issue:-< What I'm thinking now to fix the problem you found is that doing data validataion in the text/var/char input functions, rather than tweaking the mb functions. If corrupted MB string was found, then call elog(ERROR) to abort the transation. Will appear in 7.0.1 unless someone objects. -- Tatsuo Ishii
Tom Lane <tgl@sss.pgh.pa.us> writes: >Sorry you had trouble, Trouble I had foreseen, which I had brought to the mailing list, and for which I had provided an explanation and fix, by the way. >but the above seems uncalled-for. What would be >more productive is a submitted patch to apply against 7.0. Actually, I think there are three separate issues here. 1. There was a potentially serious problem, and it didn't get on Bruce's master TODO list, which is the authoritative referencefor what in Postgres needs to be fixed. I take responsibility for not seeing to it that this was done. 2. By its nature, the multi-byte code is poorly overseen, and poorly exercised. This is understandable, as the overwhelmingmajority of Postgres installations don't want or need it. However, the result is that it's effectively a"contrib"-quality component in the Postgres core. 3. The multi-byte support, as it currently exists, is founded on a particular philosophy, one which I argue is not the mostpragmatic. In the East Asia that I know and love, multibyte support, as a rule, is an ugly hack on top of unibytetools and infrastructure. The exception is end-to-end Unicode systems, which can be relied on to produce predictabledata. However, I've never encountered a native Simplified Chinese GB application in which it was the leastbit difficult to produce "illegal" code sequences. In a real-world environment, with email attachments, cut-and-paste,and whatnot, it's practically inevitable. Thus, I cannot accept a situation where my database aborts on, or otherwise rejects data that is produced and acceptedby all other tools in the work environment, and will stick with a unibyte build as long as this is the case. -Michael Robinson
> Thus, I cannot accept a situation where my database aborts on, or > otherwise rejects data that is produced and accepted by all other tools > in the work environment, and will stick with a unibyte build as long as > this is the case. We are planning on working on NATIONAL CHARACTER and CHARACTER SET features for the next release, which would move MB into a supported feature which can coexist with ascii in the standard installation. Would you be interested in contributing to that, either with code or simply with implementation ideas and testing? I know that Tatsuo is planning on participating, and I'm looking forward to working on it too, though as you point out those of us who can get by with ascii text don't know how to exercise MB capabilities so I'll need suggestions and help. Regards. - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
> > What I'm thinking now to fix the problem you found is that doing data > > validataion in the text/var/char input functions, rather than tweaking > > the mb functions. > > Could you explain why? I'd sure like it better if the mb code checked for > mb things. It's simple: Do not allow corrupted data imported into database. To acomplish this, of course we have to call existing MB functions, though. -- Tatsuo Ishii
> 3. The multi-byte support, as it currently exists, is founded on a particular > philosophy, one which I argue is not the most pragmatic. In the East Asia > that I know and love, multibyte support, as a rule, is an ugly hack on top > of unibyte tools and infrastructure. The exception is end-to-end Unicode > systems, which can be relied on to produce predictable data. However, I've > never encountered a native Simplified Chinese GB application in which it > was the least bit difficult to produce "illegal" code sequences. In a > real-world environment, with email attachments, cut-and-paste, and whatnot, > it's practically inevitable. I am supprised to hear that you have so poor quality tools that produce illegal code sequences of Simplified Chinese. In Japan, as far as I know, we never have such a low quality tools which generate illegal Japanese charaters just because they are not accepted in the market, even in the case of email attachments, or cut-and-past or whatever. I would like to hear from one living in others countries such as Taiwan or Korea about their situations. Anyway, IMHO letting users know that they have corrupted data is much better than making them believe that their data are ok. -- Tatsuo Ishii
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: >We are planning on working on NATIONAL CHARACTER and CHARACTER SET >features for the next release, which would move MB into a supported >feature which can coexist with ascii in the standard installation. >Would you be interested in contributing to that, either with code or >simply with implementation ideas and testing? It will be my pleasure to provide as much support as my circumstances permit. Unfortunately, though, my circumstances lately don't permit a whole lot :-(. -Michael
> Tom Lane <tgl@sss.pgh.pa.us> writes: > >Sorry you had trouble, > > Trouble I had foreseen, which I had brought to the mailing list, and for > which I had provided an explanation and fix, by the way. > > >but the above seems uncalled-for. What would be > >more productive is a submitted patch to apply against 7.0. > > Actually, I think there are three separate issues here. > > 1. There was a potentially serious problem, and it didn't get on Bruce's > master TODO list, which is the authoritative reference for what in Postgres > needs to be fixed. I take responsibility for not seeing to it that this > was done. Please tell me what to add. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >I am supprised to hear that you have so poor quality tools that >produce illegal code sequences of Simplified Chinese. In Japan, as far >as I know, we never have such a low quality tools which generate >illegal Japanese charaters just because they are not accepted in the >market, even in the case of email attachments, or cut-and-past or >whatever. The problem is not that the tools produce "illegal characters". The problem is that, as an EUC code, GB permits the coexistance of standard ascii characters with double-byte hanzi characters. Furthermore, most Chinese software is an operating-system "hack" on top of English-language software based on a Latin-1 character set (the Chinese software market is underserved compared to Japan, so we have to cope as best we can). The result is that it is possible to, for example, insert a carriage return or ASCII comma into the middle of a hanzi, which breaks the alignment for all the hanzi on the rest of the line. It's also possible, in non-native Chinese applications, to select one byte of a hanzi character in a cut or copy operation. So the problem is that the tools do not uniformly respect the integrity of a double-byte hanzi character, but rather treat it as two individual Latin-1 characters. The important point, though, is that all tools, whether native Chinese or "hacked" English, accept the resulting invalid code sequences consistently, robustly, and without complaint. -Michael
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> 1. There was a potentially serious problem, and it didn't get on Bruce's >> master TODO list, which is the authoritative reference for what in Postgres >> needs to be fixed. I take responsibility for not seeing to it that this >> was done. > >Please tell me what to add. TODO: Multibyte encoding/decoding routines need to handle invalid character sequences safely (and, if possible, gracefully). -Michael Robinson
I am going to need a confirmation from someone else before adding it to the TODO list. > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> 1. There was a potentially serious problem, and it didn't get on Bruce's > >> master TODO list, which is the authoritative reference for what in Postgres > >> needs to be fixed. I take responsibility for not seeing to it that this > >> was done. > > > >Please tell me what to add. > > TODO: Multibyte encoding/decoding routines need to handle invalid character > sequences safely (and, if possible, gracefully). > > -Michael Robinson > > -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026