Thread: Multibyte still broken

Multibyte still broken

From
Michael Robinson
Date:
These are exerpts from a message from Tatsuo Ishii dated January 26, on
the subject of fragile code in the multibyte routines:

---- begin ----
Defensive programming saves the system but does not user. Once
corrupted data is stored in the system, it's totally useless for the
user anyway.  What about validating data *before* inserting it into a
table?
---- end ----

---- begin ----
> >Here it is. With this patch, copy out should be happy even with the
> >wrong data. I'm not sure if it could be displayed correctly, though.
> 
> Thank you very much.  However, I think even this is too optimistic:
> 
> >!     if (*s & 0x80)
> 
> Shouldn't it be something like:
> 
>     if ((*s & 0x80) && (*(s+1) & 0x80))
> 
> Even though "\242\242\242\0" is an invalid EUC sequence, it still shouldn't be
> allowed to break the software.

Thanks for the suggestion. More robust code is always good.
---- end ----

More robust code may always be good, but "good" apparently doesn't always go
into the tree.  Imagine my surprise, while upgrading a production server
from 6.5.3 to 7.0, when the data dumped from the old database failed to load
into the new database (well, crashed the backend, to be specific).

Apparently the "validate your own damn data" sentiment of the first excerpt
above has prevailed, because, on inspection, the MB code is just as fragile
as it was five months ago.

I was forced to perform emergency repairs to my database dump file to fool a 
non-multibyte 7.0 into accepting it.  Since EUC_CN is compatible with 
Latin-1, and since the benefits of multibyte are small compared to the 
risks, I intend to stick with unibyte Postgres henceforth.

I would, though, recommend a warning in the "INSTALL" file along the lines of:
 "WARNING: Use of improperly-encoded text with multi-byte support enabled  WILL lead to data corruption and/or loss.
Donot enable multi-byte support  unless you intend to fully validate your own damn data."
 
-Michael Robinson



Re: Multibyte still broken

From
Tom Lane
Date:
Michael Robinson <robinson@netrinsics.com> writes:
> More robust code may always be good, but "good" apparently doesn't always go
> into the tree.  Imagine my surprise, while upgrading a production server
> from 6.5.3 to 7.0, when the data dumped from the old database failed to load
> into the new database (well, crashed the backend, to be specific).

Sounds like someone failed to follow through on applying that fix :-(.
Patches have been known to slip through the cracks before, especially
when not fully worked out and formally submitted.

> I would, though, recommend a warning in the "INSTALL" file along the lines of:

>   "WARNING: Use of improperly-encoded text with multi-byte support enabled
>    WILL lead to data corruption and/or loss.  Do not enable multi-byte support
>    unless you intend to fully validate your own damn data."

Sorry you had trouble, but the above seems uncalled-for.  What would be
more productive is a submitted patch to apply against 7.0.
        regards, tom lane


Great, big errors ... Again.

From
"Mitch Vincent"
Date:
Ok, it appears the database backend is being stupid again, I'm getting all
kinds of errors.

From my PHP code, I'm getting this kind of error now:


heap_beginscan:
!RelationIsValid(relation) in classes/db_pgsql.inc on line 63
Database error: Invalid SQL: select
user_id_hash,session_id,user_id,agency_id,date_part('epoch',created) as
start,date_part('epoch',updated) as age from users_online where
user_id_hash='9de5085eac89de8be0169b6dc6801524'
PostgreSQL Error: 1 (ERROR: heap_beginscan: !RelationIsValid(relation) )
Session halted

I got "Sorry, too many clients already" -- a little while ago..

I'm running at debug level 3, but have nothing to show for it in the log
file except this :

NOTICE:  Message from PostgreSQL backend:       The Postmaster has informed me that some other backend died
abnormally and possibly corrupted shared memory.       I have rolled back the current transaction and am going to
terminate
your database system connection and exit.       Please reconnect to the database system and repeat your query.
/usr/local/pgsql/bin/postmaster: CleanupProc: sending SIGUSR1 to process
44639
/usr/local/pgsql/bin/postmaster: CleanupProc: reinitializing shared memory
and semaphores
shmem_exit(0) [#0]

So.. Now that I've got it to break, any ideas on what I can do to figure out
why it's happening?  This is still a 6.5.3 server.. (I weas going to upgrade
today believe it or not.)


- Mitch

"The only real failure is quitting."




Re: Great, big errors ... Again.

From
The Hermit Hacker
Date:
On Wed, 10 May 2000, Mitch Vincent wrote:

> 
> Ok, it appears the database backend is being stupid again, I'm getting all
> kinds of errors.
> 
> >From my PHP code, I'm getting this kind of error now:
> 
> 
> heap_beginscan:
> !RelationIsValid(relation) in classes/db_pgsql.inc on line 63
> Database error: Invalid SQL: select
> user_id_hash,session_id,user_id,agency_id,date_part('epoch',created) as
> start,date_part('epoch',updated) as age from users_online where
> user_id_hash='9de5085eac89de8be0169b6dc6801524'
> PostgreSQL Error: 1 (ERROR: heap_beginscan: !RelationIsValid(relation) )
> Session halted
> 
> I got "Sorry, too many clients already" -- a little while ago..
> 
> I'm running at debug level 3, but have nothing to show for it in the log
> file except this :
> 
> NOTICE:  Message from PostgreSQL backend:
>         The Postmaster has informed me that some other backend died
> abnormally and possibly corrupted shared memory.
>         I have rolled back the current transaction and am going to terminate
> your database system connection and exit.
>         Please reconnect to the database system and repeat your query.
> /usr/local/pgsql/bin/postmaster: CleanupProc: sending SIGUSR1 to process
> 44639
> /usr/local/pgsql/bin/postmaster: CleanupProc: reinitializing shared memory
> and semaphores
> shmem_exit(0) [#0]
> 
> So.. Now that I've got it to break, any ideas on what I can do to figure out
> why it's happening?  This is still a 6.5.3 server.. (I weas going to upgrade
> today believe it or not.)

My suggestions at this time, with v7.0 relesed (and assuming you aren't
using ident?) is to get the upgrade over and done with, so that we can
debug that.  Its hard to spend the resources on debugging v6.5.3 when
a) you are planning on upgrading anyway and b) it might already be fixed
in v7.0 :(

At least if you can generate the errors with v7.0, its fresh in ppls minds
...

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 




Re: Great, big errors ... Again.

From
"Mitch Vincent"
Date:
Already going...

- Mitch

"The only real failure is quitting."


----- Original Message -----
From: The Hermit Hacker <scrappy@hub.org>
To: Mitch Vincent <mitch@huntsvilleal.com>
Cc: <pgsql-hackers@hub.org>
Sent: Wednesday, May 10, 2000 1:01 PM
Subject: Re: [HACKERS] Great, big errors ... Again.


> On Wed, 10 May 2000, Mitch Vincent wrote:
>
> >
> > Ok, it appears the database backend is being stupid again, I'm getting
all
> > kinds of errors.
> >
> > >From my PHP code, I'm getting this kind of error now:
> >
> >
> > heap_beginscan:
> > !RelationIsValid(relation) in classes/db_pgsql.inc on line 63
> > Database error: Invalid SQL: select
> > user_id_hash,session_id,user_id,agency_id,date_part('epoch',created) as
> > start,date_part('epoch',updated) as age from users_online where
> > user_id_hash='9de5085eac89de8be0169b6dc6801524'
> > PostgreSQL Error: 1 (ERROR: heap_beginscan: !RelationIsValid(relation) )
> > Session halted
> >
> > I got "Sorry, too many clients already" -- a little while ago..
> >
> > I'm running at debug level 3, but have nothing to show for it in the log
> > file except this :
> >
> > NOTICE:  Message from PostgreSQL backend:
> >         The Postmaster has informed me that some other backend died
> > abnormally and possibly corrupted shared memory.
> >         I have rolled back the current transaction and am going to
terminate
> > your database system connection and exit.
> >         Please reconnect to the database system and repeat your query.
> > /usr/local/pgsql/bin/postmaster: CleanupProc: sending SIGUSR1 to process
> > 44639
> > /usr/local/pgsql/bin/postmaster: CleanupProc: reinitializing shared
memory
> > and semaphores
> > shmem_exit(0) [#0]
> >
> > So.. Now that I've got it to break, any ideas on what I can do to figure
out
> > why it's happening?  This is still a 6.5.3 server.. (I weas going to
upgrade
> > today believe it or not.)
>
> My suggestions at this time, with v7.0 relesed (and assuming you aren't
> using ident?) is to get the upgrade over and done with, so that we can
> debug that.  Its hard to spend the resources on debugging v6.5.3 when
> a) you are planning on upgrading anyway and b) it might already be fixed
> in v7.0 :(
>
> At least if you can generate the errors with v7.0, its fresh in ppls minds
> ...
>
> Marc G. Fournier                   ICQ#7615664               IRC Nick:
Scrappy
> Systems Administrator @ hub.org
> primary: scrappy@hub.org           secondary:
scrappy@{freebsd|postgresql}.org
>
>



Re: Multibyte still broken

From
Tatsuo Ishii
Date:
> More robust code may always be good, but "good" apparently doesn't always go
> into the tree.  Imagine my surprise, while upgrading a production server
> from 6.5.3 to 7.0, when the data dumped from the old database failed to load
> into the new database (well, crashed the backend, to be specific).
> 
> Apparently the "validate your own damn data" sentiment of the first excerpt
> above has prevailed, because, on inspection, the MB code is just as fragile
> as it was five months ago.
> 
> I was forced to perform emergency repairs to my database dump file to fool a 
> non-multibyte 7.0 into accepting it.  Since EUC_CN is compatible with 
> Latin-1, and since the benefits of multibyte are small compared to the 
> risks, I intend to stick with unibyte Postgres henceforth.
> 
> I would, though, recommend a warning in the "INSTALL" file along the lines of:
> 
>   "WARNING: Use of improperly-encoded text with multi-byte support enabled
>    WILL lead to data corruption and/or loss.  Do not enable multi-byte support
>    unless you intend to fully validate your own damn data."

Sorry for the problem. I forgot about issue:-<

What I'm thinking now to fix the problem you found is that doing data
validataion in the text/var/char input functions, rather than tweaking
the mb functions. If corrupted MB string was found, then call
elog(ERROR) to abort the transation. Will appear in 7.0.1 unless
someone objects.
--
Tatsuo Ishii


Re: Multibyte still broken

From
Michael Robinson
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
>Sorry you had trouble,

Trouble I had foreseen, which I had brought to the mailing list, and for
which I had provided an explanation and fix, by the way.

>but the above seems uncalled-for.  What would be
>more productive is a submitted patch to apply against 7.0.

Actually, I think there are three separate issues here.

1. There was a potentially serious problem, and it didn't get on Bruce's  master TODO list, which is the authoritative
referencefor what in Postgres  needs to be fixed.  I take responsibility for not seeing to it that this  was done.  
 

2. By its nature, the multi-byte code is poorly overseen, and poorly   exercised.  This is understandable, as the
overwhelmingmajority of   Postgres installations don't want or need it.  However, the result is that  it's effectively
a"contrib"-quality component in the Postgres core.
 

3. The multi-byte support, as it currently exists, is founded on a particular  philosophy, one which I argue is not the
mostpragmatic.  In the East Asia  that I know and love, multibyte support, as a rule, is an ugly hack on top  of
unibytetools and infrastructure.  The exception is end-to-end Unicode  systems, which can be relied on to produce
predictabledata.  However, I've  never encountered a native Simplified Chinese GB application in which it  was the
leastbit difficult to produce "illegal" code sequences.  In a  real-world environment, with email attachments,
cut-and-paste,and whatnot,  it's practically inevitable.
 
  Thus, I cannot accept a situation where my database aborts on, or  otherwise rejects data that is produced and
acceptedby all other tools  in the work environment, and will stick with a unibyte build as long as  this is the case.
 
-Michael Robinson



Re: Multibyte still broken

From
Thomas Lockhart
Date:
>    Thus, I cannot accept a situation where my database aborts on, or
>    otherwise rejects data that is produced and accepted by all other tools
>    in the work environment, and will stick with a unibyte build as long as
>    this is the case.

We are planning on working on NATIONAL CHARACTER and CHARACTER SET
features for the next release, which would move MB into a supported
feature which can coexist with ascii in the standard installation.
Would you be interested in contributing to that, either with code or
simply with implementation ideas and testing? I know that Tatsuo is
planning on participating, and I'm looking forward to working on it
too, though as you point out those of us who can get by with ascii
text don't know how to exercise MB capabilities so I'll need
suggestions and help.

Regards.
                   - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: Multibyte still broken

From
Tatsuo Ishii
Date:
> > What I'm thinking now to fix the problem you found is that doing data
> > validataion in the text/var/char input functions, rather than tweaking
> > the mb functions.
> 
> Could you explain why? I'd sure like it better if the mb code checked for
> mb things.

It's simple: 

Do not allow corrupted data imported into database.

To acomplish this, of course we have to call existing MB functions, though.
--
Tatsuo Ishii


Re: Multibyte still broken

From
Tatsuo Ishii
Date:
> 3. The multi-byte support, as it currently exists, is founded on a particular
>    philosophy, one which I argue is not the most pragmatic.  In the East Asia
>    that I know and love, multibyte support, as a rule, is an ugly hack on top
>    of unibyte tools and infrastructure.  The exception is end-to-end Unicode
>    systems, which can be relied on to produce predictable data.  However, I've
>    never encountered a native Simplified Chinese GB application in which it
>    was the least bit difficult to produce "illegal" code sequences.  In a
>    real-world environment, with email attachments, cut-and-paste, and whatnot,
>    it's practically inevitable.

I am supprised to hear that you have so poor quality tools that
produce illegal code sequences of Simplified Chinese. In Japan, as far
as I know, we never have such a low quality tools which generate
illegal Japanese charaters just because they are not accepted in the
market, even in the case of email attachments, or cut-and-past or
whatever. I would like to hear from one living in others countries
such as Taiwan or Korea about their situations.

Anyway, IMHO letting users know that they have corrupted data is much
better than making them believe that their data are ok.
--
Tatsuo Ishii


Re: Multibyte still broken

From
Michael Robinson
Date:
Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
>We are planning on working on NATIONAL CHARACTER and CHARACTER SET
>features for the next release, which would move MB into a supported
>feature which can coexist with ascii in the standard installation.
>Would you be interested in contributing to that, either with code or
>simply with implementation ideas and testing?

It will be my pleasure to provide as much support as my circumstances permit.
Unfortunately, though, my circumstances lately don't permit a whole lot :-(.
-Michael



Re: Multibyte still broken

From
Bruce Momjian
Date:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> >Sorry you had trouble,
> 
> Trouble I had foreseen, which I had brought to the mailing list, and for
> which I had provided an explanation and fix, by the way.
> 
> >but the above seems uncalled-for.  What would be
> >more productive is a submitted patch to apply against 7.0.
> 
> Actually, I think there are three separate issues here.
> 
> 1. There was a potentially serious problem, and it didn't get on Bruce's
>    master TODO list, which is the authoritative reference for what in Postgres
>    needs to be fixed.  I take responsibility for not seeing to it that this
>    was done.  

Please tell me what to add.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Multibyte still broken

From
Michael Robinson
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>I am supprised to hear that you have so poor quality tools that
>produce illegal code sequences of Simplified Chinese. In Japan, as far
>as I know, we never have such a low quality tools which generate
>illegal Japanese charaters just because they are not accepted in the
>market, even in the case of email attachments, or cut-and-past or
>whatever.

The problem is not that the tools produce "illegal characters".  The problem
is that, as an EUC code, GB permits the coexistance of standard ascii
characters with double-byte hanzi characters.  Furthermore, most Chinese 
software is an operating-system "hack" on top of English-language software
based on a Latin-1 character set (the Chinese software market is underserved
compared to Japan, so we have to cope as best we can).

The result is that it is possible to, for example, insert a carriage return
or ASCII comma into the middle of a hanzi, which breaks the alignment for all 
the hanzi on the rest of the line.  It's also possible, in non-native Chinese
applications, to select one byte of a hanzi character in a cut or copy 
operation.

So the problem is that the tools do not uniformly respect the integrity of
a double-byte hanzi character, but rather treat it as two individual Latin-1
characters.

The important point, though, is that all tools, whether native Chinese or
"hacked" English, accept the resulting invalid code sequences consistently,
robustly, and without complaint.
-Michael



Re: Multibyte still broken

From
Michael Robinson
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> 1. There was a potentially serious problem, and it didn't get on Bruce's
>>    master TODO list, which is the authoritative reference for what in Postgres
>>    needs to be fixed.  I take responsibility for not seeing to it that this
>>    was done.  
>
>Please tell me what to add.

TODO: Multibyte encoding/decoding routines need to handle invalid character     sequences safely (and, if possible,
gracefully).
-Michael Robinson



Re: Multibyte still broken

From
Bruce Momjian
Date:
I am going to need a confirmation from someone else before adding it to
the TODO list.

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> 1. There was a potentially serious problem, and it didn't get on Bruce's
> >>    master TODO list, which is the authoritative reference for what in Postgres
> >>    needs to be fixed.  I take responsibility for not seeing to it that this
> >>    was done.  
> >
> >Please tell me what to add.
> 
> TODO: Multibyte encoding/decoding routines need to handle invalid character
>       sequences safely (and, if possible, gracefully).
> 
>     -Michael Robinson
> 
> 


--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026