Thread: psql crashes on encoding mismatch
I found a crash case (assertion failure) when runing psql -f utf8_encoded_script.sql against client_encoding = shift_jis in postgresql.conf. Though encoding mismatch is obviously user's fault, a crash doesn't explain anything to him. The thing is, prepare_buffer() in psqlscan.l assumes PQmblen() always returns the appropriate length, but it actually isn't. newtxt can be overflowed on those cases into the following 2 byte NULLs, which is filled in the beginning of prepare_buffer(). It results in that yy_scan_buffer() returns NULL by design since the input buffer's following 2 bytes are not NULL. Then, psql_assert(state->scanbufhandle) in psql_scan() detects bug later. This bug can be occurred not only in shift_jis but also big5, bgk, etc. "unsafe" encodings. The attached is to fix it. Just double check not to pad overflowed 0xff for the input buffer. If you need unit case I'll send it, but the problem seems clear. Regards, -- Hitoshi Harada
Attachment
Hitoshi Harada <umi.tanuki@gmail.com> writes: > I found a crash case (assertion failure) when runing psql -f > utf8_encoded_script.sql against client_encoding = shift_jis in > postgresql.conf. Though encoding mismatch is obviously user's fault, a > crash doesn't explain anything to him. I'm not too impressed with this patch: it seems like the most it will accomplish is to move the failure to some other, equally obscure, place --- because you'll still have a string that's invalidly encoded. Moreover, if you've got wrongly encoded data, it wouldn't be hard at all for it to mess up psql's lexing; consider cases such as a character-that's-not-as-long-as-we-think just in front of a quote mark. Shouldn't we instead try to verify the multibyte encoding somewhere upstream of here? regards, tom lane
2011/1/13 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> I found a crash case (assertion failure) when runing psql -f >> utf8_encoded_script.sql against client_encoding = shift_jis in >> postgresql.conf. Though encoding mismatch is obviously user's fault, a >> crash doesn't explain anything to him. > > I'm not too impressed with this patch: it seems like the most it will > accomplish is to move the failure to some other, equally obscure, place > --- because you'll still have a string that's invalidly encoded. > Moreover, if you've got wrongly encoded data, it wouldn't be hard at all > for it to mess up psql's lexing; consider cases such as a > character-that's-not-as-long-as-we-think just in front of a quote mark. > > Shouldn't we instead try to verify the multibyte encoding somewhere > upstream of here? I had thought it before going into the patch, too. However, the fact that psql(fe-misc.c) doesn't have PQverfiymb() although it has PQmblen() implied to me that encoding verification should be done in server side perhaps. I might be too ignorant to imagine the lexing problem of your quote mark, but my crash sample has multibyte characters in sql comment, which is ignored in the server parsing. If we decided that the case raises error, wouldn't some existing applications be broken? I can imagine they are in the same situation of encoding mismatch and are run without problem I found by chance. Just for reference I attach the case sql file. To reproduce it: 1. initdb 2. edit client_encoding = shift_jis in postgresql.conf 3. start postgres 4. psql -f case_utf8.sql Note: the line break should be LF as the file stands. CR-LF cannot reproduce the problem. Regards, -- Hitoshi Harada