Thread: Invalid EUC_JP char seq bug?
I am using PHP with postgreSQL and I have been getting a few rare errors while trying to do selects on a table containing EUC_JP text. I thought it was a bug with PHP not recognizing a string as invalid EUC_JP characters and wrote up a bug report but the PHP developers assure me that the string that is generating the error is a valid EUC_JP string (I don't know anything about character encodings so I am taking them at their word and the fact that the string displays fine in my browser as EUC_JP lends me to suspect they might be right). The offending string is url encoded as such: words=%8f%ac%90%ec%96%be%93%fa%8d%81 When I try and do a SELECT I get the following error: select id from products where name like '??????' Query failed: ERROR: Invalid EUC_JP character sequence found (0x8100) (Where did the 0x00 come from??) Can someone let me know if this truly is a bug in postgres? Thanks, Jean-Christian Imbeault PS I have also had the error pop up with this string: search_words=%B7%F6%BA%7E select id from products where name like '??~' Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)
> I am using PHP with postgreSQL and I have been getting a few rare errors > while trying to do selects on a table containing EUC_JP text. > > I thought it was a bug with PHP not recognizing a string as invalid > EUC_JP characters and wrote up a bug report but the PHP developers > assure me that the string that is generating the error is a valid EUC_JP > string (I don't know anything about character encodings so I am taking > them at their word and the fact that the string displays fine in my > browser as EUC_JP lends me to suspect they might be right). > > The offending string is url encoded as such: > > words=%8f%ac%90%ec%96%be%93%fa%8d%81 > > When I try and do a SELECT I get the following error: > > select id from products where name like '??????' > Query failed: ERROR: Invalid EUC_JP character sequence found (0x8100) Since you did not show us exact query you send to PostgreSQL, I assume the query passed to PostgreSQL is: select id from products where name like 'string'; where string is "0x8fac90ec96be93fa8d81". If the string is supposed to be an EUC_JP, it would be parsed as follows: 8f: single shift 3 (indicates that following 2 bytes are a JIS 0212 character) ac90: a JIS 0212 character ec96: a JIS 0208 character be93: a JIS 0208 character fa8d: a JIS 0208 character 81: ??? The last 0x81 is invalid if the string is assumed as EUC_JP. > (Where did the 0x00 come from??) trailing '\0'. > Can someone let me know if this truly is a bug in postgres? No. > Thanks, > > Jean-Christian Imbeault > > > PS I have also had the error pop up with this string: > > search_words=%B7%F6%BA%7E > select id from products where name like '??~' > Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e) This is definitly a bad EUC_JP. -- Tatsuo Ishii
Tatsuo Ishii wrote: > > Since you did not show us exact query you send to PostgreSQL I can't show the exact query because it is generated by PHP. I can however show you the code that generates the query: $words = $_GET["words"]; $sql = "select id from products where name like '$words'"; $conn = pg_connect("host=$DB_IP port=5432 dbname=$DB_NAME user=postgres"); $res = pg_query($conn, $sql); The GET query string was: words=%8f%ac%90%ec%96%be%93%fa%8d%81 I think that PHP does some internal translation of this before passing it on though. > I assume the query passed to PostgreSQL is: > > select id from products where name like 'string'; Yes. > where string is "0x8fac90ec96be93fa8d81". That I don't know. > If the string is supposed to be an EUC_JP, it would be parsed as follows: > > 8f: single shift 3 (indicates that following 2 bytes are a JIS 0212 character [snip ...] Ah ... so it is not an EUC-JP string but an SJIS string. Postgres was right. That answers my question. Thanks! >>PS I have also had the error pop up with this string: >> >>search_words=%B7%F6%BA%7E >>select id from products where name like '??~' >>Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e) > > > This is definitly a bad EUC_JP. According to a PHP developer in my bug report (http://bugs.php.net/bug.php?id=24309&edit=2): "URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I believe encoding detection of mbstring works fine in this case. B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is correct EUC-JP byte sequence." I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my bug report to PHP and pointed this out. Hopefully the developer will see that this sequence is incorrect EUC-JP and that PHP failed to detect this :) I *knew* there was nothing wrong with Postgres ;) Thanks! Jean-Christian Imbeault PS I posted to HACKERS a few weeks ago about another bug (a real one :) in the EUC-JP translation having to do with the WAVE DASH. I'll repost here on the BUGS list, could you let me know the status of that BUG? Thanks!
> >>search_words=%B7%F6%BA%7E > >>select id from products where name like '??~' > >>Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e) > > > > > > This is definitly a bad EUC_JP. > > According to a PHP developer in my bug report > (http://bugs.php.net/bug.php?id=24309&edit=2): > > "URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is > B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I > believe encoding detection of mbstring works fine in this case. > B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is > correct EUC-JP byte sequence." > > I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my > bug report to PHP and pointed this out. Hopefully the developer will see > that this sequence is incorrect EUC-JP and that PHP failed to detect this :) In the EUC_JP encoding there are some rules: 1) if the first byte is 0x8e then second byte is a JIS 0201 character and should be greater than 0x7f 2) else if the first byte is 0x8f then second and third byte is a JIS 0212 character and they should be greater than 0x7f 3) else if the first byte is greater than 0x7f then second and third byte is a JIS 0208 character and they should be greater than 0x7f 4) else the byte is ASII and should be eqaul to or less than 0x7f Apparently: B7F6: this is ok. we can apply rule #3 BA7E: this is not good, since it satisfies non of rule #1 to #4 > Thanks! > > Jean-Christian Imbeault > > PS I posted to HACKERS a few weeks ago about another bug (a real one :) > in the EUC-JP translation having to do with the WAVE DASH. I'll repost > here on the BUGS list, could you let me know the status of that BUG? Thanks! Sorry for the delay. In EUC-JP <--> Unicode translation, WAVE DASH is always a problem since there are several different mappings among different vendors/standards. I think I need more time to solve this. -- Tatsuo Ishii