Combining chars in psql (pre-patch) - Mailing list pgsql-hackers
From | Patrice Hédé |
---|---|
Subject | Combining chars in psql (pre-patch) |
Date | |
Msg-id | 20010926202333.P1316@idf.net Whole thread Raw |
Responses |
Re: Combining chars in psql (pre-patch)
Re: Combining chars in psql (pre-patch) |
List | pgsql-hackers |
Hi, I have been working a bit at a patch for that problem in psql. The patch is far from being ready for inclusion or whatever, it's just for comments... By the way, someone can tell me how to generate nice patches showing the difference between one's version and the cvs code that has been downloaded ? I'm new to this (I've only used cvs for personal projects so far, and I don't need to send patches to myself ;) ). The good things in this patch : - it works for me :) - I've used Markus Kuhn's implementation of wcwidth.c : it is locale independant, and is in the public domain. :) [if we keep it, I'll have to tell him, though !] - No dependency on the local libc's UTF-8-awareness ;) [I've seen that psql has no such dependancy, at least in print.c, so I haven't added any]. Actually, the change is completely self-contained. - I've made my own utf-8 -> ucs converter, since I haven't found any without a copyright notice yesterday. It checks invalid and non-optimal UTF-8 sequences, as requested per Unicode 3.0.1 (or 3.1, I don't remember). - it works for japanese (and I believe other "full-width" characters). - if MULTIBYTE is not defined, the code doesn't change from the commited version. The not so good things : - I've made my own utf-8 -> ucs converter... It seems to work fine, but it's not tested well enough, it may not be so robust. - The printf( "%*s", width, utfstr) doesn't work as expected, so I had to fix by doing printf( "%*s%s", width - utfstrwidth, "", utfstr); - everything in #ifdef MULTIBYTE/#endif . Since they're is no dependancy on anything else (including the rest of the multibyte implementation - which I haven't had the time to look at in detail), it doesn't depend on it. - I get this (for each call to pg_mb_utfs_width) and I don't know why : print.c:265: warning: passing arg 1 of `pg_mb_utfs_width' discards qualifiers from pointer target type - If pg_mb_utfs_width finds an invalid UTF-8 string, it truncates it. I suppose that's what we want to do, but that's probably not the best place to do it. The bad things : - If MULTIBYTE is defined, the strings must be in UTF-8, it doesn't check any encoding. - it is not integrated at all with the rest of the MB code. - it doesn't respect the indentation policy ;) To do : - integrate better with the rest of the MB (client-side encoding), and with the rest of the code of print.c . - verify utf8-to-ucs robustness seriously. - make a visually nicer code :) - find better function names. And possibly : - consolidate the code, in order to remove the need for the #ifdef's in many places. - make it working with some others multiwidth-encoding (but then, I don't know anything about these encodings myself !). - check also utf-8 stream at input time, so that no invalid utf-8 is sent to the backend (at least from psql - the backend will need also a strict checking for UTF-8). - add nice UTF-8 borders as an option :) - add a command-line parameter to consider Unicode Ambiguous characters (characters which can be narrow or wide, depending on the terminal) wide characters, as it seems to be the case for CJK terminals (as per TR#11). - What else ? BTW, here is the table I had in the first mail. I would have shown the one with all the weird Unicode characters, but my mutt is configured with iso-8859-15, and I doubt many of you have utf-8 as a default yet :) +------+-------+--------+ | lang | text | text | +------+-------+--------+ | isl | álíta | áleit | | isl | álíta | álitum | | isl | álíta | álitið | | isl | maður | mann | | isl | maður | mönnum | | isl | maður | manna | | isl | óska | -aði | +------+-------+--------+ The files in attachment : - a diff for pgsql/src/bin/psql/print.c - a diff for pgsql/src/bin/psql/Makefile - two new files : pgsql/src/bin/psql/pg_mb_utf8.c pgsql/src/bin/psql/pg_mb_utf8.h Have fun ! Patrice -- Patrice HÉDÉ ------------------------------- patrice à islande org ----- -- Isn't it weird how scientists can imagine all the matter of the universe exploding out of a dot smaller than the head of a pin, but they can't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the universe ? -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes ------------------------------------------ http://www.islande.org/ -----
Attachment
pgsql-hackers by date: