Thread: \xDD patch for 7.5devel
Hi all, This is my first patch for PostgreSQL against the 7.5devel cvs (please advise if this is the wrong place to post patches). This patch simply enables the \xDD (or \XDD) hexadecimal import in the copy command (im starting with the simplestuff first). I did notice that there may be a need to issue an error if an invalid octal or hex character is foundfollowing a \ or \x. No errors are currently flagged by the octal (or this hex) import. Rgds, Jason cvs server: Diffing . Index: copy.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/backend/commands/copy.c,v retrieving revision 1.213 diff -u -r1.213 copy.c --- copy.c 6 Oct 2003 02:38:53 -0000 1.213 +++ copy.c 5 Nov 2003 11:19:33 -0000 @@ -48,9 +48,10 @@#include "utils/lsyscache.h"#include "utils/syscache.h" -#define ISOCTAL(c) (((c) >= '0') && ((c) <= '7'))#define OCTVALUE(c) ((c) - '0') +#define ISHEX(c) ((((c)>='0') && ((c)<='9')) || (((c)>='A') && ((c)<='F')) || (((c)>='a') && ((c)<='f'))) +#define HEXVALUE(c) (((c)>='a') ? ((c)-87) : (((c)>='A') ? ((c)-55) : ((c)-'0'))) /* * Represents the different source/dest cases we need to worry about at @@ -1947,6 +1948,33 @@ c = line_buf.data[line_buf.cursor++]; switch (c) { + case 'x': + case 'X': + { + if (line_buf.cursor < line_buf.len) + { + int hexval = 0; + + c = line_buf.data[line_buf.cursor]; + if (ISHEX(c)) + { + line_buf.cursor++; + hexval = HEXVALUE(c); + if (line_buf.cursor < line_buf.len) + { + c = line_buf.data[line_buf.cursor]; + line_buf.cursor++; + if (ISHEX(c)) + { + line_buf.cursor++; + hexval = (hexval << 4) + HEXVALUE(c); + } + } + } + c = hexval; + } + } + break; case '0': case'1': case '2':
Jason Godden writes: > This is my first patch for PostgreSQL against the 7.5devel cvs (please advise if this is the wrong place to post patches). This patch simply enables the \xDD (or \XDD) hexadecimal import in the copy command (im starting with the simplestuff first). I did notice that there may be a need to issue an error if an invalid octal or hex character is foundfollowing a \ or \x. No errors are currently flagged by the octal (or this hex) import. I think this belongs into the string literal parser (at least in addition to COPY). -- Peter Eisentraut peter_e@gmx.net
>>This is my first patch for PostgreSQL against the 7.5devel cvs (please advise if this is the wrong place to post patches). This patch simply enables the \xDD (or \XDD) hexadecimal import in the copy command (im starting with the simplestuff first). I did notice that there may be a need to issue an error if an invalid octal or hex character is foundfollowing a \ or \x. No errors are currently flagged by the octal (or this hex) import. > > > I think this belongs into the string literal parser (at least in addition > to COPY). That's what always happens when I start working on something - someone points out something that makes it 100 times harder :P Chris
Jason Godden <jasongodden@optushome.com.au> writes: > This is my first patch for PostgreSQL against the 7.5devel cvs (please > advise if this is the wrong place to post patches). pgsql-patches in future, please. > +#define HEXVALUE(c) (((c)>='a') ? ((c)-87) : (((c)>='A') ? ((c)-55) : ((c)-'0'))) This seems excessively dependent on the assumption that the character set is ASCII. Why have you hard-coded numeric equivalents into this macro? BTW, the patch is incomplete because it is lacking documentation. regards, tom lane
В Срд, 05.11.2003, в 16:25, Tom Lane пишет: > > +#define HEXVALUE(c) (((c)>='a') ? ((c)-87) : (((c)>='A') ? ((c)-55) : ((c)-'0'))) > > This seems excessively dependent on the assumption that the character > set is ASCII. Why have you hard-coded numeric equivalents into this > macro? What not ASCII compatible character sets are out there in use still today? -- Markus Bertheau <twanger@bluetwanger.de>
On Thu, 6 Nov 2003 06:25 am, Markus Bertheau wrote: > В Срд, 05.11.2003, в 16:25, Tom Lane пишет: > > > +#define HEXVALUE(c) (((c)>='a') ? ((c)-87) : (((c)>='A') ? ((c)-55) : > > > ((c)-'0'))) > > > > This seems excessively dependent on the assumption that the character > > set is ASCII. Why have you hard-coded numeric equivalents into this > > macro? > > What not ASCII compatible character sets are out there in use still > today? Ah, yes - didn't even think about the character sets. If thats the case then octal needs attention as well because it makes a similar assumption. Peter Eisentraut commented that this should be in the string literal parser. Should this be the case? and if so should i migrate both octal and hex to this parser? Rgds, Jason
--On Thursday, November 06, 2003 07:43:07 +1100 Jason Godden <jasongodden@optushome.com.au> wrote: > On Thu, 6 Nov 2003 06:25 am, Markus Bertheau wrote: >> ? ???, 05.11.2003, ? 16:25, Tom Lane ?????: >> > > +#define HEXVALUE(c) (((c)>='a') ? ((c)-87) : (((c)>='A') ? ((c)-55) >> > > : ((c)-'0'))) >> > >> > This seems excessively dependent on the assumption that the character >> > set is ASCII. Why have you hard-coded numeric equivalents into this >> > macro? >> >> What not ASCII compatible character sets are out there in use still >> today? > EBCDIC as far as I know is still the default on IBM Mainframes (been 5+ years but...). -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
On Wed, Nov 05, 2003 at 02:47:17PM -0600, Larry Rosenman wrote: > > > --On Thursday, November 06, 2003 07:43:07 +1100 Jason Godden > <jasongodden@optushome.com.au> wrote: > > >On Thu, 6 Nov 2003 06:25 am, Markus Bertheau wrote: > >>? ???, 05.11.2003, ? 16:25, Tom Lane ?????: > >>> > +#define HEXVALUE(c) (((c)>='a') ? ((c)-87) : (((c)>='A') ? ((c)-55) > >>> > : ((c)-'0'))) > >>> > >>> This seems excessively dependent on the assumption that the character > >>> set is ASCII. Why have you hard-coded numeric equivalents into this > >>> macro? > >> > >>What not ASCII compatible character sets are out there in use still > >>today? > > > EBCDIC as far as I know is still the default on IBM Mainframes (been 5+ > years but...). Linux on the s390, s390x runs in ASCII mode. MVS, OS/390, z/OS all use EBCDIC though. But I don't think it has anything to do with which OS/hardware you use but rather what charset is used during the communication. It's probably about the charset that is used to send the "\xDD". I guess question is that you can assume that that string is encoded in ASCII. If this is broken, I'd say that the octal encoding and other quotes are broken too. Kurt
On Thu, 6 Nov 2003, Jason Godden wrote: > On Thu, 6 Nov 2003 06:25 am, Markus Bertheau wrote: > > В Срд, 05.11.2003, в 16:25, Tom Lane пишет: > > > > +#define HEXVALUE(c) (((c)>='a') ? ((c)-87) : (((c)>='A') ? ((c)-55) : > > > > ((c)-'0'))) > > > > > > This seems excessively dependent on the assumption that the character > > > set is ASCII. Why have you hard-coded numeric equivalents into this > > > macro? > > > > What not ASCII compatible character sets are out there in use still > > today? > > Ah, yes - didn't even think about the character sets. If thats the case then > octal needs attention as well because it makes a similar assumption. Peter I haven't looked at the code in question, but assuming the digits are contiguous and in order is safe, the C spec mandates that. Assuming that the letters are in order and contiguous is not safe.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > On Thu, 6 Nov 2003, Jason Godden wrote: >> On Thu, 6 Nov 2003 06:25 am, Markus Bertheau wrote: >>> +#define HEXVALUE(c) (((c)>='a') ? ((c)-87) : (((c)>='A') ? ((c)-55) : > ((c)-'0'))) > I haven't looked at the code in question, but assuming the digits are > contiguous and in order is safe, the C spec mandates that. Assuming that > the letters are in order and contiguous is not safe. I believe that's a true statement with respect to the character sets used in the field; I dunno whether the C spec actually says that though. My original concern about this macro had several different levels: 1. I can't see any reason why the subtractions are coded as "-55" and not "-'A' + 10". The existing coding is less understandable than doing it right, and won't save anything in runtime (since the compiler will fold the constants anyway), on top of being a character set dependency. 2. I don't much care for the assumption that lower case letters are greater than upper case are greater than digits. This could be avoided at a fairly small runtime penalty by making range tests: #define HEXVALUE(c) \(((c) >= '0' && (c) <= '9') ? ((c) - '0') : \ (((c) >= 'A' && (c) <= 'F') ? ((c) - 'A' + 10) : \ ((c)- 'a' + 10))) 3. The third level would be to get rid of the assumption that letters are contiguous, which would probably require making a lookup table to map from char code to hex value. I'm not sure level 3 is really worth doing, since AFAIK no one tries to run Postgres on any EBCDIC platform. (It's likely that there are other places that depend on the letters-are-contiguous assumption anyway.) I do think level 1 and probably level 2 are appropriate changes. regards, tom lane
> #define HEXVALUE(c) \ > (((c) >= '0' && (c) <= '9') ? ((c) - '0') : \ > (((c) >= 'A' && (c) <= 'F') ? ((c) - 'A' + 10) : \ > ((c) - 'a' + 10))) > > 3. The third level would be to get rid of the assumption that letters > are contiguous, which would probably require making a lookup table to > map from char code to hex value. > > I'm not sure level 3 is really worth doing, since AFAIK no one tries to > run Postgres on any EBCDIC platform. (It's likely that there are other > places that depend on the letters-are-contiguous assumption anyway.) > I do think level 1 and probably level 2 are appropriate changes. > > regards, tom lane Hi Guys, Thanks for the feedback. Tom I'll make the changes proposed here to the macro and repost the patch to pgsql-patches (and do some reading on Unicode!). I guess at this stage I would like to offer any of my time to any janitorial work that might be needed as until im more knowledgeable about the pg source I think any large scale stuff is off the cards. Rgds, Jason