Thread: Re: [HACKERS] dollar quoting
Andrew Dunstan wrote: > > > Tom Lane wrote: > >> Andrew Dunstan <andrew@dunslane.net> writes: >> >> >>> After staring at the code for a long time, I think I see how to do >>> this. It's complicated a bit by the fact that $ is a valid >>> identifier character. So my current thinking is to say that if we >>> see $ not in a quote and not preceded by a valid identifier char >>> then it is the start of a $foo$ sequence. Or have I missed >>> something? Can we validly see $ in any other context? >>> >> I had missed one, though - numbered params in prepared statements. Fixed in attached patch. >> >> Right, a $ should be considered to start a quote marker only if it's not >> part of an identifier. The backend lexer doesn't have a problem with >> this because it's written in flex, but I can imagine that getting it >> right in psql's ad-hoc parser might be tricky. >> > > I think it's doable, though. I seem to have a working patch, which I > will send out for review soon. > Proof of Concept patch (i.e. not for application) attached for review. The scanner changes are based on Tom's original, with some name/comment changes and a more liberal pattern. The psql changes are all my own work :-). Comments welcome. Reviewers: I am not sure I got multi-byte stuff right in psql/mainloop.c - please pay close attention to that. If I'm not wildly off course I will polish this up and start on docs. cheers andrew Index: src/backend/parser/scan.l =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v retrieving revision 1.112 diff -c -w -r1.112 scan.l *** src/backend/parser/scan.l 29 Nov 2003 19:51:52 -0000 1.112 --- src/backend/parser/scan.l 8 Feb 2004 19:59:24 -0000 *************** *** 39,44 **** --- 39,46 ---- static int xcdepth = 0; /* depth of nesting in slash-star comments */ + static char *dolqstart; /* current $foo$ quote start string */ + /* * literalbuf is used to accumulate literal values when multiple rules * are needed to parse a single literal. Call startlit to reset buffer *************** *** 95,100 **** --- 97,103 ---- * <xd> delimited identifiers (double-quoted identifiers) * <xh> hexadecimal numeric string * <xq> quoted strings + * <dolq> $foo$-style quoted strings */ %x xb *************** *** 102,107 **** --- 105,111 ---- %x xd %x xh %x xq + %x dolq /* Bit string * It is tempting to scan the string for only those characters *************** *** 141,146 **** --- 145,159 ---- xqoctesc [\\][0-7]{1,3} xqcat {quote}{whitespace_with_newline}{quote} + /* $foo$ style quotes ("dollar quoting") + * The quoted string starts with $foo$ where "foo" is an optional string + * in the form of an identifier, except that it may not contain "$", + * and extends to the first occurrence + * of an identical string. There is *no* processing of the quoted text. + */ + dolqdelim \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$ + dolqinside [^$]+ + /* Double quote * Allows embedded spaces and other special characters into identifiers. */ *************** *** 387,392 **** --- 400,434 ---- } <xq><<EOF>> { yyerror("unterminated quoted string"); } + {dolqdelim} { + token_start = yytext; + dolqstart = pstrdup(yytext); + BEGIN(dolq); + startlit(); + } + <dolq>{dolqdelim} { + if (strcmp(yytext, dolqstart) == 0) + { + pfree(dolqstart); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return SCONST; + } + /* + * When we fail to match $...$ to dolqstart, transfer + * the $... part to the output, but put back the final + * $ for rescanning. Consider $delim$...$junk$delim$ + */ + addlit(yytext, yyleng-1); + yyless(yyleng-1); + } + <dolq>{dolqinside} { + addlit(yytext, yyleng); + } + <dolq>. { + addlitchar(yytext[0]); + } + <dolq><<EOF>> { yyerror("unterminated special-quoted string"); } {xdstart} { token_start = yytext; Index: src/bin/psql/mainloop.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v retrieving revision 1.61 diff -c -w -r1.61 mainloop.c *** src/bin/psql/mainloop.c 25 Jan 2004 03:07:22 -0000 1.61 --- src/bin/psql/mainloop.c 8 Feb 2004 19:59:44 -0000 *************** *** 49,54 **** --- 49,58 ---- unsigned int query_start; volatile int count_eof = 0; volatile unsigned int bslash_count = 0; + volatile bool free_dolquote = false; + char *dol_quote = NULL; + char *dol_end; + int i, prevlen, *************** *** 120,125 **** --- 124,130 ---- in_quote = 0; paren_level = 0; count_eof = 0; + free_dolquote = true; slashCmdStatus = CMD_UNKNOWN; } else *************** *** 136,141 **** --- 141,156 ---- pqsignal(SIGINT, handle_sigint); /* control-C => cancel */ #endif /* not WIN32 */ + if (free_dolquote) + { + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } + free_dolquote = false; + } + fflush(stdout); if (slashCmdStatus == CMD_NEWEDIT) *************** *** 150,155 **** --- 165,175 ---- in_xcomment = 0; in_quote = 0; paren_level = 0; + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } slashCmdStatus = CMD_UNKNOWN; } *************** *** 161,167 **** { int prompt_status; ! if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; --- 181,189 ---- { int prompt_status; ! if (dol_quote) ! prompt_status = PROMPT_DOLLARQUOTE; ! else if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; *************** *** 268,273 **** --- 290,339 ---- in_quote = 0; } + /* + * start of $foo$ type quote + * must be a $ not preceded by a valid identifier character, + * not inside a quote, not succeeded by a digit, + * and with a terminating $ somewhere on the line. + * + * if an invalid $foo$ meeting these requirements is seen, it won't + * be trapped here but will instead generate an SQL syntax error. + */ + + else if (!dol_quote && line[i] == '$' && + !isdigit(line[i + thislen]) && + (dol_end = strchr(line+i+1,'$')) != NULL && + (i == 0 || + ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || + line[i-1] == '_'))) + { + char eos; + + dol_end ++; + eos = *(dol_end); + *dol_end = '\0'; + dol_quote = strdup(line+i); + *dol_end = eos; + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + + } + + /* in or end of $foo$ type quote? */ + + else if (dol_quote) + { + if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0) + { + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + free(dol_quote); + dol_quote = NULL; + } + } + /* start of extended comment? */ else if (line[i] == '/' && line[i + thislen] == '*') { *************** *** 291,297 **** /* single-line comment? truncate line */ else if (line[i] == '-' && line[i + thislen] == '-') { ! line[i] = '\0'; /* remove comment */ break; } --- 357,363 ---- /* single-line comment? truncate line */ else if (line[i] == '-' && line[i + thislen] == '-') { ! line[i] = '\0'; /* removae comment */ break; } *************** *** 458,464 **** /* Put the rest of the line in the query buffer. */ ! if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); --- 524,531 ---- /* Put the rest of the line in the query buffer. */ ! if (in_quote || dol_quote || ! line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); Index: src/bin/psql/prompt.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v retrieving revision 1.34 diff -c -w -r1.34 prompt.c *** src/bin/psql/prompt.c 25 Jan 2004 03:07:22 -0000 1.34 --- src/bin/psql/prompt.c 8 Feb 2004 19:59:44 -0000 *************** *** 85,90 **** --- 85,91 ---- case PROMPT_CONTINUE: case PROMPT_SINGLEQUOTE: case PROMPT_DOUBLEQUOTE: + case PROMPT_DOLLARQUOTE: case PROMPT_COMMENT: case PROMPT_PAREN: prompt_name = "PROMPT2"; *************** *** 198,203 **** --- 199,207 ---- break; case PROMPT_DOUBLEQUOTE: buf[0] = '"'; + break; + case PROMPT_DOLLARQUOTE: + buf[0] = '$'; break; case PROMPT_COMMENT: buf[0] = '*'; Index: src/bin/psql/prompt.h =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v retrieving revision 1.13 diff -c -w -r1.13 prompt.h *** src/bin/psql/prompt.h 29 Nov 2003 19:52:07 -0000 1.13 --- src/bin/psql/prompt.h 8 Feb 2004 19:59:44 -0000 *************** *** 15,20 **** --- 15,21 ---- PROMPT_COMMENT, PROMPT_SINGLEQUOTE, PROMPT_DOUBLEQUOTE, + PROMPT_DOLLARQUOTE, PROMPT_PAREN, PROMPT_COPY } promptStatus_t;
Andrew Dunstan <andrew@dunslane.net> writes: > Comments welcome. Reviewers: I am not sure I got multi-byte stuff right > in psql/mainloop.c - please pay close attention to that. The i-1 stuff should generally be i-prevlen. Not sure if there are any other pitfalls. A bigger problem here: > + else if (!dol_quote && line[i] == '$' && > + !isdigit(line[i + thislen]) && > + (dol_end = strchr(line+i+1,'$')) != NULL && > + (i == 0 || > + ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || > + line[i-1] == '_'))) > + { is that you aren't checking that what comes between the two dollar signs looks like empty-or-an-identifier. The check for next-char-isn't-a-digit is part of that but not the only part. Also I'm not sure about the positioning of these tests relative to the in_quote and in_xcomment tests. As you have it, $foo$ will be recognized within an xcomment, which I think is at variance with the proposed backend lexing behavior. Also, the strdup should be pg_strdup. regards, tom lane
Tom Lane wrote: >A bigger problem here: > > > >>+ else if (!dol_quote && line[i] == '$' && >>+ !isdigit(line[i + thislen]) && >>+ (dol_end = strchr(line+i+1,'$')) != NULL && >>+ (i == 0 || >>+ ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || >>+ line[i-1] == '_'))) >>+ { >> >> > >is that you aren't checking that what comes between the two dollar signs >looks like empty-or-an-identifier. The check for >next-char-isn't-a-digit is part of that but not the only part. > > Well, I think the right way to do a full check would be with a regex, which I had hoped to avoid. However, I will now try to get one working and to address your other concerns. Thanks for the comments. cheers andrew
I think the attached patch addresses Tom's comments. I ended up not using a regex, which seemed to be a little heavy handed, but just writing a small custom recognition function, that should (and I think does) mimic the pattern recognition for these tokens used by the backend lexer. This patch just puts that function in mainloop.c, but perhaps it belongs elsewhere (string_utils.c maybe?). I don't have strong opinions on that. Enjoy andrew Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>Comments welcome. Reviewers: I am not sure I got multi-byte stuff right >>in psql/mainloop.c - please pay close attention to that. >> >> > >The i-1 stuff should generally be i-prevlen. Not sure if there are any >other pitfalls. > >A bigger problem here: > > > >>+ else if (!dol_quote && line[i] == '$' && >>+ !isdigit(line[i + thislen]) && >>+ (dol_end = strchr(line+i+1,'$')) != NULL && >>+ (i == 0 || >>+ ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || >>+ line[i-1] == '_'))) >>+ { >> >> > >is that you aren't checking that what comes between the two dollar signs >looks like empty-or-an-identifier. The check for >next-char-isn't-a-digit is part of that but not the only part. > >Also I'm not sure about the positioning of these tests relative to the >in_quote and in_xcomment tests. As you have it, $foo$ will be >recognized within an xcomment, which I think is at variance with the >proposed backend lexing behavior. > >Also, the strdup should be pg_strdup. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > > Index: src/backend/parser/scan.l =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v retrieving revision 1.112 diff -c -w -r1.112 scan.l *** src/backend/parser/scan.l 29 Nov 2003 19:51:52 -0000 1.112 --- src/backend/parser/scan.l 9 Feb 2004 15:26:34 -0000 *************** *** 39,44 **** --- 39,46 ---- static int xcdepth = 0; /* depth of nesting in slash-star comments */ + static char *dolqstart; /* current $foo$ quote start string */ + /* * literalbuf is used to accumulate literal values when multiple rules * are needed to parse a single literal. Call startlit to reset buffer *************** *** 95,100 **** --- 97,103 ---- * <xd> delimited identifiers (double-quoted identifiers) * <xh> hexadecimal numeric string * <xq> quoted strings + * <dolq> $foo$-style quoted strings */ %x xb *************** *** 102,107 **** --- 105,111 ---- %x xd %x xh %x xq + %x dolq /* Bit string * It is tempting to scan the string for only those characters *************** *** 141,146 **** --- 145,159 ---- xqoctesc [\\][0-7]{1,3} xqcat {quote}{whitespace_with_newline}{quote} + /* $foo$ style quotes ("dollar quoting") + * The quoted string starts with $foo$ where "foo" is an optional string + * in the form of an identifier, except that it may not contain "$", + * and extends to the first occurrence + * of an identical string. There is *no* processing of the quoted text. + */ + dolqdelim \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$ + dolqinside [^$]+ + /* Double quote * Allows embedded spaces and other special characters into identifiers. */ *************** *** 387,392 **** --- 400,434 ---- } <xq><<EOF>> { yyerror("unterminated quoted string"); } + {dolqdelim} { + token_start = yytext; + dolqstart = pstrdup(yytext); + BEGIN(dolq); + startlit(); + } + <dolq>{dolqdelim} { + if (strcmp(yytext, dolqstart) == 0) + { + pfree(dolqstart); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return SCONST; + } + /* + * When we fail to match $...$ to dolqstart, transfer + * the $... part to the output, but put back the final + * $ for rescanning. Consider $delim$...$junk$delim$ + */ + addlit(yytext, yyleng-1); + yyless(yyleng-1); + } + <dolq>{dolqinside} { + addlit(yytext, yyleng); + } + <dolq>. { + addlitchar(yytext[0]); + } + <dolq><<EOF>> { yyerror("unterminated special-quoted string"); } {xdstart} { token_start = yytext; Index: src/bin/psql/mainloop.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v retrieving revision 1.61 diff -c -w -r1.61 mainloop.c *** src/bin/psql/mainloop.c 25 Jan 2004 03:07:22 -0000 1.61 --- src/bin/psql/mainloop.c 9 Feb 2004 15:26:51 -0000 *************** *** 21,26 **** --- 21,61 ---- sigjmp_buf main_loop_jmp; #endif + /* + * function to detect a valid $foo$ quote delimiter at the start of the + * parameter dquote. + */ + + static bool valid_dolquote(char * dquote) + { + int i; + + /* must start with a $ */ + if (dquote[0] != '$') + return false; + + /* empty 'identifier' case */ + if (dquote[1] == '$') + return true; + + /* first 'identifier' char must be a letter or have high bit set */ + if (!isalpha(dquote[1]) && (dquote[1] & 0x80) == 0) + return false; + + /* subsequent chars must be alphanumeric or _ or have high bit set */ + for (i = 2; dquote[i] != '$'; i++) + { + if ((dquote[i] & 0x80) == 0 && ! isalnum(dquote[i]) && + dquote[i] != '_') + { + /* we found an invalid character */ + return false; + } + } + + return true; + } + /* * Main processing loop for reading lines of input *************** *** 49,54 **** --- 84,92 ---- unsigned int query_start; volatile int count_eof = 0; volatile unsigned int bslash_count = 0; + volatile bool free_dolquote = false; + char *dol_quote = NULL; + int i, prevlen, *************** *** 120,125 **** --- 158,164 ---- in_quote = 0; paren_level = 0; count_eof = 0; + free_dolquote = true; slashCmdStatus = CMD_UNKNOWN; } else *************** *** 136,141 **** --- 175,190 ---- pqsignal(SIGINT, handle_sigint); /* control-C => cancel */ #endif /* not WIN32 */ + if (free_dolquote) + { + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } + free_dolquote = false; + } + fflush(stdout); if (slashCmdStatus == CMD_NEWEDIT) *************** *** 150,155 **** --- 199,209 ---- in_xcomment = 0; in_quote = 0; paren_level = 0; + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } slashCmdStatus = CMD_UNKNOWN; } *************** *** 161,167 **** { int prompt_status; ! if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; --- 215,223 ---- { int prompt_status; ! if (dol_quote) ! prompt_status = PROMPT_DOLLARQUOTE; ! else if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; *************** *** 268,273 **** --- 324,343 ---- in_quote = 0; } + /* in or end of $foo$ type quote? */ + + else if (dol_quote) + { + if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0) + { + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + free(dol_quote); + dol_quote = NULL; + } + } + /* start of extended comment? */ else if (line[i] == '/' && line[i + thislen] == '*') { *************** *** 288,297 **** else if (line[i] == '\'' || line[i] == '"') in_quote = line[i]; /* single-line comment? truncate line */ else if (line[i] == '-' && line[i + thislen] == '-') { ! line[i] = '\0'; /* remove comment */ break; } --- 358,395 ---- else if (line[i] == '\'' || line[i] == '"') in_quote = line[i]; + /* + * start of $foo$ type quote? + * + * must not be preceded by a valid identifier character + */ + + else if (!dol_quote && valid_dolquote(line+i) && + (i == 0 || + ! ((line[i-prevlen] & 0x80) != 0 || + isalnum(line[i-prevlen]) || + line[i-prevlen] == '_' || + line[i-prevlen] == '$' ))) + { + char * dol_end; + char eos; + + dol_end = strchr(line+i+1,'$'); + dol_end ++; + eos = *dol_end; + *dol_end = '\0'; + dol_quote = pg_strdup(line+i); + *dol_end = eos; + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + + } + /* single-line comment? truncate line */ else if (line[i] == '-' && line[i + thislen] == '-') { ! line[i] = '\0'; /* removae comment */ break; } *************** *** 458,464 **** /* Put the rest of the line in the query buffer. */ ! if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); --- 556,563 ---- /* Put the rest of the line in the query buffer. */ ! if (in_quote || dol_quote || ! line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); Index: src/bin/psql/prompt.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v retrieving revision 1.34 diff -c -w -r1.34 prompt.c *** src/bin/psql/prompt.c 25 Jan 2004 03:07:22 -0000 1.34 --- src/bin/psql/prompt.c 9 Feb 2004 15:26:51 -0000 *************** *** 85,90 **** --- 85,91 ---- case PROMPT_CONTINUE: case PROMPT_SINGLEQUOTE: case PROMPT_DOUBLEQUOTE: + case PROMPT_DOLLARQUOTE: case PROMPT_COMMENT: case PROMPT_PAREN: prompt_name = "PROMPT2"; *************** *** 198,203 **** --- 199,207 ---- break; case PROMPT_DOUBLEQUOTE: buf[0] = '"'; + break; + case PROMPT_DOLLARQUOTE: + buf[0] = '$'; break; case PROMPT_COMMENT: buf[0] = '*'; Index: src/bin/psql/prompt.h =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v retrieving revision 1.13 diff -c -w -r1.13 prompt.h *** src/bin/psql/prompt.h 29 Nov 2003 19:52:07 -0000 1.13 --- src/bin/psql/prompt.h 9 Feb 2004 15:26:51 -0000 *************** *** 15,20 **** --- 15,21 ---- PROMPT_COMMENT, PROMPT_SINGLEQUOTE, PROMPT_DOUBLEQUOTE, + PROMPT_DOLLARQUOTE, PROMPT_PAREN, PROMPT_COPY } promptStatus_t;
Andrew Dunstan <andrew@dunslane.net> writes: > I ended up not using a regex, which seemed to be a little heavy handed, > but just writing a small custom recognition function, that should (and I > think does) mimic the pattern recognition for these tokens used by the > backend lexer. I looked at this and realized that it still doesn't do very well at distinguishing $foo$ from other random uses of $. The problem is that looking back at just the immediately preceding character isn't enough context to tell whether a $ is part of an identifier. Consider the input a42$foo$ This is a legal identifier according to PG 7.4. But how about 42$foo$ This is a syntax error in 7.4, and we propose to redefine it as an integer literal '42' followed by a dollar-quote start symbol. There's no way to tell these apart with a single-character lookback, or indeed any fixed number of characters of lookback. I begin to think that we'll really have to bite the bullet and convert psql's input parser to use flex. If we're not scanning with exactly the same rules as the backend uses, we're going to get the wrong answers. regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>I ended up not using a regex, which seemed to be a little heavy handed, >>but just writing a small custom recognition function, that should (and I >>think does) mimic the pattern recognition for these tokens used by the >>backend lexer. >> >> > >I looked at this and realized that it still doesn't do very well at >distinguishing $foo$ from other random uses of $. The problem is that >looking back at just the immediately preceding character isn't enough >context to tell whether a $ is part of an identifier. Consider the >input > a42$foo$ >This is a legal identifier according to PG 7.4. But how about > 42$foo$ >This is a syntax error in 7.4, and we propose to redefine it as an >integer literal '42' followed by a dollar-quote start symbol. > The test in the patch I sent is this: else if (!dol_quote && valid_dolquote(line+i) && (i == 0 || ! ((line[i-prevlen] & 0x80) != 0 || isalnum(line[i-prevlen]) || line[i-prevlen] == '_' || line[i-prevlen] == '$' ))) The test should not succeed anywhere in the string '42$foo$'. Note that psql does not change any '$foo$' at all - it just passes it to the backend. The reason we need this at all in psql is that it has to detect the end of a statement, and it has to prompt correctly, and to do that it needs to know if we are in a quote (single, double, dollar) or a comment. psql does not detect many syntax errors, or even lexical errors - that is the job of the backend - rightly so, I believe. > >There's no way to tell these apart with a single-character lookback, >or indeed any fixed number of characters of lookback. > I'm still not convinced, although maybe there's something I'm not getting. > >I begin to think that we'll really have to bite the bullet and convert >psql's input parser to use flex. If we're not scanning with exactly the >same rules as the backend uses, we're going to get the wrong answers. > > > Interacting with lexer states would probably be ... unpleasant. Matching a stream oriented lexer with a line oriented CLI would be messy I suspect. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> ... But how about >> 42$foo$ >> This is a syntax error in 7.4, and we propose to redefine it as an >> integer literal '42' followed by a dollar-quote start symbol. > The test should not succeed anywhere in the string '42$foo$'. No, it won't. The problem is that it should, because the backend will see that as '42' followed by a $foo$ quote start. > Interacting with lexer states would probably be ... unpleasant. Matching > a stream oriented lexer with a line oriented CLI would be messy I suspect. I think it would not be that bad. We'd have to run the lexer on the command input buffer and see what state it terminates in. regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>Tom Lane wrote: >> >> >>>... But how about >>>42$foo$ >>>This is a syntax error in 7.4, and we propose to redefine it as an >>>integer literal '42' followed by a dollar-quote start symbol. >>> >>> > > > >>The test should not succeed anywhere in the string '42$foo$'. >> >> > >No, it won't. The problem is that it should, because the backend will >see that as '42' followed by a $foo$ quote start. > Ok, I see what you are saying. This mismatch would only happen on invalid input, though. I believe that what I did will work on all legal input. I think that this might be cured by having psql recognise a legal identifier or keyword and eating it as a word, rather than treating it as just another set of bytes in the stream. That would enable us to avoid the lookback in the dollar-quote recognition test altogether. The attached patch does it that way - the keyword/id test needs to come right at the end of the loop to avoid clashing with backslash commands, btw. I *think* that this way psql will recognise the start of a dollar quote iff the backend lexer would. > > > >>Interacting with lexer states would probably be ... unpleasant. Matching >>a stream oriented lexer with a line oriented CLI would be messy I suspect. >> >> > >I think it would not be that bad. We'd have to run the lexer on the >command input buffer and see what state it terminates in. > > > Yeah. I am not enough of a flex wizard to undertake the task, though. It would take me lots of time. If we make a decision that we really need this in order to do dollar quoting in psql I would need some substantial help, at least. cheers andrew Index: src/bin/psql/mainloop.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v retrieving revision 1.61 diff -c -r1.61 mainloop.c *** src/bin/psql/mainloop.c 25 Jan 2004 03:07:22 -0000 1.61 --- src/bin/psql/mainloop.c 15 Feb 2004 14:28:02 -0000 *************** *** 21,26 **** --- 21,61 ---- sigjmp_buf main_loop_jmp; #endif + /* + * function to detect a valid $foo$ quote delimiter at the start of the + * parameter dquote. + */ + + static bool valid_dolquote(char * dquote) + { + int i; + + /* must start with a $ */ + if (dquote[0] != '$') + return false; + + /* empty 'identifier' case */ + if (dquote[1] == '$') + return true; + + /* first 'identifier' char must be a letter or have high bit set */ + if (!isalpha(dquote[1]) && (dquote[1] & 0x80) == 0) + return false; + + /* subsequent chars must be alphanumeric or _ or have high bit set */ + for (i = 2; dquote[i] != '$'; i++) + { + if ((dquote[i] & 0x80) == 0 && ! isalnum(dquote[i]) && + dquote[i] != '_') + { + /* we found an invalid character */ + return false; + } + } + + return true; + } + /* * Main processing loop for reading lines of input *************** *** 49,54 **** --- 84,92 ---- unsigned int query_start; volatile int count_eof = 0; volatile unsigned int bslash_count = 0; + volatile bool free_dolquote = false; + char *dol_quote = NULL; + int i, prevlen, *************** *** 120,125 **** --- 158,164 ---- in_quote = 0; paren_level = 0; count_eof = 0; + free_dolquote = true; slashCmdStatus = CMD_UNKNOWN; } else *************** *** 136,141 **** --- 175,190 ---- pqsignal(SIGINT, handle_sigint); /* control-C => cancel */ #endif /* not WIN32 */ + if (free_dolquote) + { + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } + free_dolquote = false; + } + fflush(stdout); if (slashCmdStatus == CMD_NEWEDIT) *************** *** 150,155 **** --- 199,209 ---- in_xcomment = 0; in_quote = 0; paren_level = 0; + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } slashCmdStatus = CMD_UNKNOWN; } *************** *** 161,167 **** { int prompt_status; ! if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; --- 215,223 ---- { int prompt_status; ! if (dol_quote) ! prompt_status = PROMPT_DOLLARQUOTE; ! else if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; *************** *** 268,273 **** --- 324,343 ---- in_quote = 0; } + /* in or end of $foo$ type quote? */ + + else if (dol_quote) + { + if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0) + { + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + free(dol_quote); + dol_quote = NULL; + } + } + /* start of extended comment? */ else if (line[i] == '/' && line[i + thislen] == '*') { *************** *** 288,293 **** --- 358,383 ---- else if (line[i] == '\'' || line[i] == '"') in_quote = line[i]; + /* + * start of $foo$ type quote? + */ + else if (!dol_quote && valid_dolquote(line+i)) + { + char * dol_end; + char eos; + + dol_end = strchr(line+i+1,'$'); + dol_end ++; + eos = *dol_end; + *dol_end = '\0'; + dol_quote = pg_strdup(line+i); + *dol_end = eos; + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + + } + /* single-line comment? truncate line */ else if (line[i] == '-' && line[i + thislen] == '-') { *************** *** 447,452 **** --- 537,566 ---- i = end_of_cmd - line; query_start = i; } + + /* + * keyword or identifier? + * We grab the whole string so that we don't + * mistakenly see $foo$ inside an identifier as the start + * of a dollar quote. + */ + + else if ( (line[i] & 0x80) != 0 || + isalpha(line[i]) || + line[i] == '_') + { + while ((line[i+thislen] & 0x80) != 0 || + isalnum(line[i+thislen]) || + line[i+thislen] == '_' || + line[i+thislen] == '$' ) + { + /* keep going while we still have identifier chars */ + ADVANCE_1; + } + + + } + } /* for (line) */ *************** *** 458,464 **** /* Put the rest of the line in the query buffer. */ ! if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); --- 572,579 ---- /* Put the rest of the line in the query buffer. */ ! if (in_quote || dol_quote || ! line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); Index: src/bin/psql/prompt.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v retrieving revision 1.34 diff -c -r1.34 prompt.c *** src/bin/psql/prompt.c 25 Jan 2004 03:07:22 -0000 1.34 --- src/bin/psql/prompt.c 15 Feb 2004 14:28:02 -0000 *************** *** 85,90 **** --- 85,91 ---- case PROMPT_CONTINUE: case PROMPT_SINGLEQUOTE: case PROMPT_DOUBLEQUOTE: + case PROMPT_DOLLARQUOTE: case PROMPT_COMMENT: case PROMPT_PAREN: prompt_name = "PROMPT2"; *************** *** 198,203 **** --- 199,207 ---- break; case PROMPT_DOUBLEQUOTE: buf[0] = '"'; + break; + case PROMPT_DOLLARQUOTE: + buf[0] = '$'; break; case PROMPT_COMMENT: buf[0] = '*'; Index: src/bin/psql/prompt.h =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v retrieving revision 1.13 diff -c -r1.13 prompt.h *** src/bin/psql/prompt.h 29 Nov 2003 19:52:07 -0000 1.13 --- src/bin/psql/prompt.h 15 Feb 2004 14:28:02 -0000 *************** *** 15,20 **** --- 15,21 ---- PROMPT_COMMENT, PROMPT_SINGLEQUOTE, PROMPT_DOUBLEQUOTE, + PROMPT_DOLLARQUOTE, PROMPT_PAREN, PROMPT_COPY } promptStatus_t; Index: src/backend/parser/scan.l =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v retrieving revision 1.112 diff -c -r1.112 scan.l *** src/backend/parser/scan.l 29 Nov 2003 19:51:52 -0000 1.112 --- src/backend/parser/scan.l 15 Feb 2004 14:28:16 -0000 *************** *** 39,44 **** --- 39,46 ---- static int xcdepth = 0; /* depth of nesting in slash-star comments */ + static char *dolqstart; /* current $foo$ quote start string */ + /* * literalbuf is used to accumulate literal values when multiple rules * are needed to parse a single literal. Call startlit to reset buffer *************** *** 95,100 **** --- 97,103 ---- * <xd> delimited identifiers (double-quoted identifiers) * <xh> hexadecimal numeric string * <xq> quoted strings + * <dolq> $foo$-style quoted strings */ %x xb *************** *** 102,107 **** --- 105,111 ---- %x xd %x xh %x xq + %x dolq /* Bit string * It is tempting to scan the string for only those characters *************** *** 141,146 **** --- 145,159 ---- xqoctesc [\\][0-7]{1,3} xqcat {quote}{whitespace_with_newline}{quote} + /* $foo$ style quotes ("dollar quoting") + * The quoted string starts with $foo$ where "foo" is an optional string + * in the form of an identifier, except that it may not contain "$", + * and extends to the first occurrence + * of an identical string. There is *no* processing of the quoted text. + */ + dolqdelim \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$ + dolqinside [^$]+ + /* Double quote * Allows embedded spaces and other special characters into identifiers. */ *************** *** 387,392 **** --- 400,434 ---- } <xq><<EOF>> { yyerror("unterminated quoted string"); } + {dolqdelim} { + token_start = yytext; + dolqstart = pstrdup(yytext); + BEGIN(dolq); + startlit(); + } + <dolq>{dolqdelim} { + if (strcmp(yytext, dolqstart) == 0) + { + pfree(dolqstart); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return SCONST; + } + /* + * When we fail to match $...$ to dolqstart, transfer + * the $... part to the output, but put back the final + * $ for rescanning. Consider $delim$...$junk$delim$ + */ + addlit(yytext, yyleng-1); + yyless(yyleng-1); + } + <dolq>{dolqinside} { + addlit(yytext, yyleng); + } + <dolq>. { + addlitchar(yytext[0]); + } + <dolq><<EOF>> { yyerror("unterminated special-quoted string"); } {xdstart} { token_start = yytext;
Andrew Dunstan <andrew@dunslane.net> writes: >> No, it won't. The problem is that it should, because the backend will >> see that as '42' followed by a $foo$ quote start. > Ok, I see what you are saying. This mismatch would only happen on > invalid input, though. I believe that what I did will work on all legal > input. I'm unconvinced. Even if there are not any current syntaxes in which a numeric literal can be adjacent to a string literal (I'm not totally sure about that), what of the future? We should solve the problem rather than assuming it won't bite us. > I think that this might be cured by having psql recognise a legal > identifier or keyword and eating it as a word, rather than treating it > as just another set of bytes in the stream. Hm, might work ... will think about it ... regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> No, it won't. The problem is that it should, because the backend will > >> see that as '42' followed by a $foo$ quote start. > > > Ok, I see what you are saying. This mismatch would only happen on > > invalid input, though. I believe that what I did will work on all legal > > input. > > I'm unconvinced. Even if there are not any current syntaxes in which a > numeric literal can be adjacent to a string literal (I'm not totally > sure about that), what of the future? We should solve the problem > rather than assuming it won't bite us. > > > I think that this might be cured by having psql recognise a legal > > identifier or keyword and eating it as a word, rather than treating it > > as just another set of bytes in the stream. > > Hm, might work ... will think about it ... I am a little concerned about adding the overhead of lex to psql. Right now, some folks have reported that lex/yacc take a considerable amount of processing time in the backend as part of a query, and adding that to psql just to do $$ seems questionable. Of course, we can alway test and see what the overhead shows. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I am a little concerned about adding the overhead of lex to psql. Right > now, some folks have reported that lex/yacc take a considerable amount > of processing time in the backend as part of a query, and adding that to > psql just to do $$ seems questionable. Of course, we can alway test and > see what the overhead shows. That's not the question to ask --- the question is whether a flex lexer will be faster or slower than the hand-made lexing code that psql is currently using. Lexer generation is a well-understood art, and you have to be pretty tense to beat out flex with hand-made code. It's the same tradeoff as trying to write better assembly code than a compiler does. Look at the lexing loop in psql/mainloop.c (that series of if-tests starting at about line 250) and ask yourself if that's really going to beat out the state-machine implementation flex uses --- which looks to be about two table lookups per character, plus a switch() executed at the end of every token. I'll bet on flex being faster. The reason the lexer shows up in the backend is that it has to grovel over every individual character of a query. For sufficiently large query strings that's gonna take awhile no matter how you do it. But in any case, the argument for moving to flex is not about performance, it is about making the code more understandable and more certain to agree with the backend lexer. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I am a little concerned about adding the overhead of lex to psql. Right > > now, some folks have reported that lex/yacc take a considerable amount > > of processing time in the backend as part of a query, and adding that to > > psql just to do $$ seems questionable. Of course, we can alway test and > > see what the overhead shows. > > That's not the question to ask --- the question is whether a flex lexer > will be faster or slower than the hand-made lexing code that psql is > currently using. Lexer generation is a well-understood art, and you > have to be pretty tense to beat out flex with hand-made code. It's the > same tradeoff as trying to write better assembly code than a compiler > does. Look at the lexing loop in psql/mainloop.c (that series of > if-tests starting at about line 250) and ask yourself if that's really > going to beat out the state-machine implementation flex uses --- which > looks to be about two table lookups per character, plus a switch() > executed at the end of every token. I'll bet on flex being faster. > > The reason the lexer shows up in the backend is that it has to grovel > over every individual character of a query. For sufficiently large > query strings that's gonna take awhile no matter how you do it. > > But in any case, the argument for moving to flex is not about > performance, it is about making the code more understandable and more > certain to agree with the backend lexer. If we go with lex/flex for psql, I would still like someone to test performance to see that we aren't taking a big hit. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073