Thread: dollar quoting
What has become of the "dollar quoting" mechanism that we had so much discussion about back in August/September? IIRC, a consensus was reached on the actual format of the quote delimiters (either $$ or $identifier$), and Tom had a proof of concept patch to the parser to handle it, but work was needed on psql, plpgsql, pg_dump (and pg_restore?) + docs. Is anyone working on it? cheers andrew
On Thu, 5 Feb 2004, Andrew Dunstan wrote: > What has become of the "dollar quoting" mechanism that we had so much > discussion about back in August/September? > > IIRC, a consensus was reached on the actual format of the quote > delimiters (either $$ or $identifier$), and Tom had a proof of concept > patch to the parser to handle it, but work was needed on psql, plpgsql, > pg_dump (and pg_restore?) + docs. > > Is anyone working on it? I am, but not very quickly. I hope to have some time in the next month, but if someone else beats me to it I'll just be happy it got done. Jon
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>IIRC, a consensus was reached on the actual format of the quote >>delimiters (either $$ or $identifier$), and Tom had a proof of concept >>patch to the parser to handle it, but work was needed on psql, plpgsql, >>pg_dump (and pg_restore?) + docs. >> >> > >I think someone has to fix psql before we can consider applying the >backend patch. Fixing the other stuff can come after. > Makes sense. > > > >>Is anyone working on it? >> >> > >I kinda thought you had volunteered to work on the psql part... > > > I don't recall being that specific, but you could be right. In any case, I didn't want to trip over anyone else, which is why I asked. I will try to coordinate with Jon. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > IIRC, a consensus was reached on the actual format of the quote > delimiters (either $$ or $identifier$), and Tom had a proof of concept > patch to the parser to handle it, but work was needed on psql, plpgsql, > pg_dump (and pg_restore?) + docs. I think someone has to fix psql before we can consider applying the backend patch. Fixing the other stuff can come after. > Is anyone working on it? I kinda thought you had volunteered to work on the psql part... regards, tom lane
Andrew Dunstan wrote: > > > Tom Lane wrote: > >> >> I kinda thought you had volunteered to work on the psql part... >> >> >> > > > I don't recall being that specific, but you could be right. In any > case, I didn't want to trip over anyone else, which is why I asked. > > I will try to coordinate with Jon. > After staring at the code for a long time, I think I see how to do this. It's complicated a bit by the fact that $ is a valid identifier character. So my current thinking is to say that if we see $ not in a quote and not preceded by a valid identifier char then it is the start of a $foo$ sequence. Or have I missed something? Can we validly see $ in any other context? BTW, Tom's proof of concept patch worked just fine for me. I changed the allowed pattern to what I think was agreed: dolqdelim \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$ and changed some names and comments to remove misleading references to "here docs". cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > After staring at the code for a long time, I think I see how to do this. > It's complicated a bit by the fact that $ is a valid identifier > character. So my current thinking is to say that if we see $ not in a > quote and not preceded by a valid identifier char then it is the start > of a $foo$ sequence. Or have I missed something? Can we validly see $ in > any other context? Right, a $ should be considered to start a quote marker only if it's not part of an identifier. The backend lexer doesn't have a problem with this because it's written in flex, but I can imagine that getting it right in psql's ad-hoc parser might be tricky. There was some discussion awhile back of converting psql to use flex for interpreting its input, but I dunno how practical that really is. I don't know how you get flex to do reasonable stuff with an incomplete input string. Still, it might be worth looking into. regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>After staring at the code for a long time, I think I see how to do this. >>It's complicated a bit by the fact that $ is a valid identifier >>character. So my current thinking is to say that if we see $ not in a >>quote and not preceded by a valid identifier char then it is the start >>of a $foo$ sequence. Or have I missed something? Can we validly see $ in >>any other context? >> >> > >Right, a $ should be considered to start a quote marker only if it's not >part of an identifier. The backend lexer doesn't have a problem with >this because it's written in flex, but I can imagine that getting it >right in psql's ad-hoc parser might be tricky. > I think it's doable, though. I seem to have a working patch, which I will send out for review soon. > >There was some discussion awhile back of converting psql to use flex >for interpreting its input, but I dunno how practical that really is. >I don't know how you get flex to do reasonable stuff with an incomplete >input string. Still, it might be worth looking into. > > > That's what made me not even think about it. If someone better versed in this stuff than me wants to do it then more power to them. cheers andrew
Andrew Dunstan wrote: > > > Tom Lane wrote: > >> Andrew Dunstan <andrew@dunslane.net> writes: >> >> >>> After staring at the code for a long time, I think I see how to do >>> this. It's complicated a bit by the fact that $ is a valid >>> identifier character. So my current thinking is to say that if we >>> see $ not in a quote and not preceded by a valid identifier char >>> then it is the start of a $foo$ sequence. Or have I missed >>> something? Can we validly see $ in any other context? >>> >> I had missed one, though - numbered params in prepared statements. Fixed in attached patch. >> >> Right, a $ should be considered to start a quote marker only if it's not >> part of an identifier. The backend lexer doesn't have a problem with >> this because it's written in flex, but I can imagine that getting it >> right in psql's ad-hoc parser might be tricky. >> > > I think it's doable, though. I seem to have a working patch, which I > will send out for review soon. > Proof of Concept patch (i.e. not for application) attached for review. The scanner changes are based on Tom's original, with some name/comment changes and a more liberal pattern. The psql changes are all my own work :-). Comments welcome. Reviewers: I am not sure I got multi-byte stuff right in psql/mainloop.c - please pay close attention to that. If I'm not wildly off course I will polish this up and start on docs. cheers andrew Index: src/backend/parser/scan.l =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v retrieving revision 1.112 diff -c -w -r1.112 scan.l *** src/backend/parser/scan.l 29 Nov 2003 19:51:52 -0000 1.112 --- src/backend/parser/scan.l 8 Feb 2004 19:59:24 -0000 *************** *** 39,44 **** --- 39,46 ---- static int xcdepth = 0; /* depth of nesting in slash-star comments */ + static char *dolqstart; /* current $foo$ quote start string */ + /* * literalbuf is used to accumulate literal values when multiple rules * are needed to parse a single literal. Call startlit to reset buffer *************** *** 95,100 **** --- 97,103 ---- * <xd> delimited identifiers (double-quoted identifiers) * <xh> hexadecimal numeric string * <xq> quoted strings + * <dolq> $foo$-style quoted strings */ %x xb *************** *** 102,107 **** --- 105,111 ---- %x xd %x xh %x xq + %x dolq /* Bit string * It is tempting to scan the string for only those characters *************** *** 141,146 **** --- 145,159 ---- xqoctesc [\\][0-7]{1,3} xqcat {quote}{whitespace_with_newline}{quote} + /* $foo$ style quotes ("dollar quoting") + * The quoted string starts with $foo$ where "foo" is an optional string + * in the form of an identifier, except that it may not contain "$", + * and extends to the first occurrence + * of an identical string. There is *no* processing of the quoted text. + */ + dolqdelim \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$ + dolqinside [^$]+ + /* Double quote * Allows embedded spaces and other special characters into identifiers. */ *************** *** 387,392 **** --- 400,434 ---- } <xq><<EOF>> { yyerror("unterminated quoted string"); } + {dolqdelim} { + token_start = yytext; + dolqstart = pstrdup(yytext); + BEGIN(dolq); + startlit(); + } + <dolq>{dolqdelim} { + if (strcmp(yytext, dolqstart) == 0) + { + pfree(dolqstart); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return SCONST; + } + /* + * When we fail to match $...$ to dolqstart, transfer + * the $... part to the output, but put back the final + * $ for rescanning. Consider $delim$...$junk$delim$ + */ + addlit(yytext, yyleng-1); + yyless(yyleng-1); + } + <dolq>{dolqinside} { + addlit(yytext, yyleng); + } + <dolq>. { + addlitchar(yytext[0]); + } + <dolq><<EOF>> { yyerror("unterminated special-quoted string"); } {xdstart} { token_start = yytext; Index: src/bin/psql/mainloop.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v retrieving revision 1.61 diff -c -w -r1.61 mainloop.c *** src/bin/psql/mainloop.c 25 Jan 2004 03:07:22 -0000 1.61 --- src/bin/psql/mainloop.c 8 Feb 2004 19:59:44 -0000 *************** *** 49,54 **** --- 49,58 ---- unsigned int query_start; volatile int count_eof = 0; volatile unsigned int bslash_count = 0; + volatile bool free_dolquote = false; + char *dol_quote = NULL; + char *dol_end; + int i, prevlen, *************** *** 120,125 **** --- 124,130 ---- in_quote = 0; paren_level = 0; count_eof = 0; + free_dolquote = true; slashCmdStatus = CMD_UNKNOWN; } else *************** *** 136,141 **** --- 141,156 ---- pqsignal(SIGINT, handle_sigint); /* control-C => cancel */ #endif /* not WIN32 */ + if (free_dolquote) + { + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } + free_dolquote = false; + } + fflush(stdout); if (slashCmdStatus == CMD_NEWEDIT) *************** *** 150,155 **** --- 165,175 ---- in_xcomment = 0; in_quote = 0; paren_level = 0; + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } slashCmdStatus = CMD_UNKNOWN; } *************** *** 161,167 **** { int prompt_status; ! if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; --- 181,189 ---- { int prompt_status; ! if (dol_quote) ! prompt_status = PROMPT_DOLLARQUOTE; ! else if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; *************** *** 268,273 **** --- 290,339 ---- in_quote = 0; } + /* + * start of $foo$ type quote + * must be a $ not preceded by a valid identifier character, + * not inside a quote, not succeeded by a digit, + * and with a terminating $ somewhere on the line. + * + * if an invalid $foo$ meeting these requirements is seen, it won't + * be trapped here but will instead generate an SQL syntax error. + */ + + else if (!dol_quote && line[i] == '$' && + !isdigit(line[i + thislen]) && + (dol_end = strchr(line+i+1,'$')) != NULL && + (i == 0 || + ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || + line[i-1] == '_'))) + { + char eos; + + dol_end ++; + eos = *(dol_end); + *dol_end = '\0'; + dol_quote = strdup(line+i); + *dol_end = eos; + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + + } + + /* in or end of $foo$ type quote? */ + + else if (dol_quote) + { + if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0) + { + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + free(dol_quote); + dol_quote = NULL; + } + } + /* start of extended comment? */ else if (line[i] == '/' && line[i + thislen] == '*') { *************** *** 291,297 **** /* single-line comment? truncate line */ else if (line[i] == '-' && line[i + thislen] == '-') { ! line[i] = '\0'; /* remove comment */ break; } --- 357,363 ---- /* single-line comment? truncate line */ else if (line[i] == '-' && line[i + thislen] == '-') { ! line[i] = '\0'; /* removae comment */ break; } *************** *** 458,464 **** /* Put the rest of the line in the query buffer. */ ! if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); --- 524,531 ---- /* Put the rest of the line in the query buffer. */ ! if (in_quote || dol_quote || ! line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); Index: src/bin/psql/prompt.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v retrieving revision 1.34 diff -c -w -r1.34 prompt.c *** src/bin/psql/prompt.c 25 Jan 2004 03:07:22 -0000 1.34 --- src/bin/psql/prompt.c 8 Feb 2004 19:59:44 -0000 *************** *** 85,90 **** --- 85,91 ---- case PROMPT_CONTINUE: case PROMPT_SINGLEQUOTE: case PROMPT_DOUBLEQUOTE: + case PROMPT_DOLLARQUOTE: case PROMPT_COMMENT: case PROMPT_PAREN: prompt_name = "PROMPT2"; *************** *** 198,203 **** --- 199,207 ---- break; case PROMPT_DOUBLEQUOTE: buf[0] = '"'; + break; + case PROMPT_DOLLARQUOTE: + buf[0] = '$'; break; case PROMPT_COMMENT: buf[0] = '*'; Index: src/bin/psql/prompt.h =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v retrieving revision 1.13 diff -c -w -r1.13 prompt.h *** src/bin/psql/prompt.h 29 Nov 2003 19:52:07 -0000 1.13 --- src/bin/psql/prompt.h 8 Feb 2004 19:59:44 -0000 *************** *** 15,20 **** --- 15,21 ---- PROMPT_COMMENT, PROMPT_SINGLEQUOTE, PROMPT_DOUBLEQUOTE, + PROMPT_DOLLARQUOTE, PROMPT_PAREN, PROMPT_COPY } promptStatus_t;
Andrew Dunstan <andrew@dunslane.net> writes: > Comments welcome. Reviewers: I am not sure I got multi-byte stuff right > in psql/mainloop.c - please pay close attention to that. The i-1 stuff should generally be i-prevlen. Not sure if there are any other pitfalls. A bigger problem here: > + else if (!dol_quote && line[i] == '$' && > + !isdigit(line[i + thislen]) && > + (dol_end = strchr(line+i+1,'$')) != NULL && > + (i == 0 || > + ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || > + line[i-1] == '_'))) > + { is that you aren't checking that what comes between the two dollar signs looks like empty-or-an-identifier. The check for next-char-isn't-a-digit is part of that but not the only part. Also I'm not sure about the positioning of these tests relative to the in_quote and in_xcomment tests. As you have it, $foo$ will be recognized within an xcomment, which I think is at variance with the proposed backend lexing behavior. Also, the strdup should be pg_strdup. regards, tom lane
Tom Lane wrote: >A bigger problem here: > > > >>+ else if (!dol_quote && line[i] == '$' && >>+ !isdigit(line[i + thislen]) && >>+ (dol_end = strchr(line+i+1,'$')) != NULL && >>+ (i == 0 || >>+ ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || >>+ line[i-1] == '_'))) >>+ { >> >> > >is that you aren't checking that what comes between the two dollar signs >looks like empty-or-an-identifier. The check for >next-char-isn't-a-digit is part of that but not the only part. > > Well, I think the right way to do a full check would be with a regex, which I had hoped to avoid. However, I will now try to get one working and to address your other concerns. Thanks for the comments. cheers andrew
I think the attached patch addresses Tom's comments. I ended up not using a regex, which seemed to be a little heavy handed, but just writing a small custom recognition function, that should (and I think does) mimic the pattern recognition for these tokens used by the backend lexer. This patch just puts that function in mainloop.c, but perhaps it belongs elsewhere (string_utils.c maybe?). I don't have strong opinions on that. Enjoy andrew Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>Comments welcome. Reviewers: I am not sure I got multi-byte stuff right >>in psql/mainloop.c - please pay close attention to that. >> >> > >The i-1 stuff should generally be i-prevlen. Not sure if there are any >other pitfalls. > >A bigger problem here: > > > >>+ else if (!dol_quote && line[i] == '$' && >>+ !isdigit(line[i + thislen]) && >>+ (dol_end = strchr(line+i+1,'$')) != NULL && >>+ (i == 0 || >>+ ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) || >>+ line[i-1] == '_'))) >>+ { >> >> > >is that you aren't checking that what comes between the two dollar signs >looks like empty-or-an-identifier. The check for >next-char-isn't-a-digit is part of that but not the only part. > >Also I'm not sure about the positioning of these tests relative to the >in_quote and in_xcomment tests. As you have it, $foo$ will be >recognized within an xcomment, which I think is at variance with the >proposed backend lexing behavior. > >Also, the strdup should be pg_strdup. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > > Index: src/backend/parser/scan.l =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v retrieving revision 1.112 diff -c -w -r1.112 scan.l *** src/backend/parser/scan.l 29 Nov 2003 19:51:52 -0000 1.112 --- src/backend/parser/scan.l 9 Feb 2004 15:26:34 -0000 *************** *** 39,44 **** --- 39,46 ---- static int xcdepth = 0; /* depth of nesting in slash-star comments */ + static char *dolqstart; /* current $foo$ quote start string */ + /* * literalbuf is used to accumulate literal values when multiple rules * are needed to parse a single literal. Call startlit to reset buffer *************** *** 95,100 **** --- 97,103 ---- * <xd> delimited identifiers (double-quoted identifiers) * <xh> hexadecimal numeric string * <xq> quoted strings + * <dolq> $foo$-style quoted strings */ %x xb *************** *** 102,107 **** --- 105,111 ---- %x xd %x xh %x xq + %x dolq /* Bit string * It is tempting to scan the string for only those characters *************** *** 141,146 **** --- 145,159 ---- xqoctesc [\\][0-7]{1,3} xqcat {quote}{whitespace_with_newline}{quote} + /* $foo$ style quotes ("dollar quoting") + * The quoted string starts with $foo$ where "foo" is an optional string + * in the form of an identifier, except that it may not contain "$", + * and extends to the first occurrence + * of an identical string. There is *no* processing of the quoted text. + */ + dolqdelim \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$ + dolqinside [^$]+ + /* Double quote * Allows embedded spaces and other special characters into identifiers. */ *************** *** 387,392 **** --- 400,434 ---- } <xq><<EOF>> { yyerror("unterminated quoted string"); } + {dolqdelim} { + token_start = yytext; + dolqstart = pstrdup(yytext); + BEGIN(dolq); + startlit(); + } + <dolq>{dolqdelim} { + if (strcmp(yytext, dolqstart) == 0) + { + pfree(dolqstart); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return SCONST; + } + /* + * When we fail to match $...$ to dolqstart, transfer + * the $... part to the output, but put back the final + * $ for rescanning. Consider $delim$...$junk$delim$ + */ + addlit(yytext, yyleng-1); + yyless(yyleng-1); + } + <dolq>{dolqinside} { + addlit(yytext, yyleng); + } + <dolq>. { + addlitchar(yytext[0]); + } + <dolq><<EOF>> { yyerror("unterminated special-quoted string"); } {xdstart} { token_start = yytext; Index: src/bin/psql/mainloop.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v retrieving revision 1.61 diff -c -w -r1.61 mainloop.c *** src/bin/psql/mainloop.c 25 Jan 2004 03:07:22 -0000 1.61 --- src/bin/psql/mainloop.c 9 Feb 2004 15:26:51 -0000 *************** *** 21,26 **** --- 21,61 ---- sigjmp_buf main_loop_jmp; #endif + /* + * function to detect a valid $foo$ quote delimiter at the start of the + * parameter dquote. + */ + + static bool valid_dolquote(char * dquote) + { + int i; + + /* must start with a $ */ + if (dquote[0] != '$') + return false; + + /* empty 'identifier' case */ + if (dquote[1] == '$') + return true; + + /* first 'identifier' char must be a letter or have high bit set */ + if (!isalpha(dquote[1]) && (dquote[1] & 0x80) == 0) + return false; + + /* subsequent chars must be alphanumeric or _ or have high bit set */ + for (i = 2; dquote[i] != '$'; i++) + { + if ((dquote[i] & 0x80) == 0 && ! isalnum(dquote[i]) && + dquote[i] != '_') + { + /* we found an invalid character */ + return false; + } + } + + return true; + } + /* * Main processing loop for reading lines of input *************** *** 49,54 **** --- 84,92 ---- unsigned int query_start; volatile int count_eof = 0; volatile unsigned int bslash_count = 0; + volatile bool free_dolquote = false; + char *dol_quote = NULL; + int i, prevlen, *************** *** 120,125 **** --- 158,164 ---- in_quote = 0; paren_level = 0; count_eof = 0; + free_dolquote = true; slashCmdStatus = CMD_UNKNOWN; } else *************** *** 136,141 **** --- 175,190 ---- pqsignal(SIGINT, handle_sigint); /* control-C => cancel */ #endif /* not WIN32 */ + if (free_dolquote) + { + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } + free_dolquote = false; + } + fflush(stdout); if (slashCmdStatus == CMD_NEWEDIT) *************** *** 150,155 **** --- 199,209 ---- in_xcomment = 0; in_quote = 0; paren_level = 0; + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } slashCmdStatus = CMD_UNKNOWN; } *************** *** 161,167 **** { int prompt_status; ! if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; --- 215,223 ---- { int prompt_status; ! if (dol_quote) ! prompt_status = PROMPT_DOLLARQUOTE; ! else if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; *************** *** 268,273 **** --- 324,343 ---- in_quote = 0; } + /* in or end of $foo$ type quote? */ + + else if (dol_quote) + { + if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0) + { + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + free(dol_quote); + dol_quote = NULL; + } + } + /* start of extended comment? */ else if (line[i] == '/' && line[i + thislen] == '*') { *************** *** 288,297 **** else if (line[i] == '\'' || line[i] == '"') in_quote = line[i]; /* single-line comment? truncate line */ else if (line[i] == '-' && line[i + thislen] == '-') { ! line[i] = '\0'; /* remove comment */ break; } --- 358,395 ---- else if (line[i] == '\'' || line[i] == '"') in_quote = line[i]; + /* + * start of $foo$ type quote? + * + * must not be preceded by a valid identifier character + */ + + else if (!dol_quote && valid_dolquote(line+i) && + (i == 0 || + ! ((line[i-prevlen] & 0x80) != 0 || + isalnum(line[i-prevlen]) || + line[i-prevlen] == '_' || + line[i-prevlen] == '$' ))) + { + char * dol_end; + char eos; + + dol_end = strchr(line+i+1,'$'); + dol_end ++; + eos = *dol_end; + *dol_end = '\0'; + dol_quote = pg_strdup(line+i); + *dol_end = eos; + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + + } + /* single-line comment? truncate line */ else if (line[i] == '-' && line[i + thislen] == '-') { ! line[i] = '\0'; /* removae comment */ break; } *************** *** 458,464 **** /* Put the rest of the line in the query buffer. */ ! if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); --- 556,563 ---- /* Put the rest of the line in the query buffer. */ ! if (in_quote || dol_quote || ! line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); Index: src/bin/psql/prompt.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v retrieving revision 1.34 diff -c -w -r1.34 prompt.c *** src/bin/psql/prompt.c 25 Jan 2004 03:07:22 -0000 1.34 --- src/bin/psql/prompt.c 9 Feb 2004 15:26:51 -0000 *************** *** 85,90 **** --- 85,91 ---- case PROMPT_CONTINUE: case PROMPT_SINGLEQUOTE: case PROMPT_DOUBLEQUOTE: + case PROMPT_DOLLARQUOTE: case PROMPT_COMMENT: case PROMPT_PAREN: prompt_name = "PROMPT2"; *************** *** 198,203 **** --- 199,207 ---- break; case PROMPT_DOUBLEQUOTE: buf[0] = '"'; + break; + case PROMPT_DOLLARQUOTE: + buf[0] = '$'; break; case PROMPT_COMMENT: buf[0] = '*'; Index: src/bin/psql/prompt.h =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v retrieving revision 1.13 diff -c -w -r1.13 prompt.h *** src/bin/psql/prompt.h 29 Nov 2003 19:52:07 -0000 1.13 --- src/bin/psql/prompt.h 9 Feb 2004 15:26:51 -0000 *************** *** 15,20 **** --- 15,21 ---- PROMPT_COMMENT, PROMPT_SINGLEQUOTE, PROMPT_DOUBLEQUOTE, + PROMPT_DOLLARQUOTE, PROMPT_PAREN, PROMPT_COPY } promptStatus_t;
Andrew Dunstan <andrew@dunslane.net> writes: > I ended up not using a regex, which seemed to be a little heavy handed, > but just writing a small custom recognition function, that should (and I > think does) mimic the pattern recognition for these tokens used by the > backend lexer. I looked at this and realized that it still doesn't do very well at distinguishing $foo$ from other random uses of $. The problem is that looking back at just the immediately preceding character isn't enough context to tell whether a $ is part of an identifier. Consider the input a42$foo$ This is a legal identifier according to PG 7.4. But how about 42$foo$ This is a syntax error in 7.4, and we propose to redefine it as an integer literal '42' followed by a dollar-quote start symbol. There's no way to tell these apart with a single-character lookback, or indeed any fixed number of characters of lookback. I begin to think that we'll really have to bite the bullet and convert psql's input parser to use flex. If we're not scanning with exactly the same rules as the backend uses, we're going to get the wrong answers. regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>I ended up not using a regex, which seemed to be a little heavy handed, >>but just writing a small custom recognition function, that should (and I >>think does) mimic the pattern recognition for these tokens used by the >>backend lexer. >> >> > >I looked at this and realized that it still doesn't do very well at >distinguishing $foo$ from other random uses of $. The problem is that >looking back at just the immediately preceding character isn't enough >context to tell whether a $ is part of an identifier. Consider the >input > a42$foo$ >This is a legal identifier according to PG 7.4. But how about > 42$foo$ >This is a syntax error in 7.4, and we propose to redefine it as an >integer literal '42' followed by a dollar-quote start symbol. > The test in the patch I sent is this: else if (!dol_quote && valid_dolquote(line+i) && (i == 0 || ! ((line[i-prevlen] & 0x80) != 0 || isalnum(line[i-prevlen]) || line[i-prevlen] == '_' || line[i-prevlen] == '$' ))) The test should not succeed anywhere in the string '42$foo$'. Note that psql does not change any '$foo$' at all - it just passes it to the backend. The reason we need this at all in psql is that it has to detect the end of a statement, and it has to prompt correctly, and to do that it needs to know if we are in a quote (single, double, dollar) or a comment. psql does not detect many syntax errors, or even lexical errors - that is the job of the backend - rightly so, I believe. > >There's no way to tell these apart with a single-character lookback, >or indeed any fixed number of characters of lookback. > I'm still not convinced, although maybe there's something I'm not getting. > >I begin to think that we'll really have to bite the bullet and convert >psql's input parser to use flex. If we're not scanning with exactly the >same rules as the backend uses, we're going to get the wrong answers. > > > Interacting with lexer states would probably be ... unpleasant. Matching a stream oriented lexer with a line oriented CLI would be messy I suspect. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> ... But how about >> 42$foo$ >> This is a syntax error in 7.4, and we propose to redefine it as an >> integer literal '42' followed by a dollar-quote start symbol. > The test should not succeed anywhere in the string '42$foo$'. No, it won't. The problem is that it should, because the backend will see that as '42' followed by a $foo$ quote start. > Interacting with lexer states would probably be ... unpleasant. Matching > a stream oriented lexer with a line oriented CLI would be messy I suspect. I think it would not be that bad. We'd have to run the lexer on the command input buffer and see what state it terminates in. regards, tom lane
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>Tom Lane wrote: >> >> >>>... But how about >>>42$foo$ >>>This is a syntax error in 7.4, and we propose to redefine it as an >>>integer literal '42' followed by a dollar-quote start symbol. >>> >>> > > > >>The test should not succeed anywhere in the string '42$foo$'. >> >> > >No, it won't. The problem is that it should, because the backend will >see that as '42' followed by a $foo$ quote start. > Ok, I see what you are saying. This mismatch would only happen on invalid input, though. I believe that what I did will work on all legal input. I think that this might be cured by having psql recognise a legal identifier or keyword and eating it as a word, rather than treating it as just another set of bytes in the stream. That would enable us to avoid the lookback in the dollar-quote recognition test altogether. The attached patch does it that way - the keyword/id test needs to come right at the end of the loop to avoid clashing with backslash commands, btw. I *think* that this way psql will recognise the start of a dollar quote iff the backend lexer would. > > > >>Interacting with lexer states would probably be ... unpleasant. Matching >>a stream oriented lexer with a line oriented CLI would be messy I suspect. >> >> > >I think it would not be that bad. We'd have to run the lexer on the >command input buffer and see what state it terminates in. > > > Yeah. I am not enough of a flex wizard to undertake the task, though. It would take me lots of time. If we make a decision that we really need this in order to do dollar quoting in psql I would need some substantial help, at least. cheers andrew Index: src/bin/psql/mainloop.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v retrieving revision 1.61 diff -c -r1.61 mainloop.c *** src/bin/psql/mainloop.c 25 Jan 2004 03:07:22 -0000 1.61 --- src/bin/psql/mainloop.c 15 Feb 2004 14:28:02 -0000 *************** *** 21,26 **** --- 21,61 ---- sigjmp_buf main_loop_jmp; #endif + /* + * function to detect a valid $foo$ quote delimiter at the start of the + * parameter dquote. + */ + + static bool valid_dolquote(char * dquote) + { + int i; + + /* must start with a $ */ + if (dquote[0] != '$') + return false; + + /* empty 'identifier' case */ + if (dquote[1] == '$') + return true; + + /* first 'identifier' char must be a letter or have high bit set */ + if (!isalpha(dquote[1]) && (dquote[1] & 0x80) == 0) + return false; + + /* subsequent chars must be alphanumeric or _ or have high bit set */ + for (i = 2; dquote[i] != '$'; i++) + { + if ((dquote[i] & 0x80) == 0 && ! isalnum(dquote[i]) && + dquote[i] != '_') + { + /* we found an invalid character */ + return false; + } + } + + return true; + } + /* * Main processing loop for reading lines of input *************** *** 49,54 **** --- 84,92 ---- unsigned int query_start; volatile int count_eof = 0; volatile unsigned int bslash_count = 0; + volatile bool free_dolquote = false; + char *dol_quote = NULL; + int i, prevlen, *************** *** 120,125 **** --- 158,164 ---- in_quote = 0; paren_level = 0; count_eof = 0; + free_dolquote = true; slashCmdStatus = CMD_UNKNOWN; } else *************** *** 136,141 **** --- 175,190 ---- pqsignal(SIGINT, handle_sigint); /* control-C => cancel */ #endif /* not WIN32 */ + if (free_dolquote) + { + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } + free_dolquote = false; + } + fflush(stdout); if (slashCmdStatus == CMD_NEWEDIT) *************** *** 150,155 **** --- 199,209 ---- in_xcomment = 0; in_quote = 0; paren_level = 0; + if(dol_quote) + { + free(dol_quote); + dol_quote = NULL; + } slashCmdStatus = CMD_UNKNOWN; } *************** *** 161,167 **** { int prompt_status; ! if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; --- 215,223 ---- { int prompt_status; ! if (dol_quote) ! prompt_status = PROMPT_DOLLARQUOTE; ! else if (in_quote && in_quote == '\'') prompt_status = PROMPT_SINGLEQUOTE; else if (in_quote && in_quote == '"') prompt_status = PROMPT_DOUBLEQUOTE; *************** *** 268,273 **** --- 324,343 ---- in_quote = 0; } + /* in or end of $foo$ type quote? */ + + else if (dol_quote) + { + if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0) + { + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + free(dol_quote); + dol_quote = NULL; + } + } + /* start of extended comment? */ else if (line[i] == '/' && line[i + thislen] == '*') { *************** *** 288,293 **** --- 358,383 ---- else if (line[i] == '\'' || line[i] == '"') in_quote = line[i]; + /* + * start of $foo$ type quote? + */ + else if (!dol_quote && valid_dolquote(line+i)) + { + char * dol_end; + char eos; + + dol_end = strchr(line+i+1,'$'); + dol_end ++; + eos = *dol_end; + *dol_end = '\0'; + dol_quote = pg_strdup(line+i); + *dol_end = eos; + ADVANCE_1; + while(line[i] != '$') + ADVANCE_1; + + } + /* single-line comment? truncate line */ else if (line[i] == '-' && line[i + thislen] == '-') { *************** *** 447,452 **** --- 537,566 ---- i = end_of_cmd - line; query_start = i; } + + /* + * keyword or identifier? + * We grab the whole string so that we don't + * mistakenly see $foo$ inside an identifier as the start + * of a dollar quote. + */ + + else if ( (line[i] & 0x80) != 0 || + isalpha(line[i]) || + line[i] == '_') + { + while ((line[i+thislen] & 0x80) != 0 || + isalnum(line[i+thislen]) || + line[i+thislen] == '_' || + line[i+thislen] == '$' ) + { + /* keep going while we still have identifier chars */ + ADVANCE_1; + } + + + } + } /* for (line) */ *************** *** 458,464 **** /* Put the rest of the line in the query buffer. */ ! if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); --- 572,579 ---- /* Put the rest of the line in the query buffer. */ ! if (in_quote || dol_quote || ! line[query_start + strspn(line + query_start, " \t\n\r")] != '\0') { if (query_buf->len > 0) appendPQExpBufferChar(query_buf, '\n'); Index: src/bin/psql/prompt.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v retrieving revision 1.34 diff -c -r1.34 prompt.c *** src/bin/psql/prompt.c 25 Jan 2004 03:07:22 -0000 1.34 --- src/bin/psql/prompt.c 15 Feb 2004 14:28:02 -0000 *************** *** 85,90 **** --- 85,91 ---- case PROMPT_CONTINUE: case PROMPT_SINGLEQUOTE: case PROMPT_DOUBLEQUOTE: + case PROMPT_DOLLARQUOTE: case PROMPT_COMMENT: case PROMPT_PAREN: prompt_name = "PROMPT2"; *************** *** 198,203 **** --- 199,207 ---- break; case PROMPT_DOUBLEQUOTE: buf[0] = '"'; + break; + case PROMPT_DOLLARQUOTE: + buf[0] = '$'; break; case PROMPT_COMMENT: buf[0] = '*'; Index: src/bin/psql/prompt.h =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v retrieving revision 1.13 diff -c -r1.13 prompt.h *** src/bin/psql/prompt.h 29 Nov 2003 19:52:07 -0000 1.13 --- src/bin/psql/prompt.h 15 Feb 2004 14:28:02 -0000 *************** *** 15,20 **** --- 15,21 ---- PROMPT_COMMENT, PROMPT_SINGLEQUOTE, PROMPT_DOUBLEQUOTE, + PROMPT_DOLLARQUOTE, PROMPT_PAREN, PROMPT_COPY } promptStatus_t; Index: src/backend/parser/scan.l =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v retrieving revision 1.112 diff -c -r1.112 scan.l *** src/backend/parser/scan.l 29 Nov 2003 19:51:52 -0000 1.112 --- src/backend/parser/scan.l 15 Feb 2004 14:28:16 -0000 *************** *** 39,44 **** --- 39,46 ---- static int xcdepth = 0; /* depth of nesting in slash-star comments */ + static char *dolqstart; /* current $foo$ quote start string */ + /* * literalbuf is used to accumulate literal values when multiple rules * are needed to parse a single literal. Call startlit to reset buffer *************** *** 95,100 **** --- 97,103 ---- * <xd> delimited identifiers (double-quoted identifiers) * <xh> hexadecimal numeric string * <xq> quoted strings + * <dolq> $foo$-style quoted strings */ %x xb *************** *** 102,107 **** --- 105,111 ---- %x xd %x xh %x xq + %x dolq /* Bit string * It is tempting to scan the string for only those characters *************** *** 141,146 **** --- 145,159 ---- xqoctesc [\\][0-7]{1,3} xqcat {quote}{whitespace_with_newline}{quote} + /* $foo$ style quotes ("dollar quoting") + * The quoted string starts with $foo$ where "foo" is an optional string + * in the form of an identifier, except that it may not contain "$", + * and extends to the first occurrence + * of an identical string. There is *no* processing of the quoted text. + */ + dolqdelim \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$ + dolqinside [^$]+ + /* Double quote * Allows embedded spaces and other special characters into identifiers. */ *************** *** 387,392 **** --- 400,434 ---- } <xq><<EOF>> { yyerror("unterminated quoted string"); } + {dolqdelim} { + token_start = yytext; + dolqstart = pstrdup(yytext); + BEGIN(dolq); + startlit(); + } + <dolq>{dolqdelim} { + if (strcmp(yytext, dolqstart) == 0) + { + pfree(dolqstart); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return SCONST; + } + /* + * When we fail to match $...$ to dolqstart, transfer + * the $... part to the output, but put back the final + * $ for rescanning. Consider $delim$...$junk$delim$ + */ + addlit(yytext, yyleng-1); + yyless(yyleng-1); + } + <dolq>{dolqinside} { + addlit(yytext, yyleng); + } + <dolq>. { + addlitchar(yytext[0]); + } + <dolq><<EOF>> { yyerror("unterminated special-quoted string"); } {xdstart} { token_start = yytext;
Andrew Dunstan <andrew@dunslane.net> writes: >> No, it won't. The problem is that it should, because the backend will >> see that as '42' followed by a $foo$ quote start. > Ok, I see what you are saying. This mismatch would only happen on > invalid input, though. I believe that what I did will work on all legal > input. I'm unconvinced. Even if there are not any current syntaxes in which a numeric literal can be adjacent to a string literal (I'm not totally sure about that), what of the future? We should solve the problem rather than assuming it won't bite us. > I think that this might be cured by having psql recognise a legal > identifier or keyword and eating it as a word, rather than treating it > as just another set of bytes in the stream. Hm, might work ... will think about it ... regards, tom lane
Another interesting thing abut psql that I noticed when using '$' in identifiers is this effect: andrew=# create table ab$cd$ef (ef$cd$ab text); CREATE TABLE andrew=# \d ab$cd$ef Did not find any relation named "ab$cd$ef". andrew=# \d ab\$cd\$ef Table "public.ab$cd$ef" Column | Type | Modifiers ----------+------+-----------ef$cd$ab | text | which is perhaps slightly less than intuitive. cheers andrew > >
Andrew Dunstan <andrew@dunslane.net> writes: > andrew=# create table ab$cd$ef (ef$cd$ab text); > CREATE TABLE > andrew=# \d ab$cd$ef > Did not find any relation named "ab$cd$ef". Hmph. I always thought that "$" was only special at the end of a regex, but that doesn't seem to be how our implementation treats it. Anyway this is not a bug, it is a feature: the argument of \d is a regex. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> No, it won't. The problem is that it should, because the backend will > >> see that as '42' followed by a $foo$ quote start. > > > Ok, I see what you are saying. This mismatch would only happen on > > invalid input, though. I believe that what I did will work on all legal > > input. > > I'm unconvinced. Even if there are not any current syntaxes in which a > numeric literal can be adjacent to a string literal (I'm not totally > sure about that), what of the future? We should solve the problem > rather than assuming it won't bite us. > > > I think that this might be cured by having psql recognise a legal > > identifier or keyword and eating it as a word, rather than treating it > > as just another set of bytes in the stream. > > Hm, might work ... will think about it ... I am a little concerned about adding the overhead of lex to psql. Right now, some folks have reported that lex/yacc take a considerable amount of processing time in the backend as part of a query, and adding that to psql just to do $$ seems questionable. Of course, we can alway test and see what the overhead shows. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I am a little concerned about adding the overhead of lex to psql. Right > now, some folks have reported that lex/yacc take a considerable amount > of processing time in the backend as part of a query, and adding that to > psql just to do $$ seems questionable. Of course, we can alway test and > see what the overhead shows. That's not the question to ask --- the question is whether a flex lexer will be faster or slower than the hand-made lexing code that psql is currently using. Lexer generation is a well-understood art, and you have to be pretty tense to beat out flex with hand-made code. It's the same tradeoff as trying to write better assembly code than a compiler does. Look at the lexing loop in psql/mainloop.c (that series of if-tests starting at about line 250) and ask yourself if that's really going to beat out the state-machine implementation flex uses --- which looks to be about two table lookups per character, plus a switch() executed at the end of every token. I'll bet on flex being faster. The reason the lexer shows up in the backend is that it has to grovel over every individual character of a query. For sufficiently large query strings that's gonna take awhile no matter how you do it. But in any case, the argument for moving to flex is not about performance, it is about making the code more understandable and more certain to agree with the backend lexer. regards, tom lane
Bruce Momjian wrote: >Tom Lane wrote: > > >>Andrew Dunstan <andrew@dunslane.net> writes: >> >> >>>>No, it won't. The problem is that it should, because the backend will >>>>see that as '42' followed by a $foo$ quote start. >>>> >>>> >>>Ok, I see what you are saying. This mismatch would only happen on >>>invalid input, though. I believe that what I did will work on all legal >>>input. >>> >>> >>I'm unconvinced. Even if there are not any current syntaxes in which a >>numeric literal can be adjacent to a string literal (I'm not totally >>sure about that), what of the future? We should solve the problem >>rather than assuming it won't bite us. >> >> >> >>>I think that this might be cured by having psql recognise a legal >>>identifier or keyword and eating it as a word, rather than treating it >>>as just another set of bytes in the stream. >>> >>> >>Hm, might work ... will think about it ... >> >> > >I am a little concerned about adding the overhead of lex to psql. Right >now, some folks have reported that lex/yacc take a considerable amount >of processing time in the backend as part of a query, and adding that to >psql just to do $$ seems questionable. Of course, we can alway test and >see what the overhead shows. > > > I think Tom's point is that the custom lexical recognition done by psql is approaching the point of being unmaintainable, and if we have to bite the bullet then maybe we might as well do so now. I'd be surprised if using a flex lexer instead made a huge speed difference, but maybe I'm wrong. I'm more concerned that it will be difficult to write and maintain and keep in sync with the backend's lexical structure - but those are just gut feelings, and I could be way off base. Right now psql does just enough recognition to enable it to work. Making it recognise the whole sql lexical structure instead strikes me as being somewhat redundant - that's what the backend does. Maybe we could do a flex lexer somewhat along the lines of the minimalist approach that psql currently employs. Anyway - I did put forward a possible non-flex way to handle the problem Tom saw with $$ quoting - let's see what his verdict is on that first ;-) If he thinks it won't work, I can't see much alternative to using a flex-based lexer. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I'd be surprised if using a flex lexer instead made a huge speed > difference, but maybe I'm wrong. No, I agree --- it's unlikely to make very much difference in the real world. Maybe on huge query strings you could notice the difference. > I'm more concerned that it will be > difficult to write and maintain and keep in sync with the backend's > lexical structure - but those are just gut feelings, and I could be way > off base. Right now psql does just enough recognition to enable it to > work. Making it recognise the whole sql lexical structure instead > strikes me as being somewhat redundant - that's what the backend does. Actually, I thought the way to handle it would be to duplicate the backend lexer as nearly as possible. Most of the productions would have empty bodies probably, but doing it that way would guarantee that in fact psql and the backend would lex a string the same way, which is exactly the problem we are facing here. You'd fall out of the lexer only upon detecting backslash (unless we want to put backslash command lexing into the flex code, which might or might not be a good idea), or upon detecting a ';' at parenthesis depth 0, or upon hitting end of string. In the last case the lexer state would indicate which prompt we need to give. One unsolved issue in my mind is how to deal with multibyte characters in non-backend-compatible encodings (the ones where some bytes of a multibyte character can have values < 128 and thus can look like plain ASCII characters). We'd have to make sure that the flex lexer skips over such bytes properly. I've thought of a couple of kluges but nothing I like ... regards, tom lane
> Actually, I thought the way to handle it would be to duplicate the > backend lexer as nearly as possible. Most of the productions would have > empty bodies probably, but doing it that way would guarantee that in > fact psql and the backend would lex a string the same way, which is > exactly the problem we are facing here. You'd fall out of the lexer > only upon detecting backslash (unless we want to put backslash command > lexing into the flex code, which might or might not be a good idea), > or upon detecting a ';' at parenthesis depth 0, or upon hitting end of > string. In the last case the lexer state would indicate which prompt > we need to give. You know what would be really sweet? If the lexing was made available as a public function. eg. So I could parse queries in phpPgAdmin before sending them to the backend, etc... Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > You know what would be really sweet? If the lexing was made available > as a public function. eg. So I could parse queries in phpPgAdmin before > sending them to the backend, etc... Parsing is a whole nother ball of wax besides lexing. I wasn't planning to put *that* into psql. Remember the only thing psql really wants from this is to detect where end-of-statement is ... regards, tom lane
> Parsing is a whole nother ball of wax besides lexing. I wasn't planning > to put *that* into psql. Remember the only thing psql really wants from > this is to detect where end-of-statement is ... Forgive my lameness, but I've never truly figured out where parsing ends and lexing begins. Anyone care to illuminate me on the difference? Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >> Parsing is a whole nother ball of wax besides lexing. > Forgive my lameness, but I've never truly figured out where parsing ends > and lexing begins. Anyone care to illuminate me on the difference? The theoretical answer is that you can do lexing with a finite-state machine, but parsing generally requires a stack, because it supports nested constructs. Lexers don't have any way to describe nested constructs --- a series of tokens is the only level of abstraction there is. The practical answer is that you do one with flex and the other with bison ;-). If you can do it with flex, and not cheat by implementing your own state stack, it's lexing. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I am a little concerned about adding the overhead of lex to psql. Right > > now, some folks have reported that lex/yacc take a considerable amount > > of processing time in the backend as part of a query, and adding that to > > psql just to do $$ seems questionable. Of course, we can alway test and > > see what the overhead shows. > > That's not the question to ask --- the question is whether a flex lexer > will be faster or slower than the hand-made lexing code that psql is > currently using. Lexer generation is a well-understood art, and you > have to be pretty tense to beat out flex with hand-made code. It's the > same tradeoff as trying to write better assembly code than a compiler > does. Look at the lexing loop in psql/mainloop.c (that series of > if-tests starting at about line 250) and ask yourself if that's really > going to beat out the state-machine implementation flex uses --- which > looks to be about two table lookups per character, plus a switch() > executed at the end of every token. I'll bet on flex being faster. > > The reason the lexer shows up in the backend is that it has to grovel > over every individual character of a query. For sufficiently large > query strings that's gonna take awhile no matter how you do it. > > But in any case, the argument for moving to flex is not about > performance, it is about making the code more understandable and more > certain to agree with the backend lexer. If we go with lex/flex for psql, I would still like someone to test performance to see that we aren't taking a big hit. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Christopher Kings-Lynne wrote: > > Parsing is a whole nother ball of wax besides lexing. I wasn't planning > > to put *that* into psql. Remember the only thing psql really wants from > > this is to detect where end-of-statement is ... > > Forgive my lameness, but I've never truly figured out where parsing ends > and lexing begins. Anyone care to illuminate me on the difference? The simplistic answer is the lexing breaks a string up into words/tokens, and parsing matches those tokens against patterns and fires actions. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>andrew=# create table ab$cd$ef (ef$cd$ab text); >>CREATE TABLE >>andrew=# \d ab$cd$ef >>Did not find any relation named "ab$cd$ef". >> >> > >Hmph. I always thought that "$" was only special at the end of a regex, >but that doesn't seem to be how our implementation treats it. Anyway >this is not a bug, it is a feature: the argument of \d is a regex. > > > Arguably this at least is a singularly useless feature, since a regex-meaning $ before the end of string is a nonsense, as you rightly imply, and one at the end of the string is redundant, as it is implied - psql turns 'abc' into '^abc$' when constructing the query. I don't care that much - I don't use $ in my identifiers. cheers andrew
> Parsing is a whole nother ball of wax besides lexing. I wasn't planning > to put *that* into psql. Remember the only thing psql really wants from > this is to detect where end-of-statement is ... OK, I'm not that great at understanding where lexing ands and parsing starts. These are the things that it would be really nice to know about in phpPgAdmin: * Start and end of multiline queries, so when executing an SQL script I can execute each query separately. * Knowing how many $n parameters there are in a SELECT statement that is to be prepared. Chris
Doh, sorry. I just realised that the lists just gave me a whole bunch of mails from back in February, which is when Tom made this post... Chris Christopher Kings-Lynne wrote: >> Parsing is a whole nother ball of wax besides lexing. I wasn't planning >> to put *that* into psql. Remember the only thing psql really wants from >> this is to detect where end-of-statement is ... > > > OK, I'm not that great at understanding where lexing ands and parsing > starts. These are the things that it would be really nice to know about > in phpPgAdmin: > > * Start and end of multiline queries, so when executing an SQL script I > can execute each query separately. > > * Knowing how many $n parameters there are in a SELECT statement that is > to be prepared. > > Chris > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly