Thread: Re: [HACKERS] dollar quoting

Re: [HACKERS] dollar quoting

From
Andrew Dunstan
Date:

Andrew Dunstan wrote:

>
>
> Tom Lane wrote:
>
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>
>>
>>> After staring at the code for a long time, I think I see how to do
>>> this. It's complicated a bit by the fact that $ is a valid
>>> identifier character. So my current thinking is to say that if we
>>> see $ not in a quote and not preceded by a valid identifier char
>>> then it is the start of a $foo$ sequence. Or have I missed
>>> something? Can we validly see $ in any other context?
>>>
>>

I had missed one, though - numbered params in prepared statements. Fixed
in attached patch.

>>
>> Right, a $ should be considered to start a quote marker only if it's not
>> part of an identifier.  The backend lexer doesn't have a problem with
>> this because it's written in flex, but I can imagine that getting it
>> right in psql's ad-hoc parser might be tricky.
>>
>
> I think it's doable, though.  I seem to have a working patch, which I
> will send out for review soon.
>

Proof of Concept patch (i.e. not for application) attached for review.
The scanner changes are based on Tom's original, with some name/comment
changes and a more liberal pattern. The psql changes are all my own work
:-).

Comments welcome. Reviewers: I am not sure I got multi-byte stuff right
in psql/mainloop.c - please pay close attention to that.

If I'm not wildly off course I will polish this up and start on docs.

cheers

andrew
Index: src/backend/parser/scan.l
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v
retrieving revision 1.112
diff -c -w -r1.112 scan.l
*** src/backend/parser/scan.l    29 Nov 2003 19:51:52 -0000    1.112
--- src/backend/parser/scan.l    8 Feb 2004 19:59:24 -0000
***************
*** 39,44 ****
--- 39,46 ----

  static int        xcdepth = 0;    /* depth of nesting in slash-star comments */

+ static char    *dolqstart;  /* current $foo$ quote start string */
+
  /*
   * literalbuf is used to accumulate literal values when multiple rules
   * are needed to parse a single literal.  Call startlit to reset buffer
***************
*** 95,100 ****
--- 97,103 ----
   *  <xd> delimited identifiers (double-quoted identifiers)
   *  <xh> hexadecimal numeric string
   *  <xq> quoted strings
+  *  <dolq> $foo$-style quoted strings
   */

  %x xb
***************
*** 102,107 ****
--- 105,111 ----
  %x xd
  %x xh
  %x xq
+ %x dolq

  /* Bit string
   * It is tempting to scan the string for only those characters
***************
*** 141,146 ****
--- 145,159 ----
  xqoctesc        [\\][0-7]{1,3}
  xqcat            {quote}{whitespace_with_newline}{quote}

+ /* $foo$ style quotes ("dollar quoting")
+  * The quoted string starts with $foo$ where "foo" is an optional string
+  * in the form of an identifier, except that it may not contain "$",
+  * and extends to the first occurrence
+  * of an identical string.  There is *no* processing of the quoted text.
+  */
+ dolqdelim   \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$
+ dolqinside  [^$]+
+
  /* Double quote
   * Allows embedded spaces and other special characters into identifiers.
   */
***************
*** 387,392 ****
--- 400,434 ----
                  }
  <xq><<EOF>>        { yyerror("unterminated quoted string"); }

+ {dolqdelim}  {
+      token_start = yytext;
+      dolqstart = pstrdup(yytext);
+      BEGIN(dolq);
+      startlit();
+     }
+ <dolq>{dolqdelim} {
+      if (strcmp(yytext, dolqstart) == 0)
+      {
+       pfree(dolqstart);
+       BEGIN(INITIAL);
+       yylval.str = litbufdup();
+       return SCONST;
+      }
+      /*
+       * When we fail to match $...$ to dolqstart, transfer
+       * the $... part to the output, but put back the final
+       * $ for rescanning.  Consider $delim$...$junk$delim$
+       */
+      addlit(yytext, yyleng-1);
+      yyless(yyleng-1);
+     }
+ <dolq>{dolqinside}  {
+      addlit(yytext, yyleng);
+     }
+ <dolq>.           {
+      addlitchar(yytext[0]);
+     }
+ <dolq><<EOF>>  { yyerror("unterminated special-quoted string"); }

  {xdstart}        {
                      token_start = yytext;
Index: src/bin/psql/mainloop.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v
retrieving revision 1.61
diff -c -w -r1.61 mainloop.c
*** src/bin/psql/mainloop.c    25 Jan 2004 03:07:22 -0000    1.61
--- src/bin/psql/mainloop.c    8 Feb 2004 19:59:44 -0000
***************
*** 49,54 ****
--- 49,58 ----
      unsigned int query_start;
      volatile int count_eof = 0;
      volatile unsigned int bslash_count = 0;
+     volatile bool free_dolquote = false;
+     char *dol_quote = NULL;
+     char *dol_end;
+

      int            i,
                  prevlen,
***************
*** 120,125 ****
--- 124,130 ----
                  in_quote = 0;
                  paren_level = 0;
                  count_eof = 0;
+                 free_dolquote = true;
                  slashCmdStatus = CMD_UNKNOWN;
              }
              else
***************
*** 136,141 ****
--- 141,156 ----
          pqsignal(SIGINT, handle_sigint);        /* control-C => cancel */
  #endif   /* not WIN32 */

+         if (free_dolquote)
+         {
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
+             free_dolquote = false;
+         }
+
          fflush(stdout);

          if (slashCmdStatus == CMD_NEWEDIT)
***************
*** 150,155 ****
--- 165,175 ----
              in_xcomment = 0;
              in_quote = 0;
              paren_level = 0;
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
              slashCmdStatus = CMD_UNKNOWN;
          }

***************
*** 161,167 ****
          {
              int            prompt_status;

!             if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
--- 181,189 ----
          {
              int            prompt_status;

!             if (dol_quote)
!                 prompt_status = PROMPT_DOLLARQUOTE;
!             else if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
***************
*** 268,273 ****
--- 290,339 ----
                      in_quote = 0;
              }

+             /*
+              * start of $foo$ type quote
+              * must be a $ not preceded by a valid identifier character,
+              * not inside a quote, not succeeded by a digit,
+              * and with a terminating $ somewhere on the line.
+              *
+              * if an invalid $foo$ meeting these requirements is seen, it won't
+              * be trapped here but will instead generate an SQL syntax error.
+              */
+
+             else if (!dol_quote && line[i] == '$' &&
+                      !isdigit(line[i + thislen]) &&
+                      (dol_end = strchr(line+i+1,'$')) != NULL &&
+                      (i == 0 ||
+                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) ||
+                          line[i-1] == '_')))
+             {
+                 char eos;
+
+                 dol_end ++;
+                 eos = *(dol_end);
+                 *dol_end = '\0';
+                 dol_quote = strdup(line+i);
+                 *dol_end = eos;
+                 ADVANCE_1;
+                 while(line[i] != '$')
+                     ADVANCE_1;
+
+             }
+
+             /* in or end of $foo$ type quote? */
+
+             else if (dol_quote)
+             {
+                 if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0)
+                 {
+                     ADVANCE_1;
+                     while(line[i] != '$')
+                         ADVANCE_1;
+                     free(dol_quote);
+                     dol_quote = NULL;
+                 }
+             }
+
              /* start of extended comment? */
              else if (line[i] == '/' && line[i + thislen] == '*')
              {
***************
*** 291,297 ****
              /* single-line comment? truncate line */
              else if (line[i] == '-' && line[i + thislen] == '-')
              {
!                 line[i] = '\0'; /* remove comment */
                  break;
              }

--- 357,363 ----
              /* single-line comment? truncate line */
              else if (line[i] == '-' && line[i + thislen] == '-')
              {
!                 line[i] = '\0'; /* removae comment */
                  break;
              }

***************
*** 458,464 ****


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
--- 524,531 ----


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || dol_quote ||
!             line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
Index: src/bin/psql/prompt.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v
retrieving revision 1.34
diff -c -w -r1.34 prompt.c
*** src/bin/psql/prompt.c    25 Jan 2004 03:07:22 -0000    1.34
--- src/bin/psql/prompt.c    8 Feb 2004 19:59:44 -0000
***************
*** 85,90 ****
--- 85,91 ----
          case PROMPT_CONTINUE:
          case PROMPT_SINGLEQUOTE:
          case PROMPT_DOUBLEQUOTE:
+         case PROMPT_DOLLARQUOTE:
          case PROMPT_COMMENT:
          case PROMPT_PAREN:
              prompt_name = "PROMPT2";
***************
*** 198,203 ****
--- 199,207 ----
                              break;
                          case PROMPT_DOUBLEQUOTE:
                              buf[0] = '"';
+                             break;
+                         case PROMPT_DOLLARQUOTE:
+                             buf[0] = '$';
                              break;
                          case PROMPT_COMMENT:
                              buf[0] = '*';
Index: src/bin/psql/prompt.h
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v
retrieving revision 1.13
diff -c -w -r1.13 prompt.h
*** src/bin/psql/prompt.h    29 Nov 2003 19:52:07 -0000    1.13
--- src/bin/psql/prompt.h    8 Feb 2004 19:59:44 -0000
***************
*** 15,20 ****
--- 15,21 ----
      PROMPT_COMMENT,
      PROMPT_SINGLEQUOTE,
      PROMPT_DOUBLEQUOTE,
+     PROMPT_DOLLARQUOTE,
      PROMPT_PAREN,
      PROMPT_COPY
  } promptStatus_t;

Re: [HACKERS] dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Comments welcome. Reviewers: I am not sure I got multi-byte stuff right
> in psql/mainloop.c - please pay close attention to that.

The i-1 stuff should generally be i-prevlen.  Not sure if there are any
other pitfalls.

A bigger problem here:

> +             else if (!dol_quote && line[i] == '$' &&
> +                      !isdigit(line[i + thislen]) &&
> +                      (dol_end = strchr(line+i+1,'$')) != NULL &&
> +                      (i == 0 ||
> +                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) ||
> +                          line[i-1] == '_')))
> +             {

is that you aren't checking that what comes between the two dollar signs
looks like empty-or-an-identifier.  The check for
next-char-isn't-a-digit is part of that but not the only part.

Also I'm not sure about the positioning of these tests relative to the
in_quote and in_xcomment tests.  As you have it, $foo$ will be
recognized within an xcomment, which I think is at variance with the
proposed backend lexing behavior.

Also, the strdup should be pg_strdup.

            regards, tom lane

Re: [HACKERS] dollar quoting

From
Andrew Dunstan
Date:

Tom Lane wrote:

>A bigger problem here:
>
>
>
>>+             else if (!dol_quote && line[i] == '$' &&
>>+                      !isdigit(line[i + thislen]) &&
>>+                      (dol_end = strchr(line+i+1,'$')) != NULL &&
>>+                      (i == 0 ||
>>+                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) ||
>>+                          line[i-1] == '_')))
>>+             {
>>
>>
>
>is that you aren't checking that what comes between the two dollar signs
>looks like empty-or-an-identifier.  The check for
>next-char-isn't-a-digit is part of that but not the only part.
>
>


Well, I think the right way to do a full check would be with a regex,
which I had hoped to avoid. However, I will now try to get one working
and to address your other concerns.

Thanks for the comments.

cheers

andrew


Re: [HACKERS] dollar quoting

From
Andrew Dunstan
Date:
I think the attached patch addresses Tom's comments.

I ended up not using a regex, which seemed to be a little heavy handed,
but just writing a small custom recognition function, that should (and I
think does) mimic the pattern recognition for these tokens used by the
backend lexer. This patch just puts that function in mainloop.c, but
perhaps it belongs elsewhere (string_utils.c maybe?). I don't have
strong opinions on that.

Enjoy

andrew



Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>
>
>>Comments welcome. Reviewers: I am not sure I got multi-byte stuff right
>>in psql/mainloop.c - please pay close attention to that.
>>
>>
>
>The i-1 stuff should generally be i-prevlen.  Not sure if there are any
>other pitfalls.
>
>A bigger problem here:
>
>
>
>>+             else if (!dol_quote && line[i] == '$' &&
>>+                      !isdigit(line[i + thislen]) &&
>>+                      (dol_end = strchr(line+i+1,'$')) != NULL &&
>>+                      (i == 0 ||
>>+                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) ||
>>+                          line[i-1] == '_')))
>>+             {
>>
>>
>
>is that you aren't checking that what comes between the two dollar signs
>looks like empty-or-an-identifier.  The check for
>next-char-isn't-a-digit is part of that but not the only part.
>
>Also I'm not sure about the positioning of these tests relative to the
>in_quote and in_xcomment tests.  As you have it, $foo$ will be
>recognized within an xcomment, which I think is at variance with the
>proposed backend lexing behavior.
>
>Also, the strdup should be pg_strdup.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
>
Index: src/backend/parser/scan.l
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v
retrieving revision 1.112
diff -c -w -r1.112 scan.l
*** src/backend/parser/scan.l    29 Nov 2003 19:51:52 -0000    1.112
--- src/backend/parser/scan.l    9 Feb 2004 15:26:34 -0000
***************
*** 39,44 ****
--- 39,46 ----

  static int        xcdepth = 0;    /* depth of nesting in slash-star comments */

+ static char    *dolqstart;  /* current $foo$ quote start string */
+
  /*
   * literalbuf is used to accumulate literal values when multiple rules
   * are needed to parse a single literal.  Call startlit to reset buffer
***************
*** 95,100 ****
--- 97,103 ----
   *  <xd> delimited identifiers (double-quoted identifiers)
   *  <xh> hexadecimal numeric string
   *  <xq> quoted strings
+  *  <dolq> $foo$-style quoted strings
   */

  %x xb
***************
*** 102,107 ****
--- 105,111 ----
  %x xd
  %x xh
  %x xq
+ %x dolq

  /* Bit string
   * It is tempting to scan the string for only those characters
***************
*** 141,146 ****
--- 145,159 ----
  xqoctesc        [\\][0-7]{1,3}
  xqcat            {quote}{whitespace_with_newline}{quote}

+ /* $foo$ style quotes ("dollar quoting")
+  * The quoted string starts with $foo$ where "foo" is an optional string
+  * in the form of an identifier, except that it may not contain "$",
+  * and extends to the first occurrence
+  * of an identical string.  There is *no* processing of the quoted text.
+  */
+ dolqdelim   \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$
+ dolqinside  [^$]+
+
  /* Double quote
   * Allows embedded spaces and other special characters into identifiers.
   */
***************
*** 387,392 ****
--- 400,434 ----
                  }
  <xq><<EOF>>        { yyerror("unterminated quoted string"); }

+ {dolqdelim}  {
+      token_start = yytext;
+      dolqstart = pstrdup(yytext);
+      BEGIN(dolq);
+      startlit();
+     }
+ <dolq>{dolqdelim} {
+      if (strcmp(yytext, dolqstart) == 0)
+      {
+       pfree(dolqstart);
+       BEGIN(INITIAL);
+       yylval.str = litbufdup();
+       return SCONST;
+      }
+      /*
+       * When we fail to match $...$ to dolqstart, transfer
+       * the $... part to the output, but put back the final
+       * $ for rescanning.  Consider $delim$...$junk$delim$
+       */
+      addlit(yytext, yyleng-1);
+      yyless(yyleng-1);
+     }
+ <dolq>{dolqinside}  {
+      addlit(yytext, yyleng);
+     }
+ <dolq>.           {
+      addlitchar(yytext[0]);
+     }
+ <dolq><<EOF>>  { yyerror("unterminated special-quoted string"); }

  {xdstart}        {
                      token_start = yytext;
Index: src/bin/psql/mainloop.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v
retrieving revision 1.61
diff -c -w -r1.61 mainloop.c
*** src/bin/psql/mainloop.c    25 Jan 2004 03:07:22 -0000    1.61
--- src/bin/psql/mainloop.c    9 Feb 2004 15:26:51 -0000
***************
*** 21,26 ****
--- 21,61 ----
  sigjmp_buf    main_loop_jmp;
  #endif

+ /*
+  * function to detect a valid $foo$ quote delimiter at the start of the
+  * parameter dquote.
+  */
+
+ static bool valid_dolquote(char * dquote)
+ {
+     int i;
+
+     /* must start with a $ */
+     if (dquote[0] != '$')
+         return false;
+
+     /* empty 'identifier' case */
+     if (dquote[1] == '$')
+         return true;
+
+     /* first 'identifier' char must be a letter or have high bit set */
+     if (!isalpha(dquote[1]) && (dquote[1] & 0x80) == 0)
+         return false;
+
+     /* subsequent chars must be alphanumeric or _ or have high bit set */
+     for (i = 2; dquote[i] != '$'; i++)
+     {
+         if ((dquote[i] & 0x80) == 0 && ! isalnum(dquote[i]) &&
+             dquote[i] != '_')
+         {
+             /* we found an invalid character */
+             return false;
+         }
+     }
+
+     return true;
+ }
+

  /*
   * Main processing loop for reading lines of input
***************
*** 49,54 ****
--- 84,92 ----
      unsigned int query_start;
      volatile int count_eof = 0;
      volatile unsigned int bslash_count = 0;
+     volatile bool free_dolquote = false;
+     char *dol_quote = NULL;
+

      int            i,
                  prevlen,
***************
*** 120,125 ****
--- 158,164 ----
                  in_quote = 0;
                  paren_level = 0;
                  count_eof = 0;
+                 free_dolquote = true;
                  slashCmdStatus = CMD_UNKNOWN;
              }
              else
***************
*** 136,141 ****
--- 175,190 ----
          pqsignal(SIGINT, handle_sigint);        /* control-C => cancel */
  #endif   /* not WIN32 */

+         if (free_dolquote)
+         {
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
+             free_dolquote = false;
+         }
+
          fflush(stdout);

          if (slashCmdStatus == CMD_NEWEDIT)
***************
*** 150,155 ****
--- 199,209 ----
              in_xcomment = 0;
              in_quote = 0;
              paren_level = 0;
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
              slashCmdStatus = CMD_UNKNOWN;
          }

***************
*** 161,167 ****
          {
              int            prompt_status;

!             if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
--- 215,223 ----
          {
              int            prompt_status;

!             if (dol_quote)
!                 prompt_status = PROMPT_DOLLARQUOTE;
!             else if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
***************
*** 268,273 ****
--- 324,343 ----
                      in_quote = 0;
              }

+             /* in or end of $foo$ type quote? */
+
+             else if (dol_quote)
+             {
+                 if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0)
+                 {
+                     ADVANCE_1;
+                     while(line[i] != '$')
+                         ADVANCE_1;
+                     free(dol_quote);
+                     dol_quote = NULL;
+                 }
+             }
+
              /* start of extended comment? */
              else if (line[i] == '/' && line[i + thislen] == '*')
              {
***************
*** 288,297 ****
              else if (line[i] == '\'' || line[i] == '"')
                  in_quote = line[i];

              /* single-line comment? truncate line */
              else if (line[i] == '-' && line[i + thislen] == '-')
              {
!                 line[i] = '\0'; /* remove comment */
                  break;
              }

--- 358,395 ----
              else if (line[i] == '\'' || line[i] == '"')
                  in_quote = line[i];

+             /*
+              * start of $foo$ type quote?
+              *
+              * must not be preceded by a valid identifier character
+              */
+
+             else if (!dol_quote && valid_dolquote(line+i) &&
+                      (i == 0 ||
+                       ! ((line[i-prevlen] & 0x80) != 0 ||
+                          isalnum(line[i-prevlen]) ||
+                          line[i-prevlen] == '_' ||
+                          line[i-prevlen] == '$' )))
+             {
+                 char * dol_end;
+                 char eos;
+
+                 dol_end = strchr(line+i+1,'$');
+                 dol_end ++;
+                 eos = *dol_end;
+                 *dol_end = '\0';
+                 dol_quote = pg_strdup(line+i);
+                 *dol_end = eos;
+                 ADVANCE_1;
+                 while(line[i] != '$')
+                     ADVANCE_1;
+
+             }
+
              /* single-line comment? truncate line */
              else if (line[i] == '-' && line[i + thislen] == '-')
              {
!                 line[i] = '\0'; /* removae comment */
                  break;
              }

***************
*** 458,464 ****


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
--- 556,563 ----


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || dol_quote ||
!             line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
Index: src/bin/psql/prompt.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v
retrieving revision 1.34
diff -c -w -r1.34 prompt.c
*** src/bin/psql/prompt.c    25 Jan 2004 03:07:22 -0000    1.34
--- src/bin/psql/prompt.c    9 Feb 2004 15:26:51 -0000
***************
*** 85,90 ****
--- 85,91 ----
          case PROMPT_CONTINUE:
          case PROMPT_SINGLEQUOTE:
          case PROMPT_DOUBLEQUOTE:
+         case PROMPT_DOLLARQUOTE:
          case PROMPT_COMMENT:
          case PROMPT_PAREN:
              prompt_name = "PROMPT2";
***************
*** 198,203 ****
--- 199,207 ----
                              break;
                          case PROMPT_DOUBLEQUOTE:
                              buf[0] = '"';
+                             break;
+                         case PROMPT_DOLLARQUOTE:
+                             buf[0] = '$';
                              break;
                          case PROMPT_COMMENT:
                              buf[0] = '*';
Index: src/bin/psql/prompt.h
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v
retrieving revision 1.13
diff -c -w -r1.13 prompt.h
*** src/bin/psql/prompt.h    29 Nov 2003 19:52:07 -0000    1.13
--- src/bin/psql/prompt.h    9 Feb 2004 15:26:51 -0000
***************
*** 15,20 ****
--- 15,21 ----
      PROMPT_COMMENT,
      PROMPT_SINGLEQUOTE,
      PROMPT_DOUBLEQUOTE,
+     PROMPT_DOLLARQUOTE,
      PROMPT_PAREN,
      PROMPT_COPY
  } promptStatus_t;

Re: [HACKERS] dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I ended up not using a regex, which seemed to be a little heavy handed,
> but just writing a small custom recognition function, that should (and I
> think does) mimic the pattern recognition for these tokens used by the
> backend lexer.

I looked at this and realized that it still doesn't do very well at
distinguishing $foo$ from other random uses of $.  The problem is that
looking back at just the immediately preceding character isn't enough
context to tell whether a $ is part of an identifier.  Consider the
input
    a42$foo$
This is a legal identifier according to PG 7.4.  But how about
    42$foo$
This is a syntax error in 7.4, and we propose to redefine it as an
integer literal '42' followed by a dollar-quote start symbol.

There's no way to tell these apart with a single-character lookback,
or indeed any fixed number of characters of lookback.

I begin to think that we'll really have to bite the bullet and convert
psql's input parser to use flex.  If we're not scanning with exactly the
same rules as the backend uses, we're going to get the wrong answers.

            regards, tom lane

Re: [HACKERS] dollar quoting

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>
>
>>I ended up not using a regex, which seemed to be a little heavy handed,
>>but just writing a small custom recognition function, that should (and I
>>think does) mimic the pattern recognition for these tokens used by the
>>backend lexer.
>>
>>
>
>I looked at this and realized that it still doesn't do very well at
>distinguishing $foo$ from other random uses of $.  The problem is that
>looking back at just the immediately preceding character isn't enough
>context to tell whether a $ is part of an identifier.  Consider the
>input
>    a42$foo$
>This is a legal identifier according to PG 7.4.  But how about
>    42$foo$
>This is a syntax error in 7.4, and we propose to redefine it as an
>integer literal '42' followed by a dollar-quote start symbol.
>

The test in the patch I sent is this:


            else if (!dol_quote && valid_dolquote(line+i) &&
                     (i == 0 ||
                      ! ((line[i-prevlen] & 0x80) != 0 ||
                         isalnum(line[i-prevlen]) ||
                         line[i-prevlen] == '_' ||
                         line[i-prevlen] == '$' )))


The test should not succeed anywhere in the string '42$foo$'.

Note that psql does not change any '$foo$' at all - it just passes it to
the backend. The reason we need this at all in psql is that it has to
detect the end of a statement, and it has to prompt correctly, and to do
that it needs to know if we are in a quote (single, double, dollar) or a
comment.

psql does not detect many syntax errors, or even lexical errors - that
is the job of the backend - rightly so, I believe.

>
>There's no way to tell these apart with a single-character lookback,
>or indeed any fixed number of characters of lookback.
>

I'm still not convinced, although maybe there's something I'm not getting.

>
>I begin to think that we'll really have to bite the bullet and convert
>psql's input parser to use flex.  If we're not scanning with exactly the
>same rules as the backend uses, we're going to get the wrong answers.
>
>
>

Interacting with lexer states would probably be ... unpleasant. Matching
a stream oriented lexer with a line oriented CLI would be messy I suspect.

cheers

andrew


Re: [HACKERS] dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> ... But how about
>> 42$foo$
>> This is a syntax error in 7.4, and we propose to redefine it as an
>> integer literal '42' followed by a dollar-quote start symbol.

> The test should not succeed anywhere in the string '42$foo$'.

No, it won't.  The problem is that it should, because the backend will
see that as '42' followed by a $foo$ quote start.

> Interacting with lexer states would probably be ... unpleasant. Matching
> a stream oriented lexer with a line oriented CLI would be messy I suspect.

I think it would not be that bad.  We'd have to run the lexer on the
command input buffer and see what state it terminates in.

            regards, tom lane

Re: [HACKERS] dollar quoting

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>
>
>>Tom Lane wrote:
>>
>>
>>>... But how about
>>>42$foo$
>>>This is a syntax error in 7.4, and we propose to redefine it as an
>>>integer literal '42' followed by a dollar-quote start symbol.
>>>
>>>
>
>
>
>>The test should not succeed anywhere in the string '42$foo$'.
>>
>>
>
>No, it won't.  The problem is that it should, because the backend will
>see that as '42' followed by a $foo$ quote start.
>

Ok, I see what you are saying. This mismatch would only happen on
invalid input, though. I believe that what I did will work on all legal
input.

I think that this might be cured by having psql recognise a legal
identifier or keyword and eating it as a word, rather than treating it
as just another set of bytes in the stream. That would enable us to
avoid the lookback in the dollar-quote recognition test altogether. The
attached patch does it that way - the keyword/id test needs to come
right at the end of the loop to avoid clashing with backslash commands,
btw.

I *think* that this way psql will recognise the start of a dollar quote
iff the backend lexer would.

>
>
>
>>Interacting with lexer states would probably be ... unpleasant. Matching
>>a stream oriented lexer with a line oriented CLI would be messy I suspect.
>>
>>
>
>I think it would not be that bad.  We'd have to run the lexer on the
>command input buffer and see what state it terminates in.
>
>
>

Yeah. I am not enough of a flex wizard to undertake the task, though. It
would take me lots of time. If we make a decision that we really need
this in order to do dollar quoting in psql I would need some substantial
help, at least.

cheers

andrew
Index: src/bin/psql/mainloop.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v
retrieving revision 1.61
diff -c -r1.61 mainloop.c
*** src/bin/psql/mainloop.c    25 Jan 2004 03:07:22 -0000    1.61
--- src/bin/psql/mainloop.c    15 Feb 2004 14:28:02 -0000
***************
*** 21,26 ****
--- 21,61 ----
  sigjmp_buf    main_loop_jmp;
  #endif

+ /*
+  * function to detect a valid $foo$ quote delimiter at the start of the
+  * parameter dquote.
+  */
+
+ static bool valid_dolquote(char * dquote)
+ {
+     int i;
+
+     /* must start with a $ */
+     if (dquote[0] != '$')
+         return false;
+
+     /* empty 'identifier' case */
+     if (dquote[1] == '$')
+         return true;
+
+     /* first 'identifier' char must be a letter or have high bit set */
+     if (!isalpha(dquote[1]) && (dquote[1] & 0x80) == 0)
+         return false;
+
+     /* subsequent chars must be alphanumeric or _ or have high bit set */
+     for (i = 2; dquote[i] != '$'; i++)
+     {
+         if ((dquote[i] & 0x80) == 0 && ! isalnum(dquote[i]) &&
+             dquote[i] != '_')
+         {
+             /* we found an invalid character */
+             return false;
+         }
+     }
+
+     return true;
+ }
+

  /*
   * Main processing loop for reading lines of input
***************
*** 49,54 ****
--- 84,92 ----
      unsigned int query_start;
      volatile int count_eof = 0;
      volatile unsigned int bslash_count = 0;
+     volatile bool free_dolquote = false;
+     char *dol_quote = NULL;
+

      int            i,
                  prevlen,
***************
*** 120,125 ****
--- 158,164 ----
                  in_quote = 0;
                  paren_level = 0;
                  count_eof = 0;
+                 free_dolquote = true;
                  slashCmdStatus = CMD_UNKNOWN;
              }
              else
***************
*** 136,141 ****
--- 175,190 ----
          pqsignal(SIGINT, handle_sigint);        /* control-C => cancel */
  #endif   /* not WIN32 */

+         if (free_dolquote)
+         {
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
+             free_dolquote = false;
+         }
+
          fflush(stdout);

          if (slashCmdStatus == CMD_NEWEDIT)
***************
*** 150,155 ****
--- 199,209 ----
              in_xcomment = 0;
              in_quote = 0;
              paren_level = 0;
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
              slashCmdStatus = CMD_UNKNOWN;
          }

***************
*** 161,167 ****
          {
              int            prompt_status;

!             if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
--- 215,223 ----
          {
              int            prompt_status;

!             if (dol_quote)
!                 prompt_status = PROMPT_DOLLARQUOTE;
!             else if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
***************
*** 268,273 ****
--- 324,343 ----
                      in_quote = 0;
              }

+             /* in or end of $foo$ type quote? */
+
+             else if (dol_quote)
+             {
+                 if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0)
+                 {
+                     ADVANCE_1;
+                     while(line[i] != '$')
+                         ADVANCE_1;
+                     free(dol_quote);
+                     dol_quote = NULL;
+                 }
+             }
+
              /* start of extended comment? */
              else if (line[i] == '/' && line[i + thislen] == '*')
              {
***************
*** 288,293 ****
--- 358,383 ----
              else if (line[i] == '\'' || line[i] == '"')
                  in_quote = line[i];

+             /*
+              * start of $foo$ type quote?
+              */
+             else if (!dol_quote && valid_dolquote(line+i))
+             {
+                 char * dol_end;
+                 char eos;
+
+                 dol_end = strchr(line+i+1,'$');
+                 dol_end ++;
+                 eos = *dol_end;
+                 *dol_end = '\0';
+                 dol_quote = pg_strdup(line+i);
+                 *dol_end = eos;
+                 ADVANCE_1;
+                 while(line[i] != '$')
+                     ADVANCE_1;
+
+             }
+
              /* single-line comment? truncate line */
              else if (line[i] == '-' && line[i + thislen] == '-')
              {
***************
*** 447,452 ****
--- 537,566 ----
                  i = end_of_cmd - line;
                  query_start = i;
              }
+
+             /*
+              * keyword or identifier?
+              * We grab the whole string so that we don't
+              * mistakenly see $foo$ inside an identifier as the start
+              * of a dollar quote.
+              */
+
+             else if ( (line[i] & 0x80) != 0 ||
+                       isalpha(line[i]) ||
+                       line[i] == '_')
+             {
+                 while ((line[i+thislen] & 0x80) != 0 ||
+                        isalnum(line[i+thislen]) ||
+                        line[i+thislen] == '_' ||
+                        line[i+thislen] == '$' )
+                 {
+                     /* keep going while we still have identifier chars */
+                     ADVANCE_1;
+                 }
+
+
+             }
+
          }                        /* for (line) */


***************
*** 458,464 ****


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
--- 572,579 ----


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || dol_quote ||
!             line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
Index: src/bin/psql/prompt.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v
retrieving revision 1.34
diff -c -r1.34 prompt.c
*** src/bin/psql/prompt.c    25 Jan 2004 03:07:22 -0000    1.34
--- src/bin/psql/prompt.c    15 Feb 2004 14:28:02 -0000
***************
*** 85,90 ****
--- 85,91 ----
          case PROMPT_CONTINUE:
          case PROMPT_SINGLEQUOTE:
          case PROMPT_DOUBLEQUOTE:
+         case PROMPT_DOLLARQUOTE:
          case PROMPT_COMMENT:
          case PROMPT_PAREN:
              prompt_name = "PROMPT2";
***************
*** 198,203 ****
--- 199,207 ----
                              break;
                          case PROMPT_DOUBLEQUOTE:
                              buf[0] = '"';
+                             break;
+                         case PROMPT_DOLLARQUOTE:
+                             buf[0] = '$';
                              break;
                          case PROMPT_COMMENT:
                              buf[0] = '*';
Index: src/bin/psql/prompt.h
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v
retrieving revision 1.13
diff -c -r1.13 prompt.h
*** src/bin/psql/prompt.h    29 Nov 2003 19:52:07 -0000    1.13
--- src/bin/psql/prompt.h    15 Feb 2004 14:28:02 -0000
***************
*** 15,20 ****
--- 15,21 ----
      PROMPT_COMMENT,
      PROMPT_SINGLEQUOTE,
      PROMPT_DOUBLEQUOTE,
+     PROMPT_DOLLARQUOTE,
      PROMPT_PAREN,
      PROMPT_COPY
  } promptStatus_t;
Index: src/backend/parser/scan.l
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v
retrieving revision 1.112
diff -c -r1.112 scan.l
*** src/backend/parser/scan.l    29 Nov 2003 19:51:52 -0000    1.112
--- src/backend/parser/scan.l    15 Feb 2004 14:28:16 -0000
***************
*** 39,44 ****
--- 39,46 ----

  static int        xcdepth = 0;    /* depth of nesting in slash-star comments */

+ static char    *dolqstart;  /* current $foo$ quote start string */
+
  /*
   * literalbuf is used to accumulate literal values when multiple rules
   * are needed to parse a single literal.  Call startlit to reset buffer
***************
*** 95,100 ****
--- 97,103 ----
   *  <xd> delimited identifiers (double-quoted identifiers)
   *  <xh> hexadecimal numeric string
   *  <xq> quoted strings
+  *  <dolq> $foo$-style quoted strings
   */

  %x xb
***************
*** 102,107 ****
--- 105,111 ----
  %x xd
  %x xh
  %x xq
+ %x dolq

  /* Bit string
   * It is tempting to scan the string for only those characters
***************
*** 141,146 ****
--- 145,159 ----
  xqoctesc        [\\][0-7]{1,3}
  xqcat            {quote}{whitespace_with_newline}{quote}

+ /* $foo$ style quotes ("dollar quoting")
+  * The quoted string starts with $foo$ where "foo" is an optional string
+  * in the form of an identifier, except that it may not contain "$",
+  * and extends to the first occurrence
+  * of an identical string.  There is *no* processing of the quoted text.
+  */
+ dolqdelim   \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$
+ dolqinside  [^$]+
+
  /* Double quote
   * Allows embedded spaces and other special characters into identifiers.
   */
***************
*** 387,392 ****
--- 400,434 ----
                  }
  <xq><<EOF>>        { yyerror("unterminated quoted string"); }

+ {dolqdelim}  {
+      token_start = yytext;
+      dolqstart = pstrdup(yytext);
+      BEGIN(dolq);
+      startlit();
+     }
+ <dolq>{dolqdelim} {
+      if (strcmp(yytext, dolqstart) == 0)
+      {
+       pfree(dolqstart);
+       BEGIN(INITIAL);
+       yylval.str = litbufdup();
+       return SCONST;
+      }
+      /*
+       * When we fail to match $...$ to dolqstart, transfer
+       * the $... part to the output, but put back the final
+       * $ for rescanning.  Consider $delim$...$junk$delim$
+       */
+      addlit(yytext, yyleng-1);
+      yyless(yyleng-1);
+     }
+ <dolq>{dolqinside}  {
+      addlit(yytext, yyleng);
+     }
+ <dolq>.           {
+      addlitchar(yytext[0]);
+     }
+ <dolq><<EOF>>  { yyerror("unterminated special-quoted string"); }

  {xdstart}        {
                      token_start = yytext;

Re: [HACKERS] dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
>> No, it won't.  The problem is that it should, because the backend will
>> see that as '42' followed by a $foo$ quote start.

> Ok, I see what you are saying. This mismatch would only happen on
> invalid input, though. I believe that what I did will work on all legal
> input.

I'm unconvinced.  Even if there are not any current syntaxes in which a
numeric literal can be adjacent to a string literal (I'm not totally
sure about that), what of the future?  We should solve the problem
rather than assuming it won't bite us.

> I think that this might be cured by having psql recognise a legal
> identifier or keyword and eating it as a word, rather than treating it
> as just another set of bytes in the stream.

Hm, might work ... will think about it ...

            regards, tom lane

Re: [HACKERS] dollar quoting

From
Bruce Momjian
Date:
Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> >> No, it won't.  The problem is that it should, because the backend will
> >> see that as '42' followed by a $foo$ quote start.
>
> > Ok, I see what you are saying. This mismatch would only happen on
> > invalid input, though. I believe that what I did will work on all legal
> > input.
>
> I'm unconvinced.  Even if there are not any current syntaxes in which a
> numeric literal can be adjacent to a string literal (I'm not totally
> sure about that), what of the future?  We should solve the problem
> rather than assuming it won't bite us.
>
> > I think that this might be cured by having psql recognise a legal
> > identifier or keyword and eating it as a word, rather than treating it
> > as just another set of bytes in the stream.
>
> Hm, might work ... will think about it ...

I am a little concerned about adding the overhead of lex to psql.  Right
now, some folks have reported that lex/yacc take a considerable amount
of processing time in the backend as part of a query, and adding that to
psql just to do $$ seems questionable.  Of course, we can alway test and
see what the overhead shows.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [HACKERS] dollar quoting

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am a little concerned about adding the overhead of lex to psql.  Right
> now, some folks have reported that lex/yacc take a considerable amount
> of processing time in the backend as part of a query, and adding that to
> psql just to do $$ seems questionable.  Of course, we can alway test and
> see what the overhead shows.

That's not the question to ask --- the question is whether a flex lexer
will be faster or slower than the hand-made lexing code that psql is
currently using.  Lexer generation is a well-understood art, and you
have to be pretty tense to beat out flex with hand-made code.  It's the
same tradeoff as trying to write better assembly code than a compiler
does.  Look at the lexing loop in psql/mainloop.c (that series of
if-tests starting at about line 250) and ask yourself if that's really
going to beat out the state-machine implementation flex uses --- which
looks to be about two table lookups per character, plus a switch()
executed at the end of every token.  I'll bet on flex being faster.

The reason the lexer shows up in the backend is that it has to grovel
over every individual character of a query.  For sufficiently large
query strings that's gonna take awhile no matter how you do it.

But in any case, the argument for moving to flex is not about
performance, it is about making the code more understandable and more
certain to agree with the backend lexer.

            regards, tom lane

Re: [HACKERS] dollar quoting

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I am a little concerned about adding the overhead of lex to psql.  Right
> > now, some folks have reported that lex/yacc take a considerable amount
> > of processing time in the backend as part of a query, and adding that to
> > psql just to do $$ seems questionable.  Of course, we can alway test and
> > see what the overhead shows.
>
> That's not the question to ask --- the question is whether a flex lexer
> will be faster or slower than the hand-made lexing code that psql is
> currently using.  Lexer generation is a well-understood art, and you
> have to be pretty tense to beat out flex with hand-made code.  It's the
> same tradeoff as trying to write better assembly code than a compiler
> does.  Look at the lexing loop in psql/mainloop.c (that series of
> if-tests starting at about line 250) and ask yourself if that's really
> going to beat out the state-machine implementation flex uses --- which
> looks to be about two table lookups per character, plus a switch()
> executed at the end of every token.  I'll bet on flex being faster.
>
> The reason the lexer shows up in the backend is that it has to grovel
> over every individual character of a query.  For sufficiently large
> query strings that's gonna take awhile no matter how you do it.
>
> But in any case, the argument for moving to flex is not about
> performance, it is about making the code more understandable and more
> certain to agree with the backend lexer.

If we go with lex/flex for psql, I would still like someone to test
performance to see that we aren't taking a big hit.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073