Thread: dollar quoting

dollar quoting

From
Andrew Dunstan
Date:
What has become of the "dollar quoting" mechanism that we had so much 
discussion about back in August/September?

IIRC, a consensus was reached on the actual format of the quote 
delimiters (either $$ or $identifier$), and Tom had a proof of concept 
patch to the parser to handle it, but work was needed on psql, plpgsql, 
pg_dump (and pg_restore?) + docs.

Is anyone working on it?

cheers

andrew



Re: dollar quoting

From
Jon Jensen
Date:
On Thu, 5 Feb 2004, Andrew Dunstan wrote:

> What has become of the "dollar quoting" mechanism that we had so much 
> discussion about back in August/September?
> 
> IIRC, a consensus was reached on the actual format of the quote 
> delimiters (either $$ or $identifier$), and Tom had a proof of concept 
> patch to the parser to handle it, but work was needed on psql, plpgsql, 
> pg_dump (and pg_restore?) + docs.
> 
> Is anyone working on it?

I am, but not very quickly. I hope to have some time in the next month, 
but if someone else beats me to it I'll just be happy it got done.

Jon


Re: dollar quoting

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>  
>
>>IIRC, a consensus was reached on the actual format of the quote 
>>delimiters (either $$ or $identifier$), and Tom had a proof of concept 
>>patch to the parser to handle it, but work was needed on psql, plpgsql, 
>>pg_dump (and pg_restore?) + docs.
>>    
>>
>
>I think someone has to fix psql before we can consider applying the
>backend patch.  Fixing the other stuff can come after.
>

Makes sense.

>
>  
>
>>Is anyone working on it?
>>    
>>
>
>I kinda thought you had volunteered to work on the psql part...
>
>  
>


I don't recall being that specific, but you could be right. In any case, 
I didn't want to trip over anyone else, which is why I asked.

I will try to coordinate with Jon.

cheers

andrew



Re: dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> IIRC, a consensus was reached on the actual format of the quote 
> delimiters (either $$ or $identifier$), and Tom had a proof of concept 
> patch to the parser to handle it, but work was needed on psql, plpgsql, 
> pg_dump (and pg_restore?) + docs.

I think someone has to fix psql before we can consider applying the
backend patch.  Fixing the other stuff can come after.

> Is anyone working on it?

I kinda thought you had volunteered to work on the psql part...
        regards, tom lane


Re: dollar quoting

From
Andrew Dunstan
Date:

Andrew Dunstan wrote:

>
>
> Tom Lane wrote:
>
>>
>> I kinda thought you had volunteered to work on the psql part...
>>
>>  
>>
>
>
> I don't recall being that specific, but you could be right. In any 
> case, I didn't want to trip over anyone else, which is why I asked.
>
> I will try to coordinate with Jon.
>

After staring at the code for a long time, I think I see how to do this. 
It's complicated a bit by the fact that $ is a valid identifier 
character. So my current thinking is to say that if we see $ not in a 
quote and not preceded by a valid identifier char then it is the start 
of a $foo$ sequence. Or have I missed something? Can we validly see $ in 
any other context?

BTW, Tom's proof of concept patch worked just fine for me. I changed the 
allowed pattern to what I think was agreed:

dolqdelim   \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$

and changed some names and comments to remove misleading references to 
"here docs".

cheers

andrew



Re: dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> After staring at the code for a long time, I think I see how to do this. 
> It's complicated a bit by the fact that $ is a valid identifier 
> character. So my current thinking is to say that if we see $ not in a 
> quote and not preceded by a valid identifier char then it is the start 
> of a $foo$ sequence. Or have I missed something? Can we validly see $ in 
> any other context?

Right, a $ should be considered to start a quote marker only if it's not
part of an identifier.  The backend lexer doesn't have a problem with
this because it's written in flex, but I can imagine that getting it
right in psql's ad-hoc parser might be tricky.

There was some discussion awhile back of converting psql to use flex
for interpreting its input, but I dunno how practical that really is.
I don't know how you get flex to do reasonable stuff with an incomplete
input string.  Still, it might be worth looking into.
        regards, tom lane


Re: dollar quoting

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>  
>
>>After staring at the code for a long time, I think I see how to do this. 
>>It's complicated a bit by the fact that $ is a valid identifier 
>>character. So my current thinking is to say that if we see $ not in a 
>>quote and not preceded by a valid identifier char then it is the start 
>>of a $foo$ sequence. Or have I missed something? Can we validly see $ in 
>>any other context?
>>    
>>
>
>Right, a $ should be considered to start a quote marker only if it's not
>part of an identifier.  The backend lexer doesn't have a problem with
>this because it's written in flex, but I can imagine that getting it
>right in psql's ad-hoc parser might be tricky.
>

I think it's doable, though.  I seem to have a working patch, which I 
will send out for review soon.

>
>There was some discussion awhile back of converting psql to use flex
>for interpreting its input, but I dunno how practical that really is.
>I don't know how you get flex to do reasonable stuff with an incomplete
>input string.  Still, it might be worth looking into.
>
>  
>

That's what made me not even think about it. If someone better versed in 
this stuff than me wants to do it then more power to them.

cheers

andrew



Re: dollar quoting

From
Andrew Dunstan
Date:

Andrew Dunstan wrote:

>
>
> Tom Lane wrote:
>
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>
>>
>>> After staring at the code for a long time, I think I see how to do
>>> this. It's complicated a bit by the fact that $ is a valid
>>> identifier character. So my current thinking is to say that if we
>>> see $ not in a quote and not preceded by a valid identifier char
>>> then it is the start of a $foo$ sequence. Or have I missed
>>> something? Can we validly see $ in any other context?
>>>
>>

I had missed one, though - numbered params in prepared statements. Fixed
in attached patch.

>>
>> Right, a $ should be considered to start a quote marker only if it's not
>> part of an identifier.  The backend lexer doesn't have a problem with
>> this because it's written in flex, but I can imagine that getting it
>> right in psql's ad-hoc parser might be tricky.
>>
>
> I think it's doable, though.  I seem to have a working patch, which I
> will send out for review soon.
>

Proof of Concept patch (i.e. not for application) attached for review.
The scanner changes are based on Tom's original, with some name/comment
changes and a more liberal pattern. The psql changes are all my own work
:-).

Comments welcome. Reviewers: I am not sure I got multi-byte stuff right
in psql/mainloop.c - please pay close attention to that.

If I'm not wildly off course I will polish this up and start on docs.

cheers

andrew
Index: src/backend/parser/scan.l
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v
retrieving revision 1.112
diff -c -w -r1.112 scan.l
*** src/backend/parser/scan.l    29 Nov 2003 19:51:52 -0000    1.112
--- src/backend/parser/scan.l    8 Feb 2004 19:59:24 -0000
***************
*** 39,44 ****
--- 39,46 ----

  static int        xcdepth = 0;    /* depth of nesting in slash-star comments */

+ static char    *dolqstart;  /* current $foo$ quote start string */
+
  /*
   * literalbuf is used to accumulate literal values when multiple rules
   * are needed to parse a single literal.  Call startlit to reset buffer
***************
*** 95,100 ****
--- 97,103 ----
   *  <xd> delimited identifiers (double-quoted identifiers)
   *  <xh> hexadecimal numeric string
   *  <xq> quoted strings
+  *  <dolq> $foo$-style quoted strings
   */

  %x xb
***************
*** 102,107 ****
--- 105,111 ----
  %x xd
  %x xh
  %x xq
+ %x dolq

  /* Bit string
   * It is tempting to scan the string for only those characters
***************
*** 141,146 ****
--- 145,159 ----
  xqoctesc        [\\][0-7]{1,3}
  xqcat            {quote}{whitespace_with_newline}{quote}

+ /* $foo$ style quotes ("dollar quoting")
+  * The quoted string starts with $foo$ where "foo" is an optional string
+  * in the form of an identifier, except that it may not contain "$",
+  * and extends to the first occurrence
+  * of an identical string.  There is *no* processing of the quoted text.
+  */
+ dolqdelim   \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$
+ dolqinside  [^$]+
+
  /* Double quote
   * Allows embedded spaces and other special characters into identifiers.
   */
***************
*** 387,392 ****
--- 400,434 ----
                  }
  <xq><<EOF>>        { yyerror("unterminated quoted string"); }

+ {dolqdelim}  {
+      token_start = yytext;
+      dolqstart = pstrdup(yytext);
+      BEGIN(dolq);
+      startlit();
+     }
+ <dolq>{dolqdelim} {
+      if (strcmp(yytext, dolqstart) == 0)
+      {
+       pfree(dolqstart);
+       BEGIN(INITIAL);
+       yylval.str = litbufdup();
+       return SCONST;
+      }
+      /*
+       * When we fail to match $...$ to dolqstart, transfer
+       * the $... part to the output, but put back the final
+       * $ for rescanning.  Consider $delim$...$junk$delim$
+       */
+      addlit(yytext, yyleng-1);
+      yyless(yyleng-1);
+     }
+ <dolq>{dolqinside}  {
+      addlit(yytext, yyleng);
+     }
+ <dolq>.           {
+      addlitchar(yytext[0]);
+     }
+ <dolq><<EOF>>  { yyerror("unterminated special-quoted string"); }

  {xdstart}        {
                      token_start = yytext;
Index: src/bin/psql/mainloop.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v
retrieving revision 1.61
diff -c -w -r1.61 mainloop.c
*** src/bin/psql/mainloop.c    25 Jan 2004 03:07:22 -0000    1.61
--- src/bin/psql/mainloop.c    8 Feb 2004 19:59:44 -0000
***************
*** 49,54 ****
--- 49,58 ----
      unsigned int query_start;
      volatile int count_eof = 0;
      volatile unsigned int bslash_count = 0;
+     volatile bool free_dolquote = false;
+     char *dol_quote = NULL;
+     char *dol_end;
+

      int            i,
                  prevlen,
***************
*** 120,125 ****
--- 124,130 ----
                  in_quote = 0;
                  paren_level = 0;
                  count_eof = 0;
+                 free_dolquote = true;
                  slashCmdStatus = CMD_UNKNOWN;
              }
              else
***************
*** 136,141 ****
--- 141,156 ----
          pqsignal(SIGINT, handle_sigint);        /* control-C => cancel */
  #endif   /* not WIN32 */

+         if (free_dolquote)
+         {
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
+             free_dolquote = false;
+         }
+
          fflush(stdout);

          if (slashCmdStatus == CMD_NEWEDIT)
***************
*** 150,155 ****
--- 165,175 ----
              in_xcomment = 0;
              in_quote = 0;
              paren_level = 0;
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
              slashCmdStatus = CMD_UNKNOWN;
          }

***************
*** 161,167 ****
          {
              int            prompt_status;

!             if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
--- 181,189 ----
          {
              int            prompt_status;

!             if (dol_quote)
!                 prompt_status = PROMPT_DOLLARQUOTE;
!             else if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
***************
*** 268,273 ****
--- 290,339 ----
                      in_quote = 0;
              }

+             /*
+              * start of $foo$ type quote
+              * must be a $ not preceded by a valid identifier character,
+              * not inside a quote, not succeeded by a digit,
+              * and with a terminating $ somewhere on the line.
+              *
+              * if an invalid $foo$ meeting these requirements is seen, it won't
+              * be trapped here but will instead generate an SQL syntax error.
+              */
+
+             else if (!dol_quote && line[i] == '$' &&
+                      !isdigit(line[i + thislen]) &&
+                      (dol_end = strchr(line+i+1,'$')) != NULL &&
+                      (i == 0 ||
+                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) ||
+                          line[i-1] == '_')))
+             {
+                 char eos;
+
+                 dol_end ++;
+                 eos = *(dol_end);
+                 *dol_end = '\0';
+                 dol_quote = strdup(line+i);
+                 *dol_end = eos;
+                 ADVANCE_1;
+                 while(line[i] != '$')
+                     ADVANCE_1;
+
+             }
+
+             /* in or end of $foo$ type quote? */
+
+             else if (dol_quote)
+             {
+                 if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0)
+                 {
+                     ADVANCE_1;
+                     while(line[i] != '$')
+                         ADVANCE_1;
+                     free(dol_quote);
+                     dol_quote = NULL;
+                 }
+             }
+
              /* start of extended comment? */
              else if (line[i] == '/' && line[i + thislen] == '*')
              {
***************
*** 291,297 ****
              /* single-line comment? truncate line */
              else if (line[i] == '-' && line[i + thislen] == '-')
              {
!                 line[i] = '\0'; /* remove comment */
                  break;
              }

--- 357,363 ----
              /* single-line comment? truncate line */
              else if (line[i] == '-' && line[i + thislen] == '-')
              {
!                 line[i] = '\0'; /* removae comment */
                  break;
              }

***************
*** 458,464 ****


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
--- 524,531 ----


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || dol_quote ||
!             line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
Index: src/bin/psql/prompt.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v
retrieving revision 1.34
diff -c -w -r1.34 prompt.c
*** src/bin/psql/prompt.c    25 Jan 2004 03:07:22 -0000    1.34
--- src/bin/psql/prompt.c    8 Feb 2004 19:59:44 -0000
***************
*** 85,90 ****
--- 85,91 ----
          case PROMPT_CONTINUE:
          case PROMPT_SINGLEQUOTE:
          case PROMPT_DOUBLEQUOTE:
+         case PROMPT_DOLLARQUOTE:
          case PROMPT_COMMENT:
          case PROMPT_PAREN:
              prompt_name = "PROMPT2";
***************
*** 198,203 ****
--- 199,207 ----
                              break;
                          case PROMPT_DOUBLEQUOTE:
                              buf[0] = '"';
+                             break;
+                         case PROMPT_DOLLARQUOTE:
+                             buf[0] = '$';
                              break;
                          case PROMPT_COMMENT:
                              buf[0] = '*';
Index: src/bin/psql/prompt.h
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v
retrieving revision 1.13
diff -c -w -r1.13 prompt.h
*** src/bin/psql/prompt.h    29 Nov 2003 19:52:07 -0000    1.13
--- src/bin/psql/prompt.h    8 Feb 2004 19:59:44 -0000
***************
*** 15,20 ****
--- 15,21 ----
      PROMPT_COMMENT,
      PROMPT_SINGLEQUOTE,
      PROMPT_DOUBLEQUOTE,
+     PROMPT_DOLLARQUOTE,
      PROMPT_PAREN,
      PROMPT_COPY
  } promptStatus_t;

Re: [PATCHES] dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Comments welcome. Reviewers: I am not sure I got multi-byte stuff right
> in psql/mainloop.c - please pay close attention to that.

The i-1 stuff should generally be i-prevlen.  Not sure if there are any
other pitfalls.

A bigger problem here:

> +             else if (!dol_quote && line[i] == '$' &&
> +                      !isdigit(line[i + thislen]) &&
> +                      (dol_end = strchr(line+i+1,'$')) != NULL &&
> +                      (i == 0 ||
> +                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) ||
> +                          line[i-1] == '_')))
> +             {

is that you aren't checking that what comes between the two dollar signs
looks like empty-or-an-identifier.  The check for
next-char-isn't-a-digit is part of that but not the only part.

Also I'm not sure about the positioning of these tests relative to the
in_quote and in_xcomment tests.  As you have it, $foo$ will be
recognized within an xcomment, which I think is at variance with the
proposed backend lexing behavior.

Also, the strdup should be pg_strdup.

            regards, tom lane

Re: [PATCHES] dollar quoting

From
Andrew Dunstan
Date:

Tom Lane wrote:

>A bigger problem here:
>
>
>
>>+             else if (!dol_quote && line[i] == '$' &&
>>+                      !isdigit(line[i + thislen]) &&
>>+                      (dol_end = strchr(line+i+1,'$')) != NULL &&
>>+                      (i == 0 ||
>>+                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) ||
>>+                          line[i-1] == '_')))
>>+             {
>>
>>
>
>is that you aren't checking that what comes between the two dollar signs
>looks like empty-or-an-identifier.  The check for
>next-char-isn't-a-digit is part of that but not the only part.
>
>


Well, I think the right way to do a full check would be with a regex,
which I had hoped to avoid. However, I will now try to get one working
and to address your other concerns.

Thanks for the comments.

cheers

andrew


Re: [PATCHES] dollar quoting

From
Andrew Dunstan
Date:
I think the attached patch addresses Tom's comments.

I ended up not using a regex, which seemed to be a little heavy handed,
but just writing a small custom recognition function, that should (and I
think does) mimic the pattern recognition for these tokens used by the
backend lexer. This patch just puts that function in mainloop.c, but
perhaps it belongs elsewhere (string_utils.c maybe?). I don't have
strong opinions on that.

Enjoy

andrew



Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>
>
>>Comments welcome. Reviewers: I am not sure I got multi-byte stuff right
>>in psql/mainloop.c - please pay close attention to that.
>>
>>
>
>The i-1 stuff should generally be i-prevlen.  Not sure if there are any
>other pitfalls.
>
>A bigger problem here:
>
>
>
>>+             else if (!dol_quote && line[i] == '$' &&
>>+                      !isdigit(line[i + thislen]) &&
>>+                      (dol_end = strchr(line+i+1,'$')) != NULL &&
>>+                      (i == 0 ||
>>+                       ! ((line[i-1] & 0x80) != 0 || isalnum(line[i-1]) ||
>>+                          line[i-1] == '_')))
>>+             {
>>
>>
>
>is that you aren't checking that what comes between the two dollar signs
>looks like empty-or-an-identifier.  The check for
>next-char-isn't-a-digit is part of that but not the only part.
>
>Also I'm not sure about the positioning of these tests relative to the
>in_quote and in_xcomment tests.  As you have it, $foo$ will be
>recognized within an xcomment, which I think is at variance with the
>proposed backend lexing behavior.
>
>Also, the strdup should be pg_strdup.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
>
Index: src/backend/parser/scan.l
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v
retrieving revision 1.112
diff -c -w -r1.112 scan.l
*** src/backend/parser/scan.l    29 Nov 2003 19:51:52 -0000    1.112
--- src/backend/parser/scan.l    9 Feb 2004 15:26:34 -0000
***************
*** 39,44 ****
--- 39,46 ----

  static int        xcdepth = 0;    /* depth of nesting in slash-star comments */

+ static char    *dolqstart;  /* current $foo$ quote start string */
+
  /*
   * literalbuf is used to accumulate literal values when multiple rules
   * are needed to parse a single literal.  Call startlit to reset buffer
***************
*** 95,100 ****
--- 97,103 ----
   *  <xd> delimited identifiers (double-quoted identifiers)
   *  <xh> hexadecimal numeric string
   *  <xq> quoted strings
+  *  <dolq> $foo$-style quoted strings
   */

  %x xb
***************
*** 102,107 ****
--- 105,111 ----
  %x xd
  %x xh
  %x xq
+ %x dolq

  /* Bit string
   * It is tempting to scan the string for only those characters
***************
*** 141,146 ****
--- 145,159 ----
  xqoctesc        [\\][0-7]{1,3}
  xqcat            {quote}{whitespace_with_newline}{quote}

+ /* $foo$ style quotes ("dollar quoting")
+  * The quoted string starts with $foo$ where "foo" is an optional string
+  * in the form of an identifier, except that it may not contain "$",
+  * and extends to the first occurrence
+  * of an identical string.  There is *no* processing of the quoted text.
+  */
+ dolqdelim   \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$
+ dolqinside  [^$]+
+
  /* Double quote
   * Allows embedded spaces and other special characters into identifiers.
   */
***************
*** 387,392 ****
--- 400,434 ----
                  }
  <xq><<EOF>>        { yyerror("unterminated quoted string"); }

+ {dolqdelim}  {
+      token_start = yytext;
+      dolqstart = pstrdup(yytext);
+      BEGIN(dolq);
+      startlit();
+     }
+ <dolq>{dolqdelim} {
+      if (strcmp(yytext, dolqstart) == 0)
+      {
+       pfree(dolqstart);
+       BEGIN(INITIAL);
+       yylval.str = litbufdup();
+       return SCONST;
+      }
+      /*
+       * When we fail to match $...$ to dolqstart, transfer
+       * the $... part to the output, but put back the final
+       * $ for rescanning.  Consider $delim$...$junk$delim$
+       */
+      addlit(yytext, yyleng-1);
+      yyless(yyleng-1);
+     }
+ <dolq>{dolqinside}  {
+      addlit(yytext, yyleng);
+     }
+ <dolq>.           {
+      addlitchar(yytext[0]);
+     }
+ <dolq><<EOF>>  { yyerror("unterminated special-quoted string"); }

  {xdstart}        {
                      token_start = yytext;
Index: src/bin/psql/mainloop.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v
retrieving revision 1.61
diff -c -w -r1.61 mainloop.c
*** src/bin/psql/mainloop.c    25 Jan 2004 03:07:22 -0000    1.61
--- src/bin/psql/mainloop.c    9 Feb 2004 15:26:51 -0000
***************
*** 21,26 ****
--- 21,61 ----
  sigjmp_buf    main_loop_jmp;
  #endif

+ /*
+  * function to detect a valid $foo$ quote delimiter at the start of the
+  * parameter dquote.
+  */
+
+ static bool valid_dolquote(char * dquote)
+ {
+     int i;
+
+     /* must start with a $ */
+     if (dquote[0] != '$')
+         return false;
+
+     /* empty 'identifier' case */
+     if (dquote[1] == '$')
+         return true;
+
+     /* first 'identifier' char must be a letter or have high bit set */
+     if (!isalpha(dquote[1]) && (dquote[1] & 0x80) == 0)
+         return false;
+
+     /* subsequent chars must be alphanumeric or _ or have high bit set */
+     for (i = 2; dquote[i] != '$'; i++)
+     {
+         if ((dquote[i] & 0x80) == 0 && ! isalnum(dquote[i]) &&
+             dquote[i] != '_')
+         {
+             /* we found an invalid character */
+             return false;
+         }
+     }
+
+     return true;
+ }
+

  /*
   * Main processing loop for reading lines of input
***************
*** 49,54 ****
--- 84,92 ----
      unsigned int query_start;
      volatile int count_eof = 0;
      volatile unsigned int bslash_count = 0;
+     volatile bool free_dolquote = false;
+     char *dol_quote = NULL;
+

      int            i,
                  prevlen,
***************
*** 120,125 ****
--- 158,164 ----
                  in_quote = 0;
                  paren_level = 0;
                  count_eof = 0;
+                 free_dolquote = true;
                  slashCmdStatus = CMD_UNKNOWN;
              }
              else
***************
*** 136,141 ****
--- 175,190 ----
          pqsignal(SIGINT, handle_sigint);        /* control-C => cancel */
  #endif   /* not WIN32 */

+         if (free_dolquote)
+         {
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
+             free_dolquote = false;
+         }
+
          fflush(stdout);

          if (slashCmdStatus == CMD_NEWEDIT)
***************
*** 150,155 ****
--- 199,209 ----
              in_xcomment = 0;
              in_quote = 0;
              paren_level = 0;
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
              slashCmdStatus = CMD_UNKNOWN;
          }

***************
*** 161,167 ****
          {
              int            prompt_status;

!             if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
--- 215,223 ----
          {
              int            prompt_status;

!             if (dol_quote)
!                 prompt_status = PROMPT_DOLLARQUOTE;
!             else if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
***************
*** 268,273 ****
--- 324,343 ----
                      in_quote = 0;
              }

+             /* in or end of $foo$ type quote? */
+
+             else if (dol_quote)
+             {
+                 if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0)
+                 {
+                     ADVANCE_1;
+                     while(line[i] != '$')
+                         ADVANCE_1;
+                     free(dol_quote);
+                     dol_quote = NULL;
+                 }
+             }
+
              /* start of extended comment? */
              else if (line[i] == '/' && line[i + thislen] == '*')
              {
***************
*** 288,297 ****
              else if (line[i] == '\'' || line[i] == '"')
                  in_quote = line[i];

              /* single-line comment? truncate line */
              else if (line[i] == '-' && line[i + thislen] == '-')
              {
!                 line[i] = '\0'; /* remove comment */
                  break;
              }

--- 358,395 ----
              else if (line[i] == '\'' || line[i] == '"')
                  in_quote = line[i];

+             /*
+              * start of $foo$ type quote?
+              *
+              * must not be preceded by a valid identifier character
+              */
+
+             else if (!dol_quote && valid_dolquote(line+i) &&
+                      (i == 0 ||
+                       ! ((line[i-prevlen] & 0x80) != 0 ||
+                          isalnum(line[i-prevlen]) ||
+                          line[i-prevlen] == '_' ||
+                          line[i-prevlen] == '$' )))
+             {
+                 char * dol_end;
+                 char eos;
+
+                 dol_end = strchr(line+i+1,'$');
+                 dol_end ++;
+                 eos = *dol_end;
+                 *dol_end = '\0';
+                 dol_quote = pg_strdup(line+i);
+                 *dol_end = eos;
+                 ADVANCE_1;
+                 while(line[i] != '$')
+                     ADVANCE_1;
+
+             }
+
              /* single-line comment? truncate line */
              else if (line[i] == '-' && line[i + thislen] == '-')
              {
!                 line[i] = '\0'; /* removae comment */
                  break;
              }

***************
*** 458,464 ****


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
--- 556,563 ----


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || dol_quote ||
!             line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
Index: src/bin/psql/prompt.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v
retrieving revision 1.34
diff -c -w -r1.34 prompt.c
*** src/bin/psql/prompt.c    25 Jan 2004 03:07:22 -0000    1.34
--- src/bin/psql/prompt.c    9 Feb 2004 15:26:51 -0000
***************
*** 85,90 ****
--- 85,91 ----
          case PROMPT_CONTINUE:
          case PROMPT_SINGLEQUOTE:
          case PROMPT_DOUBLEQUOTE:
+         case PROMPT_DOLLARQUOTE:
          case PROMPT_COMMENT:
          case PROMPT_PAREN:
              prompt_name = "PROMPT2";
***************
*** 198,203 ****
--- 199,207 ----
                              break;
                          case PROMPT_DOUBLEQUOTE:
                              buf[0] = '"';
+                             break;
+                         case PROMPT_DOLLARQUOTE:
+                             buf[0] = '$';
                              break;
                          case PROMPT_COMMENT:
                              buf[0] = '*';
Index: src/bin/psql/prompt.h
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v
retrieving revision 1.13
diff -c -w -r1.13 prompt.h
*** src/bin/psql/prompt.h    29 Nov 2003 19:52:07 -0000    1.13
--- src/bin/psql/prompt.h    9 Feb 2004 15:26:51 -0000
***************
*** 15,20 ****
--- 15,21 ----
      PROMPT_COMMENT,
      PROMPT_SINGLEQUOTE,
      PROMPT_DOUBLEQUOTE,
+     PROMPT_DOLLARQUOTE,
      PROMPT_PAREN,
      PROMPT_COPY
  } promptStatus_t;

Re: [PATCHES] dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I ended up not using a regex, which seemed to be a little heavy handed,
> but just writing a small custom recognition function, that should (and I
> think does) mimic the pattern recognition for these tokens used by the
> backend lexer.

I looked at this and realized that it still doesn't do very well at
distinguishing $foo$ from other random uses of $.  The problem is that
looking back at just the immediately preceding character isn't enough
context to tell whether a $ is part of an identifier.  Consider the
input
    a42$foo$
This is a legal identifier according to PG 7.4.  But how about
    42$foo$
This is a syntax error in 7.4, and we propose to redefine it as an
integer literal '42' followed by a dollar-quote start symbol.

There's no way to tell these apart with a single-character lookback,
or indeed any fixed number of characters of lookback.

I begin to think that we'll really have to bite the bullet and convert
psql's input parser to use flex.  If we're not scanning with exactly the
same rules as the backend uses, we're going to get the wrong answers.

            regards, tom lane

Re: [PATCHES] dollar quoting

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>
>
>>I ended up not using a regex, which seemed to be a little heavy handed,
>>but just writing a small custom recognition function, that should (and I
>>think does) mimic the pattern recognition for these tokens used by the
>>backend lexer.
>>
>>
>
>I looked at this and realized that it still doesn't do very well at
>distinguishing $foo$ from other random uses of $.  The problem is that
>looking back at just the immediately preceding character isn't enough
>context to tell whether a $ is part of an identifier.  Consider the
>input
>    a42$foo$
>This is a legal identifier according to PG 7.4.  But how about
>    42$foo$
>This is a syntax error in 7.4, and we propose to redefine it as an
>integer literal '42' followed by a dollar-quote start symbol.
>

The test in the patch I sent is this:


            else if (!dol_quote && valid_dolquote(line+i) &&
                     (i == 0 ||
                      ! ((line[i-prevlen] & 0x80) != 0 ||
                         isalnum(line[i-prevlen]) ||
                         line[i-prevlen] == '_' ||
                         line[i-prevlen] == '$' )))


The test should not succeed anywhere in the string '42$foo$'.

Note that psql does not change any '$foo$' at all - it just passes it to
the backend. The reason we need this at all in psql is that it has to
detect the end of a statement, and it has to prompt correctly, and to do
that it needs to know if we are in a quote (single, double, dollar) or a
comment.

psql does not detect many syntax errors, or even lexical errors - that
is the job of the backend - rightly so, I believe.

>
>There's no way to tell these apart with a single-character lookback,
>or indeed any fixed number of characters of lookback.
>

I'm still not convinced, although maybe there's something I'm not getting.

>
>I begin to think that we'll really have to bite the bullet and convert
>psql's input parser to use flex.  If we're not scanning with exactly the
>same rules as the backend uses, we're going to get the wrong answers.
>
>
>

Interacting with lexer states would probably be ... unpleasant. Matching
a stream oriented lexer with a line oriented CLI would be messy I suspect.

cheers

andrew


Re: [PATCHES] dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> ... But how about
>> 42$foo$
>> This is a syntax error in 7.4, and we propose to redefine it as an
>> integer literal '42' followed by a dollar-quote start symbol.

> The test should not succeed anywhere in the string '42$foo$'.

No, it won't.  The problem is that it should, because the backend will
see that as '42' followed by a $foo$ quote start.

> Interacting with lexer states would probably be ... unpleasant. Matching
> a stream oriented lexer with a line oriented CLI would be messy I suspect.

I think it would not be that bad.  We'd have to run the lexer on the
command input buffer and see what state it terminates in.

            regards, tom lane

Re: [PATCHES] dollar quoting

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>
>
>>Tom Lane wrote:
>>
>>
>>>... But how about
>>>42$foo$
>>>This is a syntax error in 7.4, and we propose to redefine it as an
>>>integer literal '42' followed by a dollar-quote start symbol.
>>>
>>>
>
>
>
>>The test should not succeed anywhere in the string '42$foo$'.
>>
>>
>
>No, it won't.  The problem is that it should, because the backend will
>see that as '42' followed by a $foo$ quote start.
>

Ok, I see what you are saying. This mismatch would only happen on
invalid input, though. I believe that what I did will work on all legal
input.

I think that this might be cured by having psql recognise a legal
identifier or keyword and eating it as a word, rather than treating it
as just another set of bytes in the stream. That would enable us to
avoid the lookback in the dollar-quote recognition test altogether. The
attached patch does it that way - the keyword/id test needs to come
right at the end of the loop to avoid clashing with backslash commands,
btw.

I *think* that this way psql will recognise the start of a dollar quote
iff the backend lexer would.

>
>
>
>>Interacting with lexer states would probably be ... unpleasant. Matching
>>a stream oriented lexer with a line oriented CLI would be messy I suspect.
>>
>>
>
>I think it would not be that bad.  We'd have to run the lexer on the
>command input buffer and see what state it terminates in.
>
>
>

Yeah. I am not enough of a flex wizard to undertake the task, though. It
would take me lots of time. If we make a decision that we really need
this in order to do dollar quoting in psql I would need some substantial
help, at least.

cheers

andrew
Index: src/bin/psql/mainloop.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/mainloop.c,v
retrieving revision 1.61
diff -c -r1.61 mainloop.c
*** src/bin/psql/mainloop.c    25 Jan 2004 03:07:22 -0000    1.61
--- src/bin/psql/mainloop.c    15 Feb 2004 14:28:02 -0000
***************
*** 21,26 ****
--- 21,61 ----
  sigjmp_buf    main_loop_jmp;
  #endif

+ /*
+  * function to detect a valid $foo$ quote delimiter at the start of the
+  * parameter dquote.
+  */
+
+ static bool valid_dolquote(char * dquote)
+ {
+     int i;
+
+     /* must start with a $ */
+     if (dquote[0] != '$')
+         return false;
+
+     /* empty 'identifier' case */
+     if (dquote[1] == '$')
+         return true;
+
+     /* first 'identifier' char must be a letter or have high bit set */
+     if (!isalpha(dquote[1]) && (dquote[1] & 0x80) == 0)
+         return false;
+
+     /* subsequent chars must be alphanumeric or _ or have high bit set */
+     for (i = 2; dquote[i] != '$'; i++)
+     {
+         if ((dquote[i] & 0x80) == 0 && ! isalnum(dquote[i]) &&
+             dquote[i] != '_')
+         {
+             /* we found an invalid character */
+             return false;
+         }
+     }
+
+     return true;
+ }
+

  /*
   * Main processing loop for reading lines of input
***************
*** 49,54 ****
--- 84,92 ----
      unsigned int query_start;
      volatile int count_eof = 0;
      volatile unsigned int bslash_count = 0;
+     volatile bool free_dolquote = false;
+     char *dol_quote = NULL;
+

      int            i,
                  prevlen,
***************
*** 120,125 ****
--- 158,164 ----
                  in_quote = 0;
                  paren_level = 0;
                  count_eof = 0;
+                 free_dolquote = true;
                  slashCmdStatus = CMD_UNKNOWN;
              }
              else
***************
*** 136,141 ****
--- 175,190 ----
          pqsignal(SIGINT, handle_sigint);        /* control-C => cancel */
  #endif   /* not WIN32 */

+         if (free_dolquote)
+         {
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
+             free_dolquote = false;
+         }
+
          fflush(stdout);

          if (slashCmdStatus == CMD_NEWEDIT)
***************
*** 150,155 ****
--- 199,209 ----
              in_xcomment = 0;
              in_quote = 0;
              paren_level = 0;
+             if(dol_quote)
+             {
+                 free(dol_quote);
+                 dol_quote = NULL;
+             }
              slashCmdStatus = CMD_UNKNOWN;
          }

***************
*** 161,167 ****
          {
              int            prompt_status;

!             if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
--- 215,223 ----
          {
              int            prompt_status;

!             if (dol_quote)
!                 prompt_status = PROMPT_DOLLARQUOTE;
!             else if (in_quote && in_quote == '\'')
                  prompt_status = PROMPT_SINGLEQUOTE;
              else if (in_quote && in_quote == '"')
                  prompt_status = PROMPT_DOUBLEQUOTE;
***************
*** 268,273 ****
--- 324,343 ----
                      in_quote = 0;
              }

+             /* in or end of $foo$ type quote? */
+
+             else if (dol_quote)
+             {
+                 if (strncmp(line+i,dol_quote,strlen(dol_quote)) == 0)
+                 {
+                     ADVANCE_1;
+                     while(line[i] != '$')
+                         ADVANCE_1;
+                     free(dol_quote);
+                     dol_quote = NULL;
+                 }
+             }
+
              /* start of extended comment? */
              else if (line[i] == '/' && line[i + thislen] == '*')
              {
***************
*** 288,293 ****
--- 358,383 ----
              else if (line[i] == '\'' || line[i] == '"')
                  in_quote = line[i];

+             /*
+              * start of $foo$ type quote?
+              */
+             else if (!dol_quote && valid_dolquote(line+i))
+             {
+                 char * dol_end;
+                 char eos;
+
+                 dol_end = strchr(line+i+1,'$');
+                 dol_end ++;
+                 eos = *dol_end;
+                 *dol_end = '\0';
+                 dol_quote = pg_strdup(line+i);
+                 *dol_end = eos;
+                 ADVANCE_1;
+                 while(line[i] != '$')
+                     ADVANCE_1;
+
+             }
+
              /* single-line comment? truncate line */
              else if (line[i] == '-' && line[i + thislen] == '-')
              {
***************
*** 447,452 ****
--- 537,566 ----
                  i = end_of_cmd - line;
                  query_start = i;
              }
+
+             /*
+              * keyword or identifier?
+              * We grab the whole string so that we don't
+              * mistakenly see $foo$ inside an identifier as the start
+              * of a dollar quote.
+              */
+
+             else if ( (line[i] & 0x80) != 0 ||
+                       isalpha(line[i]) ||
+                       line[i] == '_')
+             {
+                 while ((line[i+thislen] & 0x80) != 0 ||
+                        isalnum(line[i+thislen]) ||
+                        line[i+thislen] == '_' ||
+                        line[i+thislen] == '$' )
+                 {
+                     /* keep going while we still have identifier chars */
+                     ADVANCE_1;
+                 }
+
+
+             }
+
          }                        /* for (line) */


***************
*** 458,464 ****


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
--- 572,579 ----


          /* Put the rest of the line in the query buffer. */
!         if (in_quote || dol_quote ||
!             line[query_start + strspn(line + query_start, " \t\n\r")] != '\0')
          {
              if (query_buf->len > 0)
                  appendPQExpBufferChar(query_buf, '\n');
Index: src/bin/psql/prompt.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.c,v
retrieving revision 1.34
diff -c -r1.34 prompt.c
*** src/bin/psql/prompt.c    25 Jan 2004 03:07:22 -0000    1.34
--- src/bin/psql/prompt.c    15 Feb 2004 14:28:02 -0000
***************
*** 85,90 ****
--- 85,91 ----
          case PROMPT_CONTINUE:
          case PROMPT_SINGLEQUOTE:
          case PROMPT_DOUBLEQUOTE:
+         case PROMPT_DOLLARQUOTE:
          case PROMPT_COMMENT:
          case PROMPT_PAREN:
              prompt_name = "PROMPT2";
***************
*** 198,203 ****
--- 199,207 ----
                              break;
                          case PROMPT_DOUBLEQUOTE:
                              buf[0] = '"';
+                             break;
+                         case PROMPT_DOLLARQUOTE:
+                             buf[0] = '$';
                              break;
                          case PROMPT_COMMENT:
                              buf[0] = '*';
Index: src/bin/psql/prompt.h
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/psql/prompt.h,v
retrieving revision 1.13
diff -c -r1.13 prompt.h
*** src/bin/psql/prompt.h    29 Nov 2003 19:52:07 -0000    1.13
--- src/bin/psql/prompt.h    15 Feb 2004 14:28:02 -0000
***************
*** 15,20 ****
--- 15,21 ----
      PROMPT_COMMENT,
      PROMPT_SINGLEQUOTE,
      PROMPT_DOUBLEQUOTE,
+     PROMPT_DOLLARQUOTE,
      PROMPT_PAREN,
      PROMPT_COPY
  } promptStatus_t;
Index: src/backend/parser/scan.l
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/parser/scan.l,v
retrieving revision 1.112
diff -c -r1.112 scan.l
*** src/backend/parser/scan.l    29 Nov 2003 19:51:52 -0000    1.112
--- src/backend/parser/scan.l    15 Feb 2004 14:28:16 -0000
***************
*** 39,44 ****
--- 39,46 ----

  static int        xcdepth = 0;    /* depth of nesting in slash-star comments */

+ static char    *dolqstart;  /* current $foo$ quote start string */
+
  /*
   * literalbuf is used to accumulate literal values when multiple rules
   * are needed to parse a single literal.  Call startlit to reset buffer
***************
*** 95,100 ****
--- 97,103 ----
   *  <xd> delimited identifiers (double-quoted identifiers)
   *  <xh> hexadecimal numeric string
   *  <xq> quoted strings
+  *  <dolq> $foo$-style quoted strings
   */

  %x xb
***************
*** 102,107 ****
--- 105,111 ----
  %x xd
  %x xh
  %x xq
+ %x dolq

  /* Bit string
   * It is tempting to scan the string for only those characters
***************
*** 141,146 ****
--- 145,159 ----
  xqoctesc        [\\][0-7]{1,3}
  xqcat            {quote}{whitespace_with_newline}{quote}

+ /* $foo$ style quotes ("dollar quoting")
+  * The quoted string starts with $foo$ where "foo" is an optional string
+  * in the form of an identifier, except that it may not contain "$",
+  * and extends to the first occurrence
+  * of an identical string.  There is *no* processing of the quoted text.
+  */
+ dolqdelim   \$([A-Za-z\200-\377][A-Za-z\200-\377_0-9]*)?\$
+ dolqinside  [^$]+
+
  /* Double quote
   * Allows embedded spaces and other special characters into identifiers.
   */
***************
*** 387,392 ****
--- 400,434 ----
                  }
  <xq><<EOF>>        { yyerror("unterminated quoted string"); }

+ {dolqdelim}  {
+      token_start = yytext;
+      dolqstart = pstrdup(yytext);
+      BEGIN(dolq);
+      startlit();
+     }
+ <dolq>{dolqdelim} {
+      if (strcmp(yytext, dolqstart) == 0)
+      {
+       pfree(dolqstart);
+       BEGIN(INITIAL);
+       yylval.str = litbufdup();
+       return SCONST;
+      }
+      /*
+       * When we fail to match $...$ to dolqstart, transfer
+       * the $... part to the output, but put back the final
+       * $ for rescanning.  Consider $delim$...$junk$delim$
+       */
+      addlit(yytext, yyleng-1);
+      yyless(yyleng-1);
+     }
+ <dolq>{dolqinside}  {
+      addlit(yytext, yyleng);
+     }
+ <dolq>.           {
+      addlitchar(yytext[0]);
+     }
+ <dolq><<EOF>>  { yyerror("unterminated special-quoted string"); }

  {xdstart}        {
                      token_start = yytext;

Re: [PATCHES] dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
>> No, it won't.  The problem is that it should, because the backend will
>> see that as '42' followed by a $foo$ quote start.

> Ok, I see what you are saying. This mismatch would only happen on
> invalid input, though. I believe that what I did will work on all legal
> input.

I'm unconvinced.  Even if there are not any current syntaxes in which a
numeric literal can be adjacent to a string literal (I'm not totally
sure about that), what of the future?  We should solve the problem
rather than assuming it won't bite us.

> I think that this might be cured by having psql recognise a legal
> identifier or keyword and eating it as a word, rather than treating it
> as just another set of bytes in the stream.

Hm, might work ... will think about it ...

            regards, tom lane

Re: [PATCHES] dollar quoting

From
Andrew Dunstan
Date:
Another interesting thing abut psql that I noticed when using '$' in 
identifiers is this effect:


andrew=# create table ab$cd$ef (ef$cd$ab text);
CREATE TABLE
andrew=# \d ab$cd$ef
Did not find any relation named "ab$cd$ef".
andrew=# \d ab\$cd\$ef  Table "public.ab$cd$ef" Column  | Type | Modifiers
----------+------+-----------ef$cd$ab | text |


which is perhaps slightly less than intuitive.

cheers

andrew

>  
>



Re: [PATCHES] dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> andrew=# create table ab$cd$ef (ef$cd$ab text);
> CREATE TABLE
> andrew=# \d ab$cd$ef
> Did not find any relation named "ab$cd$ef".

Hmph.  I always thought that "$" was only special at the end of a regex,
but that doesn't seem to be how our implementation treats it.  Anyway
this is not a bug, it is a feature: the argument of \d is a regex.
        regards, tom lane


Re: [PATCHES] dollar quoting

From
Bruce Momjian
Date:
Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> >> No, it won't.  The problem is that it should, because the backend will
> >> see that as '42' followed by a $foo$ quote start.
>
> > Ok, I see what you are saying. This mismatch would only happen on
> > invalid input, though. I believe that what I did will work on all legal
> > input.
>
> I'm unconvinced.  Even if there are not any current syntaxes in which a
> numeric literal can be adjacent to a string literal (I'm not totally
> sure about that), what of the future?  We should solve the problem
> rather than assuming it won't bite us.
>
> > I think that this might be cured by having psql recognise a legal
> > identifier or keyword and eating it as a word, rather than treating it
> > as just another set of bytes in the stream.
>
> Hm, might work ... will think about it ...

I am a little concerned about adding the overhead of lex to psql.  Right
now, some folks have reported that lex/yacc take a considerable amount
of processing time in the backend as part of a query, and adding that to
psql just to do $$ seems questionable.  Of course, we can alway test and
see what the overhead shows.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [PATCHES] dollar quoting

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am a little concerned about adding the overhead of lex to psql.  Right
> now, some folks have reported that lex/yacc take a considerable amount
> of processing time in the backend as part of a query, and adding that to
> psql just to do $$ seems questionable.  Of course, we can alway test and
> see what the overhead shows.

That's not the question to ask --- the question is whether a flex lexer
will be faster or slower than the hand-made lexing code that psql is
currently using.  Lexer generation is a well-understood art, and you
have to be pretty tense to beat out flex with hand-made code.  It's the
same tradeoff as trying to write better assembly code than a compiler
does.  Look at the lexing loop in psql/mainloop.c (that series of
if-tests starting at about line 250) and ask yourself if that's really
going to beat out the state-machine implementation flex uses --- which
looks to be about two table lookups per character, plus a switch()
executed at the end of every token.  I'll bet on flex being faster.

The reason the lexer shows up in the backend is that it has to grovel
over every individual character of a query.  For sufficiently large
query strings that's gonna take awhile no matter how you do it.

But in any case, the argument for moving to flex is not about
performance, it is about making the code more understandable and more
certain to agree with the backend lexer.

            regards, tom lane

Re: [PATCHES] dollar quoting

From
Andrew Dunstan
Date:
Bruce Momjian wrote:

>Tom Lane wrote:
>  
>
>>Andrew Dunstan <andrew@dunslane.net> writes:
>>    
>>
>>>>No, it won't.  The problem is that it should, because the backend will
>>>>see that as '42' followed by a $foo$ quote start.
>>>>        
>>>>
>>>Ok, I see what you are saying. This mismatch would only happen on 
>>>invalid input, though. I believe that what I did will work on all legal 
>>>input.
>>>      
>>>
>>I'm unconvinced.  Even if there are not any current syntaxes in which a
>>numeric literal can be adjacent to a string literal (I'm not totally
>>sure about that), what of the future?  We should solve the problem
>>rather than assuming it won't bite us.
>>
>>    
>>
>>>I think that this might be cured by having psql recognise a legal 
>>>identifier or keyword and eating it as a word, rather than treating it 
>>>as just another set of bytes in the stream.
>>>      
>>>
>>Hm, might work ... will think about it ...
>>    
>>
>
>I am a little concerned about adding the overhead of lex to psql.  Right
>now, some folks have reported that lex/yacc take a considerable amount
>of processing time in the backend as part of a query, and adding that to
>psql just to do $$ seems questionable.  Of course, we can alway test and
>see what the overhead shows.
>
>  
>

I think Tom's point is that the custom lexical recognition done by psql 
is approaching the point of being unmaintainable, and if we have to bite 
the bullet then maybe we might as well do so now.

I'd be surprised if using a flex lexer instead made a huge speed 
difference, but maybe I'm wrong. I'm more concerned that it will be 
difficult to write and maintain and keep in sync with the backend's 
lexical structure - but those are just gut feelings, and I could be way 
off base. Right now psql does just enough recognition to enable it to 
work. Making it recognise the whole sql lexical structure instead 
strikes me as being somewhat redundant - that's what the backend does. 
Maybe we could do a flex lexer somewhat along the lines of the 
minimalist approach that psql currently employs.

Anyway - I did put forward a possible non-flex way to handle the problem 
Tom saw with $$ quoting - let's see what his verdict is on that first 
;-) If he thinks it won't work, I can't see much alternative to using a 
flex-based lexer.

cheers

andrew



Re: [PATCHES] dollar quoting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I'd be surprised if using a flex lexer instead made a huge speed 
> difference, but maybe I'm wrong.

No, I agree --- it's unlikely to make very much difference in the real
world.  Maybe on huge query strings you could notice the difference.

> I'm more concerned that it will be 
> difficult to write and maintain and keep in sync with the backend's 
> lexical structure - but those are just gut feelings, and I could be way 
> off base. Right now psql does just enough recognition to enable it to 
> work. Making it recognise the whole sql lexical structure instead 
> strikes me as being somewhat redundant - that's what the backend does. 

Actually, I thought the way to handle it would be to duplicate the
backend lexer as nearly as possible.  Most of the productions would have
empty bodies probably, but doing it that way would guarantee that in
fact psql and the backend would lex a string the same way, which is
exactly the problem we are facing here.  You'd fall out of the lexer
only upon detecting backslash (unless we want to put backslash command
lexing into the flex code, which might or might not be a good idea),
or upon detecting a ';' at parenthesis depth 0, or upon hitting end of
string.  In the last case the lexer state would indicate which prompt
we need to give.

One unsolved issue in my mind is how to deal with multibyte characters
in non-backend-compatible encodings (the ones where some bytes of a
multibyte character can have values < 128 and thus can look like plain
ASCII characters).  We'd have to make sure that the flex lexer skips
over such bytes properly.  I've thought of a couple of kluges but
nothing I like ...
        regards, tom lane


Re: [PATCHES] dollar quoting

From
Christopher Kings-Lynne
Date:
> Actually, I thought the way to handle it would be to duplicate the
> backend lexer as nearly as possible.  Most of the productions would have
> empty bodies probably, but doing it that way would guarantee that in
> fact psql and the backend would lex a string the same way, which is
> exactly the problem we are facing here.  You'd fall out of the lexer
> only upon detecting backslash (unless we want to put backslash command
> lexing into the flex code, which might or might not be a good idea),
> or upon detecting a ';' at parenthesis depth 0, or upon hitting end of
> string.  In the last case the lexer state would indicate which prompt
> we need to give.

You know what would be really sweet?  If the lexing was made available 
as a public function.  eg. So I could parse queries in phpPgAdmin before 
sending them to the backend, etc...

Chris


Re: [PATCHES] dollar quoting

From
Tom Lane
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> You know what would be really sweet?  If the lexing was made available 
> as a public function.  eg. So I could parse queries in phpPgAdmin before 
> sending them to the backend, etc...

Parsing is a whole nother ball of wax besides lexing.  I wasn't planning
to put *that* into psql.  Remember the only thing psql really wants from
this is to detect where end-of-statement is ...
        regards, tom lane


Re: [PATCHES] dollar quoting

From
Christopher Kings-Lynne
Date:
> Parsing is a whole nother ball of wax besides lexing.  I wasn't planning
> to put *that* into psql.  Remember the only thing psql really wants from
> this is to detect where end-of-statement is ...

Forgive my lameness, but I've never truly figured out where parsing ends 
and lexing begins.  Anyone care to illuminate me on the difference?

Chris


Re: [PATCHES] dollar quoting

From
Tom Lane
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>> Parsing is a whole nother ball of wax besides lexing.

> Forgive my lameness, but I've never truly figured out where parsing ends 
> and lexing begins.  Anyone care to illuminate me on the difference?

The theoretical answer is that you can do lexing with a finite-state
machine, but parsing generally requires a stack, because it supports
nested constructs.  Lexers don't have any way to describe nested
constructs --- a series of tokens is the only level of abstraction there
is.

The practical answer is that you do one with flex and the other with
bison ;-).  If you can do it with flex, and not cheat by implementing
your own state stack, it's lexing.
        regards, tom lane


Re: [PATCHES] dollar quoting

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I am a little concerned about adding the overhead of lex to psql.  Right
> > now, some folks have reported that lex/yacc take a considerable amount
> > of processing time in the backend as part of a query, and adding that to
> > psql just to do $$ seems questionable.  Of course, we can alway test and
> > see what the overhead shows.
>
> That's not the question to ask --- the question is whether a flex lexer
> will be faster or slower than the hand-made lexing code that psql is
> currently using.  Lexer generation is a well-understood art, and you
> have to be pretty tense to beat out flex with hand-made code.  It's the
> same tradeoff as trying to write better assembly code than a compiler
> does.  Look at the lexing loop in psql/mainloop.c (that series of
> if-tests starting at about line 250) and ask yourself if that's really
> going to beat out the state-machine implementation flex uses --- which
> looks to be about two table lookups per character, plus a switch()
> executed at the end of every token.  I'll bet on flex being faster.
>
> The reason the lexer shows up in the backend is that it has to grovel
> over every individual character of a query.  For sufficiently large
> query strings that's gonna take awhile no matter how you do it.
>
> But in any case, the argument for moving to flex is not about
> performance, it is about making the code more understandable and more
> certain to agree with the backend lexer.

If we go with lex/flex for psql, I would still like someone to test
performance to see that we aren't taking a big hit.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [PATCHES] dollar quoting

From
Bruce Momjian
Date:
Christopher Kings-Lynne wrote:
> > Parsing is a whole nother ball of wax besides lexing.  I wasn't planning
> > to put *that* into psql.  Remember the only thing psql really wants from
> > this is to detect where end-of-statement is ...
> 
> Forgive my lameness, but I've never truly figured out where parsing ends 
> and lexing begins.  Anyone care to illuminate me on the difference?

The simplistic answer is the lexing breaks a string up into
words/tokens, and parsing matches those tokens against patterns and
fires actions.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: [PATCHES] dollar quoting

From
Andrew Dunstan
Date:

Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>  
>
>>andrew=# create table ab$cd$ef (ef$cd$ab text);
>>CREATE TABLE
>>andrew=# \d ab$cd$ef
>>Did not find any relation named "ab$cd$ef".
>>    
>>
>
>Hmph.  I always thought that "$" was only special at the end of a regex,
>but that doesn't seem to be how our implementation treats it.  Anyway
>this is not a bug, it is a feature: the argument of \d is a regex.
>
>  
>

Arguably this at least is a singularly useless feature, since a 
regex-meaning $ before the end of string is a nonsense, as you rightly 
imply, and one at the end of the string is redundant, as it is implied - 
psql turns 'abc' into '^abc$' when constructing the query.

I don't care that much - I don't use $ in my identifiers.

cheers

andrew



Re: [PATCHES] dollar quoting

From
Christopher Kings-Lynne
Date:
> Parsing is a whole nother ball of wax besides lexing.  I wasn't planning
> to put *that* into psql.  Remember the only thing psql really wants from
> this is to detect where end-of-statement is ...

OK, I'm not that great at understanding where lexing ands and parsing 
starts.  These are the things that it would be really nice to know about  in phpPgAdmin:

* Start and end of multiline queries, so when executing an SQL script I 
can execute each query separately.

* Knowing how many $n parameters there are in a SELECT statement that is 
to be prepared.

Chris



Re: [PATCHES] dollar quoting

From
Christopher Kings-Lynne
Date:
Doh, sorry.  I just realised that the lists just gave me a whole bunch 
of mails from back in February, which is when Tom made this post...

Chris

Christopher Kings-Lynne wrote:

>> Parsing is a whole nother ball of wax besides lexing.  I wasn't planning
>> to put *that* into psql.  Remember the only thing psql really wants from
>> this is to detect where end-of-statement is ...
> 
> 
> OK, I'm not that great at understanding where lexing ands and parsing 
> starts.  These are the things that it would be really nice to know about 
>  in phpPgAdmin:
> 
> * Start and end of multiline queries, so when executing an SQL script I 
> can execute each query separately.
> 
> * Knowing how many $n parameters there are in a SELECT statement that is 
> to be prepared.
> 
> Chris
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly