Re: Win32 patch for COPY - Mailing list pgsql-patches

From Bruce Momjian
Subject Re: Win32 patch for COPY
Date
Msg-id 200304182345.h3INjhc04871@candle.pha.pa.us
Whole thread Raw
In response to Win32 patch for COPY  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-patches
Here is a new version of the patch.  I realized that if you were loading
in a column that had only a single column, the tests wouldn't throw an
error on a literal \r, so I added code to record the end-of-line type
for the first line, and make sure the rest of the lines match.

It is a little more code, but I think COPY is important enough to
cleanly handle this and be 100% reliable.

---------------------------------------------------------------------------

Bruce Momjian wrote:
> Here is a patch to allow COPY FROM to accept line terminators of \r, \n,
> and \r\n, and for COPY TO to output \r\n on Win32.
>
> CHANGES FROM PREVIOUS BEHAVIOR:
>
>     o We used to allow a literal carriage return as a data value,
>       while this patch will assume it is a line terminator.
>
> This was not documented in the COPY manual page, and was not output as
> part of COPY, but it was accepted, while in 7.4 it will not.  You can
> still supply carriage return as \r or backslash-carriage-return.
>
> One trick was to prevent silently ignoring carriage returns at the end
> of a line in non-\r\n files.  The solution was to create a has_crnl
> variable that is set from the first copy line --- if it is false, a
> literal carriage return found as a data value will throw an error, while
> a newline without a preceeding carriage return also throws an error.
> Backslash-literal still works fine.  Literal carriage returns or line
> feeds not at the end of a line will cause the next line to have the
> incorrect number of fields which will throw an error.
>
> Even single-line COPY tables are properly checked when using
> STDIN/STDOUT because the \. must also terminate consistenly.
>
> Another change is that Win32 will output COPY files as native \r\n,
> rather than \n.  Of course, this can be loaded into non-Win32 too.
>
> Should be be outputting \r for OS X?
>
> The good news is that copy.c is the only place where EOL still needs to
> be dealt with.  Other files are either open in text mode (meaning they
> can handle any end-of-line format) or aren't edited/created by users.
> There is no need to change psql \copy because those files are opened in
> text mode.
>
> I also cleaned up the BinarySignature variable usage.
>
> I have tested with \n and \r\n PGEOL values.
>
> Docs updated.
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
Index: doc/src/sgml/ref/copy.sgml
===================================================================
RCS file: /cvsroot/pgsql-server/doc/src/sgml/ref/copy.sgml,v
retrieving revision 1.42
diff -c -c -r1.42 copy.sgml
*** doc/src/sgml/ref/copy.sgml    15 Apr 2003 13:25:08 -0000    1.42
--- doc/src/sgml/ref/copy.sgml    18 Apr 2003 23:42:48 -0000
***************
*** 289,295 ****
      otherwise be taken as row or column delimiters. In particular, the
      following characters <emphasis>must</> be preceded by a backslash if
      they appear as part of a column value: backslash itself,
!     newline, and the current delimiter character.
     </para>

     <para>
--- 289,295 ----
      otherwise be taken as row or column delimiters. In particular, the
      following characters <emphasis>must</> be preceded by a backslash if
      they appear as part of a column value: backslash itself,
!     newline, carriage return, and the current delimiter character.
     </para>

     <para>
***************
*** 355,370 ****
      It is strongly recommended that applications generating COPY data convert
      data newlines and carriage returns to the <literal>\n</> and
      <literal>\r</> sequences respectively.  At present it is
!     possible to represent a data carriage return without any special quoting,
!     and to represent a data newline by a backslash and newline.  However,
!     these representations will not be accepted by default in future releases.
     </para>

     <para>
!     Note that the end of each row is marked by a Unix-style newline
!     (<quote><literal>\n</></>).  Presently, <command>COPY FROM</command> will not behave as
!     desired if given a file containing DOS- or Mac-style newlines.
!     This is expected to change in future releases.
     </para>
    </refsect2>

--- 355,370 ----
      It is strongly recommended that applications generating COPY data convert
      data newlines and carriage returns to the <literal>\n</> and
      <literal>\r</> sequences respectively.  At present it is
!     possible to represent a data carriage return by a backslash and carriage
!     return, and to represent a data newline by a backslash and newline.
!     However, these representations might not be accepted in future releases.
     </para>

     <para>
!     <command>COPY TO</command> will terminate each row with a Unix-style
!     newline (<quote><literal>\n</></>),  or carriage return/newline
!     ("\r\n") on  MS Windows.  <command>COPY FROM</command> can handle lines
!     ending with newlines, carriage returns, or carriage return/newlines.
     </para>
    </refsect2>

***************
*** 393,399 ****
  12-byte sequence <literal>PGBCOPY\n\377\r\n\0</> --- note that the zero byte
  is a required part of the signature.  (The signature is designed to allow
  easy identification of files that have been munged by a non-8-bit-clean
! transfer.  This signature will be changed by newline-translation
  filters, dropped zero bytes, dropped high bits, or parity changes.)
         </para>
        </listitem>
--- 393,399 ----
  12-byte sequence <literal>PGBCOPY\n\377\r\n\0</> --- note that the zero byte
  is a required part of the signature.  (The signature is designed to allow
  easy identification of files that have been munged by a non-8-bit-clean
! transfer.  This signature will be changed by end-of-line-translation
  filters, dropped zero bytes, dropped high bits, or parity changes.)
         </para>
        </listitem>
Index: src/backend/commands/copy.c
===================================================================
RCS file: /cvsroot/pgsql-server/src/backend/commands/copy.c,v
retrieving revision 1.191
diff -c -c -r1.191 copy.c
*** src/backend/commands/copy.c    4 Apr 2003 20:42:11 -0000    1.191
--- src/backend/commands/copy.c    18 Apr 2003 23:42:50 -0000
***************
*** 49,54 ****
--- 49,60 ----
  #define ISOCTAL(c) (((c) >= '0') && ((c) <= '7'))
  #define OCTVALUE(c) ((c) - '0')

+ #ifndef WIN32
+ #define PGEOL    "\n"
+ #else
+ #define PGEOL    "\r\n"
+ #endif
+
  /*
   * Represents the type of data returned by CopyReadAttribute()
   */
***************
*** 70,82 ****
  static void CopyAttributeOut(FILE *fp, char *string, char *delim);
  static List *CopyGetAttnums(Relation rel, List *attnamelist);

! static const char BinarySignature[12] = "PGBCOPY\n\377\r\n\0";

  /*
   * Static communication variables ... pretty grotty, but COPY has
   * never been reentrant...
   */
  int            copy_lineno = 0;    /* exported for use by elog() -- dz */
  static bool fe_eof;

  /*
--- 76,97 ----
  static void CopyAttributeOut(FILE *fp, char *string, char *delim);
  static List *CopyGetAttnums(Relation rel, List *attnamelist);

! /* The trailing null is part of the signature */
! static const char BinarySignature[] = "PGBCOPY\n\377\r\n";

  /*
   * Static communication variables ... pretty grotty, but COPY has
   * never been reentrant...
   */
  int            copy_lineno = 0;    /* exported for use by elog() -- dz */
+ static enum
+ {
+     EOL_UNKNOWN,
+     EOL_NL,
+     EOL_CR,
+     EOL_CRNL
+ } eol_type;
+
  static bool fe_eof;

  /*
***************
*** 504,510 ****
      else if (!is_from)
      {
          if (!binary)
!             CopySendData("\\.\n", 3, fp);
          if (IsUnderPostmaster)
              pq_endcopyout(false);
      }
--- 519,528 ----
      else if (!is_from)
      {
          if (!binary)
!         {
!             CopySendString("\\.", fp);
!             CopySendString(fp ? PGEOL : "\n", fp);
!         }
          if (IsUnderPostmaster)
              pq_endcopyout(false);
      }
***************
*** 589,595 ****
          int32        tmp;

          /* Signature */
!         CopySendData((char *) BinarySignature, 12, fp);
          /* Integer layout field */
          tmp = 0x01020304;
          CopySendData(&tmp, sizeof(int32), fp);
--- 607,613 ----
          int32        tmp;

          /* Signature */
!         CopySendData((char *) BinarySignature, sizeof(BinarySignature), fp);
          /* Integer layout field */
          tmp = 0x01020304;
          CopySendData(&tmp, sizeof(int32), fp);
***************
*** 725,731 ****
          }

          if (!binary)
!             CopySendChar('\n', fp);

          MemoryContextSwitchTo(oldcontext);
      }
--- 743,749 ----
          }

          if (!binary)
!             CopySendString(fp ? PGEOL : "\n", fp);

          MemoryContextSwitchTo(oldcontext);
      }
***************
*** 906,912 ****

          /* Signature */
          CopyGetData(readSig, 12, fp);
!         if (CopyGetEof(fp) || memcmp(readSig, BinarySignature, 12) != 0)
              elog(ERROR, "COPY BINARY: file signature not recognized");
          /* Integer layout field */
          CopyGetData(&tmp, sizeof(int32), fp);
--- 924,930 ----

          /* Signature */
          CopyGetData(readSig, 12, fp);
!         if (CopyGetEof(fp) || memcmp(readSig, BinarySignature, sizeof(BinarySignature)) != 0)
              elog(ERROR, "COPY BINARY: file signature not recognized");
          /* Integer layout field */
          CopyGetData(&tmp, sizeof(int32), fp);
***************
*** 937,942 ****
--- 955,961 ----
      nulls = (char *) palloc(num_phys_attrs * sizeof(char));

      copy_lineno = 0;
+     eol_type = EOL_UNKNOWN;
      fe_eof = false;

      /* Make room for a PARAM_EXEC value for domain constraint checks */
***************
*** 1350,1357 ****
--- 1369,1412 ----
              *result = END_OF_FILE;
              goto copy_eof;
          }
+         if (c == '\r')
+         {
+             if (eol_type == EOL_NL)
+                 elog(ERROR, "CopyReadAttribute: Literal carriage return data value\n"
+                             "found in input that has newline termination; use \\r");
+
+             /*    Check for \r\n on first line, _and_ handle \r\n. */
+             if (copy_lineno == 1 || eol_type == EOL_CRNL)
+             {
+                 int c2 = CopyPeekChar(fp);
+                 if (c2 == '\n')
+                 {
+                     CopyDonePeek(fp, c2, true);        /* eat newline */
+                     eol_type = EOL_CRNL;
+                 }
+                 else
+                 {
+                     /* found \r, but no \n */
+                     if (eol_type == EOL_CRNL)
+                         elog(ERROR, "CopyReadAttribute: Literal carriage return data value\n"
+                                     "found in input that has carriage return/newline termination; use \\r");
+                     /* if we got here, it is the first line and we didn't get \n, so put it back */
+                     CopyDonePeek(fp, c2, false);
+                     eol_type = EOL_CR;
+                 }
+             }
+             *result = END_OF_LINE;
+             break;
+         }
          if (c == '\n')
          {
+             if (eol_type == EOL_CRNL)
+                 elog(ERROR, "CopyReadAttribute: Literal newline data value found in input\n"
+                             "that has carriage return/newline termination; use \\n");
+             if (eol_type == EOL_CR)
+                 elog(ERROR, "CopyReadAttribute: Literal newline data value found in input\n"
+                             "that has carriage return termination; use \\n");
+             eol_type = EOL_NL;
              *result = END_OF_LINE;
              break;
          }
***************
*** 1441,1451 ****
                      c = '\v';
                      break;
                  case '.':
                      c = CopyGetChar(fp);
!                     if (c != '\n')
!                         elog(ERROR, "CopyReadAttribute: end of record marker corrupted");
                      *result = END_OF_FILE;
                      goto copy_eof;
              }
          }
          appendStringInfoCharMacro(&attribute_buf, c);
--- 1496,1519 ----
                      c = '\v';
                      break;
                  case '.':
+                     if (eol_type == EOL_CRNL)
+                     {
+                         c = CopyGetChar(fp);
+                         if (c == '\n')
+                             elog(ERROR, "CopyReadAttribute: end-of-copy termination does not match previous input");
+                         if (c != '\r')
+                             elog(ERROR, "CopyReadAttribute: end-of-copy marker corrupt");
+                     }
                      c = CopyGetChar(fp);
!                     if (c != '\r' && c != '\n')
!                         elog(ERROR, "CopyReadAttribute: end-of-copy marker corrupt");
!                     if (((eol_type == EOL_NL || eol_type == EOL_CRNL) && c != '\n') ||
!                         (eol_type == EOL_CR && c != '\r'))
!                         elog(ERROR, "CopyReadAttribute: end-of-copy termination does not match previous input");
                      *result = END_OF_FILE;
                      goto copy_eof;
+
+                 /* Default, fall through with whatever character was just escaped. */
              }
          }
          appendStringInfoCharMacro(&attribute_buf, c);

pgsql-patches by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Win32 defines
Next
From: Sean Chittenden
Date:
Subject: Slightly improved SSL bits...