Re: CSV multiline final fix - Mailing list pgsql-patches
From | Bruce Momjian |
---|---|
Subject | Re: CSV multiline final fix |
Date | |
Msg-id | 200502212140.j1LLeWZ18529@candle.pha.pa.us Whole thread Raw |
In response to | CSV multiline final fix (Andrew Dunstan <andrew@dunslane.net>) |
Responses |
Re: CSV multiline final fix
|
List | pgsql-patches |
Shame we had to duplicate CopyReadLine() in a sense. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Andrew Dunstan wrote: > > Well, in response to the huge number (0) of comments on my POC patch to > fix this, I prepared the attached patch, which improves on my previous > effort a bit (there was one obscure failure case which is now handled). > > Basically, all the required logic is in a new function CopyReadLineCSV() > which is almost but not quite like CopyReadLine(). The new function > keeps just enough state to know if a line ending sequence (CR, CRLF, or > LF) is part of a quoted field or not. This gets rid of the need for > special casing embedded line endings on input elsewhere, so that is > removed, as is the warning about them on output that we added back in > december (as we then thought just before release). Lastly, the docs are > also patched. > > Also attached is my tiny test file - maybe we need to cover this in > regression tests? > > cheers > > andrew > diff -c -r ../../pgbf/root/REL8_0_STABLE/pgsql/doc/src/sgml/ref/copy.sgml ./doc/src/sgml/ref/copy.sgml > *** ../../pgbf/root/REL8_0_STABLE/pgsql/doc/src/sgml/ref/copy.sgml Mon Jan 3 19:39:53 2005 > --- ./doc/src/sgml/ref/copy.sgml Sun Feb 20 19:18:54 2005 > *************** > *** 496,508 **** > <para> > CSV mode will both recognize and produce CSV files with quoted > values containing embedded carriage returns and line feeds. Thus > ! the files are not strictly one line per table row like text-mode > ! files. However, <productname>PostgreSQL</productname> will reject > ! <command>COPY</command> input if any fields contain embedded line > ! end character sequences that do not match the line ending > ! convention used in the CSV file itself. It is generally safer to > ! import data containing embedded line end characters using the > ! text or binary formats rather than CSV. > </para> > </note> > > --- 496,503 ---- > <para> > CSV mode will both recognize and produce CSV files with quoted > values containing embedded carriage returns and line feeds. Thus > ! the files are not strictly one line per table row as are text-mode > ! files. > </para> > </note> > > *************** > *** 513,518 **** > --- 508,515 ---- > might encounter some files that cannot be imported using this > mechanism, and <command>COPY</> might produce files that other > programs cannot process. > + It is generally safer to import data using the text or binary formats, > + if possible, rather than using CSV format. > </para> > </note> > > diff -c -r ../../pgbf/root/REL8_0_STABLE/pgsql/src/backend/commands/copy.c ./src/backend/commands/copy.c > *** ../../pgbf/root/REL8_0_STABLE/pgsql/src/backend/commands/copy.c Fri Dec 31 16:59:41 2004 > --- ./src/backend/commands/copy.c Sun Feb 20 13:40:56 2005 > *************** > *** 98,104 **** > static EolType eol_type; /* EOL type of input */ > static int client_encoding; /* remote side's character encoding */ > static int server_encoding; /* local encoding */ > - static bool embedded_line_warning; > > /* these are just for error messages, see copy_in_error_callback */ > static bool copy_binary; /* is it a binary copy? */ > --- 98,103 ---- > *************** > *** 140,145 **** > --- 139,145 ---- > char *delim, char *null_print, bool csv_mode, char *quote, char *escape, > List *force_notnull_atts); > static bool CopyReadLine(void); > + static bool CopyReadLineCSV(char * quote, char * escape); > static char *CopyReadAttribute(const char *delim, const char *null_print, > CopyReadResult *result, bool *isnull); > static char *CopyReadAttributeCSV(const char *delim, const char *null_print, > *************** > *** 1191,1197 **** > attr = tupDesc->attrs; > num_phys_attrs = tupDesc->natts; > attr_count = list_length(attnumlist); > - embedded_line_warning = false; > > /* > * Get info about the columns we need to process. > --- 1191,1196 ---- > *************** > *** 1718,1724 **** > ListCell *cur; > > /* Actually read the line into memory here */ > ! done = CopyReadLine(); > > /* > * EOF at start of line means we're done. If we see EOF after > --- 1717,1723 ---- > ListCell *cur; > > /* Actually read the line into memory here */ > ! done = csv_mode ? CopyReadLineCSV(quote, escape) : CopyReadLine(); > > /* > * EOF at start of line means we're done. If we see EOF after > *************** > *** 2194,2199 **** > --- 2193,2448 ---- > return result; > } > > + /* > + * Read a line for CSV copy mode. Differences from standard mode: > + * . CR an NL are not special inside quoted fields - they just get added > + * to the buffer. > + * . \ is not magical except as the start of the end of data marker. > + * > + */ > + > + static bool > + CopyReadLineCSV(char * quote, char * escape) > + { > + bool result; > + bool change_encoding = (client_encoding != server_encoding); > + int c; > + int mblen; > + int j; > + unsigned char s[2]; > + char *cvt; > + bool in_quote = false, last_was_esc = false; > + char quotec = quote[0]; > + char escapec = escape[0]; > + > + s[1] = 0; > + > + /* ignore special escape processing if it's the same as quote */ > + if (quotec == escapec) > + escapec = '\0'; > + > + /* reset line_buf to empty */ > + line_buf.len = 0; > + line_buf.data[0] = '\0'; > + line_buf.cursor = 0; > + > + /* mark that encoding conversion hasn't occurred yet */ > + line_buf_converted = false; > + > + /* set default status */ > + result = false; > + > + /* > + * In this loop we only care for detecting newlines (\r and/or \n) > + * and the end-of-copy marker (\.). These four > + * characters, and only these four, are assumed the same in frontend > + * and backend encodings. We do not assume that second and later bytes > + * of a frontend multibyte character couldn't look like ASCII characters. > + * > + * What about the encoding implications of the quote / excape chars? > + * > + * However, CR and NL characters that are inside a quoted field are > + * not special, and are simply a part of the data value. The parsing rule > + * used is a bit rough and ready, but probably adequate for our purposes. > + */ > + > + for (;;) > + { > + c = CopyGetChar(); > + if (c == EOF) > + { > + result = true; > + break; > + } > + > + /* > + * Dealing with quotes and escapes here is mildly tricky. If the > + * quote char is also the escape char, there's no problem - we > + * just use the char as a toggle. If they are different, we need > + * to ensure that we only take account of an escape inside a quoted > + * field and immediately preceding a quote char, and not the > + * second in a escape-escape sequence. > + */ > + > + if (in_quote && c == escapec) > + last_was_esc = ! last_was_esc; > + > + if (c == quotec && ! last_was_esc) > + in_quote = ! in_quote; > + > + if (c != escapec) > + last_was_esc = false; > + > + /* > + * updating the line count for embedded CR and/or LF chars is > + * necessarily a little fragile - this test is probably about > + * the best we can do. > + */ > + if (in_quote && c == (eol_type == EOL_CR ? '\r' : '\n')) > + copy_lineno++; > + > + if (!in_quote && c == '\r') > + { > + if (eol_type == EOL_NL) > + ereport(ERROR, > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > + errmsg("unquoted carriage return found in CSV data"), > + errhint("Use quoted CSV field to represent carriage return."))); > + /* Check for \r\n on first line, _and_ handle \r\n. */ > + if (eol_type == EOL_UNKNOWN || eol_type == EOL_CRNL) > + { > + int c2 = CopyPeekChar(); > + > + if (c2 == '\n') > + { > + CopyDonePeek(c2, true); /* eat newline */ > + eol_type = EOL_CRNL; > + } > + else > + { > + /* found \r, but no \n */ > + if (eol_type == EOL_CRNL) > + ereport(ERROR, > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > + errmsg("unquoted carriage return found in CSV data"), > + errhint("Use quoted CSV field to represent carriage return."))); > + > + /* > + * if we got here, it is the first line and we didn't > + * get \n, so put it back > + */ > + CopyDonePeek(c2, false); > + eol_type = EOL_CR; > + } > + } > + break; > + } > + if (!in_quote && c == '\n') > + { > + if (eol_type == EOL_CR || eol_type == EOL_CRNL) > + ereport(ERROR, > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > + errmsg("unquoted newline found in CSV data"), > + errhint("Use quoted CSV field to represent newline."))); > + eol_type = EOL_NL; > + break; > + } > + > + /* \ is only potentially magical at the start of a line */ > + if (line_buf.len == 0 && c == '\\') > + { > + int c2 = CopyPeekChar(); > + > + if (c2 == EOF) > + { > + result = true; > + > + CopyDonePeek(c2, true); /* eat it - do we need to? */ > + > + break; > + } > + if (c2 == '.') > + { > + > + CopyDonePeek(c2, true); /* so we can keep calling GetChar() */ > + > + if (eol_type == EOL_CRNL) > + { > + c = CopyGetChar(); > + if (c == '\n') > + ereport(ERROR, > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > + errmsg("end-of-copy marker does not match previous newline style"))); > + if (c != '\r') > + ereport(ERROR, > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > + errmsg("end-of-copy marker corrupt"))); > + } > + c = CopyGetChar(); > + if (c != '\r' && c != '\n') > + ereport(ERROR, > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > + errmsg("end-of-copy marker corrupt"))); > + if ((eol_type == EOL_NL && c != '\n') || > + (eol_type == EOL_CRNL && c != '\n') || > + (eol_type == EOL_CR && c != '\r')) > + ereport(ERROR, > + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > + errmsg("end-of-copy marker does not match previous newline style"))); > + > + /* > + * In protocol version 3, we should ignore anything > + * after \. up to the protocol end of copy data. (XXX > + * maybe better not to treat \. as special?) > + */ > + if (copy_dest == COPY_NEW_FE) > + { > + while (c != EOF) > + c = CopyGetChar(); > + } > + result = true; /* report EOF */ > + break; > + } > + > + CopyDonePeek(c2, false); /* not a dot, so put it back */ > + > + } > + > + appendStringInfoCharMacro(&line_buf, c); > + > + /* > + * When client encoding != server, must be careful to read the > + * extra bytes of a multibyte character exactly, since the encoding > + * might not ensure they don't look like ASCII. When the encodings > + * are the same, we need not do this, since no server encoding we > + * use has ASCII-like following bytes. > + */ > + if (change_encoding) > + { > + s[0] = c; > + mblen = pg_encoding_mblen(client_encoding, s); > + for (j = 1; j < mblen; j++) > + { > + c = CopyGetChar(); > + if (c == EOF) > + { > + result = true; > + break; > + } > + appendStringInfoCharMacro(&line_buf, c); > + } > + if (result) > + break; /* out of outer loop */ > + } > + } /* end of outer loop */ > + > + /* > + * Done reading the line. Convert it to server encoding. > + * > + * Note: set line_buf_converted to true *before* attempting conversion; > + * this prevents infinite recursion during error reporting should > + * pg_client_to_server() issue an error, due to copy_in_error_callback > + * again attempting the same conversion. We'll end up issuing the message > + * without conversion, which is bad but better than nothing ... > + */ > + line_buf_converted = true; > + > + if (change_encoding) > + { > + cvt = (char *) pg_client_to_server((unsigned char *) line_buf.data, > + line_buf.len); > + if (cvt != line_buf.data) > + { > + /* transfer converted data back to line_buf */ > + line_buf.len = 0; > + line_buf.data[0] = '\0'; > + appendBinaryStringInfo(&line_buf, cvt, strlen(cvt)); > + } > + } > + > + return result; > + } > + > /*---------- > * Read the value of a single attribute, performing de-escaping as needed. > * > *************** > *** 2369,2402 **** > > for (;;) > { > - /* handle multiline quoted fields */ > - if (in_quote && line_buf.cursor >= line_buf.len) > - { > - bool done; > - > - switch (eol_type) > - { > - case EOL_NL: > - appendStringInfoString(&attribute_buf, "\n"); > - break; > - case EOL_CR: > - appendStringInfoString(&attribute_buf, "\r"); > - break; > - case EOL_CRNL: > - appendStringInfoString(&attribute_buf, "\r\n"); > - break; > - case EOL_UNKNOWN: > - /* shouldn't happen - just keep going */ > - break; > - } > - > - copy_lineno++; > - done = CopyReadLine(); > - if (done && line_buf.len == 0) > - break; > - start_cursor = line_buf.cursor; > - } > - > end_cursor = line_buf.cursor; > if (line_buf.cursor >= line_buf.len) > break; > --- 2618,2623 ---- > *************** > *** 2629,2653 **** > !use_quote && (c = *test_string) != '\0'; > test_string += mblen) > { > - /* > - * We don't know here what the surrounding line end characters > - * might be. It might not even be under postgres' control. So > - * we simple warn on ANY embedded line ending character. > - * > - * This warning will disappear when we make line parsing field-aware, > - * so that we can reliably read in embedded line ending characters > - * regardless of the file's line-end context. > - * > - */ > - > - if (!embedded_line_warning && (c == '\n' || c == '\r') ) > - { > - embedded_line_warning = true; > - elog(WARNING, > - "CSV fields with embedded linefeed or carriage return " > - "characters might not be able to be reimported"); > - } > - > if (c == delimc || c == quotec || c == '\n' || c == '\r') > use_quote = true; > if (!same_encoding) > --- 2850,2855 ---- > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
pgsql-patches by date: