libpq should have functions for escaping data for use in COPY FROM - Mailing list pgsql-hackers
From | Joey Adams |
---|---|
Subject | libpq should have functions for escaping data for use in COPY FROM |
Date | |
Msg-id | CAARyMpA7Gq3joL=1AqRKBS0FOGAOHb6Fh-wU=hbLghHcaFjOXw@mail.gmail.com Whole thread Raw |
Responses |
Re: libpq should have functions for escaping data for use
in COPY FROM
|
List | pgsql-hackers |
libpq has functions for escaping values in SQL commands (PQescapeStringConn, PQescapeByteaConn, and the new PQescapeLiteral), and it supports parameterizing queries with PQexecParams. But it does not (to my knowledge) have functions for escaping values for COPY FROM. COPY FROM is useful for inserting rows in bulk (though I wonder if constructing massive INSERT statements and using PQexecParams is just as efficient). It is also useful for generating .sql files which can be run on a database elsewhere. I think libpq should include functions for escaping with COPY FROM. Perhaps the API could look like this: typedef struct EscapeCopyInfo { /* * If this is not NULL, the PQescapeCopy functions will take advantage of * connection-specific information to minimize the escaped representation * length. */ PGConn *conn; char delimiter; const char *null_string; } EscapeCopyInfo; void initEscapeCopyInfo(EscapeCopyInfo *); /* If info is NULL, it defaults to {NULL, '\t', "\\N"} */ size_t PQescapeStringCopy( EscapeCopyInfo *info, char *to, const char *from, size_t from_length, int *error ); size_t PQescapeByteaCopy( EscapeCopyInfo *info, unsigned char *to, const unsigned char *from, size_t from_length, int *error ); Using an EscapeCopyInfo structure to specify format information cuts down on parameter counts for the escaping functions, and makes it possible to support more options in the future (e.g. CSV). I'm not terribly attached to the parameter order of the functions. I was just following the lead of PQescapeStringConn. This API writes text into a buffer, rather than allocating a buffer with malloc like PQescapeByteaConn and PQescapeLiteral do. This means rows containing multiple values can be constructed efficiently. On the other hand, it is inconvenient and error-prone, since the programmer has to calculate how big the buffer needs to be (and this varies depending on settings). Imagine this bug: * A programmer allocates a buffer of size 2*length + 2 for a BYTEA, assuming the bytestring will be escaped using the \x... format * A user runs the program with PostgreSQL 8.4, which predates the \x... format, and now the bytestring's escaped representation can be up to 5*length bytes long (but very frequently isn't!) I wrote a basic implementation for a Haskell binding, but it only supports default COPY options (text mode, with tab delimiter and \N null string). https://github.com/joeyadams/haskell-libpq/blob/copy-from/cbits/escape-copy.c Before spending a bunch of time on this, I'd like some input. A few questions: * Should we have corresponding functions for parsing COPY TO data, or is PQexecParams sufficient? * Should we support CSV escaping? Can the CSV format safely encode all characters (in particular, newlines)? * Should we deal with encodings here, or just escape everything that isn't printable ASCII like the code I wrote does? Thanks, -Joey
pgsql-hackers by date: