Thread: Escaping input from COPY
Hi folks, In a Perl application I would like to frequently bulk load several hundred thousand rows of data into a temporary table before merging the contents into the database proper. I'm currently doing individual INSERTs into the temporary table, which has obviously got a significant performance penalty--the importing takes several minutes even on a very fast machine. I'd like to switch to COPY, which should be orders of magnitude faster. I see that DBD::Pg has an interface for this, which looks just fine. My problem is with how to escape the data. I need to use whatever escaping rules are in use by the server, which I've seen documented in the manual; but in order to cope with any future changes to these rules, and ensure identical behaviour, are there any standard functions I can use to escape the data before loading it? Thanks, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `- GPG Public Key: 0x25BFB848 Please GPG sign your mail.
On Mon, Dec 19, 2011 at 6:56 AM, Roger Leigh <rleigh@codelibre.net> wrote: > I'd like to switch to COPY, which should be orders of > magnitude faster. I see that DBD::Pg has an interface for > this, which looks just fine. My problem is with how to > escape the data. I need to use whatever escaping rules > are in use by the server, which I've seen documented in > the manual; but in order to cope with any future changes > to these rules, and ensure identical behaviour, are there > any standard functions I can use to escape the data before > loading it? This is really a question for the DBD::Pg folks, I think. Looking at: http://search.cpan.org/~turnstep/DBD-Pg-2.16.1/Pg.pm#COPY_support It doesn't look like there is support for escaping COPY data. But incidentally, I recently brought up the same problem with psycopg2 on the psycopg list, and it seems there's no existing solution there, either. Going out on a limb, I'm guessing that connectors don't offer this support because there is no function in libpq for them to wrap, and they don't want to kludge their own. Anyone else think it might be a good idea for libpq to offer some function to escape text to be used by COPY? Josh
On Tuesday, December 20, 2011 3:56:14 pm Josh Kupershmidt wrote: > > This is really a question for the DBD::Pg folks, I think. Looking at: > http://search.cpan.org/~turnstep/DBD-Pg-2.16.1/Pg.pm#COPY_support > > It doesn't look like there is support for escaping COPY data. But > incidentally, I recently brought up the same problem with psycopg2 on > the psycopg list, and it seems there's no existing solution there, > either. As far as I know you did not get an answer, which is not the same as there being no answer:) I think you will find that the escaping is handled for you. > Going out on a limb, I'm guessing that connectors don't offer > this support because there is no function in libpq for them to wrap, > and they don't want to kludge their own. > > Anyone else think it might be a good idea for libpq to offer some > function to escape text to be used by COPY? > > Josh -- Adrian Klaver adrian.klaver@gmail.com
On Tue, Dec 20, 2011 at 7:47 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote: > As far as I know you did not get an answer, which is not the same as there being > no answer:) I think you will find that the escaping is handled for you. I am rather dubious of the claim that "escaping is handled for you" with copy_from(). Let us look at this example from psycopg2's copy_from.py example code: data = StringIO.StringIO() data.write('\n'.join(['Tom\tJenkins\t37', 'Madonna\t\N\t45', 'Federico\tDi Gregorio\t\N'])) data.seek(0) curs.copy_from(data, 'test_copy') This works because the strings have essentially been escaped by hand, and None turned into '\N'. So let's say you had the same data, without the escaping being done by hand, like this: rows = [('Tom', 'Jenkins', 37), ('Madonna', None, 45), ('Federico', 'Di Gregorio', None),] You could get away with: data = StringIO.StringIO() for row in rows: data.write('\t'.join([str(el) if el is not None else '\\N' for el in row])) data.write('\n') data.seek(0) curs.copy_from(data, 'test_copy') But only because none of the rows happen to contain any characters which must be be escaped. How are you supposed to use copy_from() with arbitrary text, e.g. rows = [('Strange\t\tFirst\\Name', 'Last\nName', 100), ] because that sure doesn't seem to be handled automagically. Yes, I know I can write my own escaping code, but as Roger points out that's not ideal. Josh
On Wed, Dec 21, 2011 at 06:16:42PM -0500, Josh Kupershmidt wrote: > On Tue, Dec 20, 2011 at 7:47 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote: > > As far as I know you did not get an answer, which is not the same as there being > > no answer:) I think you will find that the escaping is handled for you. > > I am rather dubious of the claim that "escaping is handled for you" > with copy_from(). [...] > This works because the strings have essentially been escaped by hand, > and None turned into '\N'. So let's say you had the same data, without > the escaping being done by hand, like this: [...] > But only because none of the rows happen to contain any characters > which must be be escaped. How are you supposed to use copy_from() with > arbitrary text, e.g. > > rows = [('Strange\t\tFirst\\Name', 'Last\nName', 100), > ] > > because that sure doesn't seem to be handled automagically. Yes, I > know I can write my own escaping code, but as Roger points out that's > not ideal. Yes, this is exactly the issue I have. Without being able to handle the escaping of arbitrary data, it's too fragile to rely on. At some point I'll get caught out by data containing escapes or tabs, and it will all go horribly wrong. I did see an example of using CSV format instead here: http://www.perlmonks.org/?node_id=847265 and I'm sure there's a Pythonic equivalant which you could also use. Since CSV has somewhat better-defined quoting rules (i.e. not PostgreSQL-specific), I think I'll be trying this first. I'd be happy to use the PostgreSQL format, but to avoid future breakage, I'd need some way of determining what the server-side format is when escaping. If neither are sufficiently robust, I'll need to stick with parameterised inserts, which handle arbitrary stuff without problems, other than being slow. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `- GPG Public Key: 0x25BFB848 Please GPG sign your mail.
On Wednesday, December 21, 2011 3:16:42 pm Josh Kupershmidt wrote: > On Tue, Dec 20, 2011 at 7:47 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote: > > As far as I know you did not get an answer, which is not the same as > > there being no answer:) I think you will find that the escaping is > > handled for you. > > I am rather dubious of the claim that "escaping is handled for you" > with copy_from(). Let us look at this example from psycopg2's > copy_from.py example code: > The issue is COPY needs to see data in proper format and so yes some massaging is necessary. For simple cases copy_from() is sufficient ,for more complex cases see below. > > But only because none of the rows happen to contain any characters > which must be be escaped. How are you supposed to use copy_from() with > arbitrary text, e.g. In this case copy_expert is your friend: rows = [('Strange\t\tFirst\\Name', 'Last\nName', 100)] f=StringIO.StringIO() c=csv.writer(f,delimiter='|') for row in rows: c.writerow(row) f.seek(0) cur.copy_expert("copy copy_t from stdin with csv delimiter '|'",f) test(5432)aklaver=>SELECT * from copy_t ; fld_1 | fld_2 | fld_3 ----------------------------+-------+------- Strange First\Name | Last +| 100 | Name | At any rate this more appropriately handled on the psycopg list, so if you want to explore this further we can continue over there. > > rows = [('Strange\t\tFirst\\Name', 'Last\nName', 100), > ] > > because that sure doesn't seem to be handled automagically. Yes, I > know I can write my own escaping code, but as Roger points out that's > not ideal. > > Josh -- Adrian Klaver adrian.klaver@gmail.com