Thread: Escaping input from COPY

Escaping input from COPY

From

Roger Leigh

Date:

19 December 2011, 10:57:13

Hi folks,

In a Perl application I would like to frequently bulk load
several hundred thousand rows of data into a temporary
table before merging the contents into the database proper.
I'm currently doing individual INSERTs into the temporary
table, which has obviously got a significant performance
penalty--the importing takes several minutes even on a
very fast machine.

I'd like to switch to COPY, which should be orders of
magnitude faster.  I see that DBD::Pg has an interface for
this, which looks just fine.  My problem is with how to
escape the data.  I need to use whatever escaping rules
are in use by the server, which I've seen documented in
the manual; but in order to cope with any future changes
to these rules, and ensure identical behaviour, are there
any standard functions I can use to escape the data before
loading it?


Thanks,
Roger

--
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.

Re: Escaping input from COPY

From

Josh Kupershmidt

Date:

20 December 2011, 22:56:53

On Mon, Dec 19, 2011 at 6:56 AM, Roger Leigh <rleigh@codelibre.net> wrote:

> I'd like to switch to COPY, which should be orders of
> magnitude faster.  I see that DBD::Pg has an interface for
> this, which looks just fine.  My problem is with how to
> escape the data.  I need to use whatever escaping rules
> are in use by the server, which I've seen documented in
> the manual; but in order to cope with any future changes
> to these rules, and ensure identical behaviour, are there
> any standard functions I can use to escape the data before
> loading it?

This is really a question for the DBD::Pg folks, I think. Looking at:
  http://search.cpan.org/~turnstep/DBD-Pg-2.16.1/Pg.pm#COPY_support

It doesn't look like there is support for escaping COPY data. But
incidentally, I recently brought up the same problem with psycopg2 on
the psycopg list, and it seems there's no existing solution there,
either. Going out on a limb, I'm guessing that connectors don't offer
this support because there is no function in libpq for them to wrap,
and they don't want to kludge their own.

Anyone else think it might be a good idea for libpq to offer some
function to escape text to be used by COPY?

Josh

Re: Escaping input from COPY

From

Adrian Klaver

Date:

20 December 2011, 23:47:43

On Tuesday, December 20, 2011 3:56:14 pm Josh Kupershmidt wrote:

>
> This is really a question for the DBD::Pg folks, I think. Looking at:
>   http://search.cpan.org/~turnstep/DBD-Pg-2.16.1/Pg.pm#COPY_support
>
> It doesn't look like there is support for escaping COPY data. But
> incidentally, I recently brought up the same problem with psycopg2 on
> the psycopg list, and it seems there's no existing solution there,
> either.

As far as I know you did not get an answer, which is not the same as there being
no answer:) I think you will find that the escaping is handled for you.

> Going out on a limb, I'm guessing that connectors don't offer
> this support because there is no function in libpq for them to wrap,
> and they don't want to kludge their own.
>
> Anyone else think it might be a good idea for libpq to offer some
> function to escape text to be used by COPY?
>
> Josh

--
Adrian Klaver
adrian.klaver@gmail.com

Re: Escaping input from COPY

From

Josh Kupershmidt

Date:

21 December 2011, 22:17:20

On Tue, Dec 20, 2011 at 7:47 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote:
> As far as I know you did not get an answer, which is not the same as there being
> no answer:) I think you will find that the escaping is handled for you.

I am rather dubious of the claim that "escaping is handled for you"
with copy_from(). Let us look at this example from psycopg2's
copy_from.py example code:

data = StringIO.StringIO()
data.write('\n'.join(['Tom\tJenkins\t37',
                      'Madonna\t\N\t45',
                      'Federico\tDi Gregorio\t\N']))

data.seek(0)
curs.copy_from(data, 'test_copy')

This works because the strings have essentially been escaped by hand,
and None turned into '\N'. So let's say you had the same data, without
the escaping being done by hand, like this:

rows = [('Tom', 'Jenkins', 37),
    ('Madonna', None, 45),
        ('Federico', 'Di Gregorio', None),]

You could get away with:

data = StringIO.StringIO()
for row in rows:
    data.write('\t'.join([str(el) if el is not None else '\\N' for el in row]))
    data.write('\n')

data.seek(0)
curs.copy_from(data, 'test_copy')

But only because none of the rows happen to contain any characters
which must be be escaped. How are you supposed to use copy_from() with
arbitrary text, e.g.

rows = [('Strange\t\tFirst\\Name', 'Last\nName', 100),
        ]

because that sure doesn't seem to be handled automagically. Yes, I
know I can write my own escaping code, but as Roger points out that's
not ideal.

Josh

Re: Escaping input from COPY

From

Roger Leigh

Date:

21 December 2011, 22:27:04

On Wed, Dec 21, 2011 at 06:16:42PM -0500, Josh Kupershmidt wrote:
> On Tue, Dec 20, 2011 at 7:47 PM, Adrian Klaver <adrian.klaver@gmail.com> wrote:
> > As far as I know you did not get an answer, which is not the same as there being
> > no answer:) I think you will find that the escaping is handled for you.
>
> I am rather dubious of the claim that "escaping is handled for you"
> with copy_from().
[...]
> This works because the strings have essentially been escaped by hand,
> and None turned into '\N'. So let's say you had the same data, without
> the escaping being done by hand, like this:
[...]
> But only because none of the rows happen to contain any characters
> which must be be escaped. How are you supposed to use copy_from() with
> arbitrary text, e.g.
>
> rows = [('Strange\t\tFirst\\Name', 'Last\nName', 100),
>         ]
>
> because that sure doesn't seem to be handled automagically. Yes, I
> know I can write my own escaping code, but as Roger points out that's
> not ideal.

Yes, this is exactly the issue I have.  Without being able to
handle the escaping of arbitrary data, it's too fragile to rely on.
At some point I'll get caught out by data containing escapes or
tabs, and it will all go horribly wrong.

I did see an example of using CSV format instead here:
  http://www.perlmonks.org/?node_id=847265
and I'm sure there's a Pythonic equivalant which you could also use.
Since CSV has somewhat better-defined quoting rules (i.e. not
PostgreSQL-specific), I think I'll be trying this first.  I'd be
happy to use the PostgreSQL format, but to avoid future breakage, I'd
need some way of determining what the server-side format is when
escaping.

If neither are sufficiently robust, I'll need to stick with
parameterised inserts, which handle arbitrary stuff without problems,
other than being slow.

Regards,
Roger

--
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.

Re: Escaping input from COPY

From

Adrian Klaver

Date:

22 December 2011, 00:16:36

On Wednesday, December 21, 2011 3:16:42 pm Josh Kupershmidt wrote:
> On Tue, Dec 20, 2011 at 7:47 PM, Adrian Klaver <adrian.klaver@gmail.com>
wrote:
> > As far as I know you did not get an answer, which is not the same as
> > there being no answer:) I think you will find that the escaping is
> > handled for you.
>
> I am rather dubious of the claim that "escaping is handled for you"
> with copy_from(). Let us look at this example from psycopg2's
> copy_from.py example code:
>

The issue is COPY needs to see data in proper format and so yes some massaging
is necessary. For simple cases copy_from() is sufficient ,for more complex cases
see below.

>
> But only because none of the rows happen to contain any characters
> which must be be escaped. How are you supposed to use copy_from() with
> arbitrary text, e.g.

In this case copy_expert is your friend:

rows  =  [('Strange\t\tFirst\\Name', 'Last\nName', 100)]
f=StringIO.StringIO()
c=csv.writer(f,delimiter='|')
for row in rows:
    c.writerow(row)
f.seek(0)
cur.copy_expert("copy copy_t from stdin with csv delimiter '|'",f)


test(5432)aklaver=>SELECT * from copy_t ;
           fld_1            | fld_2 | fld_3
----------------------------+-------+-------
 Strange         First\Name | Last +|   100
                                         | Name  |

At any rate this more appropriately handled on the psycopg list, so if you want
to explore this further we can continue over there.

>
> rows = [('Strange\t\tFirst\\Name', 'Last\nName', 100),
>         ]
>
> because that sure doesn't seem to be handled automagically. Yes, I
> know I can write my own escaping code, but as Roger points out that's
> not ideal.
>
> Josh

--
Adrian Klaver
adrian.klaver@gmail.com