Re: [psycopg] speed concerns with executemany() - Mailing list psycopg

From Daniele Varrazzo
Subject Re: [psycopg] speed concerns with executemany()
Date
Msg-id CA+mi_8Y+Xuhy8tMjT0BsWZCk8OT_HTj789cj4JC7dOofZh+UUg@mail.gmail.com
Whole thread Raw
In response to [psycopg] speed concerns with executemany()  (mike bayer <mike_mp@zzzcomputing.com>)
Responses Re: [psycopg] speed concerns with executemany()  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
List psycopg
Hi Mike

On Sat, Dec 24, 2016 at 1:05 AM, mike bayer <mike_mp@zzzcomputing.com> wrote:
> I'm getting more and more regular complaints among users of SQLAlchemy of
> the relatively slow speed of the cursor.executemany() call in psycopg2.   In
> almost all cases, these users have discovered that Postgresql is entirely
> capable of running an INSERT or UPDATE of many values with a high degree of
> speed using a single statement with a form like this:
>
> INSERT INTO table (a, b, c)
> VALUES  (1, 2, 3), (4, 5, 6), (7, 8, 9), ...
>
> whereas if they run the same form using a single VALUES insert and sending
> the parameters as a sequence via executemany(), they will observe
> hyperbolically slow speeds - today, someone claims they can run
> approximately 200 sets of three integers each using the multiple VALUES
> approach in approximately .02 seconds, whereas running 200 values into a
> single executemany() call of the otherwise identical INSERT statement, they
> are claiming takes 20 seconds; that is, 100000% slower.   I'm not really
> sure how that's even possible, considering the single INSERT with many
> VALUES is a much larger string to send over the network and be parsed by the
> server, if the overhead of a single INSERT is .02 seconds, we would think an
> executemany() of 200 INSERT statements each with a single parameter set
> would be at most four seconds.

A x1000 slowdown sounds dodgy yes.

> Obviously something is wrong with these users' environment, although I will
> note that the relative speed of psycopg2 executemany() over a 1G network is
> still pretty bad, compared to both sending a single INSERT with a large
> VALUES clause as well as compared to the executemany() speed of DBAPIs (even
> pure Python) for other databases like MySQL, I can provide some quick
> benchmarks if that's helpful.

You know something that could be worked out quickly? Currently
executemany boils down to pseudocode:

    def executemany(self, stmt, argslist):
        for args in argslist:
            self.execute(stmt, args)

This is not efficient because it's not prepared and because it does a
roundtrip per args. We could save on the second by running something
like:

    def executemany(self, stmt, argslist):
        # TODO: Do this in pages of 100 args, not all together
        stmts = []
        for args in argslist:
            stmts.append(self.mogrify(stmt, argslist))
        self.execute(";".join(stmts))

If you are in a mood for benchmarks, could you please check if this
approach is so noticeably faster that we ought to change the
executemany implementation in 2.7?

This is something that only works with the old protocol, i.e. what
psycopg 2.x implements. In a version implementing the "new protocol"
we would likely implement executemany with a prepared statement, so
this shouldn't be a concern.

> I understand that psycopg2 does not use prepared statements, and I have dim
> recollections that internal use of prepared statements for executemany is
> not on the roadmap for psycopg2.  However, I'm still not sure what I should
> be telling my users when I get reports of these vastly slower results with
> executemany().
>
> I'm not asking that psycopg2 change anything, I'm just looking to understand
> what the heck is going on when people are reporting this. Should I:
>
> 1. tell them they have a network issue that is causing executemany() to have
> a problem?  (even though I can also observe executemany() is kind of slow,
> though not as slow as these people are reporting)

As I said, while executemany is not a screaming arrow I don't expect
it x1000 slower either.

> 2. tell them there's some known issue, vacuuming / indexes/ or something
> that is known to have this effect?

That could be, but rather than you or me they should have a chat in
pgsql-general or -performance to solve that.

> 3. tell them that yes, they should use multiple-VALUES within a single
> INSERT (which would eventually lead to strong pressure on me to reinvent
> executemany() within SQLAlchemy's psycopg2 dialect to use this form) ? (also
> if so, why is this the case?   can this claimed 100000% slowdown be real?)

Even if we had a better implemented executemany, I doubt it would be
implemented with an INSERT ... VALUES list, because that would mean
manipulate the SQL anyway. What we would do would be prepare and then
execute prepared repeatedly. In my knowledge there is no libpq
function to execute repeatedly a prepared statement with an array of
arrays of values
(https://www.postgresql.org/docs/9.3/static/libpq-exec.html). I'd be
very happy if someone could prove me wrong on that.

My bet is that `cur.execute(";".join(cur.mogrify(stmt, args) for args
in argslist)` on old protocol will be faster than prepare + `for stmt
in stmtlist: execute(prepared, stmt)` on new protocol: we would be
trading parsing time for network roundtrips and, while I don't have
benchmarks, the latters smell more expensive than the formers.

> 4. use a hack to actually make my own prepared statements within
> executemany() (I vaguely recall some recipe that you can get a prepared
> statement going with psycopg2 by rolling it on the outside) ?

Yes, I have a sort of recipe for a preparing cursor here:
https://gist.github.com/dvarrazzo/3797445 but I don't think it would
make executemany faster, because it's still one network roundtrip per
values and one statement to parse (likely an EXECUTE instead of an
INSERT)


> 5. Other reasons that executemany() is known to sometimes be extremely slow?
>
> I'm purposely trying to stay out of the realm of picking apart the libpq
> internals, assuming psycopg2 devs can shed some light what's going on here.
> Thanks for any guidance you can offer!

Hope my notes are useful. While I haven't really thrown much brain
juice at the slow executemany problem (because COPY will always be
faster and that's what I tend to use... but that has adaptation and
interface problem of its own) If you can help me with some testing,
and if nobody thinks that my ';'.join() solution is totally daft, I
think we could roll this trick quite quickly.

As for a definition of "quickly": I am currently in holiday, which
means that I'm working at wrapping up psycopg 2.7 instead of $JOB. My
plan is to release 2.7 within the first days of January.


-- Daniele


psycopg by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: [psycopg] speed concerns with executemany()
Next
From: Adrian Klaver
Date:
Subject: Re: [psycopg] speed concerns with executemany()