Thread: Fastest way to insert/update many rows

Fastest way to insert/update many rows

From

pascal@ensieve.org

Date:

12 August 2014, 18:57:34

Hi,

I'd like to psycopg2 to fetch a large number of rows (hundreds of millions), perform some computations and put them
backinto the database. 

I can fetch about 130k rows/sec with
  cur.execute('select * from stuff')
  keyvals = list(cur)
and 100k/sec with
  f = io.StringIO()
  cur.copy_to(f, 'stuff')
  f.seek(0)
  keyvals = list(tuple(map(int, l.split('\t'))) for l in f)

but inserting using
  cur.executemany('insert into stuff values (%s, %s)', keyvals)
only has a throughput of 23k/sec with ca. 20% CPU used by Python, 80% by Postgres, while
  cur.copy_from(io.StringIO('\n'.join('{}\t{}'.format(*r) for r in keyvals)), 'stuff')
manages to insert 1.8M/sec.

I can't quite believe that generating a string should be the fastest method, am I missing something?


What I'd really like to do is
  cur.executemany('update stuff set value = %s where key = %s', ...)
but that was orders of magnitude slower still; probably because the order is random, so it performs an index lookup for
eachkey. 
Populating a temporary table and using 'update stuff ... from temptable ...' is quicker.

I have to set one column in each row, is there a way to update cursors like in PL/pgSQL's
  update <table> set ... where current of <cursor>
i.e. iterate through the rows in the most efficient way for the database.


Or would it be wiser to use PL/Python for this kind of task instead?

--
Pascal Germroth

Re: Fastest way to insert/update many rows

From

Daniele Varrazzo

Date:

12 August 2014, 19:17:36

On Tue, Aug 12, 2014 at 2:46 PM,  <pascal@ensieve.org> wrote:
> Hi,
>
> I'd like to psycopg2 to fetch a large number of rows (hundreds of millions), perform some computations and put them
backinto the database. 
>
> I can fetch about 130k rows/sec with
>   cur.execute('select * from stuff')
>   keyvals = list(cur)
> and 100k/sec with
>   f = io.StringIO()
>   cur.copy_to(f, 'stuff')
>   f.seek(0)
>   keyvals = list(tuple(map(int, l.split('\t'))) for l in f)
>
> but inserting using
>   cur.executemany('insert into stuff values (%s, %s)', keyvals)
> only has a throughput of 23k/sec with ca. 20% CPU used by Python, 80% by Postgres, while
>   cur.copy_from(io.StringIO('\n'.join('{}\t{}'.format(*r) for r in keyvals)), 'stuff')
> manages to insert 1.8M/sec.
>
> I can't quite believe that generating a string should be the fastest method, am I missing something?

No, copy is by large the fastest method of loading data into postgres
but psycopg doesn't currently offer adaptation support for copy: in
the current version composing a string in a file-like object is what
it takes.

> What I'd really like to do is
>   cur.executemany('update stuff set value = %s where key = %s', ...)
> but that was orders of magnitude slower still; probably because the order is random, so it performs an index lookup
foreach key. 
> Populating a temporary table and using 'update stuff ... from temptable ...' is quicker.

Yes: using copy to populate a temp table and then update via a query
is the fastest way to bulk-update in postgres, regardless of the
psycopg usage.

> I have to set one column in each row, is there a way to update cursors like in PL/pgSQL's
>   update <table> set ... where current of <cursor>
> i.e. iterate through the rows in the most efficient way for the database.
>
>
> Or would it be wiser to use PL/Python for this kind of task instead?

Well, if you can do everything server-side only, using a server-side
language would save a double roundtrip for the whole dataset and
should be a win. You are probably I/O bound anyway so PL/pgSQL or
PL/Python shouldn't make much difference performance-wise: you may
choose on other language features (python is more expressive but it's
an untrusted language and from the docs I don't think it supports
cursors for update).

-- Daniele

Re: Fastest way to insert/update many rows

From

Joe Abbate

Date:

12 August 2014, 19:53:15

Pascal,

On 12/08/14 09:46, pascal@ensieve.org wrote:
> I'd like to psycopg2 to fetch a large number of rows (hundreds of millions), perform some computations and put them
backinto the database. 
>
> I can fetch about 130k rows/sec with
>   cur.execute('select * from stuff')
>   keyvals = list(cur)
> and 100k/sec with
>   f = io.StringIO()
>   cur.copy_to(f, 'stuff')
>   f.seek(0)
>   keyvals = list(tuple(map(int, l.split('\t'))) for l in f)
>
> but inserting using
>   cur.executemany('insert into stuff values (%s, %s)', keyvals)
> only has a throughput of 23k/sec with ca. 20% CPU used by Python, 80% by Postgres, while
>   cur.copy_from(io.StringIO('\n'.join('{}\t{}'.format(*r) for r in keyvals)), 'stuff')
> manages to insert 1.8M/sec.
>
> I can't quite believe that generating a string should be the fastest method, am I missing something?
>
>
> What I'd really like to do is
>   cur.executemany('update stuff set value = %s where key = %s', ...)
> but that was orders of magnitude slower still; probably because the order is random, so it performs an index lookup
foreach key. 
> Populating a temporary table and using 'update stuff ... from temptable ...' is quicker.
>
> I have to set one column in each row, is there a way to update cursors like in PL/pgSQL's
>   update <table> set ... where current of <cursor>
> i.e. iterate through the rows in the most efficient way for the database.
>
>
> Or would it be wiser to use PL/Python for this kind of task instead?

In any relational database, the fastest way to do something is to have
the server do the work on a *set* of rows at a time.  So, ideally, I'd
try to have the client issue a single statement like:

UPDATE stuff SET value = value_expression WHERE some_expression;

If you're going to update all the rows, then eliminate the WHERE clause
and let the server scan and update in whichever way it finds is most
efficient.  If you need a WHERE, then try to use some_expression that
will take advantage of sequential scanning as much as possible.  If
value_expression is too complicated, in PG you can use functions to deal
with the complexity (functions can also be used for some_expression).
You can use this technique even if you have to update more than one
column value.

If the decision logic is too complex, in PG you can use

SELECT * FROM some_func(arg1, arg2);

where some_func will implement the logic for deciding what rows to
update and what values to update.

Cheers,

Joe