Thread: Fastest way to insert/update many rows
Hi, I'd like to psycopg2 to fetch a large number of rows (hundreds of millions), perform some computations and put them backinto the database. I can fetch about 130k rows/sec with cur.execute('select * from stuff') keyvals = list(cur) and 100k/sec with f = io.StringIO() cur.copy_to(f, 'stuff') f.seek(0) keyvals = list(tuple(map(int, l.split('\t'))) for l in f) but inserting using cur.executemany('insert into stuff values (%s, %s)', keyvals) only has a throughput of 23k/sec with ca. 20% CPU used by Python, 80% by Postgres, while cur.copy_from(io.StringIO('\n'.join('{}\t{}'.format(*r) for r in keyvals)), 'stuff') manages to insert 1.8M/sec. I can't quite believe that generating a string should be the fastest method, am I missing something? What I'd really like to do is cur.executemany('update stuff set value = %s where key = %s', ...) but that was orders of magnitude slower still; probably because the order is random, so it performs an index lookup for eachkey. Populating a temporary table and using 'update stuff ... from temptable ...' is quicker. I have to set one column in each row, is there a way to update cursors like in PL/pgSQL's update <table> set ... where current of <cursor> i.e. iterate through the rows in the most efficient way for the database. Or would it be wiser to use PL/Python for this kind of task instead? -- Pascal Germroth
On Tue, Aug 12, 2014 at 2:46 PM, <pascal@ensieve.org> wrote: > Hi, > > I'd like to psycopg2 to fetch a large number of rows (hundreds of millions), perform some computations and put them backinto the database. > > I can fetch about 130k rows/sec with > cur.execute('select * from stuff') > keyvals = list(cur) > and 100k/sec with > f = io.StringIO() > cur.copy_to(f, 'stuff') > f.seek(0) > keyvals = list(tuple(map(int, l.split('\t'))) for l in f) > > but inserting using > cur.executemany('insert into stuff values (%s, %s)', keyvals) > only has a throughput of 23k/sec with ca. 20% CPU used by Python, 80% by Postgres, while > cur.copy_from(io.StringIO('\n'.join('{}\t{}'.format(*r) for r in keyvals)), 'stuff') > manages to insert 1.8M/sec. > > I can't quite believe that generating a string should be the fastest method, am I missing something? No, copy is by large the fastest method of loading data into postgres but psycopg doesn't currently offer adaptation support for copy: in the current version composing a string in a file-like object is what it takes. > What I'd really like to do is > cur.executemany('update stuff set value = %s where key = %s', ...) > but that was orders of magnitude slower still; probably because the order is random, so it performs an index lookup foreach key. > Populating a temporary table and using 'update stuff ... from temptable ...' is quicker. Yes: using copy to populate a temp table and then update via a query is the fastest way to bulk-update in postgres, regardless of the psycopg usage. > I have to set one column in each row, is there a way to update cursors like in PL/pgSQL's > update <table> set ... where current of <cursor> > i.e. iterate through the rows in the most efficient way for the database. > > > Or would it be wiser to use PL/Python for this kind of task instead? Well, if you can do everything server-side only, using a server-side language would save a double roundtrip for the whole dataset and should be a win. You are probably I/O bound anyway so PL/pgSQL or PL/Python shouldn't make much difference performance-wise: you may choose on other language features (python is more expressive but it's an untrusted language and from the docs I don't think it supports cursors for update). -- Daniele
Pascal, On 12/08/14 09:46, pascal@ensieve.org wrote: > I'd like to psycopg2 to fetch a large number of rows (hundreds of millions), perform some computations and put them backinto the database. > > I can fetch about 130k rows/sec with > cur.execute('select * from stuff') > keyvals = list(cur) > and 100k/sec with > f = io.StringIO() > cur.copy_to(f, 'stuff') > f.seek(0) > keyvals = list(tuple(map(int, l.split('\t'))) for l in f) > > but inserting using > cur.executemany('insert into stuff values (%s, %s)', keyvals) > only has a throughput of 23k/sec with ca. 20% CPU used by Python, 80% by Postgres, while > cur.copy_from(io.StringIO('\n'.join('{}\t{}'.format(*r) for r in keyvals)), 'stuff') > manages to insert 1.8M/sec. > > I can't quite believe that generating a string should be the fastest method, am I missing something? > > > What I'd really like to do is > cur.executemany('update stuff set value = %s where key = %s', ...) > but that was orders of magnitude slower still; probably because the order is random, so it performs an index lookup foreach key. > Populating a temporary table and using 'update stuff ... from temptable ...' is quicker. > > I have to set one column in each row, is there a way to update cursors like in PL/pgSQL's > update <table> set ... where current of <cursor> > i.e. iterate through the rows in the most efficient way for the database. > > > Or would it be wiser to use PL/Python for this kind of task instead? In any relational database, the fastest way to do something is to have the server do the work on a *set* of rows at a time. So, ideally, I'd try to have the client issue a single statement like: UPDATE stuff SET value = value_expression WHERE some_expression; If you're going to update all the rows, then eliminate the WHERE clause and let the server scan and update in whichever way it finds is most efficient. If you need a WHERE, then try to use some_expression that will take advantage of sequential scanning as much as possible. If value_expression is too complicated, in PG you can use functions to deal with the complexity (functions can also be used for some_expression). You can use this technique even if you have to update more than one column value. If the decision logic is too complex, in PG you can use SELECT * FROM some_func(arg1, arg2); where some_func will implement the logic for deciding what rows to update and what values to update. Cheers, Joe