Thread: Bulk Insert/Update Scenario

Bulk Insert/Update Scenario

From
Mana M
Date:
I am setting up the data processing pipeline and the end result gets stored in Postgres. Have not been a heavy DB user in general and had question regarding how best to handle bulk insert/updates scenario with Postgres. 

Here is my use case:
* I get file with thousands of entries (lines) periodically.
* I process each entry (line) from the file and data is split and stored in different Postgres tables. Some tables have foreign keys on other tables. There is "no" straight mapping from the entry in file to Postgres tables.
* Some data could be updates on existing rows in Postgres tables while others could be inserts.
* Would like to ensure the atomicity (either all rows gets stored in all tables or nothing gets stored on failure from Postgres).
* Also like to make sure no concurrency issues in case two different processes try to perform above at the same time.
* Ideally, would want to avoid individual upserts after processing every single entry (line) from the file.


I thought this would be a fairly common use case. What is the best way to handle above? What performance issues I should keep in mind and what are the pitfalls? I tried looking around for articles for such use case - any pointers would be greatly appreciated.


By the way, the application is in Python running in Apache Spark and can use any Python libraries that can help simplify above.

Thanks in advance.

Re: Bulk Insert/Update Scenario

From
Jordan Deitch
Date:
Hi Mana, 

A starting point would be reading about the batch upsert functionality:

You would do something like:
INSERT INTO table ON CONFLICT update... 

This operation would be atomic. You can also look into deferrable constraints such that you would perform all your insert / update operations in a transaction block and accommodate for the constraints. 

I hope this helps to get you on the right track!

Thanks,
Jordan Deitch


Re: Bulk Insert/Update Scenario

From
legrand legrand
Date:
Hi,

check documentation  Populate a database
<https://www.postgresql.org/docs/10/static/populate.html>  

this explains how to create a dummy table,
load it using COPY command,
and then INSERT / UPDATE target tables (using ON CONFLICT if needed)

You can also investigate:
-  file_fdw <https://www.postgresql.org/docs/current/static/file-fdw.html>  
extension (that permits to use text files as tables)
-  pg_bulkload <https://github.com/ossc-db/pg_bulkload/>   extension (that
permits to load data like Oracle loader do)

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Re: Bulk Insert/Update Scenario

From
Mana M
Date:
Thanks Jordan.

One more question I had was - anyway to avoid doing individual INSERT ... ON CONFLICT? I was thinking about dumping everything into TEMP table and using that as source for INSERT ... ON CONFLICT. However, I was not sure on how to get thousands of rows from my Python application into TEMP table in one shot. Or is there any better alternatives?

Thanks.

On Thu, Jan 4, 2018 at 12:43 PM, Jordan Deitch <jwdeitch@gmail.com> wrote:
Hi Mana, 

A starting point would be reading about the batch upsert functionality:

You would do something like:
INSERT INTO table ON CONFLICT update... 

This operation would be atomic. You can also look into deferrable constraints such that you would perform all your insert / update operations in a transaction block and accommodate for the constraints. 

I hope this helps to get you on the right track!

Thanks,
Jordan Deitch



Re: Bulk Insert/Update Scenario

From
Jordan Deitch
Date:
INSERT .. ON CONFLICT can be ran as a bulk operation:

create table test(id int);
insert into test(id) values (1), (2), (3), (4);

Unless you mean you don't want to run this for every table?


Thanks,
Jordan Deitch

Re: Bulk Insert/Update Scenario

From
"Peter J. Holzer"
Date:
On 2018-01-04 13:47:49 -0800, Mana M wrote:
> I was thinking about dumping everything into TEMP table and using that
> as source for INSERT ... ON CONFLICT. However, I was not sure on how
> to get thousands of rows from my Python application into TEMP table in
> one shot. Or is there any better alternatives?

Psycopg2 can decent support for copy:
http://initd.org/psycopg/docs/cursor.html#cursor.copy_from
(but I had problems copying more than 2GB from a StringIO buffer. Not
sure whether that was a problem with psycopg or StringIO.)

Alternatively you can just invoke psql -c '\copy ...'

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment