SV: copy csv into partitioned table with unique index - Mailing list pgsql-performance

From Lars Aksel Opsahl
Subject SV: copy csv into partitioned table with unique index
Date
Msg-id af5249b8499940ea801e54bdb0c85713@nibio.no
Whole thread Raw
In response to SV: copy csv into partitioned table with unique index  (Lars Aksel Opsahl <Lars.Opsahl@nibio.no>)
List pgsql-performance

Hei


Sorry it's was a zero to much, it should 30-40 million weather observations pr second.


Lars


Fra: Lars Aksel Opsahl <Lars.Opsahl@nibio.no>
Sendt: 28. januar 2018 12:57
Til: Mariel Cherkassky; PostgreSQL mailing lists
Emne: SV: copy csv into partitioned table with unique index
 

Hi


We had the same problems with performance when testing with more than 100 billion weather observations. We now have a solution where we can push between 300 and 400 million weather observations pr. Second into the database.


We download date from NetCDF files. The date are joined based on time and geolocation, so data from many different NetCDF should end up in the same row and table.


Doing this based on database join and table update was taking a long time as you also have noticed.


To get high performance we ended up this solution

- Uses os commands like awk,sed,join,cut,.. to prepare CSV file for the database.

- Use multithreads.

- Insert data directly into child tables.

- No triggers, constraints and indexes on working table.

- Don’t update rows.

- Unlogged tables.


We first download NetCDF and make CSV files that fits in perfect for the copy a command and with complete files for each child tables it’s created for, this is a time consuming operation.


So before the copy in into database we just do a truncate on the selected table. We are then able to insert between 300 and 400 mill. weather observations pr. Second. We have 11 observations pr row so it means around 35 mill rows pr second. We have one child table for each year and month.


The database we working on have 16 dual core CPU’s and SSD discs. When testing I was running 11 threads in parallel.


Indexes and constraints are added later based on needs.


How can you take on chance om using something like unlogged tables?


Linux system are quite stable and if we keep the a copy of the CVS files it does not take long time to insert data after crash.


You can also change your table to logged later if you need to secure your data in the database.


Lars





Fra: Mariel Cherkassky <mariel.cherkassky@gmail.com>
Sendt: 28. januar 2018 10:11
Til: PostgreSQL mailing lists
Emne: copy csv into partitioned table with unique index
 
Hi,
I configured a master table that is called "year_2018" : 
create table year_2018(a int,b int c date); 

The master table has a unique constraint on those 3 columns so that I wont have any duplicated rows. Moreover, I configured a before insert trigger on that table that creates a child table for each day in the year. The child should include all the data related to that specific day.

Now, every day I got a csv file that I need to load to the year table. I must load the data as fast as possible but I have 2 problems :
1)I must load the data as a bulk - via the copy command. However, the copy command fails because sometimes I have duplicated rows.
2)I tried to use the pgloader extension but it fails because I have a trigger before each insert.

-I cant load all the data into a temp table and then run insert into year_2018 select * from temp because it takes too much time.

Any idea ?


pgsql-performance by date:

Previous
From: Lars Aksel Opsahl
Date:
Subject: SV: copy csv into partitioned table with unique index
Next
From: Nur Agus
Date:
Subject: Query Slow After 2018