On 10/26/22 17:30, Siddharth Jain wrote:
> Hello,
>
> Given a long list like this:
>
> 1,2
> 2,4 --> 2 appears once
> 7,9
> 8,9
> 5,3
> 2,5 --> note 2 appears twice
>
> I want to load it into this table:
>
> create table tbl (
> id integer primary key,
> fids integer[]
> )
>
> so we will have 2 -> [4,5] where 2 is id and [4,5] are the fids
>
> My actual dataset is very large and has 100M rows in it. How can it be
> efficiently loaded into postgres?
>
> I think I have to use a program for this and am trying to use the pg
> library that comes with Node.js. I am reading the data in batches of 1M
> or 100k rows for example. I have created a dictionary in Node.js where I
> am storing the hashmap. The part where I am stuck is how to generate the
> SQL command?
>
> 1. I want to make one call to the server for the batch of 1M rows, not
> 1M calls
> 2. Some keys in the hashmap might already exist in the database. For
> these keys we want to append to the array
> 3. Some keys will not exist and for these we want to insert new rows
>
> Can someone help me please? I am using Postgres for the first time.
The best way to deal with importing large batches of data is to use COPY
https://www.postgresql.org/docs/current/sql-copy.html
But that will not play well with modifying the data as you input it.
What I can see doing is:
1) COPY the data into a staging table:
create staging_tbl (id integer, fid integer)
2) Then using SQL statements to move the data to the final table.
As example of one possibility, using ON CONFLICT from here:
https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT
insert into table tbl select id, array[fid] from staging_table on
conflict(id) DO UPDATE SET fids = array_append(fids, excluded.fid);
I would test with a smaller example data set to vetify.
>
>
--
Adrian Klaver
adrian.klaver@aklaver.com