Thread: Importing a Large .ndjson file

Importing a Large .ndjson file

From
Sankar P
Date:
Hi

I have a .ndjson file. It is a new-line-delimited JSON file. It is
about 10GB and has about 100,000 records.

Some sample records:
```
{ "key11": "value11", "key12": [ "value12.1", "value12.2"], "key13": {
"k111": "v111" } } \n\r
{ "key21": "value21", "key22": [ "value22.1", "value22.2"] }
```
Now I want to INSERT these json records into my postgres table of the
following schema:

```
CREATE TABLE myTable (id BIGSERIAL, content JSONB);
```

Where I want the records to be inserted to the `content` field of my
postgres table.

What is the best way to do this on a postgresql database, deployed in
kubernetes, with a 1 GB RAM allocated ?

I can probably write a that would read this file line-by-line and
INSERT into the database, in a transaction. But that I believe would
take a lot of network traffic and I want to know if there is a better
way to do this.

Thanks.

-- 
Sankar P
http://psankar.blogspot.com



Re: Importing a Large .ndjson file

From
Tom Lane
Date:
Sankar P <sankar.curiosity@gmail.com> writes:
> I have a .ndjson file. It is a new-line-delimited JSON file. It is
> about 10GB and has about 100,000 records.
> Some sample records:
> { "key11": "value11", "key12": [ "value12.1", "value12.2"], "key13": {
> "k111": "v111" } } \n\r
> { "key21": "value21", "key22": [ "value22.1", "value22.2"] }

> What is the best way to do this on a postgresql database, deployed in
> kubernetes, with a 1 GB RAM allocated ?

It looks like plain old COPY would do this just fine, along the lines
of (in psql)

\copy myTable(content) from 'myfile.ndjson'

If the newlines actually are \n\r rather than the more usual \r\n,
you might have to clean that up to stop COPY from thinking they
represent two line endings not one.

I'd advise extracting the first hundred or so lines of the file and doing
a test import into a temporary table, just to verify the process.

            regards, tom lane



Re: Importing a Large .ndjson file

From
Sankar P
Date:
> It looks like plain old COPY would do this just fine, along the lines
> of (in psql)
>
> \copy myTable(content) from 'myfile.ndjson'
>

Indeed. Thanks.

-- 
Sankar P
http://psankar.blogspot.com



Re: Importing a Large .ndjson file

From
Sankar P
Date:
> > It looks like plain old COPY would do this just fine, along the lines
> > of (in psql)
> >
> > \copy myTable(content) from 'myfile.ndjson'

I spoke too soon. While this worked fine when there were no indexes
and finished within 10 minutes, with GIN index on the jsonb column, it
is taking hours and still not completing.

-- 
Sankar P
http://psankar.blogspot.com



Re: Importing a Large .ndjson file

From
Michael Lewis
Date:
I spoke too soon. While this worked fine when there were no indexes
and finished within 10 minutes, with GIN index on the jsonb column, it
is taking hours and still not completing.

It is always recommended to create indexes AFTER loading data. Sometimes it can be faster to drop all indexes on the table, load huge data, and re-create the indexes but there's no hard & fast rule. If you are adding 100k records to an empty or near empty table, I would remove all indexes and create them after. Be sure you have sufficient maintenance_work_mem also. 

Re: Importing a Large .ndjson file

From
Tom Lane
Date:
Sankar P <sankar.curiosity@gmail.com> writes:
> I spoke too soon. While this worked fine when there were no indexes
> and finished within 10 minutes, with GIN index on the jsonb column, it
> is taking hours and still not completing.

There's the generic advice that building an index after-the-fact
is often cheaper than updating it incrementally.  For GIN indexes
specifically, see also

https://www.postgresql.org/docs/current/gin-tips.html

            regards, tom lane



Re: Importing a Large .ndjson file

From
Sankar P
Date:
> Sankar P <sankar.curiosity@gmail.com> writes:
> > I spoke too soon. While this worked fine when there were no indexes
> > and finished within 10 minutes, with GIN index on the jsonb column, it
> > is taking hours and still not completing.
>
> There's the generic advice that building an index after-the-fact
> is often cheaper than updating it incrementally.  For GIN indexes
> specifically, see also
>
> https://www.postgresql.org/docs/current/gin-tips.html

Thanks guys.

-- 
Sankar P
http://psankar.blogspot.com