Thread: Importing a Large .ndjson file

Importing a Large .ndjson file

From

Sankar P

Date:

17 June 2020, 14:21:21

Hi

I have a .ndjson file. It is a new-line-delimited JSON file. It is
about 10GB and has about 100,000 records.

Some sample records:
```
{ "key11": "value11", "key12": [ "value12.1", "value12.2"], "key13": {
"k111": "v111" } } \n\r
{ "key21": "value21", "key22": [ "value22.1", "value22.2"] }
```
Now I want to INSERT these json records into my postgres table of the
following schema:

```
CREATE TABLE myTable (id BIGSERIAL, content JSONB);
```

Where I want the records to be inserted to the `content` field of my
postgres table.

What is the best way to do this on a postgresql database, deployed in
kubernetes, with a 1 GB RAM allocated ?

I can probably write a that would read this file line-by-line and
INSERT into the database, in a transaction. But that I believe would
take a lot of network traffic and I want to know if there is a better
way to do this.

Thanks.

-- 
Sankar P
http://psankar.blogspot.com

Re: Importing a Large .ndjson file

From

Tom Lane

Date:

17 June 2020, 18:18:21

Sankar P <sankar.curiosity@gmail.com> writes:
> I have a .ndjson file. It is a new-line-delimited JSON file. It is
> about 10GB and has about 100,000 records.
> Some sample records:
> { "key11": "value11", "key12": [ "value12.1", "value12.2"], "key13": {
> "k111": "v111" } } \n\r
> { "key21": "value21", "key22": [ "value22.1", "value22.2"] }

> What is the best way to do this on a postgresql database, deployed in
> kubernetes, with a 1 GB RAM allocated ?

It looks like plain old COPY would do this just fine, along the lines
of (in psql)

\copy myTable(content) from 'myfile.ndjson'

If the newlines actually are \n\r rather than the more usual \r\n,
you might have to clean that up to stop COPY from thinking they
represent two line endings not one.

I'd advise extracting the first hundred or so lines of the file and doing
a test import into a temporary table, just to verify the process.

            regards, tom lane

Re: Importing a Large .ndjson file

From

Sankar P

Date:

18 June 2020, 10:46:43

> It looks like plain old COPY would do this just fine, along the lines
> of (in psql)
>
> \copy myTable(content) from 'myfile.ndjson'
>

Indeed. Thanks.

-- 
Sankar P
http://psankar.blogspot.com

Re: Importing a Large .ndjson file

From

Sankar P

Date:

18 June 2020, 20:38:44

> > It looks like plain old COPY would do this just fine, along the lines
> > of (in psql)
> >
> > \copy myTable(content) from 'myfile.ndjson'

I spoke too soon. While this worked fine when there were no indexes
and finished within 10 minutes, with GIN index on the jsonb column, it
is taking hours and still not completing.

-- 
Sankar P
http://psankar.blogspot.com

Re: Importing a Large .ndjson file

From

Michael Lewis

Date:

18 June 2020, 21:09:11

I spoke too soon. While this worked fine when there were no indexes
and finished within 10 minutes, with GIN index on the jsonb column, it
is taking hours and still not completing.

It is always recommended to create indexes AFTER loading data. Sometimes it can be faster to drop all indexes on the table, load huge data, and re-create the indexes but there's no hard & fast rule. If you are adding 100k records to an empty or near empty table, I would remove all indexes and create them after. Be sure you have sufficient maintenance_work_mem also.

Re: Importing a Large .ndjson file

From

Tom Lane

Date:

18 June 2020, 21:30:25

Sankar P <sankar.curiosity@gmail.com> writes:
> I spoke too soon. While this worked fine when there were no indexes
> and finished within 10 minutes, with GIN index on the jsonb column, it
> is taking hours and still not completing.

There's the generic advice that building an index after-the-fact
is often cheaper than updating it incrementally.  For GIN indexes
specifically, see also

https://www.postgresql.org/docs/current/gin-tips.html

            regards, tom lane

Re: Importing a Large .ndjson file

From

Sankar P

Date:

19 June 2020, 06:52:38

> Sankar P <sankar.curiosity@gmail.com> writes:
> > I spoke too soon. While this worked fine when there were no indexes
> > and finished within 10 minutes, with GIN index on the jsonb column, it
> > is taking hours and still not completing.
>
> There's the generic advice that building an index after-the-fact
> is often cheaper than updating it incrementally.  For GIN indexes
> specifically, see also
>
> https://www.postgresql.org/docs/current/gin-tips.html

Thanks guys.

-- 
Sankar P
http://psankar.blogspot.com