Thread: Importing a Large .ndjson file
Hi I have a .ndjson file. It is a new-line-delimited JSON file. It is about 10GB and has about 100,000 records. Some sample records: ``` { "key11": "value11", "key12": [ "value12.1", "value12.2"], "key13": { "k111": "v111" } } \n\r { "key21": "value21", "key22": [ "value22.1", "value22.2"] } ``` Now I want to INSERT these json records into my postgres table of the following schema: ``` CREATE TABLE myTable (id BIGSERIAL, content JSONB); ``` Where I want the records to be inserted to the `content` field of my postgres table. What is the best way to do this on a postgresql database, deployed in kubernetes, with a 1 GB RAM allocated ? I can probably write a that would read this file line-by-line and INSERT into the database, in a transaction. But that I believe would take a lot of network traffic and I want to know if there is a better way to do this. Thanks. -- Sankar P http://psankar.blogspot.com
Sankar P <sankar.curiosity@gmail.com> writes: > I have a .ndjson file. It is a new-line-delimited JSON file. It is > about 10GB and has about 100,000 records. > Some sample records: > { "key11": "value11", "key12": [ "value12.1", "value12.2"], "key13": { > "k111": "v111" } } \n\r > { "key21": "value21", "key22": [ "value22.1", "value22.2"] } > What is the best way to do this on a postgresql database, deployed in > kubernetes, with a 1 GB RAM allocated ? It looks like plain old COPY would do this just fine, along the lines of (in psql) \copy myTable(content) from 'myfile.ndjson' If the newlines actually are \n\r rather than the more usual \r\n, you might have to clean that up to stop COPY from thinking they represent two line endings not one. I'd advise extracting the first hundred or so lines of the file and doing a test import into a temporary table, just to verify the process. regards, tom lane
> It looks like plain old COPY would do this just fine, along the lines > of (in psql) > > \copy myTable(content) from 'myfile.ndjson' > Indeed. Thanks. -- Sankar P http://psankar.blogspot.com
> > It looks like plain old COPY would do this just fine, along the lines > > of (in psql) > > > > \copy myTable(content) from 'myfile.ndjson' I spoke too soon. While this worked fine when there were no indexes and finished within 10 minutes, with GIN index on the jsonb column, it is taking hours and still not completing. -- Sankar P http://psankar.blogspot.com
I spoke too soon. While this worked fine when there were no indexes
and finished within 10 minutes, with GIN index on the jsonb column, it
is taking hours and still not completing.
It is always recommended to create indexes AFTER loading data. Sometimes it can be faster to drop all indexes on the table, load huge data, and re-create the indexes but there's no hard & fast rule. If you are adding 100k records to an empty or near empty table, I would remove all indexes and create them after. Be sure you have sufficient maintenance_work_mem also.
Sankar P <sankar.curiosity@gmail.com> writes: > I spoke too soon. While this worked fine when there were no indexes > and finished within 10 minutes, with GIN index on the jsonb column, it > is taking hours and still not completing. There's the generic advice that building an index after-the-fact is often cheaper than updating it incrementally. For GIN indexes specifically, see also https://www.postgresql.org/docs/current/gin-tips.html regards, tom lane
> Sankar P <sankar.curiosity@gmail.com> writes: > > I spoke too soon. While this worked fine when there were no indexes > > and finished within 10 minutes, with GIN index on the jsonb column, it > > is taking hours and still not completing. > > There's the generic advice that building an index after-the-fact > is often cheaper than updating it incrementally. For GIN indexes > specifically, see also > > https://www.postgresql.org/docs/current/gin-tips.html Thanks guys. -- Sankar P http://psankar.blogspot.com