Re: Using COPY to import large xml file - Mailing list pgsql-general

From Anto Aravinth
Subject Re: Using COPY to import large xml file
Date
Msg-id CANtp6RLW-UuRr2xQkVt74S_+u3Ad_bVM7npGNzF8BE1YyZwEWA@mail.gmail.com
Whole thread Raw
In response to Re: Using COPY to import large xml file  (Tim Cross <theophilusx@gmail.com>)
Responses Re: Using COPY to import large xml file  (Tim Cross <theophilusx@gmail.com>)
Re: Using COPY to import large xml file  (Christoph Moench-Tegeder <cmt@burggraben.net>)
List pgsql-general


On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx@gmail.com> wrote:

Anto Aravinth <anto.aravinth.cse@gmail.com> writes:

> Thanks for the response. I'm not sure, how long does this tool takes for
> the 70GB data.
>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took
> around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> suggested on the internet.
>
> Definitely, will try the code and let you know.. But looks like it uses the
> same INSERT, not copy.. interesting if it runs quick on my machine.
>
> On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info>
> wrote:
>
>> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> > Hello Everyone,
>> >
>> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>> till
>> > date).. the file is around 70GB.. I wanna import the data in those xml
>> to my
>> > table.. is there a way to do so in postgres?
>> >
>> >
>> > Thanks,
>> > Anto.
>>
>> Hello Anto,
>>
>> I used this tool :
>> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>>

If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database which is
also performing other processing.

We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.

Sure, let me try that.. I have a question here, COPY usually works when you move data from files to your postgres instance, right? Now in node.js, processing the whole file, can I use COPY
programmatically like COPY Stackoverflow <calculated value at run time>? Because from doc:


I don't see its possible. May be I need to convert the files to copy understandable first? 

Anto. 

Be wary of what indexes your defining on your table. Depending on the
type and number, these can have significant impact on insert times as
well.


--
Tim Cross

pgsql-general by date:

Previous
From: Tim Cross
Date:
Subject: Re: Using COPY to import large xml file
Next
From: Tim Cross
Date:
Subject: Re: Using COPY to import large xml file