Re: Using COPY to import large xml file - Mailing list pgsql-general

From Adrian Klaver
Subject Re: Using COPY to import large xml file
Date
Msg-id d3c834a0-edb0-4363-d882-e36422ebcdc9@aklaver.com
Whole thread Raw
In response to Re: Using COPY to import large xml file  (Anto Aravinth <anto.aravinth.cse@gmail.com>)
List pgsql-general
On 06/25/2018 07:25 AM, Anto Aravinth wrote:
> Thanks a lot. But I do got lot of challenges! Looks like SO data 
> contains lot of tabs within itself.. So tabs delimiter didn't work for 
> me. I thought I can give a special demiliter but looks like Postrgesql 
> copy allow only one character as delimiter :(

I use | as it is rarely found in data itself.

> 
> Sad, I guess only way is to insert or do a through serialization of my 
> data into something that COPY can understand.
> 
> On Mon, Jun 25, 2018 at 8:09 AM, Tim Cross <theophilusx@gmail.com 
> <mailto:theophilusx@gmail.com>> wrote:
> 
> 
> 
>     On Mon, 25 Jun 2018 at 11:38, Anto Aravinth
>     <anto.aravinth.cse@gmail.com <mailto:anto.aravinth.cse@gmail.com>>
>     wrote:
> 
> 
> 
>         On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross
>         <theophilusx@gmail.com <mailto:theophilusx@gmail.com>> wrote:
> 
> 
>             Anto Aravinth <anto.aravinth.cse@gmail.com
>             <mailto:anto.aravinth.cse@gmail.com>> writes:
> 
>             > Thanks for the response. I'm not sure, how long does this tool takes for
>             > the 70GB data.
>             >
>             > I used node to stream the xml files into inserts.. which was very slow..
>             > Actually the xml contains 40 million records, out of which 10Million took
>             > around 2 hrs using nodejs. Hence, I thought will use COPY command, as
>             > suggested on the internet.
>             >
>             > Definitely, will try the code and let you know.. But looks like it uses the
>             > same INSERT, not copy.. interesting if it runs quick on my machine.
>             >
>             > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info
<mailto:adrien.nayrat@anayrat.info>>
>             > wrote:
>             >
>             >> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>             >> > Hello Everyone,
>             >> >
>             >> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>             >> till
>             >> > date).. the file is around 70GB.. I wanna import the data in those xml
>             >> to my
>             >> > table.. is there a way to do so in postgres?
>             >> >
>             >> >
>             >> > Thanks,
>             >> > Anto.
>             >>
>             >> Hello Anto,
>             >>
>             >> I used this tool :
>             >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>             <https://github.com/Networks-Learning/stackexchange-dump-to-postgres>
>             >>
> 
>             If you are using nodejs, then you can easily use the
>             pg-copy-streams
>             module to insert the records into your database. I've been
>             using this
>             for inserting large numbers of records from NetCDF files.
>             Takes between
>             40 to 50 minutes to insert 60 Million+ records and we are doing
>             additional calculations on the values, not just inserting them,
>             plus we are inserting into a database over the network and
>             into a database which is
>             also performing other processing.
> 
>             We found a significant speed improvement with COPY over
>             blocks of insert
>             transactions, which was faster than just individual inserts.
>             The only
>             downside with using COPY is that it either completely works or
>             completely fails and when it fails, it can be tricky to work
>             out which
>             record is causing the failure. A benefit of using blocks of
>             transactions
>             is that you have more fine grained control, allowing you to
>             recover from
>             some errors or providing more specific detail regarding the
>             cause of the
>             error.
> 
> 
>         Sure, let me try that.. I have a question here, COPY usually
>         works when you move data from files to your postgres instance,
>         right? Now in node.js, processing the whole file, can I use COPY
>         programmatically like COPY Stackoverflow <calculated value at
>         run time>? Because from doc:
> 
>         https://www.postgresql.org/docs/9.2/static/sql-copy.html
>         <https://www.postgresql.org/docs/9.2/static/sql-copy.html>
> 
>         I don't see its possible. May be I need to convert the files to
>         copy understandable first?
> 
>         Anto.
> 
> 
> 
> 
>     Yes. Essentially what you do is create a stream and feed whatever
>     information you want to copy into that stream. PG sees the. data as
>     if it was seeing each line in a file, so you push data onto the
>     stream wherre each item is seperated by a tab (or whatever). Here is
>     the basic low level function I use (Don't know how the formatting
>     will go!)
> 
>     async function copyInsert(sql, stringifyFN, records) {
>        const logName = `${moduleName}.copyInsert`;
>        var client;
> 
>        assert.ok(Array.isArray(records), "The records arg must be an
>     array");
>        assert.ok(typeof(stringifyFN) === "function", "The stringifyFN
>     arg must be a function");
>        return getClient()
>          .then(c => {
>            client = c;
>            return new Promise(function(resolve, reject) {
>              var stream, rs;
>              var idx = 0;
>              function done() {
>                releaseClient(client);
>                client = undefined;
>                resolve(idx + 1);
>              }
> 
>              function onError(err) {
>                if (client !== undefined) {
>                  releaseClient(client);
>                }
>                reject(new VError(err, `${logName}: COPY failed at record
>     ${idx}`));
>              }
> 
>              function arrayRead() {
>                if (idx === records.length) {
>                  rs.push(null);
>                } else {
>                  let rec = records[idx];
>                  rs.push(stringifyFN(rec));
>                  idx += 1;
>                }
>              }
> 
>              rs = new Readable;
>              rs._read = arrayRead;
>              rs.on("error", onError);
>              stream = client.query(copyFrom(sql));
>              stream.on("error", onError);
>              stream.on("end", done);
>              rs.pipe(stream);
>            });
>          })
>          .catch(err => {
>            throw new VError(err, `${logName} Failed COPY insert`);
>          });
>     }
> 
>     and I will call it like
> 
>     copyInsert(sql, stringifyClimateRecord, records)
> 
>     where sql and stringifycomateRecord arguments are
> 
>        const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
>              + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
>              + "vprp_09,vprp_15,wind_speed) FROM STDIN";
> 
>        function stringifyClimateRecord(rec) {
>          return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
>            +
>     `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`;
>        }
> 
>     The stringifyClimateRecord returns a record to be inserted as a
>     'line' into the stream with values separated by tabs. Records is an
>     array of data records where each record is an array.
> 
> 
>     -- 
>     regards,
> 
>     Tim
> 
>     --
>     Tim Cross
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


pgsql-general by date:

Previous
From: Alban Hertroys
Date:
Subject: Re: DB size growing exponentially when materialized view refreshedconcurrently (postgres 9.6)
Next
From: Adrian Klaver
Date:
Subject: Re: Schema/Data conversion opensource tools from MySQL to PostgreSQL