Re: moving CSV data into table? - Mailing list pgsql-novice
From | Steve Crawford |
---|---|
Subject | Re: moving CSV data into table? |
Date | |
Msg-id | 4E611D6F.20708@pinpointresearch.com Whole thread Raw |
In response to | moving CSV data into table? (James Hartley <jjhartley@gmail.com>) |
Responses |
Re: moving CSV data into table?
(James Hartley <jjhartley@gmail.com>)
|
List | pgsql-novice |
On 09/02/2011 10:08 AM, James Hartley wrote: > I need to ask about best practices. I now have periodic bulk CSV data > needing to be added to an existing table. I assume it is best to > first move the data into a temporary table & scrub it before moving > into the final table. However, the incoming data does not have the > same schema as the permanent target as additional attributes (columns) > have been added. Is it better to massage the temporary table into the > same target's schema first, & use COPY to copy all temporary contents > into the permanent table, or is scripting a better solution? Is there > a better method? > > Your insights would be appreciated. > > Thanks. As always the answer is, "it depends". Is performance a concern? What type of scrubbing must be done? Does scrubbing require comparison to existing database data?Is lights-out automation required? A combination of the above? You may find it preferable to do pre-scrubbing (sed/grep/awk type of stuff that doesn't require access to existing data in the database) prior to initial import. I would typically create a temporary table that matches the structure of the data you are bringing in then craft a query or set of queries to scrub and import the data. Notes: You don't need to alter the temporary table structure to copy data - just select the data you want. Suppose your temporary import table has 5 columns a,b,c,d and e but your permanent table has only a, b and c. The import is basically: insert into permtable (a, b, c) select a, b, c from temptable; The select can be as complicated as you want. You can eliminate duplicates, perform calculations, add static data like a batch number, etc. For example, the following would import "a" unchanged, put b+d from the temp table into b in the permanent table, import "c" unchanged and put the integer 1 into batchnum. It also eliminates records in the import table where e is zero and where a would create a duplicate a in the permanent table: insert into permtable (a, b, c, batchnum) select a, b+d, c, 1::int from temptable t where e !=0 and not exists (select 1 from permtable p where p.a = t.a); Also note that for performance, the temporary table should really be temporary (i.e. create temporary table foo...). Temporary tables are "unlogged", visible only to the current connection and deleted when the connection closes. Since they are unlogged, you eliminate the performance cost of maintaining the write-ahead logs. (One upcoming feature is that you can specify a permanent table is unlogged - useful for certain tables where data-loss is not a problem.) Cheers, Steve
pgsql-novice by date: