Re: Better way to bulk-load millions of CSV records into - Mailing list pgsql-novice
From | Marc Spitzer |
---|---|
Subject | Re: Better way to bulk-load millions of CSV records into |
Date | |
Msg-id | 20020522180743.A38860@oscar.eng.cv.net Whole thread Raw |
In response to | Re: Better way to bulk-load millions of CSV records into (Ron Johnson <ron.l.johnson@cox.net>) |
List | pgsql-novice |
On Wed, May 22, 2002 at 01:51:45PM -0500, Ron Johnson wrote: > On Wed, 2002-05-22 at 13:11, Marc Spitzer wrote: > > On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote: > > > On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote: > > > > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote: > [snip] > > for i in load_data/* ;do > > echo "datafile $i" > > awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp > > mv $i.tmp $i > > grep -E "[0-9]+([.][0-9]+)+" $i > > grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp > > mv $i.tmp $i > > echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $ > > done > [snip] > > I'm not an awk programmer. What does that command do? > it splits a line on the commas ( -F, ) and prints the output with a field seperator of comma (OFS-",") then for each row it checks the 15th field for junk(things woth a '.' in them) and substatutes -1 for the junk. > Also, all my fields have double-quotes around them. Is there > a tool (or really clever use of sed) that will strip them tr will take care of that, if you just want to delete them tr -d "\"" <file >newffile should do it, my syntax might be a little off check the man page. Another question is do you need the quotes around any of the fields? Is it possable that you have a field seperator(comma for example) embeded in any of the fields. If the answere is yes or perhaps then it gets harder to deal with. a quick check can be done using this script: (assumes the field seperator is a comma) awk -F, '{print NF }' <data_file |sort |uniq -c this will tell you how many lines have a given number of fields, if things are good they should all have the same number and it should agree with what you think it is. NF is defined in the awk man page. > away from the fields that don't need them? I actually have > _comma_ delimited files, and any fields with commas in them > need the double quotes... > In that case there is perl and DBI + DBD::CSV, that handles all the special cases for you. I think there is an awk library that does this also if you want to look for it. Or if possable you could uses a different charater as a field seperator, semi-colon, colon, and pipe come to mind. Good luck marc > -- > +---------------------------------------------------------+ > | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | > | Jefferson, LA USA http://ronandheather.dhs.org:81 | > | | > | "I have created a government of whirled peas..." | > | Maharishi Mahesh Yogi, 12-May-2002, | > ! CNN, Larry King Live | > +---------------------------------------------------------+ > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
pgsql-novice by date: