Home > mailing lists

Re: Better way to bulk-load millions of CSV records into - Mailing list pgsql-novice

From	Marc Spitzer
Subject	Re: Better way to bulk-load millions of CSV records into
Date	May 22, 2002 18:19:35
Msg-id	20020522180743.A38860@oscar.eng.cv.net Whole thread Raw
In response to	Re: Better way to bulk-load millions of CSV records into (Ron Johnson <ron.l.johnson@cox.net>)
List	pgsql-novice

Tree view

On Wed, May 22, 2002 at 01:51:45PM -0500, Ron Johnson wrote:
> On Wed, 2002-05-22 at 13:11, Marc Spitzer wrote:
> > On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote:
> > > On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote:
> > > > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
> [snip]
> > for i in load_data/* ;do
> > echo "datafile $i"
> > awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp
> > mv $i.tmp $i
> > grep -E "[0-9]+([.][0-9]+)+" $i
> > grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp
> > mv $i.tmp $i
> > echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $
> > done
> [snip]
>
> I'm not an awk programmer.  What does that command do?
>

it splits a line on the commas ( -F, ) and prints the output with
a field seperator of comma (OFS-",") then for each row it checks
the 15th field for junk(things woth a '.' in them) and substatutes
-1 for the junk.

> Also, all my fields have double-quotes around them.  Is there
> a tool (or really clever use of sed) that will strip them

tr will take care of that, if you just want to delete them

tr -d "\"" <file >newffile should do it, my syntax might be a little
off check the man page.

Another question is do you need the quotes around any of the
fields?  Is it possable that you have a field seperator(comma
for example) embeded in any of the fields.  If the answere is
yes or perhaps then it gets harder to deal with.

a quick check can be done using this script:
(assumes the field seperator is a comma)
awk -F, '{print NF }' <data_file |sort |uniq -c

this will tell you how many lines have a given
number of fields, if things are good they should
all have the same number and it should agree with
what you think it is.  NF is defined in the awk
man page.

> away from the fields that don't need them?  I actually have
> _comma_ delimited files, and any fields with commas in them
> need the double quotes...
>

In that case  there is perl and DBI + DBD::CSV, that handles all
the special cases for you.  I think there is an awk library that
does this also if you want to look for it.

Or if possable you could uses a different charater as a field
seperator, semi-colon, colon, and pipe come to mind.

Good luck

marc



> --
> +---------------------------------------------------------+
> | Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
> | Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
> |                                                         |
> | "I have created a government of whirled peas..."        |
> |   Maharishi Mahesh Yogi, 12-May-2002,                   |
> !   CNN, Larry King Live                                  |
> +---------------------------------------------------------+
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org

pgsql-novice by date:

From: John Taylor
Date: 22 May 2002, 17:53:39
Subject: Re: optimising data load

From: "Brian Johnson"
Date: 22 May 2002, 20:00:06
Subject: Moving data to different hard drive /or/ incremental backup

Re: Better way to bulk-load millions of CSV records into - Mailing list pgsql-novice

Previous

Next