Re: Better way to bulk-load millions of CSV records into - Mailing list pgsql-novice

From Marc Spitzer
Subject Re: Better way to bulk-load millions of CSV records into
Date
Msg-id 20020522141122.A38174@oscar.eng.cv.net
Whole thread Raw
In response to Re: Better way to bulk-load millions of CSV records into  (Ron Johnson <ron.l.johnson@cox.net>)
Responses Re: Better way to bulk-load millions of CSV records into  (Ron Johnson <ron.l.johnson@cox.net>)
List pgsql-novice
On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote:
> On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote:
> > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
> > > Have you looked at the COPY command is psql  for this?  There are the, 'FROM
> > > { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use.
> > > It seems to me that bulk loading is what they were designed for.
> > >
> > > ts
> >
> > for very large datasets that can be a problem, it is 1 transacrion and
> > the transaction logs must be kept until is finishes or aborts.  This
> > can be a big disk hit.
> >
> > If it is just a plain csv file you can use split to make 1 file into
> > several smaller files and load each one seperatly.
>
> That's a possibility, but that would create _lots_ of little
> files...  Much more convenient to have 1 big file, instead of
> 1 big file and many dozens of small files.

Well you do not keep the small files, just create, load, delete. And I
routeenly load 1 million+ rows of data using "copy into" from psql,
here is a sample:

for i in load_data/* ;do
echo "datafile $i"
awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp
mv $i.tmp $i
grep -E "[0-9]+([.][0-9]+)+" $i
grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp
mv $i.tmp $i
echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $
done

This is part of a production job doing 1 million+ rows/day on 7.1.3 in
under 1 hour.  after everything is done I clean up like this:

cat <<EOT|/usr/local/bin/psql call_me_bob

delete from ubr_info
where sampletime < current_date - '3 days'::interval;

vacuum analyze;
;; I think that the 7.2.X command would be vacuum full analize, or close
reindex table ubr_info;
EOT

reindexing also helps a lot, I delete a lot of rows daily.

This is running on a PII 450 with ide raid, os freebsd 4.4 rc, I realy
need to update that.
>
> > you can look at transactions and do roughly the same thing from
> > python, commit every 10,000 rows.
>
> That's exactly what I'm doing, but with commit count of 4,000.
>
> > the is a varable in config files that allows you to turn off flush
> > to disk.  If you do that for the load you will have better load
> > speed.  Turn it back on when you are done with the load.
>
> This is presuming that one could bounce postmaster (not always
> feasible).  Also, as many have said, if An Accident Happens,
> and postmaster terminates for what ever reason, you are left
> with a corrupt database, and must reload _everything_ from
> _every_ table.  Blech...
>

Well you could always do it in C, python is slow, from
what I here.

good luck

marc

> > and read the admin guide twice, it will help.

pgsql-novice by date:

Previous
From: stev knowles
Date:
Subject: SELECT DISTINCT
Next
From: "Phillip J. Allen"
Date:
Subject: How to Identify a SERIAL column type?