Re: Better way to bulk-load millions of CSV records into - Mailing list pgsql-novice
From | Marc Spitzer |
---|---|
Subject | Re: Better way to bulk-load millions of CSV records into |
Date | |
Msg-id | 20020522141122.A38174@oscar.eng.cv.net Whole thread Raw |
In response to | Re: Better way to bulk-load millions of CSV records into (Ron Johnson <ron.l.johnson@cox.net>) |
Responses |
Re: Better way to bulk-load millions of CSV records into
|
List | pgsql-novice |
On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote: > On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote: > > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote: > > > Have you looked at the COPY command is psql for this? There are the, 'FROM > > > { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use. > > > It seems to me that bulk loading is what they were designed for. > > > > > > ts > > > > for very large datasets that can be a problem, it is 1 transacrion and > > the transaction logs must be kept until is finishes or aborts. This > > can be a big disk hit. > > > > If it is just a plain csv file you can use split to make 1 file into > > several smaller files and load each one seperatly. > > That's a possibility, but that would create _lots_ of little > files... Much more convenient to have 1 big file, instead of > 1 big file and many dozens of small files. Well you do not keep the small files, just create, load, delete. And I routeenly load 1 million+ rows of data using "copy into" from psql, here is a sample: for i in load_data/* ;do echo "datafile $i" awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp mv $i.tmp $i grep -E "[0-9]+([.][0-9]+)+" $i grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp mv $i.tmp $i echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $ done This is part of a production job doing 1 million+ rows/day on 7.1.3 in under 1 hour. after everything is done I clean up like this: cat <<EOT|/usr/local/bin/psql call_me_bob delete from ubr_info where sampletime < current_date - '3 days'::interval; vacuum analyze; ;; I think that the 7.2.X command would be vacuum full analize, or close reindex table ubr_info; EOT reindexing also helps a lot, I delete a lot of rows daily. This is running on a PII 450 with ide raid, os freebsd 4.4 rc, I realy need to update that. > > > you can look at transactions and do roughly the same thing from > > python, commit every 10,000 rows. > > That's exactly what I'm doing, but with commit count of 4,000. > > > the is a varable in config files that allows you to turn off flush > > to disk. If you do that for the load you will have better load > > speed. Turn it back on when you are done with the load. > > This is presuming that one could bounce postmaster (not always > feasible). Also, as many have said, if An Accident Happens, > and postmaster terminates for what ever reason, you are left > with a corrupt database, and must reload _everything_ from > _every_ table. Blech... > Well you could always do it in C, python is slow, from what I here. good luck marc > > and read the admin guide twice, it will help.
pgsql-novice by date: