Thread: Better way to bulk-load millions of CSV records into postgres?
Hi, Currently, I've got a python script using pyPgSQL that parses the CSV record, creates a string that is a big "INSERT INTO VALUES (...)" command, then, execute() it. top shows that this method uses postmaster with ~70% CPU utilization, and python with ~15% utilization. Still, it's only inserting ~190 recs/second. Is there a better way to do this, or am I constrained by the hardware? Instead of python and postmaster having to do a ton of data xfer over sockets, I'm wondering if there's a way to send a large number of csv records (4000, for example) in one big chunk to a stored procedure and have the engine process it all. Linux 2.4.18 PostgreSQL 7.2.1 python 2.1.3 csv file on /dev/hda table on /dev/hde (ATA/100) -- +---------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA http://ronandheather.dhs.org:81 | | | | "I have created a government of whirled peas..." | | Maharishi Mahesh Yogi, 12-May-2002, | ! CNN, Larry King Live | +---------------------------------------------------------+
Ron, > Currently, I've got a python script using pyPgSQL that > parses the CSV record, creates a string that is a big > "INSERT INTO VALUES (...)" command, then, execute() it. What's wrong with the COPY command? > top shows that this method uses postmaster with ~70% CPU > utilization, and python with ~15% utilization. > > Still, it's only inserting ~190 recs/second. Is there a > better way to do this, or am I constrained by the hardware? This sounds pretty good for an ATA system. Upgrading to SCSI-RAID will also improve your performance. -Josh Berkus
> -----Original Message----- > From: pgsql-novice-owner@postgresql.org > [mailto:pgsql-novice-owner@postgresql.org]On Behalf Of Ron Johnson > Sent: Tuesday, May 21, 2002 4:40 PM > To: PgSQL Novice ML > Subject: [NOVICE] Better way to bulk-load millions of CSV records into > postgres? > > > Hi, > > Currently, I've got a python script using pyPgSQL that > parses the CSV record, creates a string that is a big > "INSERT INTO VALUES (...)" command, then, execute() it. > > top shows that this method uses postmaster with ~70% CPU > utilization, and python with ~15% utilization. > > Still, it's only inserting ~190 recs/second. Is there a > better way to do this, or am I constrained by the hardware? > > Instead of python and postmaster having to do a ton of data > xfer over sockets, I'm wondering if there's a way to send a > large number of csv records (4000, for example) in one big > chunk to a stored procedure and have the engine process it > all. You could change your Python script to output a COPY command, which is *much* faster than individual INSERT commands. - J. Joel BURTON | joel@joelburton.com | joelburton.com | aim: wjoelburton Knowledge Management & Technology Consultant
Have you looked at the COPY command is psql for this? There are the, 'FROM { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use. It seems to me that bulk loading is what they were designed for. ts ----- Original Message ----- From: "Ron Johnson" <ron.l.johnson@cox.net> To: "PgSQL Novice ML" <pgsql-novice@postgresql.org> Sent: Tuesday, May 21, 2002 3:40 PM Subject: [NOVICE] Better way to bulk-load millions of CSV records into postgres? > > Hi, > > Currently, I've got a python script using pyPgSQL that > parses the CSV record, creates a string that is a big > "INSERT INTO VALUES (...)" command, then, execute() it. > > top shows that this method uses postmaster with ~70% CPU > utilization, and python with ~15% utilization. > > Still, it's only inserting ~190 recs/second. Is there a > better way to do this, or am I constrained by the hardware? > > Instead of python and postmaster having to do a ton of data > xfer over sockets, I'm wondering if there's a way to send a > large number of csv records (4000, for example) in one big > chunk to a stored procedure and have the engine process it > all. > > Linux 2.4.18 > PostgreSQL 7.2.1 > python 2.1.3 > csv file on /dev/hda > table on /dev/hde (ATA/100) > > -- > +---------------------------------------------------------+ > | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | > | Jefferson, LA USA http://ronandheather.dhs.org:81 | > | | > | "I have created a government of whirled peas..." | > | Maharishi Mahesh Yogi, 12-May-2002, | > ! CNN, Larry King Live | > +---------------------------------------------------------+ > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote: > Have you looked at the COPY command is psql for this? There are the, 'FROM > { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use. > It seems to me that bulk loading is what they were designed for. > > ts for very large datasets that can be a problem, it is 1 transacrion and the transaction logs must be kept until is finishes or aborts. This can be a big disk hit. If it is just a plain csv file you can use split to make 1 file into several smaller files and load each one seperatly. you can look at transactions and do roughly the same thing from python, commit every 10,000 rows. the is a varable in config files that allows you to turn off flush to disk. If you do that for the load you will have better load speed. Turn it back on when you are done with the load. and read the admin guide twice, it will help. marc > > > ----- Original Message ----- > From: "Ron Johnson" <ron.l.johnson@cox.net> > To: "PgSQL Novice ML" <pgsql-novice@postgresql.org> > Sent: Tuesday, May 21, 2002 3:40 PM > Subject: [NOVICE] Better way to bulk-load millions of CSV records into > postgres? > > > > > > Hi, > > > > Currently, I've got a python script using pyPgSQL that > > parses the CSV record, creates a string that is a big > > "INSERT INTO VALUES (...)" command, then, execute() it. > > > > top shows that this method uses postmaster with ~70% CPU > > utilization, and python with ~15% utilization. > > > > Still, it's only inserting ~190 recs/second. Is there a > > better way to do this, or am I constrained by the hardware? > > > > Instead of python and postmaster having to do a ton of data > > xfer over sockets, I'm wondering if there's a way to send a > > large number of csv records (4000, for example) in one big > > chunk to a stored procedure and have the engine process it > > all. > > > > Linux 2.4.18 > > PostgreSQL 7.2.1 > > python 2.1.3 > > csv file on /dev/hda > > table on /dev/hde (ATA/100) > > > > -- > > +---------------------------------------------------------+ > > | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | > > | Jefferson, LA USA http://ronandheather.dhs.org:81 | > > | | > > | "I have created a government of whirled peas..." | > > | Maharishi Mahesh Yogi, 12-May-2002, | > > ! CNN, Larry King Live | > > +---------------------------------------------------------+ > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 3: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
On Wed, 2002-05-22 at 09:19, Tom Sheehan wrote: > Have you looked at the COPY command is psql for this? There are the, 'FROM > { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use. > It seems to me that bulk loading is what they were designed for. It doesn't handle csv files where there are double-quotes around each field. Also, it runs it 1 big transaction, and if it fails, you must load the whole thing over again... From the mailing list archives, I see where COPY was only designed for db_dump, but has had a few features added to it. It was not designed to be a full-featured bulk loader/unloader. > ----- Original Message ----- > From: "Ron Johnson" <ron.l.johnson@cox.net> > To: "PgSQL Novice ML" <pgsql-novice@postgresql.org> > Sent: Tuesday, May 21, 2002 3:40 PM > Subject: [NOVICE] Better way to bulk-load millions of CSV records into > postgres? > > > > > > Hi, > > > > Currently, I've got a python script using pyPgSQL that > > parses the CSV record, creates a string that is a big > > "INSERT INTO VALUES (...)" command, then, execute() it. > > > > top shows that this method uses postmaster with ~70% CPU > > utilization, and python with ~15% utilization. > > > > Still, it's only inserting ~190 recs/second. Is there a > > better way to do this, or am I constrained by the hardware? > > > > Instead of python and postmaster having to do a ton of data > > xfer over sockets, I'm wondering if there's a way to send a > > large number of csv records (4000, for example) in one big > > chunk to a stored procedure and have the engine process it > > all. > > > > Linux 2.4.18 > > PostgreSQL 7.2.1 > > python 2.1.3 > > csv file on /dev/hda > > table on /dev/hde (ATA/100) -- +---------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA http://ronandheather.dhs.org:81 | | | | "I have created a government of whirled peas..." | | Maharishi Mahesh Yogi, 12-May-2002, | ! CNN, Larry King Live | +---------------------------------------------------------+
On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote: > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote: > > Have you looked at the COPY command is psql for this? There are the, 'FROM > > { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use. > > It seems to me that bulk loading is what they were designed for. > > > > ts > > for very large datasets that can be a problem, it is 1 transacrion and > the transaction logs must be kept until is finishes or aborts. This > can be a big disk hit. > > If it is just a plain csv file you can use split to make 1 file into > several smaller files and load each one seperatly. That's a possibility, but that would create _lots_ of little files... Much more convenient to have 1 big file, instead of 1 big file and many dozens of small files. > you can look at transactions and do roughly the same thing from > python, commit every 10,000 rows. That's exactly what I'm doing, but with commit count of 4,000. > the is a varable in config files that allows you to turn off flush > to disk. If you do that for the load you will have better load > speed. Turn it back on when you are done with the load. This is presuming that one could bounce postmaster (not always feasible). Also, as many have said, if An Accident Happens, and postmaster terminates for what ever reason, you are left with a corrupt database, and must reload _everything_ from _every_ table. Blech... > and read the admin guide twice, it will help. > > > > ----- Original Message ----- > > From: "Ron Johnson" <ron.l.johnson@cox.net> > > To: "PgSQL Novice ML" <pgsql-novice@postgresql.org> > > Sent: Tuesday, May 21, 2002 3:40 PM > > Subject: [NOVICE] Better way to bulk-load millions of CSV records into > > postgres? > > > > > > > > > > Hi, > > > > > > Currently, I've got a python script using pyPgSQL that > > > parses the CSV record, creates a string that is a big > > > "INSERT INTO VALUES (...)" command, then, execute() it. > > > > > > top shows that this method uses postmaster with ~70% CPU > > > utilization, and python with ~15% utilization. > > > > > > Still, it's only inserting ~190 recs/second. Is there a > > > better way to do this, or am I constrained by the hardware? > > > > > > Instead of python and postmaster having to do a ton of data > > > xfer over sockets, I'm wondering if there's a way to send a > > > large number of csv records (4000, for example) in one big > > > chunk to a stored procedure and have the engine process it > > > all. > > > > > > Linux 2.4.18 > > > PostgreSQL 7.2.1 > > > python 2.1.3 > > > csv file on /dev/hda > > > table on /dev/hde (ATA/100) -- +---------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA http://ronandheather.dhs.org:81 | | | | "I have created a government of whirled peas..." | | Maharishi Mahesh Yogi, 12-May-2002, | ! CNN, Larry King Live | +---------------------------------------------------------+
On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote: > On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote: > > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote: > > > Have you looked at the COPY command is psql for this? There are the, 'FROM > > > { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use. > > > It seems to me that bulk loading is what they were designed for. > > > > > > ts > > > > for very large datasets that can be a problem, it is 1 transacrion and > > the transaction logs must be kept until is finishes or aborts. This > > can be a big disk hit. > > > > If it is just a plain csv file you can use split to make 1 file into > > several smaller files and load each one seperatly. > > That's a possibility, but that would create _lots_ of little > files... Much more convenient to have 1 big file, instead of > 1 big file and many dozens of small files. Well you do not keep the small files, just create, load, delete. And I routeenly load 1 million+ rows of data using "copy into" from psql, here is a sample: for i in load_data/* ;do echo "datafile $i" awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp mv $i.tmp $i grep -E "[0-9]+([.][0-9]+)+" $i grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp mv $i.tmp $i echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $ done This is part of a production job doing 1 million+ rows/day on 7.1.3 in under 1 hour. after everything is done I clean up like this: cat <<EOT|/usr/local/bin/psql call_me_bob delete from ubr_info where sampletime < current_date - '3 days'::interval; vacuum analyze; ;; I think that the 7.2.X command would be vacuum full analize, or close reindex table ubr_info; EOT reindexing also helps a lot, I delete a lot of rows daily. This is running on a PII 450 with ide raid, os freebsd 4.4 rc, I realy need to update that. > > > you can look at transactions and do roughly the same thing from > > python, commit every 10,000 rows. > > That's exactly what I'm doing, but with commit count of 4,000. > > > the is a varable in config files that allows you to turn off flush > > to disk. If you do that for the load you will have better load > > speed. Turn it back on when you are done with the load. > > This is presuming that one could bounce postmaster (not always > feasible). Also, as many have said, if An Accident Happens, > and postmaster terminates for what ever reason, you are left > with a corrupt database, and must reload _everything_ from > _every_ table. Blech... > Well you could always do it in C, python is slow, from what I here. good luck marc > > and read the admin guide twice, it will help.
On Wed, 2002-05-22 at 13:11, Marc Spitzer wrote: > On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote: > > On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote: > > > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote: [snip] > for i in load_data/* ;do > echo "datafile $i" > awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp > mv $i.tmp $i > grep -E "[0-9]+([.][0-9]+)+" $i > grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp > mv $i.tmp $i > echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $ > done [snip] I'm not an awk programmer. What does that command do? Also, all my fields have double-quotes around them. Is there a tool (or really clever use of sed) that will strip them away from the fields that don't need them? I actually have _comma_ delimited files, and any fields with commas in them need the double quotes... -- +---------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA http://ronandheather.dhs.org:81 | | | | "I have created a government of whirled peas..." | | Maharishi Mahesh Yogi, 12-May-2002, | ! CNN, Larry King Live | +---------------------------------------------------------+
On Wed, May 22, 2002 at 01:51:45PM -0500, Ron Johnson wrote: > On Wed, 2002-05-22 at 13:11, Marc Spitzer wrote: > > On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote: > > > On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote: > > > > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote: > [snip] > > for i in load_data/* ;do > > echo "datafile $i" > > awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp > > mv $i.tmp $i > > grep -E "[0-9]+([.][0-9]+)+" $i > > grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp > > mv $i.tmp $i > > echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $ > > done > [snip] > > I'm not an awk programmer. What does that command do? > it splits a line on the commas ( -F, ) and prints the output with a field seperator of comma (OFS-",") then for each row it checks the 15th field for junk(things woth a '.' in them) and substatutes -1 for the junk. > Also, all my fields have double-quotes around them. Is there > a tool (or really clever use of sed) that will strip them tr will take care of that, if you just want to delete them tr -d "\"" <file >newffile should do it, my syntax might be a little off check the man page. Another question is do you need the quotes around any of the fields? Is it possable that you have a field seperator(comma for example) embeded in any of the fields. If the answere is yes or perhaps then it gets harder to deal with. a quick check can be done using this script: (assumes the field seperator is a comma) awk -F, '{print NF }' <data_file |sort |uniq -c this will tell you how many lines have a given number of fields, if things are good they should all have the same number and it should agree with what you think it is. NF is defined in the awk man page. > away from the fields that don't need them? I actually have > _comma_ delimited files, and any fields with commas in them > need the double quotes... > In that case there is perl and DBI + DBD::CSV, that handles all the special cases for you. I think there is an awk library that does this also if you want to look for it. Or if possable you could uses a different charater as a field seperator, semi-colon, colon, and pipe come to mind. Good luck marc > -- > +---------------------------------------------------------+ > | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | > | Jefferson, LA USA http://ronandheather.dhs.org:81 | > | | > | "I have created a government of whirled peas..." | > | Maharishi Mahesh Yogi, 12-May-2002, | > ! CNN, Larry King Live | > +---------------------------------------------------------+ > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org