Thread: Better way to bulk-load millions of CSV records into postgres?

Better way to bulk-load millions of CSV records into postgres?

From
Ron Johnson
Date:
Hi,

Currently, I've got a python script using pyPgSQL that
parses the CSV record, creates a string that is a big
"INSERT INTO VALUES (...)" command, then, execute() it.

top shows that this method uses postmaster with ~70% CPU
utilization, and python with ~15% utilization.

Still, it's only inserting ~190 recs/second.  Is there a
better way to do this, or am I constrained by the hardware?

Instead of python and postmaster having to do a ton of data
xfer over sockets, I'm wondering if there's a way to send a
large number of csv records (4000, for example) in one big
chunk to a stored procedure and have the engine process it
all.

Linux 2.4.18
PostgreSQL 7.2.1
python 2.1.3
csv file on /dev/hda
table on /dev/hde  (ATA/100)

--
+---------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
|                                                         |
| "I have created a government of whirled peas..."        |
|   Maharishi Mahesh Yogi, 12-May-2002,                   |
!   CNN, Larry King Live                                  |
+---------------------------------------------------------+


Re: Better way to bulk-load millions of CSV records into postgres?

From
Josh Berkus
Date:
Ron,

> Currently, I've got a python script using pyPgSQL that
> parses the CSV record, creates a string that is a big
> "INSERT INTO VALUES (...)" command, then, execute() it.

What's wrong with the COPY command?

> top shows that this method uses postmaster with ~70% CPU
> utilization, and python with ~15% utilization.
>
> Still, it's only inserting ~190 recs/second.  Is there a
> better way to do this, or am I constrained by the hardware?

This sounds pretty good for an ATA system.   Upgrading to SCSI-RAID will also
improve your performance.

-Josh Berkus

Re: Better way to bulk-load millions of CSV records into postgres?

From
"Joel Burton"
Date:
> -----Original Message-----
> From: pgsql-novice-owner@postgresql.org
> [mailto:pgsql-novice-owner@postgresql.org]On Behalf Of Ron Johnson
> Sent: Tuesday, May 21, 2002 4:40 PM
> To: PgSQL Novice ML
> Subject: [NOVICE] Better way to bulk-load millions of CSV records into
> postgres?
>
>
> Hi,
>
> Currently, I've got a python script using pyPgSQL that
> parses the CSV record, creates a string that is a big
> "INSERT INTO VALUES (...)" command, then, execute() it.
>
> top shows that this method uses postmaster with ~70% CPU
> utilization, and python with ~15% utilization.
>
> Still, it's only inserting ~190 recs/second.  Is there a
> better way to do this, or am I constrained by the hardware?
>
> Instead of python and postmaster having to do a ton of data
> xfer over sockets, I'm wondering if there's a way to send a
> large number of csv records (4000, for example) in one big
> chunk to a stored procedure and have the engine process it
> all.

You could change your Python script to output a COPY command, which is
*much* faster than individual INSERT commands.

- J.

Joel BURTON | joel@joelburton.com | joelburton.com | aim: wjoelburton
Knowledge Management & Technology Consultant


Re: Better way to bulk-load millions of CSV records into postgres?

From
Tom Sheehan
Date:
Have you looked at the COPY command is psql  for this?  There are the, 'FROM
{ 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use.
It seems to me that bulk loading is what they were designed for.

ts


----- Original Message -----
From: "Ron Johnson" <ron.l.johnson@cox.net>
To: "PgSQL Novice ML" <pgsql-novice@postgresql.org>
Sent: Tuesday, May 21, 2002 3:40 PM
Subject: [NOVICE] Better way to bulk-load millions of CSV records into
postgres?


>
> Hi,
>
> Currently, I've got a python script using pyPgSQL that
> parses the CSV record, creates a string that is a big
> "INSERT INTO VALUES (...)" command, then, execute() it.
>
> top shows that this method uses postmaster with ~70% CPU
> utilization, and python with ~15% utilization.
>
> Still, it's only inserting ~190 recs/second.  Is there a
> better way to do this, or am I constrained by the hardware?
>
> Instead of python and postmaster having to do a ton of data
> xfer over sockets, I'm wondering if there's a way to send a
> large number of csv records (4000, for example) in one big
> chunk to a stored procedure and have the engine process it
> all.
>
> Linux 2.4.18
> PostgreSQL 7.2.1
> python 2.1.3
> csv file on /dev/hda
> table on /dev/hde  (ATA/100)
>
> --
> +---------------------------------------------------------+
> | Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
> | Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
> |                                                         |
> | "I have created a government of whirled peas..."        |
> |   Maharishi Mahesh Yogi, 12-May-2002,                   |
> !   CNN, Larry King Live                                  |
> +---------------------------------------------------------+
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

Re: Better way to bulk-load millions of CSV records into postgres?

From
Marc Spitzer
Date:
On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
> Have you looked at the COPY command is psql  for this?  There are the, 'FROM
> { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use.
> It seems to me that bulk loading is what they were designed for.
>
> ts

for very large datasets that can be a problem, it is 1 transacrion and
the transaction logs must be kept until is finishes or aborts.  This
can be a big disk hit.

If it is just a plain csv file you can use split to make 1 file into
several smaller files and load each one seperatly.

you can look at transactions and do roughly the same thing from
python, commit every 10,000 rows.

the is a varable in config files that allows you to turn off flush
to disk.  If you do that for the load you will have better load
speed.  Turn it back on when you are done with the load.

and read the admin guide twice, it will help.

marc



>
>
> ----- Original Message -----
> From: "Ron Johnson" <ron.l.johnson@cox.net>
> To: "PgSQL Novice ML" <pgsql-novice@postgresql.org>
> Sent: Tuesday, May 21, 2002 3:40 PM
> Subject: [NOVICE] Better way to bulk-load millions of CSV records into
> postgres?
>
>
> >
> > Hi,
> >
> > Currently, I've got a python script using pyPgSQL that
> > parses the CSV record, creates a string that is a big
> > "INSERT INTO VALUES (...)" command, then, execute() it.
> >
> > top shows that this method uses postmaster with ~70% CPU
> > utilization, and python with ~15% utilization.
> >
> > Still, it's only inserting ~190 recs/second.  Is there a
> > better way to do this, or am I constrained by the hardware?
> >
> > Instead of python and postmaster having to do a ton of data
> > xfer over sockets, I'm wondering if there's a way to send a
> > large number of csv records (4000, for example) in one big
> > chunk to a stored procedure and have the engine process it
> > all.
> >
> > Linux 2.4.18
> > PostgreSQL 7.2.1
> > python 2.1.3
> > csv file on /dev/hda
> > table on /dev/hde  (ATA/100)
> >
> > --
> > +---------------------------------------------------------+
> > | Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
> > | Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
> > |                                                         |
> > | "I have created a government of whirled peas..."        |
> > |   Maharishi Mahesh Yogi, 12-May-2002,                   |
> > !   CNN, Larry King Live                                  |
> > +---------------------------------------------------------+
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo@postgresql.org so that your
> > message can get through to the mailing list cleanly
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: Better way to bulk-load millions of CSV records into

From
Ron Johnson
Date:
On Wed, 2002-05-22 at 09:19, Tom Sheehan wrote:
> Have you looked at the COPY command is psql  for this?  There are the, 'FROM
> { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use.
> It seems to me that bulk loading is what they were designed for.

It doesn't handle csv files where there are double-quotes
around each field.  Also, it runs it 1 big transaction,
and if it fails, you must load the whole thing over again...

From the mailing list archives, I see where COPY was only
designed for db_dump, but has had a few features added to
it.  It was not designed to be a full-featured bulk loader/unloader.

> ----- Original Message -----
> From: "Ron Johnson" <ron.l.johnson@cox.net>
> To: "PgSQL Novice ML" <pgsql-novice@postgresql.org>
> Sent: Tuesday, May 21, 2002 3:40 PM
> Subject: [NOVICE] Better way to bulk-load millions of CSV records into
> postgres?
>
>
> >
> > Hi,
> >
> > Currently, I've got a python script using pyPgSQL that
> > parses the CSV record, creates a string that is a big
> > "INSERT INTO VALUES (...)" command, then, execute() it.
> >
> > top shows that this method uses postmaster with ~70% CPU
> > utilization, and python with ~15% utilization.
> >
> > Still, it's only inserting ~190 recs/second.  Is there a
> > better way to do this, or am I constrained by the hardware?
> >
> > Instead of python and postmaster having to do a ton of data
> > xfer over sockets, I'm wondering if there's a way to send a
> > large number of csv records (4000, for example) in one big
> > chunk to a stored procedure and have the engine process it
> > all.
> >
> > Linux 2.4.18
> > PostgreSQL 7.2.1
> > python 2.1.3
> > csv file on /dev/hda
> > table on /dev/hde  (ATA/100)

--
+---------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
|                                                         |
| "I have created a government of whirled peas..."        |
|   Maharishi Mahesh Yogi, 12-May-2002,                   |
!   CNN, Larry King Live                                  |
+---------------------------------------------------------+


Re: Better way to bulk-load millions of CSV records into

From
Ron Johnson
Date:
On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote:
> On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
> > Have you looked at the COPY command is psql  for this?  There are the, 'FROM
> > { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use.
> > It seems to me that bulk loading is what they were designed for.
> >
> > ts
>
> for very large datasets that can be a problem, it is 1 transacrion and
> the transaction logs must be kept until is finishes or aborts.  This
> can be a big disk hit.
>
> If it is just a plain csv file you can use split to make 1 file into
> several smaller files and load each one seperatly.

That's a possibility, but that would create _lots_ of little
files...  Much more convenient to have 1 big file, instead of
1 big file and many dozens of small files.

> you can look at transactions and do roughly the same thing from
> python, commit every 10,000 rows.

That's exactly what I'm doing, but with commit count of 4,000.

> the is a varable in config files that allows you to turn off flush
> to disk.  If you do that for the load you will have better load
> speed.  Turn it back on when you are done with the load.

This is presuming that one could bounce postmaster (not always
feasible).  Also, as many have said, if An Accident Happens,
and postmaster terminates for what ever reason, you are left
with a corrupt database, and must reload _everything_ from
_every_ table.  Blech...

> and read the admin guide twice, it will help.

> >
> > ----- Original Message -----
> > From: "Ron Johnson" <ron.l.johnson@cox.net>
> > To: "PgSQL Novice ML" <pgsql-novice@postgresql.org>
> > Sent: Tuesday, May 21, 2002 3:40 PM
> > Subject: [NOVICE] Better way to bulk-load millions of CSV records into
> > postgres?
> >
> >
> > >
> > > Hi,
> > >
> > > Currently, I've got a python script using pyPgSQL that
> > > parses the CSV record, creates a string that is a big
> > > "INSERT INTO VALUES (...)" command, then, execute() it.
> > >
> > > top shows that this method uses postmaster with ~70% CPU
> > > utilization, and python with ~15% utilization.
> > >
> > > Still, it's only inserting ~190 recs/second.  Is there a
> > > better way to do this, or am I constrained by the hardware?
> > >
> > > Instead of python and postmaster having to do a ton of data
> > > xfer over sockets, I'm wondering if there's a way to send a
> > > large number of csv records (4000, for example) in one big
> > > chunk to a stored procedure and have the engine process it
> > > all.
> > >
> > > Linux 2.4.18
> > > PostgreSQL 7.2.1
> > > python 2.1.3
> > > csv file on /dev/hda
> > > table on /dev/hde  (ATA/100)

--
+---------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
|                                                         |
| "I have created a government of whirled peas..."        |
|   Maharishi Mahesh Yogi, 12-May-2002,                   |
!   CNN, Larry King Live                                  |
+---------------------------------------------------------+


Re: Better way to bulk-load millions of CSV records into

From
Marc Spitzer
Date:
On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote:
> On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote:
> > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
> > > Have you looked at the COPY command is psql  for this?  There are the, 'FROM
> > > { 'filename' | stdin } and [ [USING] DELIMITERS] clauses that may be of use.
> > > It seems to me that bulk loading is what they were designed for.
> > >
> > > ts
> >
> > for very large datasets that can be a problem, it is 1 transacrion and
> > the transaction logs must be kept until is finishes or aborts.  This
> > can be a big disk hit.
> >
> > If it is just a plain csv file you can use split to make 1 file into
> > several smaller files and load each one seperatly.
>
> That's a possibility, but that would create _lots_ of little
> files...  Much more convenient to have 1 big file, instead of
> 1 big file and many dozens of small files.

Well you do not keep the small files, just create, load, delete. And I
routeenly load 1 million+ rows of data using "copy into" from psql,
here is a sample:

for i in load_data/* ;do
echo "datafile $i"
awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp
mv $i.tmp $i
grep -E "[0-9]+([.][0-9]+)+" $i
grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp
mv $i.tmp $i
echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $
done

This is part of a production job doing 1 million+ rows/day on 7.1.3 in
under 1 hour.  after everything is done I clean up like this:

cat <<EOT|/usr/local/bin/psql call_me_bob

delete from ubr_info
where sampletime < current_date - '3 days'::interval;

vacuum analyze;
;; I think that the 7.2.X command would be vacuum full analize, or close
reindex table ubr_info;
EOT

reindexing also helps a lot, I delete a lot of rows daily.

This is running on a PII 450 with ide raid, os freebsd 4.4 rc, I realy
need to update that.
>
> > you can look at transactions and do roughly the same thing from
> > python, commit every 10,000 rows.
>
> That's exactly what I'm doing, but with commit count of 4,000.
>
> > the is a varable in config files that allows you to turn off flush
> > to disk.  If you do that for the load you will have better load
> > speed.  Turn it back on when you are done with the load.
>
> This is presuming that one could bounce postmaster (not always
> feasible).  Also, as many have said, if An Accident Happens,
> and postmaster terminates for what ever reason, you are left
> with a corrupt database, and must reload _everything_ from
> _every_ table.  Blech...
>

Well you could always do it in C, python is slow, from
what I here.

good luck

marc

> > and read the admin guide twice, it will help.

Re: Better way to bulk-load millions of CSV records into

From
Ron Johnson
Date:
On Wed, 2002-05-22 at 13:11, Marc Spitzer wrote:
> On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote:
> > On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote:
> > > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
[snip]
> for i in load_data/* ;do
> echo "datafile $i"
> awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp
> mv $i.tmp $i
> grep -E "[0-9]+([.][0-9]+)+" $i
> grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp
> mv $i.tmp $i
> echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $
> done
[snip]

I'm not an awk programmer.  What does that command do?

Also, all my fields have double-quotes around them.  Is there
a tool (or really clever use of sed) that will strip them
away from the fields that don't need them?  I actually have
_comma_ delimited files, and any fields with commas in them
need the double quotes...

--
+---------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
|                                                         |
| "I have created a government of whirled peas..."        |
|   Maharishi Mahesh Yogi, 12-May-2002,                   |
!   CNN, Larry King Live                                  |
+---------------------------------------------------------+


Re: Better way to bulk-load millions of CSV records into

From
Marc Spitzer
Date:
On Wed, May 22, 2002 at 01:51:45PM -0500, Ron Johnson wrote:
> On Wed, 2002-05-22 at 13:11, Marc Spitzer wrote:
> > On Wed, May 22, 2002 at 12:48:58PM -0500, Ron Johnson wrote:
> > > On Wed, 2002-05-22 at 11:18, Marc Spitzer wrote:
> > > > On Wed, May 22, 2002 at 09:19:31AM -0500, Tom Sheehan wrote:
> [snip]
> > for i in load_data/* ;do
> > echo "datafile $i"
> > awk -F, 'BEGIN{OFS=","}{if ($15~/[.]/){$15="-1"; $0=$0} print $0}' $i >$i.tmp
> > mv $i.tmp $i
> > grep -E "[0-9]+([.][0-9]+)+" $i
> > grep -vE "[0-9]+([.][0-9]+)+" $i >$i.tmp
> > mv $i.tmp $i
> > echo "copy call_me_bob from '/home/marc/projects/bobs_house/$i' using Delimiters ',' with null $
> > done
> [snip]
>
> I'm not an awk programmer.  What does that command do?
>

it splits a line on the commas ( -F, ) and prints the output with
a field seperator of comma (OFS-",") then for each row it checks
the 15th field for junk(things woth a '.' in them) and substatutes
-1 for the junk.

> Also, all my fields have double-quotes around them.  Is there
> a tool (or really clever use of sed) that will strip them

tr will take care of that, if you just want to delete them

tr -d "\"" <file >newffile should do it, my syntax might be a little
off check the man page.

Another question is do you need the quotes around any of the
fields?  Is it possable that you have a field seperator(comma
for example) embeded in any of the fields.  If the answere is
yes or perhaps then it gets harder to deal with.

a quick check can be done using this script:
(assumes the field seperator is a comma)
awk -F, '{print NF }' <data_file |sort |uniq -c

this will tell you how many lines have a given
number of fields, if things are good they should
all have the same number and it should agree with
what you think it is.  NF is defined in the awk
man page.

> away from the fields that don't need them?  I actually have
> _comma_ delimited files, and any fields with commas in them
> need the double quotes...
>

In that case  there is perl and DBI + DBD::CSV, that handles all
the special cases for you.  I think there is an awk library that
does this also if you want to look for it.

Or if possable you could uses a different charater as a field
seperator, semi-colon, colon, and pipe come to mind.

Good luck

marc



> --
> +---------------------------------------------------------+
> | Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
> | Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
> |                                                         |
> | "I have created a government of whirled peas..."        |
> |   Maharishi Mahesh Yogi, 12-May-2002,                   |
> !   CNN, Larry King Live                                  |
> +---------------------------------------------------------+
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org