Thread: bulk inserts

bulk inserts

From
Dave Huber
Date:
Hi, I'm fairly new to postgres and am having trouble finding what I'm looking for. Is there a feature that allows bulk inserts into tables? My setup is Win XPe 2002 SP3 and PostgreSQL 8.3. I need to add entries from a file where each file contains 250 - 500 records. The files are created by a third-party and read/written as binary. The table looks like the following:
 
CREATE TABLE data_log_20msec_table
(
  log_id bigserial NOT NULL,
  timestamp_dbl double precision,
  data bytea,
  CONSTRAINT data_log_20msec_table_pkey PRIMARY KEY (log_id)
)
WITH (OIDS=FALSE);
ALTER TABLE data_log_20msec_table OWNER TO postgres;
-- Index: data_log_20msec_table_timestamp_index
-- DROP INDEX data_log_20msec_table_timestamp_index;
CREATE INDEX data_log_20msec_table_timestamp_index
  ON data_log_20msec_table
  USING btree
  (timestamp_dbl);
The current method for transferring records from the file to postgres is using a prepared statement that is called iteratively on each record read from the file:
 
INSERT INTO data_log_20msec_table (timestamp_dbl,data) VALUES ($1::double precision,$2::bytea)
 
Using COPY is out of the question as the file is not formatted for that and since other operations need to occur, the file needs to be read sequentially anyway.
 
Any amount of help would be gladly excepted, even if it's pointing me to another thread or somewhere in the manual. Thanks,
 
Dave Huber


This electronic mail message is intended exclusively for the individual(s) or entity to which it is addressed. This message, together with any attachment, is confidential and may contain privileged information. Any unauthorized review, use, printing, retaining, copying, disclosure or distribution is strictly prohibited. If you have received this message in error, please immediately advise the sender by reply email message to the sender and delete all copies of this message.
THIS E-MAIL IS NOT AN OFFER OR ACCEPTANCE: Notwithstanding the Uniform Electronic Transactions Act or any other law of similar import, absent an express statement to the contrary contained in this e-mail, neither this e-mail nor any attachments are an offer or acceptance to enter into a contract, and are not intended to bind the sender, LeTourneau Technologies, Inc., or any of its subsidiaries, affiliates, or any other person or entity.
WARNING: Although the company has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.

Re: bulk inserts

From
Martijn van Oosterhout
Date:
On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote:
> Hi, I'm fairly new to postgres and am having trouble finding what I'm
> looking for. Is there a feature that allows bulk inserts into tables?
> My setup is Win XPe 2002 SP3 and PostgreSQL 8.3. I need to add
> entries from a file where each file contains 250 - 500 records. The
> files are created by a third-party and read/written as binary. The
> table looks like the following:

<snip>

> The current method for transferring records from the file to postgres is using a prepared statement that is called
iterativelyon each record read from the file: 
>
> INSERT INTO data_log_20msec_table (timestamp_dbl,data) VALUES ($1::double precision,$2::bytea)
>
> Using COPY is out of the question as the file is not formatted for that and since other operations need to occur, the
fileneeds to be read sequentially anyway. 

The usual approach is to use COPY FROM STDIN, then using pqputCopyData
(or whatever it's called). That way you can perform any necessary
munging and don't require the file to be on disk at all.

Have a nice day,

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

Re: bulk inserts

From
Sam Mason
Date:
On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote:
> Using COPY is out of the question as the file is not formatted for
> that and since other operations need to occur, the file needs to be
> read sequentially anyway.

Just to expand on what Martin said; if you can generate a set of EXECUTE
commands, you can certainly generate a COPY command to insert the same
data.  The advantage is a large drop in parse time for inserting larger
numbers of rows.  As you're saying you want to insert 500 rows, I'd
suggest at least trying to get COPY working.

--
  Sam  http://samason.me.uk/

Re: bulk inserts

From
Dave Huber
Date:
Thanks, Sam and Martijn. I am attempting to use the COPY command now. I had misunderstood what was meant by STDIN and
assumedI could only use a file for my application and wasn't aware of PQputCopyData(). One assumption I am operating
underright now is that the format of the binary file is the same as the buffer in PQputCopyData, including the header.
IfI am wrong, could someone please let me know? Thanks, 

Dave

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Sam Mason
Sent: Monday, September 28, 2009 3:53 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] bulk inserts

On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote:
> Using COPY is out of the question as the file is not formatted for
> that and since other operations need to occur, the file needs to be
> read sequentially anyway.

Just to expand on what Martin said; if you can generate a set of EXECUTE
commands, you can certainly generate a COPY command to insert the same
data.  The advantage is a large drop in parse time for inserting larger
numbers of rows.  As you're saying you want to insert 500 rows, I'd
suggest at least trying to get COPY working.

--
  Sam  http://samason.me.uk/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


This electronic mail message is intended exclusively for the individual(s) or entity to which it is addressed. This
message,together with any attachment, is confidential and may contain privileged information. Any unauthorized review,
use,printing, retaining, copying, disclosure or distribution is strictly prohibited. If you have received this message
inerror, please immediately advise the sender by reply email message to the sender and delete all copies of this
message.
THIS E-MAIL IS NOT AN OFFER OR ACCEPTANCE: Notwithstanding the Uniform Electronic Transactions Act or any other law of
similarimport, absent an express statement to the contrary contained in this e-mail, neither this e-mail nor any
attachmentsare an offer or acceptance to enter into a contract, and are not intended to bind the sender, LeTourneau
Technologies,Inc., or any of its subsidiaries, affiliates, or any other person or entity. 
WARNING: Although the company has taken reasonable precautions to ensure no viruses are present in this email, the
companycannot accept responsibility for any loss or damage arising from the use of this email or attachments. 


Re: bulk inserts

From
Sam Mason
Date:
On Mon, Sep 28, 2009 at 04:35:53PM -0500, Dave Huber wrote:
> One assumption I am operating under right now is
> that the format of the binary file is the same as the buffer in
> PQputCopyData, including the header.  If I am wrong, could someone
> please let me know? Thanks,

I've always used ASCII representations of the data; no need to worry
about formats then.  Not sure what sort of performance considerations
you have, but it's certainly less of a worry for me.  I'm reasonably
sure the binary format changes more often than you'd like, maybe
something like:

  http://libpqtypes.esilo.com/

would help.  I can't see anything about COPY support, but it should help
with other things.

--
  Sam  http://samason.me.uk/

Re: bulk inserts

From
Martin Gainty
Date:
INSERTS/UPDATES are historically slow especially with autocommit is on (implied autocommit on)
the Database writer actually stops any processing and applies that one record to the database
Most bulk operations such as import/export and copy are well worth their weight as they apply en-masse
before any commit ..remember the DB actually stops flushes its buffers to Disk
and then resumes..the only solution here is to disable autocommit but be wary you may have 100's of statements waiting to be commited and then someone does a quit on your session..all your work is lost

good call on copy
http://www.postgresql.org/docs/8.1/interactive/populate.html

cheers,
Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
 
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.





> Date: Mon, 28 Sep 2009 21:52:36 +0100
> From: sam@samason.me.uk
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] bulk inserts
>
> On Mon, Sep 28, 2009 at 10:38:05AM -0500, Dave Huber wrote:
> > Using COPY is out of the question as the file is not formatted for
> > that and since other operations need to occur, the file needs to be
> > read sequentially anyway.
>
> Just to expand on what Martin said; if you can generate a set of EXECUTE
> commands, you can certainly generate a COPY command to insert the same
> data. The advantage is a large drop in parse time for inserting larger
> numbers of rows. As you're saying you want to insert 500 rows, I'd
> suggest at least trying to get COPY working.
>
> --
> Sam http://samason.me.uk/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


Bing™ brings you maps, menus, and reviews organized in one place. Try it now.

Re: bulk inserts

From
Sam Mason
Date:
On Mon, Sep 28, 2009 at 08:33:45PM -0400, Martin Gainty wrote:
> INSERTS/UPDATES are historically slow especially with autocommit is
> on (implied autocommit on) the Database writer actually stops any
> processing and applies that one record to the database

That seems to be overstating the issue somewhat.  Each connection is
still independent and (assuming no locks are taken) will carry on as
such.

> Most bulk operations such as import/export and copy are well worth
> their weight as they apply en-masse before any commit ..

?? I'm not sure what you're implying about the semantics here, but it
doesn't seem right.  COPY doesn't somehow break out of ACID semantics,
it's only an *optimization* that allows you to get large quantities of
data into the database faster.  The main reason it's faster is because
parsing CSV data is easier than parsing SQL.

At least I think that's the only difference; anybody know better?

> remember the DB actually stops flushes its buffers to Disk and then
> resumes..

The DB as a whole does not stop if you issue a commit; just your
session/connection.

> the only solution here is to disable autocommit but be wary
> you may have 100's of statements waiting to be commited and then
> someone does a quit on your session..all your work is lost

I'm not sure what you're saying here.  These are normal transactional
semantics and are what all ACID databases are specified to do.  You need
to issue a "COMMIT" for data to be committed.

--
  Sam  http://samason.me.uk/

Re: bulk inserts

From
Dave Huber
Date:

All I have to say is wow! COPY works sooo much faster than the iterative method I was using. Even after having to read the entire binary file and reformat the data into the binary format that postgres needs it is an order of magnitude faster than using a prepared INSERT. At least that’s what my prelim testing is showing me. I will have to let it run for a while with a larger table to see how it performs in the longrun, but initially this is very positive.

 

Thanks, guys!

 

Dave



This electronic mail message is intended exclusively for the individual(s) or entity to which it is addressed. This message, together with any attachment, is confidential and may contain privileged information. Any unauthorized review, use, printing, retaining, copying, disclosure or distribution is strictly prohibited. If you have received this message in error, please immediately advise the sender by reply email message to the sender and delete all copies of this message.
THIS E-MAIL IS NOT AN OFFER OR ACCEPTANCE: Notwithstanding the Uniform Electronic Transactions Act or any other law of similar import, absent an express statement to the contrary contained in this e-mail, neither this e-mail nor any attachments are an offer or acceptance to enter into a contract, and are not intended to bind the sender, LeTourneau Technologies, Inc., or any of its subsidiaries, affiliates, or any other person or entity.
WARNING: Although the company has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.

Re: bulk inserts

From
Grzegorz Jaśkiewicz
Date:


On Tue, Sep 29, 2009 at 3:31 PM, Dave Huber <DHuber@letourneautechnologies.com> wrote:

All I have to say is wow! COPY works sooo much faster than the iterative method I was using. Even after having to read the entire binary file and reformat the data into the binary format that postgres needs it is an order of magnitude faster than using a prepared INSERT. At least that’s what my prelim testing is showing me. I will have to let it run for a while with a larger table to see how it performs in the longrun, but initially this is very positive.

you can also try wrapping whole bunch of inserts in a transaction block. 


--
GJ

Re: bulk inserts

From
Alan Hodgson
Date:
On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote:
> ?? I'm not sure what you're implying about the semantics here, but it
> doesn't seem right.  COPY doesn't somehow break out of ACID semantics,
> it's only an *optimization* that allows you to get large quantities of
> data into the database faster.  The main reason it's faster is because
> parsing CSV data is easier than parsing SQL.
>
> At least I think that's the only difference; anybody know better?

I think a big reason is also that the client can stream the data without
waiting for a network round trip ack on every statement.


--
"No animals were harmed in the recording of this episode. We tried but that
damn monkey was just too fast."

Re: bulk inserts

From
Sam Mason
Date:
On Tue, Sep 29, 2009 at 08:45:55AM -0700, Alan Hodgson wrote:
> On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote:
> > it's faster is because
> > parsing CSV data is easier than parsing SQL.
> >
> > At least I think that's the only difference; anybody know better?
>
> I think a big reason is also that the client can stream the data without
> waiting for a network round trip ack on every statement.

I don't think so.  I'm pretty sure you can send multiple statements in a
single round trip.  libpq is defined to work in such cases anyway:

  http://www.postgresql.org/docs/current/static/libpq-exec.html

--
  Sam  http://samason.me.uk/

Re: bulk inserts

From
Alan Hodgson
Date:
On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote:
> > I think a big reason is also that the client can stream the data
> > without waiting for a network round trip ack on every statement.
>
> I don't think so.  I'm pretty sure you can send multiple statements in a
> single round trip.  libpq is defined to work in such cases anyway:
>
>   http://www.postgresql.org/docs/current/static/libpq-exec.html
>

I'm sure you probably _can_, but how many programming loops do so?


--
"No animals were harmed in the recording of this episode. We tried but that
damn monkey was just too fast."

Re: bulk inserts

From
Tom Lane
Date:
Sam Mason <sam@samason.me.uk> writes:
> On Tue, Sep 29, 2009 at 08:45:55AM -0700, Alan Hodgson wrote:
>> I think a big reason is also that the client can stream the data without
>> waiting for a network round trip ack on every statement.

> I don't think so.  I'm pretty sure you can send multiple statements in a
> single round trip.

You can, but that doesn't scale to megabytes of data (at least not
well).

I think the big points are elimination of per-row network and
transaction commit overhead ... but there are some other optimizations
in the COPY path too.

            regards, tom lane

Re: bulk inserts

From
Sam Mason
Date:
On Tue, Sep 29, 2009 at 09:11:19AM -0700, Alan Hodgson wrote:
> On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote:
> > I'm pretty sure you can send multiple statements in a
> > single round trip.  libpq is defined to work in such cases anyway:
> >
> >   http://www.postgresql.org/docs/current/static/libpq-exec.html
>
> I'm sure you probably _can_, but how many programming loops do so?

It's not a very big sample, but I'm pretty sure I do! :)

--
  Sam  http://samason.me.uk/

Re: bulk inserts

From
Sam Mason
Date:
On Tue, Sep 29, 2009 at 12:17:51PM -0400, Tom Lane wrote:
> Sam Mason <sam@samason.me.uk> writes:
> > On Tue, Sep 29, 2009 at 08:45:55AM -0700, Alan Hodgson wrote:
> >> I think a big reason is also that the client can stream the data without
> >> waiting for a network round trip ack on every statement.
>
> > I don't think so.  I'm pretty sure you can send multiple statements in a
> > single round trip.
>
> You can, but that doesn't scale to megabytes of data (at least not
> well).

No, but I didn't think that was being talked about.  I was thinking
network round trip time does seem to become a thousand times less
important when you're putting a thousand statements together.  This
would seem to imply that network latency can be almost arbitrarily
reduced.

> I think the big points are elimination of per-row network and
> transaction commit overhead ...

Well, if you start including transaction commit then you've just changed
semantics away from COPY.  I was implicitly thinking of what changes
when you keep the same semantics as COPY.

> but there are some other optimizations
> in the COPY path too.

Cool, I'll continue to prefer COPY then!

--
  Sam  http://samason.me.uk/

Re: bulk inserts

From
Jasen Betts
Date:
On 2009-09-29, Alan Hodgson <ahodgson@simkin.ca> wrote:
> On Tuesday 29 September 2009, Sam Mason <sam@samason.me.uk> wrote:
>> ?? I'm not sure what you're implying about the semantics here, but it
>> doesn't seem right.  COPY doesn't somehow break out of ACID semantics,
>> it's only an *optimization* that allows you to get large quantities of
>> data into the database faster.  The main reason it's faster is because
>> parsing CSV data is easier than parsing SQL.
>>
>> At least I think that's the only difference; anybody know better?
>
> I think a big reason is also that the client can stream the data without
> waiting for a network round trip ack on every statement.

a single insert statement can insert many rows.

the win with copy is more that the data part can be parsed one record
at a time. whereas for the insert the whole must be parsed, also the
quoting rules are simpler for CSV or pg's tab-delimted format.