Thread: slow speeds after 2 million rows inserted

slow speeds after 2 million rows inserted

From
James Neff
Date:
Greetings,

Ive got a java application I am reading data from a flat file and
inserting it into a table.  The first 2 million rows (each file
contained about 1 million lines) went pretty fast.  Less than 40 mins to
insert into the database.

After that the insert speed is slow.  I think I may be able to type the
data faster than what is being done by the java application on the third
file.

Table looks like this:

CREATE TABLE data_archive
(
  id serial NOT NULL,
  batchid integer NOT NULL,
  claimid character varying(25) NOT NULL,
  memberid character varying(45) NOT NULL,
  raw_data text NOT NULL,
  status integer DEFAULT 0,
  line_number integer,
  CONSTRAINT data_archive_pkey PRIMARY KEY (id)
)


there is also an index on batchid.

The insert command is like so:

"INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status,
line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data +
"', '1', '" + myFilter.claimLine + "');";

where the raw_data variable is the line from the file.

How can I find out what is causing this slow down and how do I speed it up?

Database is 8.2.0 on x86_64-unknown-linux-gnu.

There is nothing else running on this database server (other than
standard linux background programs).  PS ax did not show anything else
running.  No locks other than the occasional lock by the INSERT query.

I have done a FULL vacuum on this table but not reindex (running now).

Thanks in advance,
James

Re: slow speeds after 2 million rows inserted

From
"Joshua D. Drake"
Date:
On Fri, 2006-12-29 at 12:39 -0500, James Neff wrote:
> Greetings,
>
> Ive got a java application I am reading data from a flat file and
> inserting it into a table.  The first 2 million rows (each file
> contained about 1 million lines) went pretty fast.  Less than 40 mins to
> insert into the database.
>

You need to vacuum during the inserts :)

Joshua D. Drake



> After that the insert speed is slow.  I think I may be able to type the
> data faster than what is being done by the java application on the third
> file.
>
> Table looks like this:
>
> CREATE TABLE data_archive
> (
>   id serial NOT NULL,
>   batchid integer NOT NULL,
>   claimid character varying(25) NOT NULL,
>   memberid character varying(45) NOT NULL,
>   raw_data text NOT NULL,
>   status integer DEFAULT 0,
>   line_number integer,
>   CONSTRAINT data_archive_pkey PRIMARY KEY (id)
> )
>
>
> there is also an index on batchid.
>
> The insert command is like so:
>
> "INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status,
> line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data +
> "', '1', '" + myFilter.claimLine + "');";
>
> where the raw_data variable is the line from the file.
>
> How can I find out what is causing this slow down and how do I speed it up?
>
> Database is 8.2.0 on x86_64-unknown-linux-gnu.
>
> There is nothing else running on this database server (other than
> standard linux background programs).  PS ax did not show anything else
> running.  No locks other than the occasional lock by the INSERT query.
>
> I have done a FULL vacuum on this table but not reindex (running now).
>
> Thanks in advance,
> James
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>
--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




Re: slow speeds after 2 million rows inserted

From
Rodrigo Gonzalez
Date:
James Neff wrote:
> Greetings,
>
> Ive got a java application I am reading data from a flat file and
> inserting it into a table.  The first 2 million rows (each file
> contained about 1 million lines) went pretty fast.  Less than 40 mins to
> insert into the database.
>
> After that the insert speed is slow.  I think I may be able to type the
> data faster than what is being done by the java application on the third
> file.
>
> Table looks like this:
>
> CREATE TABLE data_archive
> (
>  id serial NOT NULL,
>  batchid integer NOT NULL,
>  claimid character varying(25) NOT NULL,
>  memberid character varying(45) NOT NULL,
>  raw_data text NOT NULL,
>  status integer DEFAULT 0,
>  line_number integer,
>  CONSTRAINT data_archive_pkey PRIMARY KEY (id)
> )
>
>
> there is also an index on batchid.
>
> The insert command is like so:
>
> "INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status,
> line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data +
> "', '1', '" + myFilter.claimLine + "');";
>
> where the raw_data variable is the line from the file.
>
> How can I find out what is causing this slow down and how do I speed it up?
>
> Database is 8.2.0 on x86_64-unknown-linux-gnu.
>
> There is nothing else running on this database server (other than
> standard linux background programs).  PS ax did not show anything else
> running.  No locks other than the occasional lock by the INSERT query.
> I have done a FULL vacuum on this table but not reindex (running now).
>
> Thanks in advance,
> James
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Consider using copy

Re: slow speeds after 2 million rows inserted

From
"Joshua D. Drake"
Date:
> > there is also an index on batchid.
> >
> > The insert command is like so:
> >
> > "INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status,
> > line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data +
> > "', '1', '" + myFilter.claimLine + "');";


Also as you are running 8.2 you can use multi valued inserts...

INSERT INTO data_archive values () () ()

> >
> > where the raw_data variable is the line from the file.
> >
> > How can I find out what is causing this slow down and how do I speed it up?
> >
> > Database is 8.2.0 on x86_64-unknown-linux-gnu.
> >
> > There is nothing else running on this database server (other than
> > standard linux background programs).  PS ax did not show anything else
> > running.  No locks other than the occasional lock by the INSERT query.
> > I have done a FULL vacuum on this table but not reindex (running now).
> >
> > Thanks in advance,
> > James
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: Don't 'kill -9' the postmaster
> >
>
> Consider using copy
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>
--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




Re: slow speeds after 2 million rows inserted

From
James Neff
Date:
Joshua D. Drake wrote:
> Also as you are running 8.2 you can use multi valued inserts...
>
> INSERT INTO data_archive values () () ()
>

Would this speed things up? Or is that just another way to do it?

Thanks,
James


Re: slow speeds after 2 million rows inserted

From
Frank Finner
Date:
When do you commit these inserts? I occasionally found similiar problems, when I do heavy inserting/updating within one
singletransaction. First all runs fast, after some time everything slows down. If I commit the inserts every some 1000
rows(large rows, small engine), this phenomenon does not occur. Maybe some buffer chokes if the transaction ist too
big.

In your case I'd recommend to commit after every one or two million rows (if possible).

Regards, Frank.


On Fri, 29 Dec 2006 12:39:03 -0500 James Neff <jneff@tethyshealth.com> thought long, then sat down and wrote:

> Greetings,
> 
> Ive got a java application I am reading data from a flat file and 
> inserting it into a table.  The first 2 million rows (each file 
> contained about 1 million lines) went pretty fast.  Less than 40 mins to 
> insert into the database.
> 
> After that the insert speed is slow.  I think I may be able to type the 
> data faster than what is being done by the java application on the third 
> file.

-- 
Frank Finner

Invenius - Lösungen mit Linux
Köpfchenstraße 36
57072 Siegen
Telefon: 0271 231 8606    Mail: frank.finner@invenius.de
Telefax: 0271 231 8608    Web:  http://www.invenius.de
Key fingerprint = 90DF FF40 582E 6D6B BADF  6E6A A74E 67E4 E788 2651


Attachment

Re: slow speeds after 2 million rows inserted

From
James Neff
Date:
Joshua D. Drake wrote:
> You need to vacuum during the inserts :)
>
> Joshua D. Drake
>

I ran the vacuum during the INSERT and it seemed to help a little, but
its still relatively slow compared to the first 2 million records.

Any other ideas?

Thanks,
James

Re: slow speeds after 2 million rows inserted

From
"Joshua D. Drake"
Date:
On Fri, 2006-12-29 at 13:21 -0500, James Neff wrote:
> Joshua D. Drake wrote:
> > Also as you are running 8.2 you can use multi valued inserts...
> >
> > INSERT INTO data_archive values () () ()
> >
>
> Would this speed things up? Or is that just another way to do it?

The fastest way will be copy.
The second fastest will be multi value inserts in batches.. eg.;

INSERT INTO data_archive values () () () (I don't knwo what the max is)

but commit every 1000 inserts or so.

Sincerely,

Joshua D. Drake


>
> Thanks,
> James
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>
--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




Re: slow speeds after 2 million rows inserted

From
Rodrigo Gonzalez
Date:
Joshua D. Drake wrote:
> On Fri, 2006-12-29 at 13:21 -0500, James Neff wrote:
>> Joshua D. Drake wrote:
>>> Also as you are running 8.2 you can use multi valued inserts...
>>>
>>> INSERT INTO data_archive values () () ()
>>>
>> Would this speed things up? Or is that just another way to do it?
>
> The fastest way will be copy.
> The second fastest will be multi value inserts in batches.. eg.;
>
> INSERT INTO data_archive values () () () (I don't knwo what the max is)
>
> but commit every 1000 inserts or so.
>
> Sincerely,
>
> Joshua D. Drake
>
>
>> Thanks,
>> James
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 6: explain analyze is your friend
>>

Another thing....if you have to make validations and so on....creating
an temp file with the information validated and using COPY will be
faster. So you can validate and use COPY at the same time....

Re: slow speeds after 2 million rows inserted

From
Frank Finner
Date:
In Java, assuming you have a Connection c, you simply say "c.commit();" after doing some action on the database. After
everycommit, the transaction will be executed and closed and a new one opened, which runs until the next commit.
 

Regards, Frank.


On Fri, 29 Dec 2006 13:23:37 -0500 James Neff <jneff@tethyshealth.com> thought long, then sat down and wrote:

> I'm sorry to ask a stupid question, but how do I 'commit' the transactions?
> 
> Thanks,
> James
> 
> 
> 
> Frank Finner wrote:
> > When do you commit these inserts? I occasionally found similiar problems, when I do heavy inserting/updating within
onesingle transaction. First all runs fast, after some time everything slows down. If I commit the inserts every some
1000rows (large rows, small engine), this phenomenon does not occur. Maybe some buffer chokes if the transaction ist
toobig.
 
> >
> > In your case I'd recommend to commit after every one or two million rows (if possible).
> >
> > Regards, Frank.
> >
> >
> > On Fri, 29 Dec 2006 12:39:03 -0500 James Neff <jneff@tethyshealth.com> thought long, then sat down and wrote:
> >
> >   
> >> Greetings,
> >>
> >> Ive got a java application I am reading data from a flat file and 
> >> inserting it into a table.  The first 2 million rows (each file 
> >> contained about 1 million lines) went pretty fast.  Less than 40 mins to 
> >> insert into the database.
> >>
> >> After that the insert speed is slow.  I think I may be able to type the 
> >> data faster than what is being done by the java application on the third 
> >> file.
> >>     
> >
> >   
> 
> 
> -- 
> 
> James Neff
> Technology Specialist
> 
> Tethys Health Ventures
> 4 North Park Drive, Suite 203
> Hunt Valley, MD  21030
> 
> office:  410.771.0692 x103
> cell:    443.865.7874
> 
> 


-- 
Frank Finner

Invenius - Lösungen mit Linux
Köpfchenstraße 36
57072 Siegen
Telefon: 0271 231 8606    Mail: frank.finner@invenius.de
Telefax: 0271 231 8608    Web:  http://www.invenius.de
Key fingerprint = 90DF FF40 582E 6D6B BADF  6E6A A74E 67E4 E788 2651


Attachment

Re: slow speeds after 2 million rows inserted

From
James Neff
Date:
Frank Finner wrote:
> In Java, assuming you have a Connection c, you simply say "c.commit();" after doing some action on the database.
Afterevery commit, the transaction will be executed and closed and a new one opened, which runs until the next commit. 
>
> Regards, Frank.
>
>

That did it, thank you!


--James


Re: slow speeds after 2 million rows inserted

From
Nikola Milutinovic
Date:
> The fastest way will be copy.
> The second fastest will be multi value inserts in batches.. eg.;
>
> INSERT INTO data_archive values () () () (I don't knwo what the max is)
>
> but commit every 1000 inserts or so.

Is this some empirical value? Can someone give heuristics as to how to calculate the optimal number of transactions
afterwhich to commit? Or at least guidelines. 

Nix.




__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: slow speeds after 2 million rows inserted

From
Guy Rouillier
Date:
Frank Finner wrote:
> In Java, assuming you have a Connection c, you simply say
> "c.commit();" after doing some action on the database. After every
> commit, the transaction will be executed and closed and a new one
> opened, which runs until the next commit.

Assuming, of course, you started with c.setAutoCommit(false);

--
Guy Rouillier

Re: slow speeds after 2 million rows inserted

From
Nikola Milutinovic
Date:
> 1. There is no difference (speed-wise) between committing every 1K or every 250K rows.

It was really some time ago, since I have experimented with this. My las experiment was on PG 7.2 or 7.3. I was inserting cca 800,000 rows. Inserting without transactions took 25 hrs. Inserting with 10,000 rows per transaction took about 2.5 hrs. So, the speedup was 10x. I have not experimented with the transaction batch size, but I suspect that 1,000 would not show much speedup.

> 2. Vacuuming also makes no difference for a heavy insert-only table, only slows it down.

Makes sense. Since my application was dumping all records each month and inserting new ones, vacuum was really needed, but no speedup.

> 3. Table size plays no real factor.

The reason I saw speedup, must have to do with the fact that without transactions, each insert was it's own transaction. That was eating resources.

Nix.

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: slow speeds after 2 million rows inserted

From
Richard Broersma Jr
Date:
> It was really some time ago, since I have experimented with this. My las experiment was on PG
> 7.2 or 7.3. I was inserting cca 800,000 rows. Inserting without transactions took 25 hrs.
> Inserting with 10,000 rows per transaction took about 2.5 hrs. So, the speedup was 10x. I have
> not experimented with the transaction batch size, but I suspect that 1,000 would not show much
> speedup.
>
> > 2. Vacuuming also makes no difference for a heavy insert-only table, only slows it down.
>
> Makes sense. Since my application was dumping all records each month and inserting new ones,
> vacuum was really needed, but no speedup.
>
> > 3. Table size plays no real factor.

Maybe this link my be useful,  it contains additional links to various postgresql preformance
test.
http://archives.postgresql.org/pgsql-general/2006-10/msg00662.php

Regards,

Richard Broersma Jr.

Re: slow speeds after 2 million rows inserted

From
"Chad Wagner"
Date:


On 12/31/06, Nikola Milutinovic <alokin1@yahoo.com> wrote:
> 1. There is no difference (speed-wise) between committing every 1K or every 250K rows.

It was really some time ago, since I have experimented with this. My las experiment was on PG 7.2 or 7.3. I was inserting cca 800,000 rows. Inserting without transactions took 25 hrs. Inserting with 10,000 rows per transaction took about 2.5 hrs. So, the speedup was 10x. I have not experimented with the transaction batch size, but I suspect that 1,000 would not show much speedup.

I would imagine the overhead here is sending the BEGIN/COMMIT (or the behavior could be version dependent), and the commit is forcing a log flush.  According to the documentation multiple inserts have the potential of being flushed in one shot when the database wakes up to do a logflush automatically, so committing more frequently would actually appear to slow you down.

The time to commit is dependent on hardware, on my hardware it was around 40ms (which you have to question the precision of the calculation, clearly it is probably less than 40ms and that is an upper limit under load).

My experiment was with 8.2, default configuration, so there is room for improvement.  It was installed from Devrim's RPM packages.

> 2. Vacuuming also makes no difference for a heavy insert-only table, only slows it down.

Makes sense. Since my application was dumping all records each month and inserting new ones, vacuum was really needed, but no speedup.

I agree, vacuuming is clearly important.  I would also think if you are going to do a massive one-time update/delete as a maintenance item on a normally read only table that you should plan on doing a vacuum full to recover the space used by the "old" rows.  Logically the fewer pages on disk, the less I/O that will result in scenarios where you are doing sequential scans and probably even many index scans.

It seems that the MVCC implementation would introduce fragmentation (with respect to a btree indexed field) if your table design had an indexed creation date field and you often range scanned on that field but also updated the record then over time the optimizer would less favor the index as the correlation approached 0.  Obviously this is a great feature for a "last update date" field.  :)  Not so great if your primary queries are on a creation date field.


> 3. Table size plays no real factor.

The reason I saw speedup, must have to do with the fact that without transactions, each insert was it's own transaction. That was eating resources.


I would agree.  I am also reusing the same statement handles (prepare once, execute many) with DBD::Pg.  The benefit here appears to be that it prepares the cursor once (one time to parse and generate the execution plan), and executes the same plan multiple times.  The difference in inserts was about 2000 inserts/s!

This is the one of the reasons why everyone keeps saying use COPY instead of INSERT, COPY is essentially a one time prepare and execute many.

Test #1 (prepare once, execute many):
4000 inserts 0.92 secs, 4368.84 inserts/s, commit 0.04 secs.
4000 inserts 0.93 secs, 4303.47 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4319.78 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4306.38 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4299.53 inserts/s, commit 0.02 secs.
4000 inserts 0.92 secs, 4345.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4298.67 inserts/s, commit 0.03 secs.
4000 inserts 0.91 secs, 4382.13 inserts/s, commit 0.04 secs.
4000 inserts 0.92 secs, 4347.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4314.66 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 28.7435
inserts/s (Mean) = 4328.6351

Test #2 (prepare/execute each time):
4000 inserts 1.92 secs, 2086.21 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2092.67 inserts/s, commit 0.02 secs.
4000 inserts 1.91 secs, 2094.54 inserts/s, commit 0.05 secs.
4000 inserts 1.96 secs, 2042.55 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.57 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2098.91 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.38 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2083.94 inserts/s, commit 0.02 secs.
4000 inserts 1.95 secs, 2050.07 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2086.14 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 19.2360
inserts/s (Mean) = 2083.1987