Thread: slow speeds after 2 million rows inserted
Greetings, Ive got a java application I am reading data from a flat file and inserting it into a table. The first 2 million rows (each file contained about 1 million lines) went pretty fast. Less than 40 mins to insert into the database. After that the insert speed is slow. I think I may be able to type the data faster than what is being done by the java application on the third file. Table looks like this: CREATE TABLE data_archive ( id serial NOT NULL, batchid integer NOT NULL, claimid character varying(25) NOT NULL, memberid character varying(45) NOT NULL, raw_data text NOT NULL, status integer DEFAULT 0, line_number integer, CONSTRAINT data_archive_pkey PRIMARY KEY (id) ) there is also an index on batchid. The insert command is like so: "INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status, line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data + "', '1', '" + myFilter.claimLine + "');"; where the raw_data variable is the line from the file. How can I find out what is causing this slow down and how do I speed it up? Database is 8.2.0 on x86_64-unknown-linux-gnu. There is nothing else running on this database server (other than standard linux background programs). PS ax did not show anything else running. No locks other than the occasional lock by the INSERT query. I have done a FULL vacuum on this table but not reindex (running now). Thanks in advance, James
On Fri, 2006-12-29 at 12:39 -0500, James Neff wrote: > Greetings, > > Ive got a java application I am reading data from a flat file and > inserting it into a table. The first 2 million rows (each file > contained about 1 million lines) went pretty fast. Less than 40 mins to > insert into the database. > You need to vacuum during the inserts :) Joshua D. Drake > After that the insert speed is slow. I think I may be able to type the > data faster than what is being done by the java application on the third > file. > > Table looks like this: > > CREATE TABLE data_archive > ( > id serial NOT NULL, > batchid integer NOT NULL, > claimid character varying(25) NOT NULL, > memberid character varying(45) NOT NULL, > raw_data text NOT NULL, > status integer DEFAULT 0, > line_number integer, > CONSTRAINT data_archive_pkey PRIMARY KEY (id) > ) > > > there is also an index on batchid. > > The insert command is like so: > > "INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status, > line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data + > "', '1', '" + myFilter.claimLine + "');"; > > where the raw_data variable is the line from the file. > > How can I find out what is causing this slow down and how do I speed it up? > > Database is 8.2.0 on x86_64-unknown-linux-gnu. > > There is nothing else running on this database server (other than > standard linux background programs). PS ax did not show anything else > running. No locks other than the occasional lock by the INSERT query. > > I have done a FULL vacuum on this table but not reindex (running now). > > Thanks in advance, > James > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
James Neff wrote: > Greetings, > > Ive got a java application I am reading data from a flat file and > inserting it into a table. The first 2 million rows (each file > contained about 1 million lines) went pretty fast. Less than 40 mins to > insert into the database. > > After that the insert speed is slow. I think I may be able to type the > data faster than what is being done by the java application on the third > file. > > Table looks like this: > > CREATE TABLE data_archive > ( > id serial NOT NULL, > batchid integer NOT NULL, > claimid character varying(25) NOT NULL, > memberid character varying(45) NOT NULL, > raw_data text NOT NULL, > status integer DEFAULT 0, > line_number integer, > CONSTRAINT data_archive_pkey PRIMARY KEY (id) > ) > > > there is also an index on batchid. > > The insert command is like so: > > "INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status, > line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data + > "', '1', '" + myFilter.claimLine + "');"; > > where the raw_data variable is the line from the file. > > How can I find out what is causing this slow down and how do I speed it up? > > Database is 8.2.0 on x86_64-unknown-linux-gnu. > > There is nothing else running on this database server (other than > standard linux background programs). PS ax did not show anything else > running. No locks other than the occasional lock by the INSERT query. > I have done a FULL vacuum on this table but not reindex (running now). > > Thanks in advance, > James > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > Consider using copy
> > there is also an index on batchid. > > > > The insert command is like so: > > > > "INSERT INTO data_archive (batchid, claimid, memberid, raw_data, status, > > line_number) VALUES ('" + commandBatchID + "', '', '', '" + raw_data + > > "', '1', '" + myFilter.claimLine + "');"; Also as you are running 8.2 you can use multi valued inserts... INSERT INTO data_archive values () () () > > > > where the raw_data variable is the line from the file. > > > > How can I find out what is causing this slow down and how do I speed it up? > > > > Database is 8.2.0 on x86_64-unknown-linux-gnu. > > > > There is nothing else running on this database server (other than > > standard linux background programs). PS ax did not show anything else > > running. No locks other than the occasional lock by the INSERT query. > > I have done a FULL vacuum on this table but not reindex (running now). > > > > Thanks in advance, > > James > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 2: Don't 'kill -9' the postmaster > > > > Consider using copy > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Joshua D. Drake wrote: > Also as you are running 8.2 you can use multi valued inserts... > > INSERT INTO data_archive values () () () > Would this speed things up? Or is that just another way to do it? Thanks, James
When do you commit these inserts? I occasionally found similiar problems, when I do heavy inserting/updating within one singletransaction. First all runs fast, after some time everything slows down. If I commit the inserts every some 1000 rows(large rows, small engine), this phenomenon does not occur. Maybe some buffer chokes if the transaction ist too big. In your case I'd recommend to commit after every one or two million rows (if possible). Regards, Frank. On Fri, 29 Dec 2006 12:39:03 -0500 James Neff <jneff@tethyshealth.com> thought long, then sat down and wrote: > Greetings, > > Ive got a java application I am reading data from a flat file and > inserting it into a table. The first 2 million rows (each file > contained about 1 million lines) went pretty fast. Less than 40 mins to > insert into the database. > > After that the insert speed is slow. I think I may be able to type the > data faster than what is being done by the java application on the third > file. -- Frank Finner Invenius - Lösungen mit Linux Köpfchenstraße 36 57072 Siegen Telefon: 0271 231 8606 Mail: frank.finner@invenius.de Telefax: 0271 231 8608 Web: http://www.invenius.de Key fingerprint = 90DF FF40 582E 6D6B BADF 6E6A A74E 67E4 E788 2651
Attachment
Joshua D. Drake wrote: > You need to vacuum during the inserts :) > > Joshua D. Drake > I ran the vacuum during the INSERT and it seemed to help a little, but its still relatively slow compared to the first 2 million records. Any other ideas? Thanks, James
On Fri, 2006-12-29 at 13:21 -0500, James Neff wrote: > Joshua D. Drake wrote: > > Also as you are running 8.2 you can use multi valued inserts... > > > > INSERT INTO data_archive values () () () > > > > Would this speed things up? Or is that just another way to do it? The fastest way will be copy. The second fastest will be multi value inserts in batches.. eg.; INSERT INTO data_archive values () () () (I don't knwo what the max is) but commit every 1000 inserts or so. Sincerely, Joshua D. Drake > > Thanks, > James > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Joshua D. Drake wrote: > On Fri, 2006-12-29 at 13:21 -0500, James Neff wrote: >> Joshua D. Drake wrote: >>> Also as you are running 8.2 you can use multi valued inserts... >>> >>> INSERT INTO data_archive values () () () >>> >> Would this speed things up? Or is that just another way to do it? > > The fastest way will be copy. > The second fastest will be multi value inserts in batches.. eg.; > > INSERT INTO data_archive values () () () (I don't knwo what the max is) > > but commit every 1000 inserts or so. > > Sincerely, > > Joshua D. Drake > > >> Thanks, >> James >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 6: explain analyze is your friend >> Another thing....if you have to make validations and so on....creating an temp file with the information validated and using COPY will be faster. So you can validate and use COPY at the same time....
In Java, assuming you have a Connection c, you simply say "c.commit();" after doing some action on the database. After everycommit, the transaction will be executed and closed and a new one opened, which runs until the next commit. Regards, Frank. On Fri, 29 Dec 2006 13:23:37 -0500 James Neff <jneff@tethyshealth.com> thought long, then sat down and wrote: > I'm sorry to ask a stupid question, but how do I 'commit' the transactions? > > Thanks, > James > > > > Frank Finner wrote: > > When do you commit these inserts? I occasionally found similiar problems, when I do heavy inserting/updating within onesingle transaction. First all runs fast, after some time everything slows down. If I commit the inserts every some 1000rows (large rows, small engine), this phenomenon does not occur. Maybe some buffer chokes if the transaction ist toobig. > > > > In your case I'd recommend to commit after every one or two million rows (if possible). > > > > Regards, Frank. > > > > > > On Fri, 29 Dec 2006 12:39:03 -0500 James Neff <jneff@tethyshealth.com> thought long, then sat down and wrote: > > > > > >> Greetings, > >> > >> Ive got a java application I am reading data from a flat file and > >> inserting it into a table. The first 2 million rows (each file > >> contained about 1 million lines) went pretty fast. Less than 40 mins to > >> insert into the database. > >> > >> After that the insert speed is slow. I think I may be able to type the > >> data faster than what is being done by the java application on the third > >> file. > >> > > > > > > > -- > > James Neff > Technology Specialist > > Tethys Health Ventures > 4 North Park Drive, Suite 203 > Hunt Valley, MD 21030 > > office: 410.771.0692 x103 > cell: 443.865.7874 > > -- Frank Finner Invenius - Lösungen mit Linux Köpfchenstraße 36 57072 Siegen Telefon: 0271 231 8606 Mail: frank.finner@invenius.de Telefax: 0271 231 8608 Web: http://www.invenius.de Key fingerprint = 90DF FF40 582E 6D6B BADF 6E6A A74E 67E4 E788 2651
Attachment
Frank Finner wrote: > In Java, assuming you have a Connection c, you simply say "c.commit();" after doing some action on the database. Afterevery commit, the transaction will be executed and closed and a new one opened, which runs until the next commit. > > Regards, Frank. > > That did it, thank you! --James
> The fastest way will be copy. > The second fastest will be multi value inserts in batches.. eg.; > > INSERT INTO data_archive values () () () (I don't knwo what the max is) > > but commit every 1000 inserts or so. Is this some empirical value? Can someone give heuristics as to how to calculate the optimal number of transactions afterwhich to commit? Or at least guidelines. Nix. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Frank Finner wrote: > In Java, assuming you have a Connection c, you simply say > "c.commit();" after doing some action on the database. After every > commit, the transaction will be executed and closed and a new one > opened, which runs until the next commit. Assuming, of course, you started with c.setAutoCommit(false); -- Guy Rouillier
> 1. There is no difference (speed-wise) between committing every 1K or every 250K rows.
It was really some time ago, since I have experimented with this. My las experiment was on PG 7.2 or 7.3. I was inserting cca 800,000 rows. Inserting without transactions took 25 hrs. Inserting with 10,000 rows per transaction took about 2.5 hrs. So, the speedup was 10x. I have not experimented with the transaction batch size, but I suspect that 1,000 would not show much speedup.
> 2. Vacuuming also makes no difference for a heavy insert-only table, only slows it down.
Makes sense. Since my application was dumping all records each month and inserting new ones, vacuum was really needed, but no speedup.
> 3. Table size plays no real factor.
The reason I saw speedup, must have to do with the fact that without transactions, each insert was it's own transaction. That was eating resources.It was really some time ago, since I have experimented with this. My las experiment was on PG 7.2 or 7.3. I was inserting cca 800,000 rows. Inserting without transactions took 25 hrs. Inserting with 10,000 rows per transaction took about 2.5 hrs. So, the speedup was 10x. I have not experimented with the transaction batch size, but I suspect that 1,000 would not show much speedup.
> 2. Vacuuming also makes no difference for a heavy insert-only table, only slows it down.
Makes sense. Since my application was dumping all records each month and inserting new ones, vacuum was really needed, but no speedup.
> 3. Table size plays no real factor.
Nix.
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
> It was really some time ago, since I have experimented with this. My las experiment was on PG > 7.2 or 7.3. I was inserting cca 800,000 rows. Inserting without transactions took 25 hrs. > Inserting with 10,000 rows per transaction took about 2.5 hrs. So, the speedup was 10x. I have > not experimented with the transaction batch size, but I suspect that 1,000 would not show much > speedup. > > > 2. Vacuuming also makes no difference for a heavy insert-only table, only slows it down. > > Makes sense. Since my application was dumping all records each month and inserting new ones, > vacuum was really needed, but no speedup. > > > 3. Table size plays no real factor. Maybe this link my be useful, it contains additional links to various postgresql preformance test. http://archives.postgresql.org/pgsql-general/2006-10/msg00662.php Regards, Richard Broersma Jr.
On 12/31/06, Nikola Milutinovic <alokin1@yahoo.com> wrote:
I would imagine the overhead here is sending the BEGIN/COMMIT (or the behavior could be version dependent), and the commit is forcing a log flush. According to the documentation multiple inserts have the potential of being flushed in one shot when the database wakes up to do a logflush automatically, so committing more frequently would actually appear to slow you down.
The time to commit is dependent on hardware, on my hardware it was around 40ms (which you have to question the precision of the calculation, clearly it is probably less than 40ms and that is an upper limit under load).
My experiment was with 8.2, default configuration, so there is room for improvement. It was installed from Devrim's RPM packages.
I agree, vacuuming is clearly important. I would also think if you are going to do a massive one-time update/delete as a maintenance item on a normally read only table that you should plan on doing a vacuum full to recover the space used by the "old" rows. Logically the fewer pages on disk, the less I/O that will result in scenarios where you are doing sequential scans and probably even many index scans.
It seems that the MVCC implementation would introduce fragmentation (with respect to a btree indexed field) if your table design had an indexed creation date field and you often range scanned on that field but also updated the record then over time the optimizer would less favor the index as the correlation approached 0. Obviously this is a great feature for a "last update date" field. :) Not so great if your primary queries are on a creation date field.
I would agree. I am also reusing the same statement handles (prepare once, execute many) with DBD::Pg. The benefit here appears to be that it prepares the cursor once (one time to parse and generate the execution plan), and executes the same plan multiple times. The difference in inserts was about 2000 inserts/s!
This is the one of the reasons why everyone keeps saying use COPY instead of INSERT, COPY is essentially a one time prepare and execute many.
Test #1 (prepare once, execute many):
4000 inserts 0.92 secs, 4368.84 inserts/s, commit 0.04 secs.
4000 inserts 0.93 secs, 4303.47 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4319.78 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4306.38 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4299.53 inserts/s, commit 0.02 secs.
4000 inserts 0.92 secs, 4345.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4298.67 inserts/s, commit 0.03 secs.
4000 inserts 0.91 secs, 4382.13 inserts/s, commit 0.04 secs.
4000 inserts 0.92 secs, 4347.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4314.66 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 28.7435
inserts/s (Mean) = 4328.6351
Test #2 (prepare/execute each time):
4000 inserts 1.92 secs, 2086.21 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2092.67 inserts/s, commit 0.02 secs.
4000 inserts 1.91 secs, 2094.54 inserts/s, commit 0.05 secs.
4000 inserts 1.96 secs, 2042.55 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.57 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2098.91 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.38 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2083.94 inserts/s, commit 0.02 secs.
4000 inserts 1.95 secs, 2050.07 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2086.14 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 19.2360
inserts/s (Mean) = 2083.1987
> 1. There is no difference (speed-wise) between committing every 1K or every 250K rows.
It was really some time ago, since I have experimented with this. My las experiment was on PG 7.2 or 7.3. I was inserting cca 800,000 rows. Inserting without transactions took 25 hrs. Inserting with 10,000 rows per transaction took about 2.5 hrs. So, the speedup was 10x. I have not experimented with the transaction batch size, but I suspect that 1,000 would not show much speedup.
I would imagine the overhead here is sending the BEGIN/COMMIT (or the behavior could be version dependent), and the commit is forcing a log flush. According to the documentation multiple inserts have the potential of being flushed in one shot when the database wakes up to do a logflush automatically, so committing more frequently would actually appear to slow you down.
The time to commit is dependent on hardware, on my hardware it was around 40ms (which you have to question the precision of the calculation, clearly it is probably less than 40ms and that is an upper limit under load).
My experiment was with 8.2, default configuration, so there is room for improvement. It was installed from Devrim's RPM packages.
> 2. Vacuuming also makes no difference for a heavy insert-only table, only slows it down.
Makes sense. Since my application was dumping all records each month and inserting new ones, vacuum was really needed, but no speedup.
I agree, vacuuming is clearly important. I would also think if you are going to do a massive one-time update/delete as a maintenance item on a normally read only table that you should plan on doing a vacuum full to recover the space used by the "old" rows. Logically the fewer pages on disk, the less I/O that will result in scenarios where you are doing sequential scans and probably even many index scans.
It seems that the MVCC implementation would introduce fragmentation (with respect to a btree indexed field) if your table design had an indexed creation date field and you often range scanned on that field but also updated the record then over time the optimizer would less favor the index as the correlation approached 0. Obviously this is a great feature for a "last update date" field. :) Not so great if your primary queries are on a creation date field.
> 3. Table size plays no real factor.The reason I saw speedup, must have to do with the fact that without transactions, each insert was it's own transaction. That was eating resources.
I would agree. I am also reusing the same statement handles (prepare once, execute many) with DBD::Pg. The benefit here appears to be that it prepares the cursor once (one time to parse and generate the execution plan), and executes the same plan multiple times. The difference in inserts was about 2000 inserts/s!
This is the one of the reasons why everyone keeps saying use COPY instead of INSERT, COPY is essentially a one time prepare and execute many.
Test #1 (prepare once, execute many):
4000 inserts 0.92 secs, 4368.84 inserts/s, commit 0.04 secs.
4000 inserts 0.93 secs, 4303.47 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4319.78 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4306.38 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4299.53 inserts/s, commit 0.02 secs.
4000 inserts 0.92 secs, 4345.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4298.67 inserts/s, commit 0.03 secs.
4000 inserts 0.91 secs, 4382.13 inserts/s, commit 0.04 secs.
4000 inserts 0.92 secs, 4347.44 inserts/s, commit 0.02 secs.
4000 inserts 0.93 secs, 4314.66 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 28.7435
inserts/s (Mean) = 4328.6351
Test #2 (prepare/execute each time):
4000 inserts 1.92 secs, 2086.21 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2092.67 inserts/s, commit 0.02 secs.
4000 inserts 1.91 secs, 2094.54 inserts/s, commit 0.05 secs.
4000 inserts 1.96 secs, 2042.55 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.57 inserts/s, commit 0.04 secs.
4000 inserts 1.91 secs, 2098.91 inserts/s, commit 0.03 secs.
4000 inserts 1.91 secs, 2098.38 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2083.94 inserts/s, commit 0.02 secs.
4000 inserts 1.95 secs, 2050.07 inserts/s, commit 0.03 secs.
4000 inserts 1.92 secs, 2086.14 inserts/s, commit 0.02 secs.
inserts/s (StdDev) = 19.2360
inserts/s (Mean) = 2083.1987