Thread: insert vs select into performance
Hi I was doing some testing on "insert" compared to "select into". I inserted 100 000 rows (with 8 column values) into a table, which took 14 seconds, compared to a select into, which took 0.8 seconds. (fyi, the inserts where batched, autocommit was turned off and it all happend on the local machine) Now I am wondering why the select into is that much faster? Does the select into translate into a specially optimised function in c that can cut corners which a insert can not do (e.g. lazy copying), or is it some other reason? The reason I am asking is that select into shows that a number of rows can be inserted into a table quite a lot faster than one would think was possible with ordinary sql. If that is the case, it means that if I write an pl-pgsql insert function in C instead of sql, then I can have my db perform order of magnitude faster. Any comments? regards thomas
On Jul 17, 2007, at 14:38 , Thomas Finneid wrote: > I was doing some testing on "insert" compared to "select into". I > inserted 100 000 rows (with 8 column values) into a table, which > took 14 seconds, compared to a select into, which took 0.8 seconds. > (fyi, the inserts where batched, autocommit was turned off and it > all happend on the local machine) > > Now I am wondering why the select into is that much faster? It would be helpful if you included the actual queries you're using, as there are a number of variables: 1) If there are any constraints on the original table, the INSERT will be checking those constraints. AIUI, SELECT INTO does not generate any table constraints. 2a) Are you using INSERT INTO foo (foo1, foo2, foo2) SELECT foo1, foo2, foo3 FROM pre_foo or individual inserts for each row? The former would be faster than the latter. 2b) If you are doing individual inserts, are you wrapping them in a transaction? The latter would be faster. Michael Glaesemann grzm seespotcode net
Have you also tried the COPY-statement? Afaik select into is similar to what happens in there. Best regards, Arjen On 17-7-2007 21:38 Thomas Finneid wrote: > Hi > > I was doing some testing on "insert" compared to "select into". I > inserted 100 000 rows (with 8 column values) into a table, which took 14 > seconds, compared to a select into, which took 0.8 seconds. > (fyi, the inserts where batched, autocommit was turned off and it all > happend on the local machine) > > Now I am wondering why the select into is that much faster? > Does the select into translate into a specially optimised function in c > that can cut corners which a insert can not do (e.g. lazy copying), or > is it some other reason? > > The reason I am asking is that select into shows that a number of rows > can be inserted into a table quite a lot faster than one would think was > possible with ordinary sql. If that is the case, it means that if I > write an pl-pgsql insert function in C instead of sql, then I can have > my db perform order of magnitude faster. > > Any comments? > > regards > > thomas > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
Michael Glaesemann <grzm@seespotcode.net> writes: > It would be helpful if you included the actual queries you're using, > as there are a number of variables: Not to mention which PG version he's testing. Since (I think) 8.1, SELECT INTO knows that it can substitute one fsync for WAL-logging the individual row inserts, since if there's a crash the new table will disappear anyway. regards, tom lane
Michael Glaesemann wrote: > > On Jul 17, 2007, at 14:38 , Thomas Finneid wrote: > >> I was doing some testing on "insert" compared to "select into". I >> inserted 100 000 rows (with 8 column values) into a table, which took >> 14 seconds, compared to a select into, which took 0.8 seconds. >> (fyi, the inserts where batched, autocommit was turned off and it all >> happend on the local machine) >> >> Now I am wondering why the select into is that much faster? > > It would be helpful if you included the actual queries you're using, as > there are a number of variables: create table ciu_data_type ( id integer, loc_id integer, value1 integer, value2 real, value3 integer, value4 real, value5 real, value6 char(2), value7 char(3), value8 bigint, value9 bigint, value10 real, value11 bigint, value12 smallint, value13 double precision, value14 real, value15 real, value16 char(1), value17 varchar(18), value18 bigint, value19 char(4) ); performed with JDBC insert into ciu_data_type (id, loc_id, value3, value5, value8, value9, value10, value11 ) values (?,?,?,?,?,?,?,?) select * into ciu_data_type_copy from ciu_data_type > 1) If there are any constraints on the original table, the INSERT will > be checking those constraints. AIUI, SELECT INTO does not generate any > table constraints. No constraints in this test. > 2a) Are you using INSERT INTO foo (foo1, foo2, foo2) SELECT foo1, foo2, > foo3 FROM pre_foo or individual inserts for each row? The former would > be faster than the latter. > > 2b) If you are doing individual inserts, are you wrapping them in a > transaction? The latter would be faster. disabling autocommit, but nothing more than that I havent done this test in a stored function yet, nor have I tried it with a C client so far, so there is the chance that it is java/jdbc that makes the insert so slow. I'll get to that test soon if there is any chance my theory makes sence. regards thomas
Tom Lane wrote: > Michael Glaesemann <grzm@seespotcode.net> writes: >> It would be helpful if you included the actual queries you're using, >> as there are a number of variables: > > Not to mention which PG version he's testing. Its pg 8.1, for now, I'll be upgrading to a compile optimised 8.2 when I do the real test on the real server. (its on kubuntu 6.10 running on a Thinkpad T60 with dual core 1.5,GB RAM and 100GB SATA, just in case anybody feels that is of any interrest.) > Since (I think) 8.1, > SELECT INTO knows that it can substitute one fsync for WAL-logging > the individual row inserts, since if there's a crash the new table > will disappear anyway. I am not sure I understand you correctly here, are you saying that SELECT INTO in 8.1 disables WAL logging and uses just a single fsync at the end? in that case it means that I could disable WAL as well and achieve the same performance, does it not? regards thomas
On Tue, Jul 17, 2007 at 10:50:22PM +0200, Thomas Finneid wrote: >I havent done this test in a stored function yet, nor have I tried it >with a C client so far, so there is the chance that it is java/jdbc that >makes the insert so slow. I'll get to that test soon if there is any >chance my theory makes sence. What you're seeing is perfectly normal. Switch to COPY for fast inserts. (When you use inserts you need to wait for a round-trip for each row, instead of sending data to the server as fast as possible.) Mike Stone
If you're performing via JDBC, are you using addBatch/executeBatch, or are you directly executing each insert? If you directly execute each insert, then your code will wait for a server round-trip between each insert. That still won't get you to the speed of select into, but it should help. You could also look at the pgsql-jdbc archives for the JDBC driver patches which allow you to use COPY-style bulk loading, which should get you to the performance level of COPY, which should be reasonably close to the performance of select into. -- Mark Lewis On Tue, 2007-07-17 at 22:50 +0200, Thomas Finneid wrote: > > Michael Glaesemann wrote: > > > > On Jul 17, 2007, at 14:38 , Thomas Finneid wrote: > > > >> I was doing some testing on "insert" compared to "select into". I > >> inserted 100 000 rows (with 8 column values) into a table, which took > >> 14 seconds, compared to a select into, which took 0.8 seconds. > >> (fyi, the inserts where batched, autocommit was turned off and it all > >> happend on the local machine) > >> > >> Now I am wondering why the select into is that much faster? > > > > It would be helpful if you included the actual queries you're using, as > > there are a number of variables: > > create table ciu_data_type > ( > id integer, > loc_id integer, > value1 integer, > value2 real, > value3 integer, > value4 real, > value5 real, > value6 char(2), > value7 char(3), > value8 bigint, > value9 bigint, > value10 real, > value11 bigint, > value12 smallint, > value13 double precision, > value14 real, > value15 real, > value16 char(1), > value17 varchar(18), > value18 bigint, > value19 char(4) > ); > > performed with JDBC > > insert into ciu_data_type (id, loc_id, value3, value5, value8, value9, > value10, value11 ) values (?,?,?,?,?,?,?,?) > > select * into ciu_data_type_copy from ciu_data_type > > > 1) If there are any constraints on the original table, the INSERT will > > be checking those constraints. AIUI, SELECT INTO does not generate any > > table constraints. > > No constraints in this test. > > > 2a) Are you using INSERT INTO foo (foo1, foo2, foo2) SELECT foo1, foo2, > > foo3 FROM pre_foo or individual inserts for each row? The former would > > be faster than the latter. > > > > 2b) If you are doing individual inserts, are you wrapping them in a > > transaction? The latter would be faster. > > disabling autocommit, but nothing more than that > > > I havent done this test in a stored function yet, nor have I tried it > with a C client so far, so there is the chance that it is java/jdbc that > makes the insert so slow. I'll get to that test soon if there is any > chance my theory makes sence. > > regards > > thomas > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster
Arjen van der Meijden wrote: > Have you also tried the COPY-statement? Afaik select into is similar to > what happens in there. No, because it only works on file to db or vice versa not table to table. regards thoams
On Jul 17, 2007, at 15:50 , Thomas Finneid wrote: > Michael Glaesemann wrote: >> 2a) Are you using INSERT INTO foo (foo1, foo2, foo2) SELECT foo1, >> foo2, foo3 FROM pre_foo or individual inserts for each row? The >> former would be faster than the latter. > performed with JDBC > > insert into ciu_data_type (id, loc_id, value3, value5, value8, > value9, value10, value11 ) values (?,?,?,?,?,?,?,?) As they're individual inserts, I think what you're seeing is overhead from calling this statement 100,000 times, not just on the server but also the overhead through JDBC. For comparison, try CREATE TABLE ciu_data_type_copy LIKE ciu_data_type; INSERT INTO ciu_data_type_copy (id, loc_id, value3, value5, value8, value9, value10, value11) SELECT id, loc_id, value3, value5, value8, value9, value10, value11 FROM ciu_data_type; I think this would be more comparable to what you're seeing. > I havent done this test in a stored function yet, nor have I tried > it with a C client so far, so there is the chance that it is java/ > jdbc that makes the insert so slow. I'll get to that test soon if > there is any chance my theory makes sence. Just testing in psql with \timing should be fairly easy. Michael Glaesemann grzm seespotcode net
Mark Lewis wrote: > If you're performing via JDBC, are you using addBatch/executeBatch, or > are you directly executing each insert? If you directly execute each > insert, then your code will wait for a server round-trip between each > insert. I tested both and I found almost no difference in the time it took to perform it. Mind you this was on a local machine, but I still thought that it was a bit strange. > That still won't get you to the speed of select into, but it should > help. You could also look at the pgsql-jdbc archives for the JDBC > driver patches which allow you to use COPY-style bulk loading, which > should get you to the performance level of COPY, which should be > reasonably close to the performance of select into. Yes, someone else on the list suggested this a couple of weeks ago. I havent had a chance to test it yet, but I am hopeful that I can use it. The only issue I have is that the test I have done are rather simplistic, because they are just speed trials. The real system will probably use 5-10 tables, with up to 100 columns for all tables, that means I need a stored function which goes through all bulked data and reinserts them into their real tables. I am worried that this might hurt the performance so much so that almost the entire bulk copy advantage diasappears. This is why I am wondering about the details of SELECT INTO and C functions etc. regards thomas
> I was doing some testing on "insert" compared to "select into". I > inserted 100 000 rows (with 8 column values) into a table, which took 14 > seconds, compared to a select into, which took 0.8 seconds. > (fyi, the inserts where batched, autocommit was turned off and it all > happend on the local machine) Did you use prepared statements ? Did you use INSERT INTO ... VALUES () with a long list of values, or just 100K insert statements ? It's the time to parse statements, plan, execute, roundtrips with the client, context switches, time for your client library to escape the data and encode it and for postgres to decode it, etc. In a word : OVERHEAD. By the way which language and client library are you using ? FYI 14s / 100k = 140 microseconds per individual SQL query. That ain't slow at all. > Does the select into translate into a specially optimised function in c > that can cut corners which a insert can not do (e.g. lazy copying), or > is it some other reason? Yeah : instead of your client having to encode 100K * 8 values, send it over a socket, and postgres decoding it, INSERT INTO SELECT just takes the data, and writes the data. Same thing as writing a file a byte at a time versus using a big buffer. > The reason I am asking is that select into shows that a number of rows > can be inserted into a table quite a lot faster than one would think was > possible with ordinary sql. If that is the case, it means that if I > write an pl-pgsql insert function in C instead of sql, then I can have > my db perform order of magnitude faster. Fortunately this is already done for you : there is the PREPARE statement, which will remove the parsing overhead. If you must insert many rows, use VALUES (),(),()...
PFC wrote: > >> I was doing some testing on "insert" compared to "select into". I >> inserted 100 000 rows (with 8 column values) into a table, which took >> 14 seconds, compared to a select into, which took 0.8 seconds. >> (fyi, the inserts where batched, autocommit was turned off and it all >> happend on the local machine) > > Did you use prepared statements ? > Did you use INSERT INTO ... VALUES () with a long list of values, or > just 100K insert statements ? It was prepared statements and I tried it both batched and non-batched (not much difference on a local machine) > It's the time to parse statements, plan, execute, roundtrips with > the client, context switches, time for your client library to escape the > data and encode it and for postgres to decode it, etc. In a word : > OVERHEAD. I know there is some overhead, but that much when running it batched...? > By the way which language and client library are you using ? > > FYI 14s / 100k = 140 microseconds per individual SQL query. That > ain't slow at all. Unfortunately its not fast enough, it needs to be done in no more than 1-2 seconds, ( and in production it will be maybe 20-50 columns of data, perhaps divided over 5-10 tables.) Additionally it needs to scale to perhaps three times as many columns and perhaps 2 - 3 times as many rows in some situation within 1 seconds. Further on it needs to allow for about 20 - 50 clients reading much of that data before the next batch of data arrives. I know the computer is going to be a much faster one than the one I am testing with, but I need to make sure the solution scales well. regars thomas
>> It's the time to parse statements, plan, execute, roundtrips with >> the client, context switches, time for your client library to escape >> the data and encode it and for postgres to decode it, etc. In a word : >> OVERHEAD. > > I know there is some overhead, but that much when running it batched...? Well, yeah ;) > Unfortunately its not fast enough, it needs to be done in no more than > 1-2 seconds, ( and in production it will be maybe 20-50 columns of data, > perhaps divided over 5-10 tables.) > Additionally it needs to scale to perhaps three times as many columns > and perhaps 2 - 3 times as many rows in some situation within 1 seconds. > Further on it needs to allow for about 20 - 50 clients reading much of > that data before the next batch of data arrives. Wow. What is the application ? Test run on a desktop PC, Athlon 64 3200+, 2 IDE disks in RAID1 (pretty slow) : test=> CREATE TABLE test (a INT, b INT, c INT, d INT, e INT, f INT); CREATE TABLE Temps : 11,463 ms test=> INSERT INTO test SELECT 1,2,3,4,5,a FROM generate_series( 1, 100000 ) as a; INSERT 0 100000 Temps : 721,579 ms OK, so you see, insert speed is pretty fast. With a better CPU and faster disks, you can get a lot more. test=> TRUNCATE TABLE test; TRUNCATE TABLE Temps : 30,010 ms test=> ALTER TABLE test ADD PRIMARY KEY (f); INFO: ALTER TABLE / ADD PRIMARY KEY créera un index implicite «test_pkey» pour la table «test» ALTER TABLE Temps : 100,577 ms test=> INSERT INTO test SELECT 1,2,3,4,5,a FROM generate_series( 1, 100000 ) as a; INSERT 0 100000 Temps : 1915,928 ms This includes the time to update the index. test=> DROP TABLE test; DROP TABLE Temps : 28,804 ms test=> CREATE TABLE test (a INT, b INT, c INT, d INT, e INT, f INT); CREATE TABLE Temps : 1,626 ms test=> CREATE OR REPLACE FUNCTION test_insert( ) RETURNS VOID LANGUAGE plpgsql AS $$ DECLARE _i INTEGER; BEGIN FOR _i IN 0..100000 LOOP INSERT INTO test (a,b,c,d,e,f) VALUES (1,2,3,4,5, _i); END LOOP; END; $$; CREATE FUNCTION Temps : 51,948 ms test=> SELECT test_insert(); test_insert ------------- (1 ligne) Temps : 1885,382 ms Now you see, performing 100K individual inserts inside a plpgsql function is also fast. The postgres engine is pretty damn fast ; it's the communication overhead that you feel, especially switching between client and server processes. Another example : => INSERT INTO test (a,b,c,d,e,f) VALUES (... 100000 integer tuples) INSERT 0 100000 Temps : 1836,458 ms VALUES is actually pretty fast. Here, there is no context switch, everything is done in 1 INSERT. However COPY is much faster because the parsing overhead and de-escaping of data is faster. COPY is optimized for throughput. So, advice : For optimum throughput, have your application build chunks of data into text files and use COPY. Or if your client lib supports the copy interface, use it. You will need a fast disk system with xlog and data on separate disks, several CPU cores (1 insert thread will max out 1 core, use the others for selects), lots of RAM so index updates don't need to seek, and tuning of bgwriter and checkpoints to avoid load spikes.
On Tue, Jul 17, 2007 at 10:58:01PM +0200, Thomas Finneid wrote: >I am not sure I understand you correctly here, are you saying that >SELECT INTO in 8.1 disables WAL logging and uses just a single fsync at >the end? in that case it means that I could disable WAL as well and >achieve the same performance, does it not? Yes. The difference is that the select into optimization just means that if the system crashes the data you're inserting is invalid (and is properly cleaned up), and disabling the WAL means that if the system crashes everything is invalid (and can't be cleaned up). Mike Stone
On Tue, Jul 17, 2007 at 11:01:15PM +0200, Thomas Finneid wrote: >Arjen van der Meijden wrote: >>Have you also tried the COPY-statement? Afaik select into is similar to >>what happens in there. > >No, because it only works on file to db or vice versa not table to table. I don't understand how the insert you described is table to table? Mike Stone
PFC wrote: >> Unfortunately its not fast enough, it needs to be done in no more than >> 1-2 seconds, ( and in production it will be maybe 20-50 columns of >> data, perhaps divided over 5-10 tables.) >> Additionally it needs to scale to perhaps three times as many columns >> and perhaps 2 - 3 times as many rows in some situation within 1 seconds. >> Further on it needs to allow for about 20 - 50 clients reading much of >> that data before the next batch of data arrives. > > Wow. What is the application ? Geological surveys, where they perform realtime geo/hydro-phone shots of areas of the size of 10x10km every 3-15 seconds. > test=> CREATE OR REPLACE FUNCTION test_insert( ) > RETURNS VOID > LANGUAGE plpgsql > AS > $$ > DECLARE > _i INTEGER; > BEGIN > FOR _i IN 0..100000 LOOP > INSERT INTO test (a,b,c,d,e,f) VALUES (1,2,3,4,5, _i); > END LOOP; > END; > $$; > CREATE FUNCTION > Temps : 51,948 ms > > test=> SELECT test_insert(); > test_insert > ------------- > > (1 ligne) > > Temps : 1885,382 ms I tested this one and it took 4 seconds, compared to the jdbc insert which took 14 seconds, so its a lot faster. but not as fast as the SELECT INTO. I also tested an INSERT INTO FROM SELECT, which took 1.8 seconds, now we are starting to talk about real performance. > However COPY is much faster because the parsing overhead and > de-escaping of data is faster. COPY is optimized for throughput. > > So, advice : > > For optimum throughput, have your application build chunks of data > into text files and use COPY. Or if your client lib supports the copy > interface, use it. I did test COPY, i.e. the jdbc COPY patch for pg 8.1, it performs at approx 1.8 seconds :) The test was done with text input, I am going to test it with binary input, which I expect will increase the performance with 20-50%. All these test have ben performed on a laptop with a Kubuntu 6.10 version of pg 8.1 without any special pg performance tuning. So I expect that compiling lates pg and doing some tuning on it and testing it on the a representative server will give it an additional boost in performance. The key here is that with abundance in performance, I can experiment with the solution in a completely different way than if I had any "artificial" restrictions. > You will need a fast disk system with xlog and data on separate > disks, several CPU cores (1 insert thread will max out 1 core, use the > others for selects), lots of RAM so index updates don't need to seek, > and tuning of bgwriter and checkpoints to avoid load spikes. will have a look at it. regards thomas
Michael Glaesemann wrote: > > As they're individual inserts, I think what you're seeing is overhead > from calling this statement 100,000 times, not just on the server but > also the overhead through JDBC. For comparison, try > > CREATE TABLE ciu_data_type_copy LIKE ciu_data_type; > > INSERT INTO ciu_data_type_copy (id, loc_id, value3, value5, value8, > value9, value10, value11) > SELECT id, loc_id, value3, value5, value8, value9, value10, value11 > FROM ciu_data_type; > > I think this would be more comparable to what you're seeing. This is much faster than my previous solution, but, I also tested two other solutions - a stored function with array arguments and it performed 3 times better. - jdbc with COPY patch performed 8.4 times faster with text input, expect binary input to be even faster. regards thomas
Michael Stone wrote: > On Tue, Jul 17, 2007 at 11:01:15PM +0200, Thomas Finneid wrote: >> Arjen van der Meijden wrote: >>> Have you also tried the COPY-statement? Afaik select into is similar >>> to what happens in there. >> >> No, because it only works on file to db or vice versa not table to table. > > I don't understand how the insert you described is table to table? SELECT INTO is table to table, so is INSERT INTO SELECT FROM. regards thomas
Michael Stone wrote: > On Tue, Jul 17, 2007 at 10:58:01PM +0200, Thomas Finneid wrote: >> I am not sure I understand you correctly here, are you saying that >> SELECT INTO in 8.1 disables WAL logging and uses just a single fsync >> at the end? in that case it means that I could disable WAL as well and >> achieve the same performance, does it not? > > Yes. The difference is that the select into optimization just means that > if the system crashes the data you're inserting is invalid (and is > properly cleaned up), and disabling the WAL means that if the system > crashes everything is invalid (and can't be cleaned up). So, how does one (temporarily) disable WAL logging ? Or, for example, disable WAL logging for a temporary table ? Regards, Adriaan van Os
Adriaan van Os wrote: > So, how does one (temporarily) disable WAL logging ? Or, for example, > disable WAL logging for a temporary table ? Operations on temporary tables are never WAL logged. Operations on other tables are, and there's no way to disable it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Jul 18, 2007 at 09:13:14PM +0200, Thomas Finneid wrote: >Michael Stone wrote: >>I don't understand how the insert you described is table to table? > >SELECT INTO is table to table, so is INSERT INTO SELECT FROM. I could have sworn that at least one of the examples you gave didn't have any select. Doesn't really matter. Mike Stone