Thread: psql vs perl prepared inserts
I've been experimenting with loading a large amount of data into a fairly simple database using both psql and perl prepared statements. Unfortunately I'm seeing no appreciable differences between the two methods, where I was under the impression that prepared statements should be much faster (in my case, they are slightly slower). I have included a pseudocode/subset of the perl code I use below. You can see the prepare statement outside the loop and the execute statement inside the loop. Alternatively you can see that I write every INSERT statement to a text file which I then load by using `psql dbname -f bulksql_load.sql`. Normally I only have either the prepare or the print-to-file in the loop, but i've included both to make the pseudocode smaller. Using a simple `time` command from my system it looks like the execution time for loading all the data in both scenarios is about 50 seconds. FYI, the infile has 18k lines, and after parsing and the db enforcing uniqueness there are 15.5k rows in the results table of the db. This works out to ~300 inserts per second with on pgsql 7.3.2 with fsync turned off. I think that is a decent performance for this old box, I'm just confused as to why the prepared statements don't seem to give any speed boost as advertised. Could the fact that many of my inserts have 15 columns slow down the prepared statements to the point where they're no longer useful as a speed enhancement? Or perhaps it's because I'm explicitly inserting each field/value pair, even if many of them are null (which I think is generally considered a good habit). Any other suggestions you might have would be welcome. Please forgive me if I should have posted this to pgsql-performance or some perl list, but I think the comparison with psql makes it relevant. Hopefully this won't be too messy: #!/usr/bin/perl open (IN,"scan.nbe"); open (OUT,">bulksql_load.sql"); use DBI; $dbh = DBI->connect("dbi:Pg:dbname=nessus"); $sth_res = $dbh->prepare("INSERT INTO results (scan_id,subnet,host,service,plugin_id,type,description,see_also,solution,risk_factor,cve_id,bid_id,port,proto,service_name) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"); $sth_tim = $dbh->prepare("INSERT INTO timestamps (scan_id,hostip_or_nul,start_or_end,scan_time) VALUES (?,?,?,?)"); while (<IN>){ if (/^timestamps/){ parse_ts(); $sth_tim->execute($scan_id,$hostip_or_nul,$start_or_end,$scan_time); print OUT "INSERT INTO timestamps (scan_id,hostip_or_nul,start_or_end,scan_time) VALUES ($scan_id,$hostip_or_nul,$start_or_end,$scan_time)\n"; } elsif (/^results/) { parse_res(); $sth_res->execute($scan_id,$subnet,$host,$service,$plugin_id,$type,$description,$see_also,$solution,$risk_factor,$cve_id,$bid_id,$port,$proto,$service_name); print OUT "INSERT INTO results (scan_id,subnet,host,service,plugin_id,type,description,see_also,solution,risk_factor,cve_id,bid_id,port,proto,service_name) VALUES ($scan_id,$subnet,$host,$service,$plugin_id,$type,$description,$see_also,$solution,$risk_factor,$cve_id,$bid_id,$port,$proto,$service_name)\n"; } }
Matt Van Mater <matt.vanmater@gmail.com> writes: > I've been experimenting with loading a large amount of data into a > fairly simple database using both psql and perl prepared statements. > Unfortunately I'm seeing no appreciable differences between the two > methods, where I was under the impression that prepared statements > should be much faster (in my case, they are slightly slower). They should be faster ... if the client library is really using server-side prepared statements, and not trying to fake it. Up till 7.4 came out there wasn't any very useful way to use server-side prepared statements in a driver, and I think only the very latest version of DBD::Pg knows anything about it. See this thread for some recent discussion: http://archives.postgresql.org/pgsql-interfaces/2005-04/msg00029.php regards, tom lane
On 4/12/05, Matt Van Mater <matt.vanmater@gmail.com> wrote: > I've been experimenting with loading a large amount of data into a > fairly simple database using both psql and perl prepared statements. > Unfortunately I'm seeing no appreciable differences between the two > methods, where I was under the impression that prepared statements > should be much faster (in my case, they are slightly slower). I've been playing with similar issue and in my case the best solution for bulk insert was using perl to format data in form suitable for COPY command. I believe there may be problems if you have RULEs on table you want to copy data into (IIRC such RULEs will be ignored). For a test you might want to try also this approach (both from perl and from psql): $dbh->do('PREPARE sth_tim (int,inet,boolean,timestamptz) AS INSERT INTO timestamps VALUES ($1,$2,$3,$4)'); $sth_tim = $dbh->prepare("EXECUTE sth_tim(?,?,?,?)"); ...and later execute it. (and likewise with psql). If you'll see gain in speed with perl it means your DBD::Pg wasn't using server side prepared statements. (And there is a quite a chance that your psql will outperform perl using this syntax). Regards, Dawid PS: I have not tested these statements, yet they should be ok; make sure the argument list (int,inet,boolean...) is correct for your data.
On Apr 13, 2005, at 4:12 AM, Dawid Kuroczko wrote: > On 4/12/05, Matt Van Mater <matt.vanmater@gmail.com> wrote: >> I've been experimenting with loading a large amount of data into a >> fairly simple database using both psql and perl prepared statements. >> Unfortunately I'm seeing no appreciable differences between the two >> methods, where I was under the impression that prepared statements >> should be much faster (in my case, they are slightly slower). > > I've been playing with similar issue and in my case the best solution > for bulk insert was using perl to format data in form suitable for COPY > command. I second this approach. Generally, getting the data into the database can be done VERY quickly (for the 18k rows you have, it would likely be instantaneous to copy them). I often create a separate "loader" schema into which I load text files. Then, I can use SQL, triggers, or functions to "clean up" the data, enforce referential integrity, etc. within the database. If you have perl code to do this, you can probably modify it just slightly to be used in a pl/perl function to do the same thing as before, but now it is done on the server side and will probably be significantly faster. Sean
Dawid Kuroczko wrote: > For a test you might want to try also this approach (both from perl and > from psql): > > $dbh->do('PREPARE sth_tim (int,inet,boolean,timestamptz) AS INSERT > INTO timestamps VALUES ($1,$2,$3,$4)'); > $sth_tim = $dbh->prepare("EXECUTE sth_tim(?,?,?,?)"); > > ...and later execute it. (and likewise with psql). If you'll see gain in speed > with perl it means your DBD::Pg wasn't using server side prepared > statements. The intent of prepared statements is to reduce the overhead of running the parser, rewriter and planner multiple times for a statement that is executed multiple times. For an INSERT query without any sub-selects that is not rewritten by any rules, the cost to parse, rewrite and plan the statement is trivial. So I wouldn't expect prepared statements to be a big win -- you would gain a lot more from batching multiple inserts into a single transaction, and more still from using COPY. -Neil
Thanks to all who replied. Thanks for the tip on that last thread Tom, I don't know how I missed it. I have a hunch that it's not applicable to me at this time because I'm running a year and a half old software (included in OpenBSD 3.4), but I will have to check which version of DBD::Pg was installed. > The intent of prepared statements is to reduce the overhead of running > the parser, rewriter and planner multiple times for a statement that is > executed multiple times. For an INSERT query without any sub-selects > that is not rewritten by any rules, the cost to parse, rewrite and plan > the statement is trivial. So I wouldn't expect prepared statements to be > a big win -- you would gain a lot more from batching multiple inserts > into a single transaction, and more still from using COPY. I was thinking something along the same lines, and was considering using the COPY statement as my next step, but as someone mentioned then I have to load it into a temporary database and then do some more internal magic to preserve referential integrity, unique contraints, etc. For that reason I was hoping to keep it in perl, and it's always nice to keep everything in a single neat portable package. Also, I forgot to mention earlier that I tried using transactions to speed things up, but since I expect to see certain inserts fail I would need to rework my code so the whole transaction doesn't fail if one insert goes bad. This is somewhat contrary to the purpose of transactions so I'm not sure how to accomplish this. I saw roughly a 20% speed improvement by turning autocommit off and only committing at the end of parsing each file. I think in the end I need to check the version of my Pg driver and perhaps upgrade to 7.4 or 8.0 in order to take advantage of the server side prepared statements. This is only a development box and I'm doing this mostly as an academic exercise that will someday help me speed up the production side, so upgrading isn't out of the question. Matt
On Wed, Apr 13, 2005 at 09:57:09 -0400, Matt Van Mater <matt.vanmater@gmail.com> wrote: > > Also, I forgot to mention earlier that I tried using transactions to > speed things up, but since I expect to see certain inserts fail I > would need to rework my code so the whole transaction doesn't fail if > one insert goes bad. This is somewhat contrary to the purpose of > transactions so I'm not sure how to accomplish this. I saw roughly a > 20% speed improvement by turning autocommit off and only committing at > the end of parsing each file. You might get a small benefit using savepoints in version 8.
Neil Conway wrote: > For an INSERT query without any sub-selects > that is not rewritten by any rules, the cost to parse, rewrite and plan > the statement is trivial. So I wouldn't expect prepared statements to be > a big win -- you would gain a lot more from batching multiple inserts > into a single transaction, and more still from using COPY. FWIW, when testing pgstream [1] I typically see a 50% increase in execution speed when switching to prepared statements in such a scenario. I'm attaching a small test program that inserts 10000 rows into 5 columns, first without and then with prepared statements, and displays elapsed time. Example of results: elapsed time in loop 0 is 1873 ms (PQexec) elapsed time in loop 1 is 1136 ms (PQexecPrepared) That's with unix domain sockets and a 8.0.1 server. [1] a thin C++ layer on top of libpq (http://manitou-mail.org/pgstream) that happens to have a unified API for prepared/non-prepared statements. -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org #include "pgstream.h" #include <iostream> #include <stdlib.h> #include <stdio.h> #include <sys/time.h> int main(int argc, char** argv) { const int loops=10000; pg_cnx cnx; char buf[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"; try { struct timeval tv1, tv2; if (argc < 2) { std::cerr << "Usage: " << argv[0] << " connection_string\n"; exit(1); } cnx.connect(argv[1]); for (int j=0; j<2; j++) { cnx.set_option("bind_variables", j>0); cnx.set_option("prepare_statements", j>0); { pg_trans trans(cnx); pg_stmt("CREATE TABLE txt1(pk1 int,t1 text,t2 text," "t3 text, t4 text)", cnx); gettimeofday(&tv1,NULL); pg_stream s1("INSERT INTO txt1(pk1,t1,t2,t3,t4) " "VALUES(:i,:p1,:p2,:p3,:p4)", cnx); for (int i=0; i<loops; i++) { s1 << i << buf << buf << buf << buf; } gettimeofday(&tv2,NULL); pg_stmt("DROP TABLE txt1", cnx); trans.commit(); printf("elapsed time in loop %d is %d ms\n", j, (tv2.tv_sec-tv1.tv_sec)*1000+(tv2.tv_usec-tv1.tv_usec)/1000); } } } catch(pg_excpt p) { std::cerr << p.full_error_txt(); } return 0; }
Neil Conway <neilc@samurai.com> writes: > The intent of prepared statements is to reduce the overhead of running > the parser, rewriter and planner multiple times for a statement that is > executed multiple times. For an INSERT query without any sub-selects > that is not rewritten by any rules, the cost to parse, rewrite and plan > the statement is trivial. You'd be surprised ... I was looking into this just the other day, with a test case that looks like create table bench1 (id int NOT NULL,id2 int NOT NULL,id3 int NOT NULL, dummy1 char(30)); create unique index bench1_index_ on bench1 (id,id2); create index bench1_index_1 on bench1 (id3); begin; insert into bench1 values (0,0,0,'ABCDEFGHIJ'); ... 300,000 inserts ... insert into bench1 values (167151,167151,167151,'CDEFGHIJKL'); commit; According to gprof, the above-1% functions are % cumulative self self total time seconds seconds calls s/call s/call name 6.62 53.38 53.38 300007 0.00 0.00 yyparse 5.31 96.19 42.81 11808781 0.00 0.00 hash_search 4.44 132.00 35.81 61268959 0.00 0.00 AllocSetAlloc 4.23 166.13 34.13 902304 0.00 0.00 XLogInsert 3.20 191.99 25.86 13688735 0.00 0.00 _bt_compare 2.94 215.75 23.76 12158347 0.00 0.00 LWLockAcquire 2.67 237.32 21.58 4500066 0.00 0.00 base_yylex 2.56 258.00 20.68 6000510 0.00 0.00 SearchCatCache 1.99 274.07 16.07 12160856 0.00 0.00 LWLockRelease 1.88 289.25 15.18 13008925 0.00 0.00 hash_any 1.49 301.25 12.01 2452386 0.00 0.00 PinBuffer 1.36 312.25 11.00 1201324 0.00 0.00 fmgr_info_cxt_security 1.36 323.24 10.98 300000 0.00 0.00 planner 1.19 332.81 9.57 20700142 0.00 0.00 MemoryContextAllocZeroAligned I don't trust gprof's tree-structured breakdown entirely, but it puts a pretty significant fraction of the blame on parse/plan activities: 3.66 767.69 300007/300007 PostgresMain [4] [5] 95.6 3.66 767.69 300007 exec_simple_query [5] 6.13 283.12 300007/300007 PortalRun [6] 0.48 167.39 300007/300007 pg_analyze_and_rewrite [9] 0.47 122.85 300007/300007 pg_plan_queries [16] 1.56 93.29 300007/300007 pg_parse_query [23] 0.62 34.78 300007/300007 pg_rewrite_queries [52] 0.99 17.39 300007/300007 PortalDrop [79] 0.56 16.26 600014/600014 finish_xact_command [84] 1.19 6.89 300007/300007 CreatePortal [126] 1.08 2.29 600014/600014 start_xact_command [186] 1.88 0.36 300007/300007 PortalStart [218] 0.76 1.44 300007/300007 pq_puttextmessage [220] 1.35 0.00 300007/600017 set_ps_display [210] 1.27 0.00 300007/300007 CreateCommandTag [271] 0.89 0.22 300007/300007 printtup_create_DR [286] 0.74 0.00 300007/300007 CreateDestReceiver [328] 0.61 0.00 300007/600015 pgstat_report_activity [277] 0.10 0.14 300007/13864259 pfree [112] 0.23 0.00 300007/300007 PortalSetResultFormat [449] 0.19 0.00 300007/300007 IsAbortedTransactionBlockState [474] 0.07 0.00 300007/300007 printtup_destroy [564] 0.05 0.00 300007/300007 PortalDefineQuery [580] 0.03 0.00 300007/300007 EndCommand [617] 0.01 0.00 300007/300007 BeginCommand [670] That adds up to over 50% of the runtime spent in parse/rewrite/plan. I haven't gotten around to converting the test case into a program that can use a prepared INSERT command, but it looks plausible to expect a factor of 2 or so speedup ... of course, using COPY would completely blow this away, anyway ... regards, tom lane
On Apr 13, 2005, at 9:57 AM, Matt Van Mater wrote: > Thanks to all who replied. Thanks for the tip on that last thread > Tom, I don't know how I missed it. I have a hunch that it's not > applicable to me at this time because I'm running a year and a half > old software (included in OpenBSD 3.4), but I will have to check which > version of DBD::Pg was installed. > >> The intent of prepared statements is to reduce the overhead of running >> the parser, rewriter and planner multiple times for a statement that >> is >> executed multiple times. For an INSERT query without any sub-selects >> that is not rewritten by any rules, the cost to parse, rewrite and >> plan >> the statement is trivial. So I wouldn't expect prepared statements to >> be >> a big win -- you would gain a lot more from batching multiple inserts >> into a single transaction, and more still from using COPY. > > I was thinking something along the same lines, and was considering > using the COPY statement as my next step, but as someone mentioned > then I have to load it into a temporary database and then do some more > internal magic to preserve referential integrity, unique contraints, > etc. For that reason I was hoping to keep it in perl, and it's always > nice to keep everything in a single neat portable package. You can use pl/perl stored functions to do exactly what you do in perl on the client side, but it happens much faster since the data is already on the server. Then, your neat little portable package IS the database. Sean
On Wed, Apr 13, 2005 at 09:57:09AM -0400, Matt Van Mater wrote: > Also, I forgot to mention earlier that I tried using transactions to > speed things up, but since I expect to see certain inserts fail I > would need to rework my code so the whole transaction doesn't fail if > one insert goes bad. This is somewhat contrary to the purpose of > transactions so I'm not sure how to accomplish this. Try the pgloader project at pgfoundry. It tries to insert all rows using COPY and the rejected ones go to a file, using something akin to binary search. It may be of some use. I haven't tried it so I'm not sure of its maturity. Let us know how it goes if you try it! -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "And as an added bonus, now my computer goes to the toilet for me, leaving me free to spend time on more useful activities! yay slug codefests!" (C. Parker)
On 4/13/05, Matt Van Mater <matt.vanmater@gmail.com> wrote: > > The intent of prepared statements is to reduce the overhead of running > > the parser, rewriter and planner multiple times for a statement that is > > executed multiple times. For an INSERT query without any sub-selects > > that is not rewritten by any rules, the cost to parse, rewrite and plan > > the statement is trivial. So I wouldn't expect prepared statements to be > > a big win -- you would gain a lot more from batching multiple inserts > > into a single transaction, and more still from using COPY. > I was thinking something along the same lines, and was considering > using the COPY statement as my next step, but as someone mentioned > then I have to load it into a temporary database and then do some more > internal magic to preserve referential integrity, unique contraints, > etc. For that reason I was hoping to keep it in perl, and it's always > nice to keep everything in a single neat portable package. I believe that COPY handles things like unique constraints and referer integerity just fine (of the latter I am not sure) -- well, its hard to imagine PostgreSQL allowing itself to loose integrity of the data, don't you agree? A simple test: qnex=# CREATE TABLE a (a int PRIMARY KEY); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "a_pkey" for table "a" CREATE TABLEqnex=# COPY a FROM STDIN; Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself. >> 1 >> 2 >> 3 >> 1 >> \. ERROR: duplicate key violates unique constraint "a_pkey" CONTEXT: COPY a, line 4: "1" ...you may have to rethink your design if there are any RULEs or TRIGGERs associated with said table. Apart from that, there should be no problems at the risk that one "bad" row will cancel whole COPY commands, so if you trust your data, it will be fine. :) Regards, Dawid
Just to add some clarity to the Dawid's fine comments: On 4/14/05, Dawid Kuroczko <qnex42@gmail.com> wrote: > On 4/13/05, Matt Van Mater <matt.vanmater@gmail.com> wrote: > I believe that COPY handles things like unique constraints and > referer integerity just fine (of the latter I am not sure) -- well, > its hard to imagine PostgreSQL allowing itself to loose integrity > of the data, don't you agree? > I do, and so does PG. :) Everything that would normally happen with an INSERT, including firing TRIGGERs, checking FKEYs, setting DEFAULTs, running CHECK constraints -- everything -- happens with a COPY import. The *one and only* exception to the above statement is that RULEs to not get used on COPY. The reason for this is that RULEs rewrite queries and after the COPY FROM header there isn't a query to rewrite, it's just a pile of data. This is in the docs, but it's not easy to find. It's actually just one line in the "Notes" section of http://www.postgresql.org/docs/8.0/static/sql-copy.html . Perhaps we should add some more verbiage (he says, non-voluntarily...)? -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org