Thread: psql vs perl prepared inserts

psql vs perl prepared inserts

From
Matt Van Mater
Date:
I've been experimenting with loading a large amount of data into a
fairly simple database using both psql and perl prepared statements.
Unfortunately I'm seeing no appreciable differences between the two
methods, where I was under the impression that prepared statements
should be much faster (in my case, they are slightly slower).

I have included a pseudocode/subset of the perl code I use below.  You
can see the prepare statement outside the loop and the execute
statement inside the loop.  Alternatively you can see that I write
every INSERT statement to a text file which I then load by using `psql
dbname -f bulksql_load.sql`.  Normally I only have either the prepare
or the print-to-file in the loop, but i've included both to make the
pseudocode smaller.

Using a simple `time` command from my system it looks like the
execution time for loading all the data in both scenarios is about 50
seconds.  FYI, the infile has 18k lines, and after parsing and the db
enforcing uniqueness there are 15.5k rows in the results table of the
db.  This works out to ~300 inserts per second with on pgsql 7.3.2
with fsync turned off.  I think that is a decent performance for this
old box, I'm just confused as to why the prepared statements don't
seem to give any speed boost as advertised.

Could the fact that many of my inserts have 15 columns slow down the
prepared statements to the point where they're no longer useful as a
speed enhancement?  Or perhaps it's because I'm explicitly inserting
each field/value pair, even if many of them are null (which I think is
generally considered a good habit).  Any other suggestions you might
have would be welcome.  Please forgive me if I should have posted this
to pgsql-performance or some perl list, but I think the comparison
with psql makes it relevant.


Hopefully this won't be too messy:

#!/usr/bin/perl

open (IN,"scan.nbe");
open (OUT,">bulksql_load.sql");


use DBI;
$dbh = DBI->connect("dbi:Pg:dbname=nessus");
$sth_res  = $dbh->prepare("INSERT INTO results

(scan_id,subnet,host,service,plugin_id,type,description,see_also,solution,risk_factor,cve_id,bid_id,port,proto,service_name)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)");
$sth_tim  = $dbh->prepare("INSERT INTO timestamps
(scan_id,hostip_or_nul,start_or_end,scan_time) VALUES (?,?,?,?)");

while (<IN>){

    if (/^timestamps/){
        parse_ts();

        $sth_tim->execute($scan_id,$hostip_or_nul,$start_or_end,$scan_time);

        print OUT "INSERT INTO timestamps
(scan_id,hostip_or_nul,start_or_end,scan_time) VALUES
($scan_id,$hostip_or_nul,$start_or_end,$scan_time)\n";

    } elsif (/^results/) {
        parse_res();


$sth_res->execute($scan_id,$subnet,$host,$service,$plugin_id,$type,$description,$see_also,$solution,$risk_factor,$cve_id,$bid_id,$port,$proto,$service_name);

        print OUT "INSERT INTO results

(scan_id,subnet,host,service,plugin_id,type,description,see_also,solution,risk_factor,cve_id,bid_id,port,proto,service_name)
VALUES
($scan_id,$subnet,$host,$service,$plugin_id,$type,$description,$see_also,$solution,$risk_factor,$cve_id,$bid_id,$port,$proto,$service_name)\n";

    }
}

Re: psql vs perl prepared inserts

From
Tom Lane
Date:
Matt Van Mater <matt.vanmater@gmail.com> writes:
> I've been experimenting with loading a large amount of data into a
> fairly simple database using both psql and perl prepared statements.
> Unfortunately I'm seeing no appreciable differences between the two
> methods, where I was under the impression that prepared statements
> should be much faster (in my case, they are slightly slower).

They should be faster ... if the client library is really using
server-side prepared statements, and not trying to fake it.  Up till
7.4 came out there wasn't any very useful way to use server-side
prepared statements in a driver, and I think only the very latest
version of DBD::Pg knows anything about it.  See this thread for
some recent discussion:
http://archives.postgresql.org/pgsql-interfaces/2005-04/msg00029.php

            regards, tom lane

Re: psql vs perl prepared inserts

From
Dawid Kuroczko
Date:
On 4/12/05, Matt Van Mater <matt.vanmater@gmail.com> wrote:
> I've been experimenting with loading a large amount of data into a
> fairly simple database using both psql and perl prepared statements.
> Unfortunately I'm seeing no appreciable differences between the two
> methods, where I was under the impression that prepared statements
> should be much faster (in my case, they are slightly slower).

I've been playing with similar issue and in my case the best solution
for bulk insert was using perl to format data in form suitable for COPY
command.

I believe there may be problems if you have RULEs on table you want
to copy data into (IIRC such RULEs will be ignored).

For a test you might want to try also this approach (both from perl and
from psql):

$dbh->do('PREPARE sth_tim (int,inet,boolean,timestamptz) AS INSERT
INTO timestamps VALUES ($1,$2,$3,$4)');
$sth_tim  = $dbh->prepare("EXECUTE sth_tim(?,?,?,?)");

...and later execute it.  (and likewise with psql).  If you'll see gain in speed
with perl it means your DBD::Pg wasn't using server side prepared
statements.  (And there is a quite a chance that your psql will outperform
perl using this syntax).

   Regards,
      Dawid

PS: I have not tested these statements, yet they should be ok; make
sure the argument list (int,inet,boolean...) is correct for your data.

Re: psql vs perl prepared inserts

From
Sean Davis
Date:
On Apr 13, 2005, at 4:12 AM, Dawid Kuroczko wrote:

> On 4/12/05, Matt Van Mater <matt.vanmater@gmail.com> wrote:
>> I've been experimenting with loading a large amount of data into a
>> fairly simple database using both psql and perl prepared statements.
>> Unfortunately I'm seeing no appreciable differences between the two
>> methods, where I was under the impression that prepared statements
>> should be much faster (in my case, they are slightly slower).
>
> I've been playing with similar issue and in my case the best solution
> for bulk insert was using perl to format data in form suitable for COPY
> command.


I second this approach.  Generally, getting the data into the database
can be done VERY quickly (for the 18k rows you have, it would likely be
instantaneous to copy them).  I often create a separate "loader" schema
into which I load text files.  Then, I can use SQL, triggers, or
functions to "clean up" the data, enforce referential integrity, etc.
within the database.  If you have perl code to do this, you can
probably modify it just slightly to be used in a pl/perl function to do
the same thing as before, but now it is done on the server side and
will probably be significantly faster.

Sean


Re: psql vs perl prepared inserts

From
Neil Conway
Date:
Dawid Kuroczko wrote:
> For a test you might want to try also this approach (both from perl and
> from psql):
>
> $dbh->do('PREPARE sth_tim (int,inet,boolean,timestamptz) AS INSERT
> INTO timestamps VALUES ($1,$2,$3,$4)');
> $sth_tim  = $dbh->prepare("EXECUTE sth_tim(?,?,?,?)");
>
> ...and later execute it.  (and likewise with psql).  If you'll see gain in speed
> with perl it means your DBD::Pg wasn't using server side prepared
> statements.

The intent of prepared statements is to reduce the overhead of running
the parser, rewriter and planner multiple times for a statement that is
executed multiple times. For an INSERT query without any sub-selects
that is not rewritten by any rules, the cost to parse, rewrite and plan
the statement is trivial. So I wouldn't expect prepared statements to be
a big win -- you would gain a lot more from batching multiple inserts
into a single transaction, and more still from using COPY.

-Neil

Re: psql vs perl prepared inserts

From
Matt Van Mater
Date:
Thanks to all who replied.  Thanks for the tip on that last thread
Tom, I don't know how I missed it.  I have a hunch that it's not
applicable to me at this time because I'm running a year and a half
old software (included in OpenBSD 3.4), but I will have to check which
version of DBD::Pg was installed.

> The intent of prepared statements is to reduce the overhead of running
> the parser, rewriter and planner multiple times for a statement that is
> executed multiple times. For an INSERT query without any sub-selects
> that is not rewritten by any rules, the cost to parse, rewrite and plan
> the statement is trivial. So I wouldn't expect prepared statements to be
> a big win -- you would gain a lot more from batching multiple inserts
> into a single transaction, and more still from using COPY.

I was thinking something along the same lines, and was considering
using the COPY statement as my next step, but as someone mentioned
then I have to load it into a temporary database and then do some more
internal magic to preserve referential integrity, unique contraints,
etc.  For that reason I was hoping to keep it in perl, and it's always
nice to keep everything in a single neat portable package.

Also, I forgot to mention earlier that I tried using transactions to
speed things up, but since I expect to see certain inserts fail I
would need to rework my code so the whole transaction doesn't fail if
one insert goes bad.  This is somewhat contrary to the purpose of
transactions so I'm not sure how to accomplish this.  I saw roughly a
20% speed improvement by turning autocommit off and only committing at
the end of parsing each file.

I think in the end I need to check the version of my Pg driver and
perhaps upgrade to 7.4 or 8.0 in order to take advantage of the server
side prepared statements.  This is only a development box and I'm
doing this mostly as an academic exercise that will someday help me
speed up the production side, so upgrading isn't out of the question.

Matt

Re: psql vs perl prepared inserts

From
Bruno Wolff III
Date:
On Wed, Apr 13, 2005 at 09:57:09 -0400,
  Matt Van Mater <matt.vanmater@gmail.com> wrote:
>
> Also, I forgot to mention earlier that I tried using transactions to
> speed things up, but since I expect to see certain inserts fail I
> would need to rework my code so the whole transaction doesn't fail if
> one insert goes bad.  This is somewhat contrary to the purpose of
> transactions so I'm not sure how to accomplish this.  I saw roughly a
> 20% speed improvement by turning autocommit off and only committing at
> the end of parsing each file.

You might get a small benefit using savepoints in version 8.

Re: psql vs perl prepared inserts

From
"Daniel Verite"
Date:
    Neil Conway wrote:

> For an INSERT query without any sub-selects
> that is not rewritten by any rules, the cost to parse, rewrite and plan
> the statement is trivial. So I wouldn't expect prepared statements to be
> a big win -- you would gain a lot more from batching multiple inserts
> into a single transaction, and more still from using COPY.

FWIW, when testing pgstream [1] I typically see a 50% increase in execution
speed when switching to prepared statements in such a scenario.
I'm attaching a small test program that inserts 10000 rows into 5 columns, first
without and then with prepared statements, and displays elapsed time.

Example of results:
  elapsed time in loop 0 is 1873 ms (PQexec)
  elapsed time in loop 1 is 1136 ms (PQexecPrepared)
That's with unix domain sockets and a 8.0.1 server.

[1] a thin C++ layer on top of libpq (http://manitou-mail.org/pgstream) that
happens to have a unified API for prepared/non-prepared statements.

--
 Daniel
 PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org

#include "pgstream.h"
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

int main(int argc, char** argv)
{
  const int loops=10000;
  pg_cnx cnx;
  char buf[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
  try {
    struct timeval tv1, tv2;
    if (argc < 2) {
      std::cerr << "Usage: " << argv[0] << " connection_string\n";
      exit(1);
    }
    cnx.connect(argv[1]);

    for (int j=0; j<2; j++) {
      cnx.set_option("bind_variables", j>0);
      cnx.set_option("prepare_statements", j>0);
      {
    pg_trans trans(cnx);
    pg_stmt("CREATE TABLE txt1(pk1 int,t1 text,t2 text,"
        "t3 text, t4 text)", cnx);
    gettimeofday(&tv1,NULL);
    pg_stream s1("INSERT INTO txt1(pk1,t1,t2,t3,t4) "
             "VALUES(:i,:p1,:p2,:p3,:p4)", cnx);
    for (int i=0; i<loops; i++) {
      s1 << i << buf << buf << buf << buf;
    }
    gettimeofday(&tv2,NULL);
    pg_stmt("DROP TABLE txt1", cnx);
    trans.commit();
    printf("elapsed time in loop %d is %d ms\n", j,
           (tv2.tv_sec-tv1.tv_sec)*1000+(tv2.tv_usec-tv1.tv_usec)/1000);
      }
    }
  }
  catch(pg_excpt p) {
    std::cerr << p.full_error_txt();
  }
  return 0;
}

Re: psql vs perl prepared inserts

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> The intent of prepared statements is to reduce the overhead of running
> the parser, rewriter and planner multiple times for a statement that is
> executed multiple times. For an INSERT query without any sub-selects
> that is not rewritten by any rules, the cost to parse, rewrite and plan
> the statement is trivial.

You'd be surprised ...

I was looking into this just the other day, with a test case that looks
like

 create table bench1 (id int NOT NULL,id2 int NOT NULL,id3 int NOT NULL,
                      dummy1 char(30));
 create unique index bench1_index_ on bench1  (id,id2);
 create index bench1_index_1 on bench1  (id3);
 begin;
 insert into bench1 values (0,0,0,'ABCDEFGHIJ');
 ... 300,000 inserts ...
 insert into bench1 values (167151,167151,167151,'CDEFGHIJKL');
 commit;

According to gprof, the above-1% functions are

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
  6.62     53.38    53.38   300007     0.00     0.00  yyparse
  5.31     96.19    42.81 11808781     0.00     0.00  hash_search
  4.44    132.00    35.81 61268959     0.00     0.00  AllocSetAlloc
  4.23    166.13    34.13   902304     0.00     0.00  XLogInsert
  3.20    191.99    25.86 13688735     0.00     0.00  _bt_compare
  2.94    215.75    23.76 12158347     0.00     0.00  LWLockAcquire
  2.67    237.32    21.58  4500066     0.00     0.00  base_yylex
  2.56    258.00    20.68  6000510     0.00     0.00  SearchCatCache
  1.99    274.07    16.07 12160856     0.00     0.00  LWLockRelease
  1.88    289.25    15.18 13008925     0.00     0.00  hash_any
  1.49    301.25    12.01  2452386     0.00     0.00  PinBuffer
  1.36    312.25    11.00  1201324     0.00     0.00  fmgr_info_cxt_security
  1.36    323.24    10.98   300000     0.00     0.00  planner
  1.19    332.81     9.57 20700142     0.00     0.00  MemoryContextAllocZeroAligned

I don't trust gprof's tree-structured breakdown entirely, but it puts
a pretty significant fraction of the blame on parse/plan activities:

                3.66  767.69  300007/300007      PostgresMain [4]
[5]     95.6    3.66  767.69  300007         exec_simple_query [5]
                6.13  283.12  300007/300007      PortalRun [6]
                0.48  167.39  300007/300007      pg_analyze_and_rewrite [9]
                0.47  122.85  300007/300007      pg_plan_queries [16]
                1.56   93.29  300007/300007      pg_parse_query [23]
                0.62   34.78  300007/300007      pg_rewrite_queries [52]
                0.99   17.39  300007/300007      PortalDrop [79]
                0.56   16.26  600014/600014      finish_xact_command [84]
                1.19    6.89  300007/300007      CreatePortal [126]
                1.08    2.29  600014/600014      start_xact_command [186]
                1.88    0.36  300007/300007      PortalStart [218]
                0.76    1.44  300007/300007      pq_puttextmessage [220]
                1.35    0.00  300007/600017      set_ps_display [210]
                1.27    0.00  300007/300007      CreateCommandTag [271]
                0.89    0.22  300007/300007      printtup_create_DR [286]
                0.74    0.00  300007/300007      CreateDestReceiver [328]
                0.61    0.00  300007/600015      pgstat_report_activity [277]
                0.10    0.14  300007/13864259     pfree [112]
                0.23    0.00  300007/300007      PortalSetResultFormat [449]
                0.19    0.00  300007/300007      IsAbortedTransactionBlockState [474]
                0.07    0.00  300007/300007      printtup_destroy [564]
                0.05    0.00  300007/300007      PortalDefineQuery [580]
                0.03    0.00  300007/300007      EndCommand [617]
                0.01    0.00  300007/300007      BeginCommand [670]

That adds up to over 50% of the runtime spent in parse/rewrite/plan.

I haven't gotten around to converting the test case into a program that
can use a prepared INSERT command, but it looks plausible to expect
a factor of 2 or so speedup ... of course, using COPY would completely
blow this away, anyway ...

            regards, tom lane

Re: psql vs perl prepared inserts

From
Sean Davis
Date:
On Apr 13, 2005, at 9:57 AM, Matt Van Mater wrote:

> Thanks to all who replied.  Thanks for the tip on that last thread
> Tom, I don't know how I missed it.  I have a hunch that it's not
> applicable to me at this time because I'm running a year and a half
> old software (included in OpenBSD 3.4), but I will have to check which
> version of DBD::Pg was installed.
>
>> The intent of prepared statements is to reduce the overhead of running
>> the parser, rewriter and planner multiple times for a statement that
>> is
>> executed multiple times. For an INSERT query without any sub-selects
>> that is not rewritten by any rules, the cost to parse, rewrite and
>> plan
>> the statement is trivial. So I wouldn't expect prepared statements to
>> be
>> a big win -- you would gain a lot more from batching multiple inserts
>> into a single transaction, and more still from using COPY.
>
> I was thinking something along the same lines, and was considering
> using the COPY statement as my next step, but as someone mentioned
> then I have to load it into a temporary database and then do some more
> internal magic to preserve referential integrity, unique contraints,
> etc.  For that reason I was hoping to keep it in perl, and it's always
> nice to keep everything in a single neat portable package.

You can use pl/perl stored functions to do exactly what you do in perl
on the client side, but it happens much faster since the data is
already on the server.  Then, your neat little portable package IS the
database.

Sean


Re: psql vs perl prepared inserts

From
Alvaro Herrera
Date:
On Wed, Apr 13, 2005 at 09:57:09AM -0400, Matt Van Mater wrote:

> Also, I forgot to mention earlier that I tried using transactions to
> speed things up, but since I expect to see certain inserts fail I
> would need to rework my code so the whole transaction doesn't fail if
> one insert goes bad.  This is somewhat contrary to the purpose of
> transactions so I'm not sure how to accomplish this.

Try the pgloader project at pgfoundry.  It tries to insert all rows
using COPY and the rejected ones go to a file, using something akin to
binary search.  It may be of some use.  I haven't tried it so I'm not
sure of its maturity.  Let us know how it goes if you try it!

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"And as an added bonus, now my computer goes to the toilet for me, leaving me
free to spend time on more useful activities! yay slug codefests!" (C. Parker)

Re: psql vs perl prepared inserts

From
Dawid Kuroczko
Date:
On 4/13/05, Matt Van Mater <matt.vanmater@gmail.com> wrote:
> > The intent of prepared statements is to reduce the overhead of running
> > the parser, rewriter and planner multiple times for a statement that is
> > executed multiple times. For an INSERT query without any sub-selects
> > that is not rewritten by any rules, the cost to parse, rewrite and plan
> > the statement is trivial. So I wouldn't expect prepared statements to be
> > a big win -- you would gain a lot more from batching multiple inserts
> > into a single transaction, and more still from using COPY.
> I was thinking something along the same lines, and was considering
> using the COPY statement as my next step, but as someone mentioned
> then I have to load it into a temporary database and then do some more
> internal magic to preserve referential integrity, unique contraints,
> etc.  For that reason I was hoping to keep it in perl, and it's always
> nice to keep everything in a single neat portable package.

I believe that COPY handles things like unique constraints and
referer integerity just fine (of the latter I am not sure) -- well,
its hard to imagine PostgreSQL allowing itself to loose integrity
of the data, don't you agree?

A simple test:
qnex=# CREATE TABLE a (a int PRIMARY KEY);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"a_pkey" for table "a"
CREATE TABLEqnex=# COPY a FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1
>> 2
>> 3
>> 1
>> \.
ERROR:  duplicate key violates unique constraint "a_pkey"
CONTEXT:  COPY a, line 4: "1"

...you may have to rethink your design if there are any RULEs or
TRIGGERs associated with said table.  Apart from that, there
should be no problems at the risk that one "bad" row will cancel
whole COPY commands, so if you trust your data, it will be fine. :)

   Regards,
     Dawid

Re: psql vs perl prepared inserts

From
Mike Rylander
Date:
Just to add some clarity to the Dawid's fine comments:

On 4/14/05, Dawid Kuroczko <qnex42@gmail.com> wrote:
> On 4/13/05, Matt Van Mater <matt.vanmater@gmail.com> wrote:
> I believe that COPY handles things like unique constraints and
> referer integerity just fine (of the latter I am not sure) -- well,
> its hard to imagine PostgreSQL allowing itself to loose integrity
> of the data, don't you agree?
>

I do, and so does PG. :)  Everything that would normally happen with
an INSERT, including firing TRIGGERs, checking FKEYs, setting
DEFAULTs, running CHECK constraints -- everything -- happens with a
COPY import.

The *one and only* exception to the above statement is that RULEs to
not get used on COPY.  The reason for this is that RULEs rewrite
queries and after the COPY FROM header there isn't a query to rewrite,
it's just a pile of data.

This is in the docs, but it's not easy to find.  It's actually just
one line in the "Notes" section of
http://www.postgresql.org/docs/8.0/static/sql-copy.html .  Perhaps we
should add some more verbiage (he says, non-voluntarily...)?

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org